Cloudera Enterprise 5.15.x | Other versions

Configuring and Managing Extraction

Entities extracted from cluster services and the metadata that is applied to them supports Cloudera Navigator features, such as the ability to trace entities to their source in lineage diagrams, search for specific entities, and so on. Extraction is enabled for some services by default, while for other services, extraction must be specifically enabled. For example, extraction is enabled for Spark by default. Extraction consumes computing resources, such as memory and storage, so administrators may want to disable extraction for some services entirely, or configure more selective extraction for specific services.

In addition to configuring extraction for specific services, specific filters can be configured to blacklist specific HDFS paths to remove them from the extraction process which both speeds up the process and cuts down on indexing time, as well as the amount of storage consumed by the datadir. Filters can also be configured to blacklist or whitelist Amazon S3 buckets.

  Note: To ensure that the extraction process runs optimally, data stewards should regularly delete metadata for entities no longer needed. Deleting metadata should be followed by using the purge function, which cleans the datadir of Solr documents associated with the deleted entities. Using purge on a regular basis facilitates fast search and noise-free lineage diagrams.

Continue reading:

Enabling and Disabling Metadata Extraction

Cloudera Manager Required Role: Navigator Administrator (or Full Administrator)

Enabling Hive Metadata Extraction in a Secure Cluster

To extract entities from Hive, the Navigator Metadata Server authenticates to the Hive metastore using the hue user account. By default the hue user has privileges and can log in to the Hive metastore for authentication. That means that if the Hive metastore configuration was changed from the defaults, user hue may not be able to authenticate and the Navigator Metadata Server will be prevented from extracting from Hive. The specific configuration properties are the following two, which work together for authentication for Hive:
  • Hive Metastore Access Control and Proxy User Groups Override (Inherits from Hive Proxy User Groups when left empty (the default)
  • Hive Proxy User Groups
  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Hive-1.
  3. Click the Configuration tab.
  4. Select Proxy for the Category filter.
  5. Add hue to the Hive Metastore Access Control and Proxy User Groups Override list if necessary:
    • Click the plus icon to open an entry field to add a row to the property.
    • Type hue in the entry field.
  6. Click Save Changes, and repeat the process to add the Hive Proxy User Groups property on the HDFS server:
    • Select Clusters > Hive-1.
    • Click the Configuration tab.
    • Select Proxy for the Category filter.
    • Add hue to the Hive Proxy User Groups list by adding a row and typing hue in the entry field.
  7. Click Save Changes.
  8. Restart the Hive service.

Disabling Spark Metadata Extraction

Metadata and lineage for Spark is extracted by default (see Apache Spark Known Issues for limitations). To disable Spark metadata extraction:
  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Spark_on_YARN-1.
  3. Click the Configuration tab.
  4. Select Cloudera Navigator for the Category filter. The state of the Enable Lineage Collection feature displays, as in this example:

  5. To disable lineage collection, click the checked box.

Removing Manually Enabled Lineage Collection Property

Prior to Cloudera Navigator 2.10 (Cloudera Manager 5.11), enabling lineage collection from Spark required setting a safety valve. If the cluster was upgraded from a previous release of Cloudera Navigator and an Advanced Configuration Snippet (Safety Valve) was used to enable lineage, you must remove that snippet to avoid conflict with the new Enable Lineage Collection property. Using a safety valve to enable Spark metadata extraction has been deprecated.

  Note: Applies only to systems with safety valve configured to enable lineage collection from Spark prior to the default configuration provided in Cloudera Navigator 2.10.
To remove the deprecated safety valve setting:
  1. Log in to the Cloudera Manager Admin Console.
  2. Select Clusters > Cloudera Management Service.
  3. Click the Configuration tab.
  4. Select Navigator Metadata Server for the Scope filter.
  5. Select Advanced for the Category filter.
  6. Scroll to the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties setting.
  7. Remove any deprecated setting for Spark extraction, such as:
    nav.spark.extraction.enable
  8. Click Save Changes.
  9. Restart the Navigator Metadata Server role.

Filtering File System Metadata

You can choose to leave some file system paths out of the scope of information tracked in Cloudera Navigator. Cloudera Manager provides a blacklist where you can specify file systems paths that should be filtered out of metadata extracted from HDFS and S3.

To filter file system paths from tracked metadata:

  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Cloudera Management Service.
  3. Click the Configuration tab.
  4. Select Navigator Metadata Server for the Scope filter.
  5. Select Extractor Filter for the Category filter.
  6. Enable the filter:
    • HDFS Filter Enable
    • S3 Filter Enable
  7. In the appropriate filter list, include the file system path that you want to exclude from Navigator Metadata Server tracking:
    • HDFS Filter Blacklist
    • S3 Filter list

    The entry can be a specific path or a Java regular expression specifying a path. For example, to specify a directory and all subdirectories, use an expression such as

    hdfs://path/to/dir(?:/.*)?
  8. Enter additional entries in the filter list by clicking to open another entry.
  9. For S3, set the S3 Filter Default Action to DISCARD.
  10. Click Save Changes.
  11. Click the Instances tab.
  12. Restart the role.

Editing MapReduce Custom Metadata

You can associate custom metadata with arbitrary configuration parameters to MapReduce jobs and job executions. The configuration parameters to be extracted by Cloudera Navigator can be specified statically or dynamically.

To specify configuration parameters statically for all MapReduce jobs and job executions, do the following:
  1. Log in to Cloudera Manager Admin Console.
  2. Select Clusters > Cloudera Management Service.
  3. Click the Configuration tab.
  4. Select Navigator Metadata Server for the Scope filter.
  5. Select Advanced for the Category filter.
  6. Scroll to find the Navigator Metadata Server Advanced Configuration Snippet for cloudera-navigator.properties and enter the custom
  7. Specify values for the following properties:
    • nav.user_defined_properties - A comma-separated list of user-defined property names.
    • nav.tags - A comma-separated list of property names that serve as tags. The property nav.tags can point to multiple property names that serve as tags, but each of those property names can only specify a single tag.
  8. Click Save Changes.
  9. Click the Instances tab.
  10. Restart the role.
  11. In the MapReduce job configuration, set the value of the property names you specified in ste.
To specify configuration parameters dynamically:
  1. Specify one or more of the following properties in a job configuration:
    • Job properties (type:OPERATION)
      • nav.job.user_defined_properties - A comma-separated list of user-defined property names
      • nav.job.tags - A comma-separated list of property names that serve as tags
    • Job execution properties (type:OPERATION_EXECUTION)
      • nav.jobexec.user_defined_properties - A comma-separated list of user-defined property names
      • nav.jobexec.tags - A comma-separated list of property names that serve as tags
    The properties nav.job.tags and nav.jobexec.tags can point to multiple property names that serve as tags, but each of those property names can only specify a single tag.
  2. In the MapReduce job configuration, set the value of the property names you specified in step 1.

Setting Properties Dynamically

Add the tags onetag and twotag to a job:
  1. Dynamically add the job_tag1 and job_tag2 properties:
    conf.set("nav.job.tags", "job_tag1, job_tag2");
  2. Set the job_tag1 property to onetag:
    conf.set("job_tag1", "onetag");
  3. Set the job_tag2 property to twotag:
    conf.set("job_tag2", "twotag");
Add the tag atag to a job execution:
  1. Dynamically add the job_tag property:
    conf.set("nav.jobexec.tags","job_exec_tag");
  2. Set the job_exec_tag property to atag:
    conf.set("job_exec_tag", "atag"); 
Add the user-defined key key with the value value:
  1. Dynamically add the user-defined key bar:
    conf.set("nav.job.user_defined_properties", "key");
  2. Set the value of the user-defined key key to value:
    conf.set("key", "value")

Enabling Inputs and Outputs to Display

The Cloudera Navigator console displays a Details page for selected entities. Details include an entity's type and can optionally include table inputs and operation inputs and outputs. The inputs and outputs are not displayed by default because rendering them can slow down the display. Enabling the display of inputs and outputs in the Details page requires changing the nav.ui.details_io_enabled on the Navigator Metadata Server to true, as follows:

  1. Log in to the Cloudera Manager Admin Console.
  2. Select Clusters > Cloudera Management Service.
  3. Click the Configuration tab.
  4. Select Navigator Metadata Server for the Scope filter.
  5. Select Advanced for the Category filter.
  6. In the Navigator Metadata Server Advanced Configuration Snippet (Safety Valve) for cloudera-navigator.properties, enter the following:
    nav.ui.details_io_enabled=true
  7. Click Save Changes.
  8. Restart the Navigator Metadata Server role.
Page generated May 18, 2018.