Using whitelists and blacklists

A whitelist specifies which Hive tables should be processed in Big Data Discovery, while a blacklist specifies which Hive tables should be ignored during data processing.

Both lists are optional when running the DP CLI. For example, if you manually run the DP CLI with the --table flag to process a specific table, you do not have to specify the lists.

Default lists are provided in the DP CLI package:

cli_whitelist.txt is the default whitelist name (you can use your own name for this file).
cli_blacklist.txt is the default blacklist name (you can use your own name for this file).

Both default lists are essentially empty — they include commented out samples of regular expressions that you can use as patterns for your tables.

To specify the whitelist, use this syntax:

--whiteList cli_whitelist.txt

To specify the blacklist, use this syntax:

--blackList cli_blacklist.txt

List syntax

The --whiteList and the --blackList flags take a corresponding text file as their argument. Each text file contains one or more regular expressions (regex). There should be one line per regex pattern in the file. The patterns are only used to match Hive table names (that is, the match is successful as long as there is one matched pattern found).

The default whitelist and blacklist contain commented out sample regular expressions that you can use as patterns for your tables. This means that the lists are essentially empty. You must edit the whitelist file to include at least one regular expression that specifies the tables to be ingested. Similarly, to exclude any tables, edit the blacklist.

For example, suppose you wanted to process any table whose name started with bdd, such as bdd_sales. The whitelist would have this regex entry:

^bdd.*

List processing

The pattern matcher in Data Processing workflow uses this algorithm:

The whitelist is parsed first. If the whitelist is not empty, then a list of Hive tables to process is generated. If the whitelist is empty, then no Hive tables are ingested.
If the blacklist is present, the blacklist pattern matching is performed. Otherwise, blacklist matching is ignored.

To summarize, the whitelist is parsed first, which generates a list of Hive tables to process, and the blacklist is parsed second, which generates a list of skipped Hive table names. Typically, the names from the blacklist names modify those generated by the whitelist. If the same name appears in both lists, then that table is not processed, that is, the blacklist can, in effect, "remove" names from the whitelist.

Example

To illustrate how these lists work, assume that you have 10 Hive tables with sales-related information. Those 10 tables have a _bdd suffix in their names, such as claims_bdd. To include them in data processing, you create a whitelist.txt file with this regex entry:

^.*_bdd$

If you then want to process all *_bdd tables except for the claims_bdd table, you create a blacklist.txt file with this entry:

claims_bdd

When you run the DP CLI with both the --whiteList and --blackList flags, all the *_bdd tables will be processed except for the claims_bdd table.