Using whitelists and blacklists

A whitelist specifies which Hive tables should be processed in Big Data Discovery, while a blacklist specifies which Hive tables should be ignored during data processing.

Default lists are provided in the DP CLI package:

cli_whitelist.txt is the default whitelist name. The default whitelist is empty, as it does not select any Hive tables.
cli_blacklist.txt is the default blacklist name. The default blacklist has one .+ regex which matches all Hive table names (therefore all Hive tables are blacklisted and will not be imported).

Both files include commented-out samples of regular expressions that you can use as patterns for your tables.

To specify the whitelist, use this syntax:

--whiteList cli_whitelist.txt

To specify the blacklist, use this syntax:

--blackList cli_blacklist.txt

Both lists are optional when running the DP CLI. However, you use the --database flag if you want to use one or both of the lists.

If you manually run the DP CLI with the --table flag to process a specific table, the whitelist and blacklist validations will not be applied.

List syntax

The --whiteList and the --blackList flags take a corresponding text file as their argument. Each text file contains one or more regular expressions (regex). There should be one line per regex pattern in the file. The patterns are only used to match Hive table names (that is, the match is successful as long as there is one matched pattern found).

The default whitelist and blacklist contain commented-out sample regular expressions that you can use as patterns for your tables. You must edit the whitelist file to include at least one regular expression that specifies the tables to be ingested. The blacklist by default excludes all tables with the .+ regex, which means you have to edit the blacklist if you want to exclude only specific tables.

For example, suppose you wanted to process any table whose name started with bdd, such as bdd_sales. The whitelist would have this regex entry:

^bdd.*

You could then run the DP CLI with the whitelist, and not specify the blacklist.

List processing

The pattern matcher in Data Processing workflow uses this algorithm:

The whitelist is parsed first. If the whitelist is not empty, then a list of Hive tables to process is generated. If the whitelist is empty, then no Hive tables are ingested.
If the blacklist is present, the blacklist pattern matching is performed. Otherwise, blacklist matching is ignored.

To summarize, the whitelist is parsed first, which generates a list of Hive tables to process, and the blacklist is parsed second, which generates a list of skipped Hive table names. Typically, the names from the blacklist names modify those generated by the whitelist. If the same name appears in both lists, then that table is not processed, that is, the blacklist can, in effect, remove names from the whitelist.

Example

To illustrate how these lists work, assume that you have 10 Hive tables with sales-related information. Those 10 tables have a _bdd suffix in their names, such as claims_bdd. To include them in data processing, you create a whitelist.txt file with this regex entry:

^.*_bdd$

If you then want to process all *_bdd tables except for the claims_bdd table, you create a blacklist.txt file with this entry:

claims_bdd

When you run the DP CLI with both the --whiteList and --blackList flags, all the *_bdd tables will be processed except for the claims_bdd table.