A whitelist specifies which Hive tables should be processed in Big Data Discovery, while a blacklist specifies which Hive tables should be ignored during data processing.
cli_whitelist.txt
is the default whitelist name. The default whitelist is empty, as it does not select any Hive tables.cli_blacklist.txt
is the default blacklist name. The default blacklist has one .+ regex which matches all Hive table names (therefore all Hive tables are blacklisted and will not be imported).Both files include commented-out samples of regular expressions that you can use as patterns for your tables.
--whiteList cli_whitelist.txt
--blackList cli_blacklist.txt
Both lists are optional when running the DP CLI. However, you use the --database flag if you want to use one or both of the lists.
If you manually run the DP CLI with the --table flag to process a specific table, the whitelist and blacklist validations will not be applied.
List syntax
The --whiteList and the --blackList flags take a corresponding text file as their argument. Each text file contains one or more regular expressions (regex). There should be one line per regex pattern in the file. The patterns are only used to match Hive table names (that is, the match is successful as long as there is one matched pattern found).
The default whitelist and blacklist contain commented-out sample regular expressions that you can use as patterns for your tables. You must edit the whitelist file to include at least one regular expression that specifies the tables to be ingested. The blacklist by default excludes all tables with the .+ regex, which means you have to edit the blacklist if you want to exclude only specific tables.
For example, suppose you wanted to process any table whose name started with bdd
, such as bdd_sales
. The whitelist would have this regex entry:
^bdd.*
You could then run the DP CLI with the whitelist, and not specify the blacklist.
List processing
To summarize, the whitelist is parsed first, which generates a list of Hive tables to process, and the blacklist is parsed second, which generates a list of skipped Hive table names. Typically, the names from the blacklist names modify those generated by the whitelist. If the same name appears in both lists, then that table is not processed, that is, the blacklist can, in effect, remove names from the whitelist.
Example
_bdd
suffix in their names, such as claims_bdd
. To include them in data processing, you create a whitelist.txt
file with this regex entry:
^.*_bdd$
*_bdd
tables except for the claims_bdd
table, you create a blacklist.txt
file with this entry:
claims_bdd
When you run the DP CLI with both the --whiteList and --blackList flags, all the *_bdd
tables will be processed except for the claims_bdd
table.