Harvesting Object Storage Files as Logical Data Entities

Your data lake typically has a large number of files that represent a single data set. The files naming conversions indicate that multiple files are part of a single logical data entity.

You can group multiple Object Storage files into logical data entities in data catalog using filename patterns . A logical data entity is like any other data entity and can be used for search and discovery. Using logical data entities, you can organize your data lake content meaningfully and prevent the explosion of data entities and attributes in your data catalog.

Typical tasks you perform while harvesting Object Storage files as logical data entities:

  1. Create a pattern.
  2. Assign the pattern to an Object Storage data asset.
  3. Harvest the data asset.
  4. View harvested logical data entities.

Understanding Logical Data Entities

Consider the following set of files:

myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json

If you harvest these files in your Oracle Object Storage data source without creating filename patterns , Data Catalog creates nine individual data entities in your data catalog. Imagine this situation with hundreds of files in your data source resulting in hundreds of data entities in your data catalog.

Using filename patterns , you can group the example set of files into logical data entities. Any files that are not matched are created as separate File type data entities.
myserv/20191205_yny_myIOTSensor.json
myserv/20191105_yny_myIOTSensor.json
myserv/20191005_yny_myIOTSensor.json
myserv/20190905_yny_myIOTSensor.json
myserv/20191005_hyd_my2ndIOTSensor.json
myserv/20190905_hyd_my2ndIOTSensor.json
myserv/20191005_bom_my3rdIOTSensor.json
myserv/20190905_bom_my3rdIOTSensor.json
myserv/somerandomfile_2019AUG05.json

Understanding Expressions

In Data Catalog, a filename pattern is defined using expressions.

An expression can have one or more components that you separate using a delimiter. Each component specifies a matching rule for the pattern. Filename patterns are created using Java regular expressions. You specify the regular expression that should be used to group your files into required logical data entities.

You can specify qualifiers that are used when parsing the expression. You can use the following qualifiers:

  • bucketName: Use this qualifier to specify that the bucket name should be derived from the path that matches the given expression. The bucketName qualifier is used only once in the expression and always as the first component of the expression. The bucketName qualifier value can be a static text or an expression.
  • logicalEntity: Use this qualifier to specify that the logical data entity name should be derived from the path that matches the given expression. You can use logicalEntity multiple times in an expression. The logicalEntity qualifier values can consist of static text or expressions.
Logical Data Entities Examples

Consider the following filenames:

bling_metering/1970120520_yny_hourly_region_res_delayed.json
bling_metering/1973110523_yny_hourly_region_res_delayed.json
bling_metering/1988101605_hyd_daily_region_res_delayed.json
bling_metering/1991042302_yny_hourly_region_res_delayed.json
bling_metering/2019073019_zrh_daily_region_res_delayed.json
bling_metering/2019073020_zrh_monthly_region_res_delayed.json
bling_metering/some_random_file_123.json
Expression with one logicalEntity qualifier

To derive logical data entities based on frequency (hourly, daily, monthly) mentioned in the filename, you can use the following pattern expression:

{bucketName:bling_metering}/[0-9]*_[a-z]*_{logicalEntity:[a-z]*}_.*.json

This expression uses the bucketName and logicalEntity qualifiers. In this example, [0-9]* matches any number; [a-z]* matches any lowercase alphabet; and .* matches any character. The expression results in the following logical data entities:

  1. bling_metering_monthly
    bling_metering/2019073020_zrh_monthly_region_res_delayed.json
  2. bling_metering_hourly
    bling_metering/1970120520_yny_hourly_region_res_delayed.json
    bling_metering/1973110523_yny_hourly_region_res_delayed.json
    bling_metering/1991042302_yny_hourly_region_res_delayed.json
  3. bling_metering_daily
    bling_metering/1988101605_hyd_daily_region_res_delayed.json
    bling_metering/2019073019_zrh_daily_region_res_delayed.json

Unmatched

bling_metering/some_random_file_123.json

To derive logical data entities based on regions (yny, hyd, zrh) mentioned in the filename, you can use either of the following pattern expression:

{bucketName:bling_metering}/[0-9]*_{logicalEntity:yny|hyd|zrh}_[a-z]*_region_res_delayed.json
{bucketName:bling_metering}/[0-9]*_{logicalEntity:[a-z]*}_[a-z]*_.*.json

This expression results in the following logical data entities:

  1. bling_metering_zrh
    bling_metering/2019073020_zrh_monthly_region_res_delayed.json
    bling_metering/2019073019_zrh_daily_region_res_delayed.json
  2. bling_metering_yny
    bling_metering/1970120520_yny_hourly_region_res_delayed.json
    bling_metering/1973110523_yny_hourly_region_res_delayed.json
    bling_metering/1991042302_yny_hourly_region_res_delayed.json
  3. bling_metering_hyd
    bling_metering/1988101605_hyd_daily_region_res_delayed.json
    

Unmatched

bling_metering/some_random_file_123.json
Expression with multiple logicalEntity qualifiers

To derive logical data entities based on regions and frequency (hourly, daily, monthly) mentioned in the filename, you can use the following pattern expression:

{bucketName:bling_metering}/[0-9]*_{logicalEntity:[a-z]*}_{logicalEntity:[a-z]*}_region_res_delayed.json

The above expression uses the bucketName and two logicalEntity qualifiers. The expression results in the following logical data entities:

  1. bling_metering_zrh_monthly
    bling_metering/2019073020_zrh_monthly_region_res_delayed.json
  2. bling_metering_hyd_daily
    bling_metering/1988101605_hyd_daily_region_res_delayed.json
  3. bling_metering_zrh_daily
    bling_metering/2019073019_zrh_daily_region_res_delayed.json
  4. bling_metering_yny_hourly
    bling_metering/1970120520_yny_hourly_region_res_delayed.json
    bling_metering/1973110523_yny_hourly_region_res_delayed.json
    bling_metering/1991042302_yny_hourly_region_res_delayed.json

Unmatched

bling_metering/some_random_file_123.json
Expression with no logicalEntity qualifier

If no logicalEntity qualifier is specified, the filename pattern name is used as the logical data entity name. For example, consider the following expression for the filename pattern bling pattern:

{bucketName:bling_metering}/[0-9]*_[a-z]*_[a-z]*_.*.json

The above expression uses the bucketName qualifier, but no logicalEntity qualifier. The expression results in the following logical data entities:

  1. bling pattern
    bling_metering/2019073020_zrh_monthly_region_res_delayed.json
    bling_metering/1970120520_yny_hourly_region_res_delayed.json
    bling_metering/1973110523_yny_hourly_region_res_delayed.json
    bling_metering/1991042302_yny_hourly_region_res_delayed.json
    bling_metering/1988101605_hyd_daily_region_res_delayed.json
    bling_metering/2019073019_zrh_daily_region_res_delayed.json

Unmatched

bling_metering/some_random_file_123.json
Note

When you test this expression with no logicalEntity qualifier, in resulting logical data entity the expression is shown as the logical entity name. But on harvesting, the name of the filename pattern is used as the logical data entity name.

Creating a Filename Pattern

Filename patterns let you group multiple Object Storage files as logical data entities in data catalog. If an Object Storage file is matched with multiple filename patterns, it can be part of multiple logical data entities.

Here's how you create a filename pattern:
  1. On your data catalog instance Home tab, click Manage Filename Patterns from Quick Actions to access the Filename Patterns tab.
  2. Click Create Filename Patterns.
  3. In the Create Filename Pattern panel, enter a Name for your filename pattern.
  4. Optionally, enter a Description for the filename pattern.
  5. For Expression, enter a regular expression that corresponds to your Oracle Object Storage files. The files matching this regular expression are grouped into logical data entities in the data catalog.
  6. To test your expression, enter list of filenames in Test filenames and click Test Expression. The test results show the logical data entities that are derived using the specified expression. Any unmatched files are created as separate file type data entities.
  7. Modify the filename pattern expression and test it until you get the required logical data entities for your file set.
  8. Click Create.
Your filename pattern is successfully created. You can now assign this pattern to a data asset. During harvest, these patterns are used to derive logical data entities.

Viewing All Filename Patterns

On your data catalog instance Home tab, click Filename Patterns to access the Filename Patterns tab. You can also click + from tabs and select Filename Patterns.

All filename patterns created in your data catalog are listed in the Filename Patterns tab. Use the available filters to refine the list to display filename patterns that were updated before or after a specified date.

Additionally, from the Filename Patterns tab, you can create, view, edit, test, or delete a filename pattern.

Viewing a Filename Pattern's Details

  1. On your data catalog instance Home tab, click Filename Patterns to access the Filename Patterns tab.
  2. Click the filename pattern name. Alternatively, click the Actions icon (three dots) for the filename pattern and select View Details. The filename pattern details tab displays.
The filename pattern details tab displays the filename pattern information, such as description, updated by, last updated, and expression. The Associated Logical Data Entities list shows the logical data entities derived using the filename pattern. If this list is empty, then either the filename pattern is not assigned to any data asset yet, or the data asset is not yet harvest with the assigned filename patterns. Use the Refresh link to retrieve the latest information for the filename pattern.

Additionally, you can edit, test, or delete the filename pattern.

Assigning Filename Patterns to Data Assets

  1. On the Home tab, click Data Assets to access the Data Assets tab.
  2. In the Data Assets list, select the Object Storage data asset for which you want to assign filename patterns.
  3. In the Summary tab on the data asset details tab, under Filename Patterns, click Assign Filename Patterns.
  4. From the Assign Filename Patterns panel, select the filename patterns that you want to assign to this data asset. You can use the filter box to filter the filename patterns by name. You can also deselect already assigned filename patterns to unassign them from this data asset.
  5. Click Assign.
The selected filename patterns are assigned to the data asset. When you harvest the data asset, these filename patterns are used to derive logical data entities. The names of the files in the Object Storage bucket are matched to the pattern expression and logical data entities are formed.
Note

When you assign a new filename pattern to a data asset, the status of any harvested logical data entities is set to Inactive. You need to harvest the data asset again to derive valid logical data entities again.

Viewing Harvested Logical Data Entities

  1. On the Home tab, click Data Entities to access the Data Entities tab.
  2. From the Filters, select Logical for Data Entity Type.
  3. From the filtered list, select the logical data entity you want to view.
  4. View general information for the logical data entity from the Summary tab.
  5. View the attributes for the logical data entity from the Attributes tab.
  6. View the Object Storage files associated with this logical data entity from the Files tab. You can also view the filename pattern that was used to derive this logical data entity. Expression displays the realized expression that resulted in the logical data entity from the assigned filename pattern expression.
    Note

    An Object Storage file can be part of multiple logical data entities if it is matched by multiple filename patterns.
You can now annotate the harvested logical data entities and attributes with terms and tags.

Editing a Filename Pattern

Here's how you edit a filename pattern:
  1. On your data catalog instance Home tab, click Filename Patterns to access the Filename Patterns tab.
  2. In the Filename Patterns list, click the Actions icon (three dots) for the filename pattern you want to edit and select Edit. The Edit Filename Pattern panel opens. You can also edit a filename pattern by using the Edit button on the filename pattern's details page.
  3. Make the required edits to the filename pattern. If you edit the Expression, ensure that you test it.
    Note

    When you modify a filename pattern expression, the status of any harvested logical data entities is set to Inactive. You need to harvest the associated data asset again to derive valid logical data entities again.
  4. Click Save Changes.
Your filename pattern is successfully updated. You can now assign this pattern to a data asset. During harvest, these patterns are used to derive logical data entities.

Testing a Filename Pattern

Here's how you test a filename pattern:
  1. On your data catalog instance Home tab, click Filename Patterns to access the Filename Patterns tab.
  2. In the Filename Patterns list, click the Actions icon (three dots) for the filename pattern you want to test and select Test. The Test Filename Pattern panel opens. You can also test a filename pattern expression by using the Test button on the filename pattern's details page.
  3. Enter the filenames you want to test against the expression in Test filenames.
  4. Click Test Expression.
  5. Verify the resulting logical data entities. If the results are not what you require, edit the filename pattern and enter a different expression.
  6. Click Close.
Your filename pattern expression is successfully verified. You can now assign this pattern to a data asset. During harvest, these patterns are used to derive logical data entities.

Deleting a Filename Pattern

Here's how you delete a filename pattern:
  1. On your data catalog instance Home tab, click Filename Patterns to access the Filename Patterns tab.
  2. In the Filename Patterns list, click the Actions icon (three dots) for the filename pattern you want to delete and select Delete. The Delete Filename Pattern panel opens. You can also delete a filename pattern by using the Delete button on the filename pattern's details page.
  3. Click Delete to confirm delete.
The filename pattern is successfully deleted.

Unassigning a Filename Pattern

When you unassign a filename pattern from a data asset, the status of any harvested logical data entities is set to Inactive. You need to harvest the data asset again to derive valid logical data entities again.

  1. On the Home tab, click Data Assets to access the Data Assets tab.
  2. In the Data Assets list, select the Object Storage data asset from which you want to unassign filename patterns.
  3. In the Summary tab on the data asset details tab, under Filename Patterns, click the actions menu for the filename pattern you want to unassign from the data asset. Then, click Unassign Filename Patterns.
The filename pattern is successfully unassigned from the data asset.

Understanding Logical Data Entity Statuses

When a logical data entity is created during harvest, the status of the logical data entity is Active.

Logical data entities are a created from the actual names of the files in the Object Storage buckets, the filename pattern expression, and the filename patterns assigned to the data assets. When there is a change to the Object Storage files, filename pattern expression, or the associated filename patterns, the status of the logical data entity can change.

  • Object Storage: If your Object Storage buckets have more or less files since the last harvest, your data catalog reflects the changes on reharvest. New files are added to the logical data entity and deleted files are removed from the logical data entity.
  • Filename pattern expression: When you modify an expression in a filename pattern, there is no certainty that the logical data entities created on reharvest are same as the logical data entities created during the previous harvest. To indicate this state of uncertainty, Data Catalog sets the status of the logical data entities harvested using the expression to Inactive. During reharvest, if same logical data entities are derived, Data Catalog sets their status to Active again. Logical data entities from the previous harvest that were not derived during reharvest remain Inactive. You need to explicitly delete them as needed.
  • Filename patterns: Consider you assign a new filename pattern to a data asset or unassign any existing filename patterns from a data asset. There is no certainty that the logical data entities created on reharvest are same as the logical data entities created during the previous harvest. To indicate this state of uncertainty, Data Catalog sets the status of the logical data entities harvested using the filename patterns to Inactive. During reharvest, if same logical data entities are derived, Data Catalog sets their status to Active again. Logical data entities from the previous harvest that were not derived during reharvest remain Inactive. You need to explicitly delete them as needed.

Understanding Limitations

  • If an Object Storage bucket contains archived files, then you must select the Include unrecognized option to create a logical data entity with only the archived files.
  • If you select the Incremental harvest option while harvesting data assets with associated filename patterns, the following changes in your Object Storage are reflected in your data catalog:
    • When new files are added to your Object Storage bucket that match any existing logical data entity, they are added to the logical data entity.
    • When new files are added to your Object Storage bucket that derive new logical data entities, the new logical data entities are created and the files are added to the new logical data entity.
    • When files are deleted from your Object Storage bucket that were associated to a logical data entity, they are removed from the logical data entity.
    • When all the files associated to a logical data entity are deleted from your Object Storage bucket, the logical data entity itself is removed from your data catalog.
    Note

    Incremental harvest for logical data entities doesn't track the changes made to your filename pattern expressions, or the filename patterns associated with the data asset. You must perform a full harvest to be in sync with all the updates.