AI Foundation Data Batch Architecture

3 AI Foundation Data Batch Architecture

The AIF DATA batch schedule is responsible for managing foundation data for the Retail Analytics and Planning solutions and storing it in a shared data warehouse architecture. Understanding the core features of the data warehouse can help you make better use of the jobs and processes available to you. When working with Oracle Support, you may also be asked to modify the behavior of some jobs to resolve issues, and this chapter provides additional information on the AIF DATA processes you will be expected to interact with.

This chapter will cover the following major functions of the AIF DATA batch schedule:

File ingestion – How customer data files are received and processed
Data extract – How data is extracted from a source system within the Oracle Cloud for use in Analytics and Planning applications
Data import – How customer data files are pulled into the Oracle database
Data transform and load – How customer data is transformed and loaded into the data warehouse
Data reclassification – How customer data is reorganized when an item or location is reclassified to a new hierarchy node
Data aggregation – How customer data is aggregated and summarized for use in BI reports
Data export – How customer data is exported from the data warehouse for use in other RAP applications
Other supporting processes – Other technical aspects of the batch processing that supports the overall flow of data throughout the system

File Ingestion

Data files are accepted from external sources using a private cloud instance of Oracle Object Storage (OS). This OS instance is not comparable to the Oracle Public Cloud version: it does not have a user interface and cannot be accessed except through specific programming APIs provided by Oracle (such as the File Transfer Services or FTS). It is used for short-term data storage and retrieval only as a means of transferring files into or out of the Oracle Retail cloud. If you are not providing any files and instead using direct integration from Merchandising Foundation Cloud Service (MFCS) then these programs will not be used.

AIF DATA programs expect customer files to come in the form of ZIP files. The ZIP file must contain only a set of files from the pre-determined list of interfaces that AIF DATA batch supports. It cannot contain a folder structure; all files must be directly packaged into the ZIP file itself. The general steps performed once a ZIP file is uploaded to Object Storage are:

The ZIP file is passed through a virus scanner and will not be consumed until the scan completes. A batch program (ZIP_FILE_WAIT_JOB) will wait for a maximum of 4 hours for the ZIP file to become available for import and then fail.
The ZIP file is unpacked by another batch program (ZIP_FILE_UNLOAD_JOB) into a temporary directory where the files will be cleansed using a variety of standard Unix functions including dos2unix and methods to apply UTF-8 encoding.
After file conversion is completed, all files used in prior batch runs are deleted and the new files are moved to the next stage of processing.
Files are compared to a list of required filenames as specified in Retail Home. If any required files are missing, the batch program (DAT_FILE_VALIDATE_JOB) will fail. If all required files are found, then the entire set of files are moved into the final server directory where they can be read by later jobs.

Files cannot be consumed by the rest of the batch unless DAT_FILE_VALIDATE_JOB is run successfully, so it is a critical program that cannot be skipped or disabled. Even if you have marked all files as optional in Retail Home by disabling the CONTROLFILES modules, you must still run DAT_FILE_VALIDATE_JOB because it will move the files to the required folder structure on the application server.

Data Extract

When you subscribe to other Oracle Cloud applications that support integration with Retail Analytics and Planning (RAP) applications, such as Merchandising Foundation Cloud Service, you have the option to leverage direct database-to-database extracts that run as part of the AIF DATA batch schedule. These extracts are collectively referred to as the Retail Data Extractor (RDE) component of RAP. They key points of the RDE architecture are:

Oracle Golden Gate technology is used to take a near-real-time copy of the source system’s data and place it on the local ADW instance where RAP is installed
The local copy of the data is overlaid with database synonyms and permission grants so that AIF DATA batch programs can access it
Extract programs in the AIF DATA schedule are usually prefixed with RDE to more easily identify them
RDE programs read from the cloned tables, transform the data, and then insert the result to the staging tables in the data warehouse

For all other implementation details on using RDE programs, refer to the Retail Analytics and Planning Implementation Guide chapter on “Integration with Merchandising”.

Data Import

Data files that have been successfully prepared through the file ingestion process will then go through a two-step process to import the contents into the database. The first step will copy a specific file from the application server staging directory to another location where the database can access it. The second step will perform an External Table Load procedure, which is a standard feature of Oracle Autonomous Data Warehouse (ADW). The file is mapped to a temporary external table name, then the rows of data are read from the file and inserted into a pre-defined target table in our data model.

The file copy step is performed by batch jobs having COPY in their names. For example, W_MCAL_PERIOD_DS_COPY_JOB will copy the W_MCAL_PERIOD_DS.dat file. These jobs have the following parameters in POM that determine the behavior of the process:

disJobType – The internal code for the program type, which affects the API call used when the job is run
object – The filename being written to the destination
objectFilepath – The source path and filename for the copy action

If these COPY jobs are not run successfully, then there will not be any file available for the next step to consume, and later steps in the batch will fail as a result.

The external table load step is performed by batch jobs having STG in their names. For example, W_MCAL_PERIOD_DS_STG_JOB will read the file data from W_MCAL_PERIOD_DS.dat and write it to the database. These jobs have the following parameters in POM that determine the behavior of the process:

disJobType – The internal code for the program type, which affects the API call used when the job is run
tableName – The target table in the database where the file data will be written to when this job is executed
filename – The source file that will be mapped to a temporary external table for processing, which must match an object in the COPY job parameters
connectionType – The ADW connection tier used for this data load, from one of (low, medium, high)
truncateTable – Instruct the load to truncate the target table before insert (true) or append to the table (false)

You may modify the parameters connectionType and truncateTable based on guidance from Oracle Support or as needed during an implementation. The connection type of “low” is optimal for low volumes of data, but if you find yourself pushing large amounts of data into a program that uses a low connection, you can change the parameter value to “medium” which can better handle high volumes. Do not make this change to connection type unless you are sure it is necessary and revert the setting to its original value after you are done loading data. You might also want to change truncateTable to “false” if you do not want to lose the data currently in a staging table when loading a new file to that table, such as when a customer provides their implementer with an incremental file that needs to be appended to data previously staged for loading. Like the connection type, do not change this setting unless required for a specific use case, and revert it to the default value (true) when finished with your loads.

The external table load (STG) programs also use the configuration options provided on context (CTX) files to further change the behavior of the data imports. Details on creating context files is provided in the Retail Analytics and Planning Implementation Guide chapter on “Data File Generation”. The context files are not directly used; they are pushed into ADW metadata tables using a dedicated POM job (RI_UPDATE_TENANT_JOB) and then the external table loads will read from that metadata to configure the imports. This metadata controls the following aspects of the file import: columns to read from the file, column delimiters, data formats, trimming of spaces, header row processing, rejected record tolerance, and any optional parameters supported by the DBMS_CLOUD package in ADW.

Transform and Load

Data that has been staged into the data warehouse (either by file imports or direct integrations) may go through additional transformations before being written to the final warehouse tables. When the data is directly loaded using RDE programs, all the transformation is already done and the next step would be to load it into the data warehouse. If the data came from a file with a CSV file extension, it needs additional processing before it can be used.

CSV file formats are provided to reduce the development overhead of creating interfaces for a set of related data warehouse tables. CSV files often combine two or more internal tables into one input file, so additional transformation is needed to split the data fields out after importing the file. The transformation of CSV file data is done using a combination of Oracle Data Integrator (ODI) programs and PL/SQL scripting, and the programs are collectively called Simplified interface (SI) programs. The SI program’s responsibility is to split out the relevant fields from the input to all the other staging tables which require something from that file. For example, the PRODUCT.csv file is imported to a table named W_PRODUCT_DTS. From here, the SI program SI_W_PRODUCT_DS_JOB takes a subset of this data and moves it to W_PRODUCT_DS. Another SI program SI_W_PROD_CAT_DHS_JOB reads from the same source table and writes to W_PROD_CAT_DHS. This continues until all the fields from W_PRODUCT_DTS have been parsed out into their designated areas by SI jobs.

The target tables of the SI programs are the same tables directly populated by the RDE extracts, meaning the two data flows are mutually exclusive. Either RDE jobs or file-based jobs can be used to populate data into the data warehouse for a particular interface. Attempting to run both for the same table will result in errors and data conflicts. For additional examples of these data flows, refer to the Retail Analytics and Planning Implementation Guide chapter on “Batch Orchestration”, under the section “Managing Multiple Data Sources”.

After any transformation steps are complete, then loading into the data warehouse target tables can begin. Load jobs take the data in staging tables like W_PRODUCT_DS and move them into internal tables like W_PRODUCT_D. The load programs are complex and perform many operations to get from the source to the target. The data warehouse applies primary and foreign key constraints, internal sequences for row tracking, audit fields such as insert/update timestamps, historical versions of modified records, data quality validations, and many other functions. The basic load jobs can be broadly split into dimension jobs (for tables ending with _D), fact jobs (for tables ending with _F), translation lookup jobs (for tables ending with _TL), and general load jobs (for tables ending with _G). For these programs, the name of the job in the AIF DATA batch schedule will correlate with the target table being loaded. For example, W_PRODUCT_D_JOB is responsible for loading the W_PRODUCT_D table. Understanding this naming scheme will allow you to connect the jobs being run in the batch schedule with the tables being populated in the data warehouse.

Data Reclassification

A core feature of the data warehouse is its ability to manage reclassifications, which is the term used to describe the reassignment of an item or location from one position in your hierarchies to another. The item and location dimensions (and their hierarchies) are the only ones to support reclassification management. All other dimensions are taken “as-is”, meaning the data warehouse will not track major changes done to those records as a reclassification.

As it pertains to the AIF DATA batch schedule, the system applies a reclassification activity across two days, following this overall flow of events:

Day 1 Batch
- Incoming product and location data is compared to the data already in the data warehouse and changes to the hierarchy assignments are recorded in temporary tables. This comparison is done automatically and does not require you to tell the system about reclassified records separately.
- For the initial batch when a reclassified item or location is sent in, the data is applied in the dimension tables like W_PRODUCT_D immediately, inserting a new record for the reclassified data and updating the existing record as inactive. Effective dates are assigned to both the old and new records to indicate the periods of time when that hierarchy classification was active. The CURRENT_FLG column indicates the latest version of the record.
- If any fact records are provided for a reclassified item or location, that data is loaded relative to the new record’s unique identifier, but existing data remains on the old record’s identifier. Joining the fact and dimension tables together is how you cross-reference the old and new identifiers with the same item or location.
- A set of jobs prefixed with FACTOPEN and FACTCLOSE are responsible for maintaining positional facts like inventory and pricing, end-dating the old version of reclassed records and inserting new ones for the new dimension keys, even if the fact itself had no changes on that day. This is necessary to carry forward the positional balances on the new dimension keys, not the inactive ones.
Day 2 Batch
- At the very start of the next day’s batch processing, a series of jobs having RECLASS in the name will take effect. These jobs are responsible for recalculating aggregation tables in the data warehouse to regenerate fact data based on the latest classifications. This is done on the second day’s batch due to the amount of time it can take to run, so it’s best to do it early in the morning before users enter the system instead of doing it at the end of Day 1’s batch.

A side-effect of the split approach is that BI reporting will not reflect a reclassification fully until a day has elapsed and the aggregate tables are recalculated. A user may only notice this if they attempt to report on the current day’s fact data at an aggregate level on the same day a reclassification occurs, which will be a small subset of reports (if any). It is still better to run the aggregation programs outside of normal business hours versus running them during the business day in the first batch, which will impact the user experience more than the temporary data inconsistency does. This approach does not impact the other Analytics and Planning applications because they receive their data directly from the item/location fact tables, not the BI aggregate tables.

Data Aggregation

The Analytics and Planning data warehouse supports the Retail Insights solution for all business intelligence and reporting needs. It does this using a comprehensive set of pre-aggregated data tables holding information at levels commonly used in reporting, such as department/week or subclass/store/week. The AIF DATA batch schedule is responsible for maintaining all of the aggregate tables. Each job in the batch schedule ending with _A_JOB populates an aggregate table in the data warehouse, with the job name directly corresponding to the table being populated (for example, the W_RTL_SLS_SC_LC_DY_A_JOB populates the W_RTL_SLS_SC_LC_DY_A table).

It is important to understand how the data warehouse performs aggregations as it can directly impact the way you provide your data to the system. Aggregations are performed using a different approach depending on the type of fact data:

Transaction facts such as sales and receipts are aggregated incrementally and additively. The incoming data is summarized at each level of aggregation and added to the data already in the target table. This means that the incoming fact data must only contain the values that can be summed with existing records to reach the desired final numbers. Loading data in this manner has the best performance in batch processing as it requires the least number of changes to the aggregate tables. Data will be updated only if it is being altered by an incoming record.
Positional facts such as inventory are aggregated as complete snapshots of the base fact data for daily aggregates and on the first day of a new week. The item/location records are rolled up to the desired level and then inserted to the aggregation table for the new daily or weekly positions. Weekly positional data is also updated incrementally during the week. The incoming records will update the current week’s positions with the changed values (non-additively, replacing the existing data with the new data).

There are multiple distinct batch jobs involved in fact aggregation and all must be enabled in the nightly batch cycles before any BI aggregates can be populated. Using the sales transaction fact aggregate W_RTL_SLS_IT_DY_A as an example, the data is processed in the following steps:

Incremental sales data for the current date will be staged into the W_RTL_SLS_TRX_IT_LC_DY_FS table using either RDE programs or a file-based load.
W_RTL_SLS_IT_DY_TMP_JOB aggregates the incoming transaction records from the staging layer to a temporary table W_RTL_SLS_IT_DY_TMP at the summarized level of item/day. The internal base fact table W_RTL_SLS_TRX_IT_LC_DY_F is not used; aggregate table data always comes from the staging layer only.
W_RTL_SLS_IT_DY_A_JOB will take the data from W_RTL_SLS_IT_DY_TMP and load it to W_RTL_SLS_IT_DY_A using an additive MERGE statement. Existing records will be merged by summing the same column in the source and target (for example, t.SLS_QTY=s.SLS_QTY+t.SLS_QTY) while unmatched records will be inserted with all the values from the temporary table. Positional facts such as inventory will use a non-additive merge, overwriting the target column from the source table.
Other aggregate tables that need item level data such as W_RTL_SLS_IT_WK_A will also source from W_RTL_SLS_IT_DY_TMP to update their data using a similar merge statement. Week-level aggregates merge the daily data onto the week-level records using the system’s fiscal calendar to map the days to weeks.

The most important behavior to understand is that the daily staged data is used to update aggregate tables. A common point of confusion about the aggregation process is the expectation that lower-level aggregate tables directly populate the higher-level aggregates. This leads to the incorrect assumption that all aggregate tables are regenerated from the base fact every day. An implementer might think that a manual data change on a base fact table like W_RTL_SLS_TRX_IT_LC_DY_F will automatically appear in the BI aggregates after batch, which is not the case. For aggregates to be updated, the data must be loaded through the staging tables and then processed through the aggregation programs.

Data Exports

The AIF data warehouse is used to export certain datasets to planning applications such as Merchandise Financial Planning (MFP). Separate AIF DATA nightly batch jobs are responsible for pulling data out of the data warehouse tables and exporting it to tables in the retail data exchange (RDX) schema. Only data exported to RDX is available for use in the planning applications, all other data warehouse tables are used only by Retail Insights or AI Foundation processes. This method of exporting data is specific to the planning applications, the way AIF applications get data is by pulling from the data warehouse using AIF APPS batch schedule programs.

Data export programs use a different naming scheme starting with W_PDS, such as W_PDS_PRODUCT_D_JOB. These jobs use PL/SQL to read from the data warehouse tables and write to RDX schema tables. Export programs operate on a run-based architecture where each execution of the job will create a new Run ID to track the execution and write the complete set of outputs. Runs are managed using tables RAP_INTF_CFG and RAP_INTF_RUN_STATUS that can be monitored from Innovation Workbench. The full mapping between the source tables in the data warehouse and the target tables in the RDX schema is available in the Retail Analytics and Planning Implementation Guide chapter on “Data Processing and Transformations”.

Exports from the data warehouse to RDX tables share some common design patterns:

Export programs, like all other AIF DATA jobs, leverage the C_LOAD_DATES table to track the job level run status, and any existing entry in C_LOAD_DATES will prevent running the export program again if it has already been run once.
Exported facts will join on their associated export dimensions to ensure data consistency across tables, such as W_PDS_SLS_IT_LC_WK_A_JOB only exporting results for item/locations also present on the W_PDS_PRODUCT_D and W_PDS_ORGANIZATION_D dimension export tables.
The source and target tables use the same intersections of data, as defined in the table names. For example, the job W_PDS_SLS_IT_LC_WK_A_JOB reads from the table W_RTL_SLS_IT_LC_WK_A and writes to W_PDS_SLS_IT_LC_WK_A, and both the source and target tables are aggregated at the item/location/week levels.
Identifiers for the item, location, and date values on the PDS export tables will be converted to their external-facing values, meaning it is not possible to directly join the input and output tables without also using dimension tables to convert the identifiers (the data warehouse tables do not hold the external codes, instead they have foreign key values referring to their associated dimensions).
Exports of dimension data will be a full snapshot of currently active source records. Exports of positional fact data will be full snapshots of the current positions only (no historical data). Exports of transaction fact data will be incremental based on the change-tracking maintained in the W_UPDATE_DT columns of the source tables. Incremental export dates are tracked on the C_SOURCE_CDC table.

Supporting Processes

The AIF DATA batch schedule has many programs that support the other major functions described in this chapter but whose purpose is not readily apparent from the job name. Review the table below for a list of such supporting processes and their primary functions within the data warehouse architecture.

Job Name	Primary Functions
*_MILE_MARKER_JOB	Any job having the term “mile marker” in the name is a simple placeholder in POM that allows for easy tracking of key points in the batch sequence. They may also be used to attach job dependencies for related groups of processes that need a single starting or ending point in the batch flow. The job does not perform any operations on the system itself.
VARIABLE_REFRESH_JOB	The underlying batch execution tool Oracle Data Integrator (ODI) uses temporary variables to hold onto any values that are specific to one batch execution, but in some cases the variables can persist across executions unless they are forcibly cleared by this program. If this program does not run, ODI may use invalid variable values from a prior run.
ETL_REFRESH_JOB	Erases the table `C_LOAD_DATES`, which is used to track the execution status of all AIF DATA programs linked to ODI. If any entry in the `C_LOAD_DATES` table is not in a “Success” status, then this job will immediately fail, because we assume a non-success entry means the prior batch execution was not properly done and needs manual intervention before we can start the next run.
REFRESH_RADM_JOB	An important part of data warehouse operations is maintaining accurate statistics, like row counts on all database tables. Statistics are used by the database to select the optimal execution plan for a given process. This program dynamically detects tables with altered data and collects new statistics on those tables and their related objects. This job’s runtime will increase relative to the amount of data changing in the system every day. The `RADM` in the job name refers to the data warehouse user schema, `RADM01`.
ANAYLZE_TEMP_TABLES_JOB	This process is similar to `REFRESH_RADM_JOB`, but specifically for temporary tables whose data has likely changed with every batch execution and require regularly updated statistics.
REFRESH_PDS_STATS_JOB	This process is similar to `REFRESH_RADM_JOB`, but specifically for RDX integration tables between the data warehouse and Planning Data Store (PDS).
RA_ERROR_COLLECTION_JOB	Maintains the table `W_ETL_REJECTED_RECORDS`, which tracks a summary view of rejected records that occurred in the last nightly batch cycle. This program aggregates the results of the individual errors on the `E$_*` set of tables and writes the final counts to `W_ETL_REJECTED_RECORDS`.
RA_BATCHEND_VALIDATION_JOB	Performs one final check on the `C_LOAD_DATES` table for any jobs that are not in a “Success” status at the end of a batch cycle and will fail if any are found. The intent of programs like this one are to ensure manual validation of batch issues that occurred for a run, so that corrective actions can be taken while all relevant data is still available. If the batch had issues and is allowed to start the next cycle without any review, all temporary data relating to that issue could be lost.
OBIEE_CACHE_CLEAR_JOB	Retail Insights relies on a data caching mechanism that is part of Oracle Analytics Server. This cache needs to be cleared at the end of every nightly batch cycle to ensure the next day’s reports do not return stale results from the cache instead of accessing the latest data in the warehouse. The job is only useful to RI customers and could be disabled otherwise.
AGG_UTILITY_PRE_JOB	This program is responsible for updating a dimension lookup temporary table named `W_PRODUCT_D_RTL_CUR_TMP`, which is used as a cross-reference of product dimension records for processes such as aggregation utilities and inactive dimension record cleanup on `W_RTL_ITEM_GRP1_D`.
AGG_UTILITY_ORG_PRE_JOB	This program is responsible for updating a dimension lookup temporary table named `W_INT_ORG_DH_RTL_CUR_TMP`, which is used as a cross-reference of product dimension records for processes such as aggregation utilities and inactive dimension record cleanup.
RESET_ETL_THREAD_VAL_STG_JOB	This program updates the column `ETL_THREAD_VAL`, which is found on many older fact interfaces into the data warehouse by setting them to a fixed value of `1`. Modern data load techniques do not require explicitly setting the number of threads to be used by a process, which is what this column was used for. If a data load has a value other than `1` in this field, it could result in a failure or improper execution of the program.
*_SDE_JOB	Any AIF DATA program ending in `SDE_JOB` refers to a “Source Dependent Extract” program that is pulling data from a source outside the data warehouse and populating an internal data warehouse table. These jobs should be left disabled unless you are familiar with the source system providing data for the job and want it to be extracted (in lieu of any other load method).
LOCALIZATION_REFRESH_JOB	Retail Insights has customizable labels for metrics and folder names which come from Retail Home during the nightly batch, but they are not immediately applied to the Oracle Analytics front-end. This program is responsible for refreshing Oracle Analytics to pick up the latest labels from the database.
RI_HEALTH_CHECK_JOB	This job queries a common platform API to check on the status of various components used within our applications and will fail if any components return a status other than READY or LIVE. A failure here does not necessarily mean the batch will have any problems in other programs, for example the `ords` service may be down, which will only impact Innovation Workbench programs. You can raise an Oracle Service Request if you are unsure how to address the errors provided, but in general the job can be skipped if you are comfortable that the errors shown do not affect your batch.