3 AI Foundation Data Batch Architecture
The AIF DATA batch schedule is responsible for managing foundation data for the Retail Analytics and Planning solutions and storing it in a shared data warehouse architecture. Understanding the core features of the data warehouse can help you make better use of the jobs and processes available to you. When working with Oracle Support, you may also be asked to modify the behavior of some jobs to resolve issues, and this chapter provides additional information on the AIF DATA processes you will be expected to interact with.
This chapter will cover the following major functions of the AIF DATA batch schedule:
-
File ingestion – How customer data files are received and processed
-
Data extract – How data is extracted from a source system within the Oracle Cloud for use in Analytics and Planning applications
-
Data import – How customer data files are pulled into the Oracle database
-
Data transform and load – How customer data is transformed and loaded into the data warehouse
-
Data reclassification – How customer data is reorganized when an item or location is reclassified to a new hierarchy node
-
Data aggregation – How customer data is aggregated and summarized for use in BI reports
-
Data export – How customer data is exported from the data warehouse for use in other RAP applications
-
Other supporting processes – Other technical aspects of the batch processing that supports the overall flow of data throughout the system
File Ingestion
Data files are accepted from external sources using a private cloud instance of Oracle Object Storage (OS). This OS instance is not comparable to the Oracle Public Cloud version: it does not have a user interface and cannot be accessed except through specific programming APIs provided by Oracle (such as the File Transfer Services or FTS). It is used for short-term data storage and retrieval only as a means of transferring files into or out of the Oracle Retail cloud. If you are not providing any files and instead using direct integration from Merchandising Foundation Cloud Service (MFCS) then these programs will not be used.
AIF DATA programs expect customer files to come in the form of ZIP files. The ZIP file must contain only a set of files from the pre-determined list of interfaces that AIF DATA batch supports. It cannot contain a folder structure; all files must be directly packaged into the ZIP file itself. The general steps performed once a ZIP file is uploaded to Object Storage are:
-
The ZIP file is passed through a virus scanner and will not be consumed until the scan completes. A batch program (
ZIP_FILE_WAIT_JOB
) will wait for a maximum of 4 hours for the ZIP file to become available for import and then fail. -
The ZIP file is unpacked by another batch program (
ZIP_FILE_UNLOAD_JOB
) into a temporary directory where the files will be cleansed using a variety of standard Unix functions includingdos2unix
and methods to apply UTF-8 encoding. -
After file conversion is completed, all files used in prior batch runs are deleted and the new files are moved to the next stage of processing.
-
Files are compared to a list of required filenames as specified in Retail Home. If any required files are missing, the batch program (
DAT_FILE_VALIDATE_JOB
) will fail. If all required files are found, then the entire set of files are moved into the final server directory where they can be read by later jobs.
Files cannot be consumed by the rest of the batch unless DAT_FILE_VALIDATE_JOB
is run successfully, so
it is a critical program that cannot be skipped or disabled. Even if you have marked all files as optional in Retail Home
by disabling the CONTROLFILES
modules, you must still run DAT_FILE_VALIDATE_JOB
because
it will move the files to the required folder structure on the application server.
Data Extract
When you subscribe to other Oracle Cloud applications that support integration with Retail Analytics and Planning (RAP) applications, such as Merchandising Foundation Cloud Service, you have the option to leverage direct database-to-database extracts that run as part of the AIF DATA batch schedule. These extracts are collectively referred to as the Retail Data Extractor (RDE) component of RAP. They key points of the RDE architecture are:
-
Oracle Golden Gate technology is used to take a near-real-time copy of the source system’s data and place it on the local ADW instance where RAP is installed
-
The local copy of the data is overlaid with database synonyms and permission grants so that AIF DATA batch programs can access it
-
Extract programs in the AIF DATA schedule are usually prefixed with
RDE
to more easily identify them -
RDE programs read from the cloned tables, transform the data, and then insert the result to the staging tables in the data warehouse
For all other implementation details on using RDE programs, refer to the Retail Analytics and Planning Implementation Guide chapter on “Integration with Merchandising”.
Data Import
Data files that have been successfully prepared through the file ingestion process will then go through a two-step process to import the contents into the database. The first step will copy a specific file from the application server staging directory to another location where the database can access it. The second step will perform an External Table Load procedure, which is a standard feature of Oracle Autonomous Data Warehouse (ADW). The file is mapped to a temporary external table name, then the rows of data are read from the file and inserted into a pre-defined target table in our data model.
The file copy step is performed by batch jobs having COPY
in their names. For example, W_MCAL_PERIOD_DS_COPY_JOB
will copy the W_MCAL_PERIOD_DS.dat
file. These jobs have the following parameters in POM that determine
the behavior of the process:
-
disJobType
– The internal code for the program type, which affects the API call used when the job is run -
object
– The filename being written to the destination -
objectFilepath
– The source path and filename for the copy action
If these COPY
jobs are not run successfully, then there will not be any file available for the next step
to consume, and later steps in the batch will fail as a result.
The external table load step is performed by batch jobs having STG
in their names. For example, W_MCAL_PERIOD_DS_STG_JOB
will read the file data from W_MCAL_PERIOD_DS.dat
and write it to the
database. These jobs have the following parameters in POM that determine the behavior of the process:
-
disJobType
– The internal code for the program type, which affects the API call used when the job is run -
tableName
– The target table in the database where the file data will be written to when this job is executed -
filename
– The source file that will be mapped to a temporary external table for processing, which must match an object in theCOPY
job parameters -
connectionType
– The ADW connection tier used for this data load, from one of (low, medium, high) -
truncateTable
– Instruct the load to truncate the target table before insert (true) or append to the table (false)
You may modify the parameters connectionType
and truncateTable
based on guidance from
Oracle Support or as needed during an implementation. The connection type of “low” is optimal for low volumes of data, but
if you find yourself pushing large amounts of data into a program that uses a low connection, you can change the parameter
value to “medium” which can better handle high volumes. Do not make this change to connection type unless you are sure it
is necessary and revert the setting to its original value after you are done loading data. You might also want to change truncateTable
to “false” if you do not want to lose the data currently in a staging table when loading a new file
to that table, such as when a customer provides their implementer with an incremental file that needs to be appended to data
previously staged for loading. Like the connection type, do not change this setting unless required for a specific use case,
and revert it to the default value (true) when finished with your loads.
The external table load (STG
) programs also use the configuration options provided on context (CTX
) files to further change the behavior of the data imports. Details on creating context files is provided in
the Retail Analytics and Planning Implementation Guide chapter on “Data File Generation”. The context files are not
directly used; they are pushed into ADW metadata tables using a dedicated POM job (RI_UPDATE_TENANT_JOB
)
and then the external table loads will read from that metadata to configure the imports. This metadata controls the following
aspects of the file import: columns to read from the file, column delimiters, data formats, trimming of spaces, header row
processing, rejected record tolerance, and any optional parameters supported by the DBMS_CLOUD
package in
ADW.
Transform and Load
Data that has been staged into the data warehouse (either by file imports or direct integrations) may go through additional transformations before being written to the final warehouse tables. When the data is directly loaded using RDE programs, all the transformation is already done and the next step would be to load it into the data warehouse. If the data came from a file with a CSV file extension, it needs additional processing before it can be used.
CSV file formats are provided to reduce the development overhead of creating interfaces for a set of related data warehouse
tables. CSV files often combine two or more internal tables into one input file, so additional transformation is needed to
split the data fields out after importing the file. The transformation of CSV file data is done using a combination of Oracle
Data Integrator (ODI) programs and PL/SQL scripting, and the programs are collectively called Simplified interface (SI) programs.
The SI program’s responsibility is to split out the relevant fields from the input to all the other staging tables which require
something from that file. For example, the PRODUCT.csv
file is imported to a table named W_PRODUCT_DTS
. From here, the SI program SI_W_PRODUCT_DS_JOB
takes a subset of this data and moves it to W_PRODUCT_DS
. Another SI program SI_W_PROD_CAT_DHS_JOB
reads from the same source table and writes to W_PROD_CAT_DHS
. This continues until all the fields from W_PRODUCT_DTS
have been parsed out into their designated areas
by SI jobs.
The target tables of the SI programs are the same tables directly populated by the RDE extracts, meaning the two data flows are mutually exclusive. Either RDE jobs or file-based jobs can be used to populate data into the data warehouse for a particular interface. Attempting to run both for the same table will result in errors and data conflicts. For additional examples of these data flows, refer to the Retail Analytics and Planning Implementation Guide chapter on “Batch Orchestration”, under the section “Managing Multiple Data Sources”.
After any transformation steps are complete, then loading into the data warehouse target tables
can begin. Load jobs take the data in staging tables like W_PRODUCT_DS
and move them into internal tables like W_PRODUCT_D
. The load programs
are complex and perform many operations to get from the source to the target. The data
warehouse applies primary and foreign key constraints, internal sequences for row
tracking, audit fields such as insert/update timestamps, historical versions of modified
records, data quality validations, and many other functions. The basic load jobs can be
broadly split into dimension jobs (for tables ending with _D
), fact
jobs (for tables ending with _F
), translation lookup jobs (for tables
ending with _TL)
, and general load jobs (for tables ending with
_G
). For these programs, the name of the job in the AIF DATA batch
schedule will correlate with the target table being loaded. For example,
W_PRODUCT_D_JOB
is responsible for loading the
W_PRODUCT_D
table. Understanding this naming scheme will allow you
to connect the jobs being run in the batch schedule with the tables being populated in
the data warehouse.
Data Reclassification
A core feature of the data warehouse is its ability to manage reclassifications, which is the term used to describe the reassignment of an item or location from one position in your hierarchies to another. The item and location dimensions (and their hierarchies) are the only ones to support reclassification management. All other dimensions are taken “as-is”, meaning the data warehouse will not track major changes done to those records as a reclassification.
As it pertains to the AIF DATA batch schedule, the system applies a reclassification activity across two days, following this overall flow of events:
-
Day 1 Batch
-
Incoming product and location data is compared to the data already in the data warehouse and changes to the hierarchy assignments are recorded in temporary tables. This comparison is done automatically and does not require you to tell the system about reclassified records separately.
-
For the initial batch when a reclassified item or location is sent in, the data is applied in the dimension tables like
W_PRODUCT_D
immediately, inserting a new record for the reclassified data and updating the existing record as inactive. Effective dates are assigned to both the old and new records to indicate the periods of time when that hierarchy classification was active. TheCURRENT_FLG
column indicates the latest version of the record. -
If any fact records are provided for a reclassified item or location, that data is loaded relative to the new record’s unique identifier, but existing data remains on the old record’s identifier. Joining the fact and dimension tables together is how you cross-reference the old and new identifiers with the same item or location.
-
A set of jobs prefixed with
FACTOPEN
andFACTCLOSE
are responsible for maintaining positional facts like inventory and pricing, end-dating the old version of reclassed records and inserting new ones for the new dimension keys, even if the fact itself had no changes on that day. This is necessary to carry forward the positional balances on the new dimension keys, not the inactive ones.
-
-
Day 2 Batch
-
At the very start of the next day’s batch processing, a series of jobs having
RECLASS
in the name will take effect. These jobs are responsible for recalculating aggregation tables in the data warehouse to regenerate fact data based on the latest classifications. This is done on the second day’s batch due to the amount of time it can take to run, so it’s best to do it early in the morning before users enter the system instead of doing it at the end of Day 1’s batch.
-
A side-effect of the split approach is that BI reporting will not reflect a reclassification fully until a day has elapsed and the aggregate tables are recalculated. A user may only notice this if they attempt to report on the current day’s fact data at an aggregate level on the same day a reclassification occurs, which will be a small subset of reports (if any). It is still better to run the aggregation programs outside of normal business hours versus running them during the business day in the first batch, which will impact the user experience more than the temporary data inconsistency does. This approach does not impact the other Analytics and Planning applications because they receive their data directly from the item/location fact tables, not the BI aggregate tables.
Data Aggregation
The Analytics and Planning data warehouse supports the Retail Insights solution for all business intelligence and reporting
needs. It does this using a comprehensive set of pre-aggregated data tables holding information at levels commonly used in
reporting, such as department/week or subclass/store/week. The AIF DATA batch schedule is responsible for maintaining all
of the aggregate tables. Each job in the batch schedule ending with _A_JOB
populates an aggregate table
in the data warehouse, with the job name directly corresponding to the table being populated (for example, the W_RTL_SLS_SC_LC_DY_A_JOB
populates the W_RTL_SLS_SC_LC_DY_A
table).
It is important to understand how the data warehouse performs aggregations as it can directly impact the way you provide your data to the system. Aggregations are performed using a different approach depending on the type of fact data:
-
Transaction facts such as sales and receipts are aggregated incrementally and additively. The incoming data is summarized at each level of aggregation and added to the data already in the target table. This means that the incoming fact data must only contain the values that can be summed with existing records to reach the desired final numbers. Loading data in this manner has the best performance in batch processing as it requires the least number of changes to the aggregate tables. Data will be updated only if it is being altered by an incoming record.
-
Positional facts such as inventory are aggregated as complete snapshots of the base fact data for daily aggregates and on the first day of a new week. The item/location records are rolled up to the desired level and then inserted to the aggregation table for the new daily or weekly positions. Weekly positional data is also updated incrementally during the week. The incoming records will update the current week’s positions with the changed values (non-additively, replacing the existing data with the new data).
There are multiple distinct batch jobs involved in fact aggregation and all must be enabled in the nightly batch cycles
before any BI aggregates can be populated. Using the sales transaction fact aggregate W_RTL_SLS_IT_DY_A
as
an example, the data is processed in the following steps:
-
Incremental sales data for the current date will be staged into the
W_RTL_SLS_TRX_IT_LC_DY_FS
table using either RDE programs or a file-based load. -
W_RTL_SLS_IT_DY_TMP_JOB
aggregates the incoming transaction records from the staging layer to a temporary tableW_RTL_SLS_IT_DY_TMP
at the summarized level of item/day. The internal base fact tableW_RTL_SLS_TRX_IT_LC_DY_F
is not used; aggregate table data always comes from the staging layer only. -
W_RTL_SLS_IT_DY_A_JOB
will take the data fromW_RTL_SLS_IT_DY_TMP
and load it toW_RTL_SLS_IT_DY_A
using an additiveMERGE
statement. Existing records will be merged by summing the same column in the source and target (for example,t.SLS_QTY=s.SLS_QTY+t.SLS_QTY
) while unmatched records will be inserted with all the values from the temporary table. Positional facts such as inventory will use a non-additive merge, overwriting the target column from the source table. -
Other aggregate tables that need item level data such as
W_RTL_SLS_IT_WK_A
will also source fromW_RTL_SLS_IT_DY_TMP
to update their data using a similar merge statement. Week-level aggregates merge the daily data onto the week-level records using the system’s fiscal calendar to map the days to weeks.
The most important behavior to understand is that the daily staged data is used to update aggregate tables. A common point
of confusion about the aggregation process is the expectation that lower-level aggregate tables directly populate the higher-level
aggregates. This leads to the incorrect assumption that all aggregate tables are regenerated from the base fact every day.
An implementer might think that a manual data change on a base fact table like W_RTL_SLS_TRX_IT_LC_DY_F
will
automatically appear in the BI aggregates after batch, which is not the case. For aggregates to be updated, the data must
be loaded through the staging tables and then processed through the aggregation programs.
Data Exports
The AIF data warehouse is used to export certain datasets to planning applications such as Merchandise Financial Planning (MFP). Separate AIF DATA nightly batch jobs are responsible for pulling data out of the data warehouse tables and exporting it to tables in the retail data exchange (RDX) schema. Only data exported to RDX is available for use in the planning applications, all other data warehouse tables are used only by Retail Insights or AI Foundation processes. This method of exporting data is specific to the planning applications, the way AIF applications get data is by pulling from the data warehouse using AIF APPS batch schedule programs.
Data export programs use a different naming scheme starting with W_PDS
, such as W_PDS_PRODUCT_D_JOB
. These jobs use PL/SQL to read from the data warehouse tables and write to RDX schema tables. Export programs operate on
a run-based architecture where each execution of the job will create a new Run ID to track the execution and write the complete
set of outputs. Runs are managed using tables RAP_INTF_CFG
and RAP_INTF_RUN_STATUS
that
can be monitored from Innovation Workbench. The full mapping between the source tables in the data warehouse and the target
tables in the RDX schema is available in the Retail Analytics and Planning Implementation Guide chapter on “Data Processing
and Transformations”.
Exports from the data warehouse to RDX tables share some common design patterns:
-
Export programs, like all other AIF DATA jobs, leverage the
C_LOAD_DATES
table to track the job level run status, and any existing entry inC_LOAD_DATES
will prevent running the export program again if it has already been run once. -
Exported facts will join on their associated export dimensions to ensure data consistency across tables, such as
W_PDS_SLS_IT_LC_WK_A_JOB
only exporting results for item/locations also present on theW_PDS_PRODUCT_D
andW_PDS_ORGANIZATION_D
dimension export tables. -
The source and target tables use the same intersections of data, as defined in the table names. For example, the job
W_PDS_SLS_IT_LC_WK_A_JOB
reads from the tableW_RTL_SLS_IT_LC_WK_A
and writes toW_PDS_SLS_IT_LC_WK_A
, and both the source and target tables are aggregated at the item/location/week levels. -
Identifiers for the item, location, and date values on the PDS export tables will be converted to their external-facing values, meaning it is not possible to directly join the input and output tables without also using dimension tables to convert the identifiers (the data warehouse tables do not hold the external codes, instead they have foreign key values referring to their associated dimensions).
-
Exports of dimension data will be a full snapshot of currently active source records. Exports of positional fact data will be full snapshots of the current positions only (no historical data). Exports of transaction fact data will be incremental based on the change-tracking maintained in the
W_UPDATE_DT
columns of the source tables. Incremental export dates are tracked on theC_SOURCE_CDC
table.
Supporting Processes
The AIF DATA batch schedule has many programs that support the other major functions described in this chapter but whose purpose is not readily apparent from the job name. Review the table below for a list of such supporting processes and their primary functions within the data warehouse architecture.
Job Name | Primary Functions |
---|---|
*_MILE_MARKER_JOB |
Any job having the term “mile marker” in the name is a simple placeholder in POM that allows for easy tracking of key points in the batch sequence. They may also be used to attach job dependencies for related groups of processes that need a single starting or ending point in the batch flow. The job does not perform any operations on the system itself. |
VARIABLE_REFRESH_JOB |
The underlying batch execution tool Oracle Data Integrator (ODI) uses temporary variables to hold onto any values that are specific to one batch execution, but in some cases the variables can persist across executions unless they are forcibly cleared by this program. If this program does not run, ODI may use invalid variable values from a prior run. |
ETL_REFRESH_JOB |
Erases the table |
REFRESH_RADM_JOB |
An important part of data warehouse operations is maintaining accurate statistics, like row counts
on all database tables. Statistics are used by the database to select the optimal execution plan for a given process. This
program dynamically detects tables with altered data and collects new statistics on those tables and their related objects.
This job’s runtime will increase relative to the amount of data changing in the system every day. The |
ANAYLZE_TEMP_TABLES_JOB |
This process is similar to |
REFRESH_PDS_STATS_JOB |
This process is similar to |
RA_ERROR_COLLECTION_JOB |
Maintains the table |
RA_BATCHEND_VALIDATION_JOB |
Performs one final check on the |
OBIEE_CACHE_CLEAR_JOB |
Retail Insights relies on a data caching mechanism that is part of Oracle Analytics Server. This cache needs to be cleared at the end of every nightly batch cycle to ensure the next day’s reports do not return stale results from the cache instead of accessing the latest data in the warehouse. The job is only useful to RI customers and could be disabled otherwise. |
AGG_UTILITY_PRE_JOB |
This program is responsible for updating a dimension lookup temporary table named |
AGG_UTILITY_ORG_PRE_JOB |
This program is responsible for updating a dimension lookup temporary table named |
RESET_ETL_THREAD_VAL_STG_JOB |
This program updates the column |
*_SDE_JOB |
Any AIF DATA program ending in |
LOCALIZATION_REFRESH_JOB |
Retail Insights has customizable labels for metrics and folder names which come from Retail Home during the nightly batch, but they are not immediately applied to the Oracle Analytics front-end. This program is responsible for refreshing Oracle Analytics to pick up the latest labels from the database. |
RI_HEALTH_CHECK_JOB |
This job queries a common platform API to check on the status of various components used within
our applications and will fail if any components return a status other than READY or LIVE. A failure here does not necessarily
mean the batch will have any problems in other programs, for example the |