Time Series Use Case Scenario

You work in an electronic store, and sales of laptops and tablets have increased over the last two quarters. You want to forecast your product sales for the next four quarters using historical timestamped data. You forecast sales using the Exponential Smoothing algorithm, predicting changes over evenly spaced intervals of time using historical data.

Related Content

Topic Link
OML4SQL GitHub Example Time Series - Exponential Smoothing
CREATE_MODEL2 Procedure CREATE_MODEL2 Procedure
Generic Model Settings DBMS_DATA_MINING - Model Settings
Exponential Smoothing Model (ESM) Settings DBMS_DATA_MINING — Algorithm Settings:Exponential Smoothing
Data Dictionary Settings Oracle Machine Learning Data Dictionary Views
Exponential Smoothing Model - Model Detail Views Model Detail Views for Exponential Smoothing
About Time Series About Time Series

Before you start your OML4SQL use case journey, ensure that you have the following:

  • Data Set

    The data set used for this use case is from the SH schema. The SH schema can be readily accessed in Oracle Autonomous Database. For on-premises databases, the schema is installed during the installation or can be manually installed by downloading the scripts. See Installing the Sample Schemas.

    You will use the SALES table from the SH schema. You can access the table by running the SELECT statements in OML Notebooks.

  • Database
    Select or create database out of the following options:
  • Machine Learning Tools
    Depending on your database selection,
  • Other Requirements

    Data Mining Privileges (this is automatically set for ADW). See System Privileges for Oracle Machine Learning for SQL.

Load Data

Access the data set from the SH schema and explore the data to understand the attributes.

Remember:

The data set used for this use case is from the SH schema. The SH schema can be readily accessed in Oracle Autonomous Database. For on-premises databases, the schema is installed during the installation or can be manually installed by downloading the scripts. See Installing the Sample Schemas.

To understand the data, you will perform the following:
  • Access the data.
  • Examine the various attributes or columns of the data set.
  • Assess data quality (by exploring the data).

Access Data

You will use SALES table data from the SH schema.

Examine Data

The following table displays information about the attributes from SALES:

Attribute Name Information
PROD_ID The ID of the product
CUST_ID The ID of the customer
TIME_ID The timestamp of the purchase of the product in yyy-mm-dd hh:mm:ss format
CHANNEL_ID The channel ID of the channel sales data
PROMO_ID The product promotion ID
QUANTITY_SOLD The number of items sold
AMOUNT_SOLD The amount or sales data

Identify Target Variable

In this use case, the task is to train a model that predicts the amount sold. Therefore, the target variable is the attribute AMOUNT_SOLD.

Explore Data

Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

If you are working with Oracle Autonomous Database, you can use the Oracle Machine Learning (OML) Notebooks for your data science project, including assessing data quality. If you are using an on-premise Oracle Database, you can use the Oracle SQL Developer to assess data quality. Query the SH schema as described.

Note:

Each record in the database is called a case and each case is identified by a case_id. In this use case TIME_ID is the case_id as it is an independent variable and you are forecasting the sales for evenly spaced time.

The following steps help you with exploratory analysis of the data.

  1. View the data in the SH.SALES table by running the following statement:
    SELECT * FROM SH.SALES;
  2. To find the number of rows in SH.SALES table, run the following statement:
    %script
    SELECT COUNT(*) from SH.SALES;
    
    COUNT(*)   
        918843 
    ---------------------------
  3. Find the distinct users in the table, run the following query:
    %sql SELECT COUNT (DISTINCT CUST_ID) FROM SH.SALES;
    
    COUNT(DISTINCTCUST_ID)   
                        7059 
    ---------------------------
  4. To view the datatype of the sales table, run the following query:
    %script 
    DESCRIBE SH.SALES;
    
    Name          	Null?    	Type         
    ------------- 	-------- 	------------ 
    PROD_ID 	NOT NULL 	NUMBER 
    CUST_ID 	NOT NULL 	NUMBER 
    TIME_ID  	NOT NULL 	DATE     
    CHANNEL_ID 	NOT NULL 	NUMBER 
    PROMO_ID 	NOT NULL 	NUMBER 
    QUANTITY_SOLD 	NOT NULL 	NUMBER(10,2) 
    AMOUNT_SOLD 	NOT NULL 	NUMBER(10,2) 
    
    ---------------------------
  5. To view all the NULLs and missing values, run the following query:
    %sql SELECT COUNT(*) FROM SH.SALES WHERE PROD_ID=NULL OR CUST_ID=NULL OR
        TIME_ID=NULL OR CHANNEL_ID=NULL OR PROMO_ID=NULL OR QUANTITY_SOLD=NULL OR
        AMOUNT_SOLD=NULL;
    
    COUNT(*)   
             0 
    ---------------------------

    NULLs, if found, are automatically handled by the OML algorithms.

  6. Now, prepare a view called ESM_SH_DATA by selecting the necessary columns from SH.SALES table. For this use case, select TIME_ID and AMOUNT_SOLD.
    %script
    CREATE OR REPLACE VIEW ESM_SH_DATA AS 
      SELECT TIME_ID, AMOUNT_SOLD FROM SH.SALES;
    
    View ESM_SH_DATA created.
    ---------------------------
  7. Count the number of rows to ensure that we have the same amount of data. Run the following query:
    %script
    SELECT count(*) from ESM_SH_DATA;
    
    COUNT(*)   
        918843 
    ---------------------------

This completes the data understanding and data exploration stage. Time series data can contain missing values. The setting EXSM_SETMISSING can be used to specify how to handle missing values. The special value EXSM_MISS_AUTO indicates that, if the series contains missing values it is to be treated as an irregular time series. The Automatic Data Preparation (ADP) setting does not impact this data for time series. See How ADP Transforms the Data to understand how ADP prepares the data for some algorithms.

Build Model

To build a model using the time series data, you will use Exponential Smoothing algorithm on the ESM_SH_DATA view that is generated during the exploratory stage.

Oracle offers the Exponential Smoothing algorithm for time series.

Exponential smoothing is a forecasting method for time series data. It is a moving average method where exponentially decreasing weights are assigned to past observations. Components of Exponential Smoothing Model (ESM) such as trend and seasonality extensions, can have an additive or multiplicative form. For additive forms, the amplitude of the variation is independent of the level, whereas for multiplicative forms, the variation is connected to the level. The simpler additive models assume that error or noise, trend, and seasonality are linear effects within the recursive formulation.

To build a model using a supervised learning algorithm you may use a subset of the data into training and test data. Time series models usually use historical data to predict the future. This is different from model validation for classification and regression, which normally involves splitting data randomly into training and test sets. In this use case, there is no need to split the data set because the model is always predicting the current value based on information from the past. This means that although it seems that you train and test on the same data set, but when the model is applied, the forecast is always based on the previous date. In this use case, you will use the ESM_SH_DATA view.

  1. To see the data in the ESM_SH_DATA view, run the following statement:
    %sql
    SELECT * from ESM_SH_DATA;
    TIME_ID     AMOUNT_SOLD   
    20-JAN-98         1205.99 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-APR-98         1250.25 
    05-JUL-98         1210.21 
    05-JUL-98         1210.21 
    05-JUL-98         1210.21 
    05-JUL-98         1210.21 
    05-JUL-98         1210.21 ...
    
  2. Build a model with the ESM_SH_DATA table, run the following script:
     %script
    
        BEGIN DBMS_DATA_MINING.DROP_MODEL('ESM_SALES_FORECAST_2');
        EXCEPTION WHEN OTHERS THEN NULL; END;
        /
        DECLARE
            v_setlist DBMS_DATA_MINING.SETTING_LIST;
        BEGIN
            v_setlist('ALGO_NAME')            := 'ALGO_EXPONENTIAL_SMOOTHING';
            V_setlist('EXSM_INTERVAL')        := 'EXSM_INTERVAL_QTR';
            V_setlist('EXSM_PREDICTION_STEP') := '4';
            V_setlist('EXSM_MODEL')           := 'EXSM_WINTERS';
            V_setlist('EXSM_SEASONALITY')     := '4';
    	 V_setlist('EXSM_SETMISSING')	:= 'EXSM_MISS_AUTO');
    
            DBMS_DATA_MINING.CREATE_MODEL2(
                MODEL_NAME          => 'ESM_SALES_FORECAST_1',
                MINING_FUNCTION     => 'TIME_SERIES',
                DATA_QUERY          => 'select * from ESM_SH_DATA',
                SET_LIST            => v_setlst,
                CASE_ID_COLUMN_NAME => 'TIME_ID',
                TARGET_COLUMN_NAME  =>'AMOUNT_SOLD');
        END; 
    PL/SQL procedure successfully completed.
        ---------------------------
        PL/SQL procedure successfully completed.

    Examine the script:

    • v_setlist is a variable to store SETTING_LIST.
    • SETTING_LIST specifies model settings or hyperparameters for the model.
    • DBMS_DATA_MINING is the PL/SQL package used for machine learning. These settings are described in DBMS_DATA_MINING - Model Settings.
    • ALGO_NAME specifies the algorithm name. Since you are using Exponential Smoothing as the algorithm, the value of the setting is ALGO_EXPONENTIAL_SMOOTHING.
    • EXSM_INTERVAL indicates the interval of the data set or a unit of interval size. For example, day, week, month, and so on. You want to predict for quarterly sales. Hence, the setting is EXSM_INTERVAL_QTR. This setting applies only to the time column with datetime type.
    • EXSM_PREDICTION_STEP specifies how many predictions to make. You want to display each value representing a quarter. Hence, a value of 4 gives four values ahead prediction.
    • EXSM_MODEL specifies the type of exponential smoothing model to be used. Here the value is EXSM_HW. The Holt-Winters triple exponential smoothing model with additive trend and multiplicative seasonality is applied. This type of model considers various combinations of additive and multiplicative trend, seasonality and error, with and without trend damping. Other options are EXSM_SIMPLE, EXSM_SIMPLE_MULT, EXSM_HOLT, EXSM_HOLT_DMP, EXSM_MUL_TRND, EXSM_MULTRD_DMP, EXSM_SEAS_ADD, EXSM_SEAS_MUL, EXSM_HW, EXSM_HW_DMP, EXSM_HW_ADDSEA, EXSM_DHW_ADDSEA, EXSM_HWMT, EXSM_HWMT_DMP.
    • EXSM_SEASONALITY indicates how long a season lasts. The parameter specifies a positive integer value as the length of seasonal cycle. The value it takes must be larger than 1. For example, 4 means that every group of four values forms a seasonal cycle.
    • EXSM_SETMISSING specifies how to handle missing values. In time series, the special value EXSM_MISS_AUTO indicates that, if the series contains missing values it is to be treated as an irregular time series.

    The CREATE_MODEL2 procedure has the following settings:

    • MODEL_NAME: A unique name that you will give to the model. Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. Here, the model name is ESM_SALES_FORECAST_1.
    • MINING_FUNCTION: Specifies the machine learning function. Since it is a time series problem, select TIME_SERIES.
    • DATA_QUERY: A query that provides training data for building the model. Here, the query is SELECT * FROM ESM_SH_DATA.
    • SET_LIST: Specifies SETTING_LIST.
    • CASE_ID_COLUMN_NAME: A unique case identifier column in the training data. In this use case, case_id is TIME_ID. If there is a composite key, you must create a new attribute before creating the model.
    • TARGET_COLUMN_NAME: Specifies the column that is to be predicted. Also referred to as the target variable of the model. In other words, the value the model predicts. In this use case, you are predicting the sale of products in terms of their dollar price. Therefore, in this use case, the TARGET_COLUMN_NAME is AMOUNT_SOLD.

      Note:

      Any parameters or settings not specified are either system-determined or default values are used.

Evaluate

Evaluate your model by viewing diagnostic metrics and performing quality checks.

Sometimes querying dictionary views and model detail views is sufficient to measure your model's performance. However, you can evaluate your model by computing test metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), confusion matrix, lift statistics, cost matrix, and so on. For Association Rules, you can inspect various rules to see if they reveal new insights for item dependencies (antecedent itemset implying consequent) or for unexpected relationships among items.

Dictionary and Model Views

To obtain information about the model and view model settings, you can query data dictionary views and model detail views. Specific views in model detail views display model statistics which can help you evaluate the model.

By examining various statistics in the model detail views, you can compare models to arrive at one model that satisfies your evaluation criteria.

The data dictionary views for Oracle Machine Learning are listed in the following table. A database administrator (DBA) and USER versions of the views are also available.

View Name Description
ALL_MINING_MODELS Provides information about all accessible machine learning models
ALL_MINING_MODEL_ATTRIBUTES Provides information about the attributes of all accessible machine learning models
ALL_MINING_MODEL_SETTINGS Provides information about the configuration settings for all accessible machine learning models
ALL_MINING_MODEL_VIEWS Provides information about the model views for all accessible machine learning models
ALL_MINING_MODEL_XFORMS Provides the user-specified transformations embedded in all accessible machine learning models.

Model detail views are specific to the algorithm. You can obtain more insights about the model you created by viewing the model detail views. The names of model detail views begin with DM$xx where xx corresponds to the view prefix. See Model Detail Views.

  1. You can review the model settings by running the following query:
     %sql
     
    SELECT SETTING_NAME, SETTING_VALUE
      FROM USER_MINING_MODEL_SETTINGS
      WHERE MODEL_NAME = UPPER('ESM_SALES_FORECAST_1')
      ORDER BY SETTING_NAME;
    
    SETTING_NAME                   SETTING_VALUE               
    ALGO_NAME                      ALGO_EXPONENTIAL_SMOOTHING  
    EXSM_ACCUMULATE                EXSM_ACCU_TOTAL             
    EXSM_CONFIDENCE_LEVEL          .95                         
    EXSM_INTERVAL                  EXSM_INTERVAL_QTR           
    EXSM_MODEL                     EXSM_WINTERS                
    EXSM_NMSE                      3                           
    EXSM_OPTIMIZATION_CRIT         EXSM_OPT_CRIT_LIK           
    EXSM_PREDICTION_STEP           4                           
    EXSM_SEASONALITY               4                           
    EXSM_SETMISSING                EXSM_MISS_AUTO              
    ODMS_DETAILS                   ODMS_ENABLE                 
    ODMS_MISSING_VALUE_TREATMENT   ODMS_MISSING_VALUE_AUTO     
    ODMS_SAMPLING                  ODMS_SAMPLING_DISABLE       
    PREP_AUTO                      ON                          
     
    14 rows selected.
    ---------------------------
  2. To view the DM$VP model view, run the following statement:

    The DM$VP view for time series contains the result of an ESM model. The output has a set of records such as partition, CASE_ID, value, prediction, lower, upper, and so on and ordered by partition and CASE_ID (time).

    %script
    SELECT CASE_ID, VALUE, PREDICTION, LOWER, UPPER FROM DM$VPESM_SALES_FORECAST_1
    ORDER BY CASE_ID;
    
    CASE_ID     VALUE                PREDICTION           LOWER   UPPER  
    01-JAN-98     6480684.0000011446   6452375.7547333492                
    01-APR-98     5593994.1400007578   5848724.7899219571                
    01-JUL-98     6071823.1000010688   6214546.3092128271                
    01-OCT-98     5937413.7100012964   5869219.4189072186                
    01-JAN-99      6093747.209999715    6132016.410793812                
    01-APR-99     4925471.6299999086   5385954.0785653945                
    01-JUL-99     5827050.1500000218   5350240.2540956484                
    01-OCT-99     5373678.6700002998   5304626.0456054937                
    01-JAN-00     5984889.4899995513   5541123.2442497462                
    01-APR-00     5371730.9200002486     5236126.09628068                
    01-JUL-00     6121239.2899996703   5955258.7436284116                
    01-OCT-00     6287646.9199997969   6089446.4024073323                
    01-JAN-01     6547097.4400001625   6837567.1739504253                
    01-APR-01     6922468.3900004178   6188944.0536819538                 ...
     
    --------------------------------------------------------------------------------------
    Examine the statement:
    • CASE_ID: Specifies the timestamp.
    • VALUE: Specifies the AMOUNT_SOLD.
    • PREDICTION: Indicates the predicted value for the model.
    • LOWER and UPPER: Indicate the confidence bounds.
  3. To view the model diagonistic view, DM$VG, and evaluate the model, run the following query:

    The DM$VG view for time series contains the global information of the model along with the estimated smoothing constants, the estimated initial state, and global diagnostic measures.

    %sql
    SELECT NAME, round(NUMERIC_VALUE,4), STRING_VALUE
      FROM DM$VGESM_SALES_FORECAST_1
      ORDER BY NAME;
     
    
    NAME                ROUND(NUMERIC_VALUE,4)   STRING_VALUE  
    -2 LOG-LIKELIHOOD                   450.7508               
    AIC                                 466.7508               
    AICC                                487.3223               
    ALPHA                                 0.4525               
    AMSE                       157764777942.4555               
    BETA                                  0.4195               
    BIC                                 472.9315               
    CONVERGED                                    YES           
    GAMMA                                 0.0001               
    INITIAL LEVEL                   6110212.8741               
    INITIAL SEASON 1                      0.9939               
    INITIAL SEASON 2                      1.0231               
    INITIAL SEASON 3                      0.9366               
    INITIAL SEASON 4                      1.0465               
     
    NAME            ROUND(NUMERIC_VALUE,4)   STRING_VALUE  
    INITIAL TREND                 55478.0794               
    MAE                               0.0424               
    MSE                    104400146583.6485               
    NUM_ROWS                          918843               
    SIGMA                              0.054               
    STD                          323110.1153               
     
     
    20 rows selected.
     
    ---------------------------
    • NAME: Indicates the diagnostic attribute name.
    • NUMERIC_VALUE: Indicates the calculated statistical value for the model.
    • STRING_VALUE: Indicates alphanumeric values for the diagnostic parameter.

      A few parameters to note for an exponential smoothing algorithm are:

      • ALPHA: Indicates the smoothing constant.
      • BETA: Indicates the trend smoothing constant.
      • GAMMA: Indicates the seasonal smoothing constant.
      • MAE: Indicates Mean Absolute Error.
      • MSE: Indicates Mean Square Error.

Exponential smoothing assumes that a series extends infinitely into the past, but that influence of past on future, decays smoothly and exponentially fast. The smooth rate of decay is expressed by one or more smoothing constants. The smoothing constants are parameters that the model estimates. These smoothing constants are represented as α, β, and γ. Values of a smoothing constant near one put almost all weight on the most recent observations. Values of a smoothing constant near zero allow the distant past observations to have a large influence.

Note that α is associated with the error or noise of the series, β is associated with the trend, and γ is associated with the seasonality factors. The γ value is closest to zero which means seasonality has an influence on the data set.

The MAE and MSE values are low which means that the model is good. The MSE magnitude depends on the actual scale of your original data. In this case, the STD is around 105. The square of it is roughly in the scale of 1010. The error percentage is low and hence, the model is good.

Score

You are ready to forecast sales for the next four quarters.

For a time series model, you can use the DM$VP view to perform scoring or prediction.

  1. Query the DM$VP model detail view to see the forecast (sales for four quarters). Run the following statement:
    %sql
    SELECT TO_CHAR(CASE_ID,'YYYY-MON') DATE_ID,
           round(VALUE,2) ACTUAL_SOLD,
           round(PREDICTION,2) FORECAST_SOLD,
           round(LOWER,2) LOWER_BOUND, round(UPPER,2) UPPER_BOUND
      FROM DM$VPESM_SALES_FORECAST_1
      ORDER BY CASE_ID;

    In this step, the prediction shows amount sold along with the case_id. The predictions display upper and lower confidence bounds showing that the estimates can vary between those values.

    Examine the statement:
    • TO_CHAR(CASE_ID,'YYYY-MON') DATE_ID: The DATE_ID column has timestamp or case_id extracted in year-month (yyyy-mon) format.
    • round(VALUE,2) ACTUAL_SOLD: Specifies the AMOUNT_SOLD value as ACTUAL_SOLD rounded to two numericals after the decimal.
    • round(PREDICTION,2) FORECAST_SOLD: Specifies the predicted value as FORECAST_SOLD rounded to two numericals after the decimal.
    • round(LOWER,2) LOWER_BOUND, round(UPPER,2) UPPER_BOUND: Specifies the lower and upper confidence levels rounded to two numericals after the decimal.
    
    DATE_ID    ACTUAL_SOLD   FORECAST_SOLD   LOWER_BOUND   UPPER_BOUND   
    1998-JAN         6480684      6452375.75                             
    1998-APR      5593994.14      5848724.79                             
    1998-JUL       6071823.1      6214546.31                             
    1998-OCT      5937413.71      5869219.42                             
    1999-JAN      6093747.21      6132016.41                             
    1999-APR      4925471.63      5385954.08                             
    1999-JUL      5827050.15      5350240.25                             
    1999-OCT      5373678.67      5304626.05                             
    2000-JAN      5984889.49      5541123.24                             
    2000-APR      5371730.92       5236126.1                             
    2000-JUL      6121239.29      5955258.74                             
    2000-OCT      6287646.92       6089446.4                             
    2001-JAN      6547097.44      6837567.17                             
    2001-APR      6922468.39      6188944.05                             
    
    DATE_ID    ACTUAL_SOLD   FORECAST_SOLD   LOWER_BOUND   UPPER_BOUND   
    2001-JUL      7195998.63      7663836.77                             
    2001-OCT      7470897.52      7573926.96                             
    2002-JAN                      8232820.51    7360847.49    9104793.54 
    2002-APR                      7642694.94    6584565.24    8700824.63 
    2002-JUL                      8648402.54    7019914.28   10276890.81 
    2002-OCT                      8692842.46    6523676.33    10862008.6 
    
    
    20 rows selected. 
    
    
    ---------------------------
    
    
  2. To see a visual representation of the predictions in OML Notebooks, run the above same query with the following settings:

    Click settings and drag DATE_ID to keys and FORECASTED_SOLD (avg), ACTUAL_SOLD (avge), LOWER_BOUND (avg), and UPPER_BOUND(avg) to values.

    %sql
    SELECT TO_CHAR(CASE_ID,'YYYY-MON') DATE_ID, VALUE ACTUAL_SOLD,
           round(PREDICTION,2) FORECAST_SOLD,
           round(LOWER,2) LOWER_BOUND, round(UPPER,2) UPPER_BOUND
      FROM DM$VPESM_SALES_FORECAST_1
      ORDER BY CASE_ID;

This completes the prediction step. The model has successfully forecast sales for the next four quarters. This helps in tracking the sales and also gives us an idea on stocking our products.