3.1 Regression Use case

The Brooklyn housing dataset contains the sale prices of homes in brooklyn borough, along with various factors that influence these prices, such as the area of the house, its location, and the type of dwelling. You are tasked with analyzing years of historical home sales data to estimate sales prices, which will help optimize real estate operations. In this case study, you will learn how to predict sales prices using the regression technique and the GLM algorithm.

Related Contents

Topic Link
OML4Py GitHub Example OML4Py Regression GLM
About Generalized Linear Model About Generalized Linear Model
About Machine Learning Classes and Algorithms About Machine Learning Classes and Algorithms
Shared Settings Shared Settings
Before you start your OML4Py use case journey, ensure that you have the following:

3.1.1 Load Data

Load the data in your database and examine the data set and its attributes.

You can download the dataset from Rolling sales data for the Brooklyn borough. If you are using the Oracle Autonomous AI Database, you will upload files to the Oracle Cloud Infrastructure (OCI) Object Storage, create a sample table, load data into the sample table from files on the OCI Object Storage, and explore the data.

Examine Data

There are 110 attributes in the dataset; below are descriptions of a few important ones. For a complete description of the attribute, see Rolling sales data.

Attribute Name Information
Borough The borough in which the tax lot is located.
BoroCode The borough in which the tax lot is located.
Year Built The year construction of the building was completed.
Zip Code A ZIP code that is valid for one of the addresses assigned to the tax lot.
Address An address for the tax lot
BUILDING CLASS A code describing the major use of structures on the tax lot.
HEALTH CENTER DISTRICT The health center district in which the tax lot is located. Thirty health center districts were created by the City in 1930 to conduct neighborhood focused health interventions
LAND USE CATEGORY A code for the tax lot's land use category
LOT AREA Total area of the tax lot, expressed in square feet rounded to the nearest integer.
Building Area The total gross area in square feet. Same as 'gross_sqft'
Gross Square Feet The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property.
Year Built Year the structure on the property was built.
Sales Price Price paid for the property.

3.1.1.1 Import Data

Import data into the Oracle AI Database by using Object Storage (for Cloud).

If using a cloud account, one of the methods of importing the data is through Object Storage. Upload the data set to an Object Storage. The Object Storage URI will be used in another procedure.You can load data into yourOracle Autonomous AI Database (Autonomous Data Warehouse [ADW] or Autonomous Transaction Processing [ATP]) using Oracle AI Database tools, and Oracle and 3rd party data integration tools. You can load data:
  • from local files in your client computer, or
  • from files stored in a cloud-based object store

Follow the steps to upload your data file to the Object Storage bucket.

  1. Login to your cloud account.
  2. Click the left-side hamburger menu and select Storage from the menu.
  3. Select Buckets from the Object Storage & Archive Storage option.
  4. Select the compartment in which you want to upload the data.
  5. Click Create Bucket.
  6. Enter a name for your bucket. For example, Bucket1. Leave the rest of the fields as default.
  7. Click Create.
  8. Click on the bucket that you created. Scroll down and click Upload under Objects.
  9. Leave the Object Name Prefix field black. Click select files to navigate to the data file that you want to upload or drag and drop the data file. In this use case, select the modified .csv file.
  10. Click Upload. The data file appears under Objects.
  11. Click the ellipses on the right side of the data file to view the menu. Click View Object Details.
  12. Copy the URL PATH (URI) to a text file. This URI is used in the DBMS_CLOUD.COPY_DATA procedure.

This procedure creates an object storage containing the data file in your cloud account.

Create Auth Token

The Auth Token is required in the DBMS_CLOUD.CREATE_CREDENTIAL procedure. You can generate the Auth Token in your cloud account.

  1. Login into your ADW Cloud account.
  2. Hover your mouse cursor over the human figure icon at the top right of the console and click User Settings from the drop-down menu.
  3. Click Auth Tokens under Resources on the left of the console.
  4. Click Generate Token. A pop-up dialog appears.
  5. Enter a description (optional).
  6. Click Generate Token.
  7. Copy the generated token to a text file. The token does not appear again.
  8. Click Close.

Create Object Storage Credential

The object storage credential is used in the DBMS_CLOUD.COPY_DATA procedure.

  1. Login to the OML Notebooks page and create a notebook. See Create a Notebook Classic
  2. Open the notebook that you just created.
  3. Enter the following query to create an object storage credentials:
    %script
    begin
      DBMS_CLOUD.create_credential (
        credential_name => 'CRED',
        username => '<your cloud account username>',
        password => '<your Auth Token>'
      );
    end;
    /
    ---------------------------  PL/SQL procedure successfully completed.
        ---------------------------

    Examine the query:

    • credential_name: The name of the credential to be stored. Provide any name. Here, CRED is the name given.
    • username: This is your cloud account username.
    • password: Enter your Auth Token password that you copied after generating the Auth Token.
  4. Click the play icon to run the query in your notebook. Your credentials are stored in the ADW user schema.
  5. In another para, run the following query to check the user credentials:
    SELECT* FROM USER_CREDENTIALS;

3.1.2 Explore Data

Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Data Understanding and Preparation

This stage focuses on building a clear understanding of the dataset through the following steps:

  • Import necessary libraries: Load essential Python libraries along with Oracle Machine Learning (OML).
  • Load the dataset: Import the dataset for initial exploration.
  • Create a DataFrame proxy object: Represent the table using a proxy object to simplify data manipulation.
  • Perform initial analysis: Examine the dataset's structure, including its shape, data types, missing values, and categorical feature cardinality.

These steps provide a solid foundation for deeper data exploration and preprocessing.

For data preparation and understanding run the following steps:

  1. Import necessary libraries

    Run the following script in a %python interpreter paragraph to import the oml modules, the Panda's module, and set the display options:
    
    import oml
    import ssl
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
  2. Load the dataset

    
    url="https://objectstorage.us-ashburn-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/brooklyn_sales.csv"
    ssl._create_default_https_context = ssl._create_unverified_context
    brooklyn_sales = pd.read_csv(url, engine='python')
    z.show(brooklyn_sales.head())

    Figure 3-1 Raw Brooklyn Data


    This screenshot displays the initial rows of a raw dataset from Brooklyn. The data is unprocessed and includes various attributes, such as ID, borough, etc.

  3. Replace missing values (NaN with None)

    
    df = brooklyn_sales.apply(lambda x: x.replace(np.nan, None) if x.dtypes == 'object' else x)
  4. Impute missing values based on data type

    First, the code removes any columns that are entirely empty or contain only missing values. Then, it goes through each remaining column, checks the data type of the non-missing values, and fills in any missing data with the most appropriate replacement based on the column's type.
     
    # Drop columns where all values are missing
    brooklyn_sales1 = brooklyn_sales.dropna(axis=1, how="all")
    
    d = {}
    
    # Iterate through each column and fill missing values based on its data type
    for col in brooklyn_sales1:
        x = brooklyn_sales1[col].dropna().tolist()  # Get non-null values to check the column type
        
        if len(x) > 0:  
            # For text columns, replace missing values with an empty string
            if isinstance(x[0], str):
                y = brooklyn_sales1[col].fillna("")
            # For integer columns, missing values are left unchanged
            elif isinstance(x[0], int):
                y = brooklyn_sales1[col]
            # For other numeric columns (e.g., floats), replace missing values with 0.0
            else:
                y = brooklyn_sales1[col].fillna(float(0))
            
            d[col] = y  # Store the modified column
    
    # Convert the dictionary back into a DataFrame
    brooklyn_sales2 = pd.DataFrame.from_dict(d)
    
    # Print the shape of the updated DataFrame
    print(brooklyn_sales2.shape)
    (390883, 110)
  5. Create dataframe proxy object

    
    try:
        oml.drop(table = 'BROOKLYN')
    except:
        pass
    
    # Create a persistent table named BROOKLYN in the Oracle AI
                                    Database
    BROOKLYN = oml.create(brooklyn_sales2, table="BROOKLYN")
  6. Analyze the dataframe

    Examine and interpret the shape, data types, missing values and find columns having low cardinality.
    • Shape of DataFrame:
      BROOKLYN.shape
      (390883, 110)
    • Data Types of Columns:
      BROOKLYN.dtypes
      ID                         <class 'oml.core.integer.Integer'>
      borough                    <class 'oml.core.integer.Integer'>
      neighborhood                 <class 'oml.core.string.String'>
      building_class_category      <class 'oml.core.string.String'>
      tax_class                    <class 'oml.core.string.String'>
                                                ...                
      PFIRM15_FL                     <class 'oml.core.float.Float'>
      Version                      <class 'oml.core.string.String'>
      MAPPLUTO_F                     <class 'oml.core.float.Float'>
      SHAPE_Leng                     <class 'oml.core.float.Float'>
      SHAPE_Area                     <class 'oml.core.float.Float'>
      Length: 110, dtype: object
    • Identify columns with missing values (>75% ):
      
      def percent_missing(dat):
          per_miss={}
          large_miss_columns=[]
          for i in dat.columns:
              l=len(dat)
              a=100-(dat[i].count()/l)*100
              if a>=75:
                  per_miss[i]=round(a)
          return per_miss
          
      z.show(pd.DataFrame(list(percent_missing(BROOKLYN).items()), columns=["Columns", "% Missing"]))

      Figure 3-2 View columns and their missing percentage.


      This screenshot shows columns that have 75 percent or more missing values.

    • Identify columns with low cardinality:
      
      def unique_values_less_10(data):
          cols=[]
          for x in data.columns:
              unique_values=data[x].nunique()
              if  unique_values< 10:
                  cols.append(x)
          return cols
      
      print(unique_values_less_10(BROOKLYN)) 
      ['borough', 'tax_class_at_sale', 'Borough', 'SanitBoro', 'ZoneDist4', 'Overlay1', 'Overlay2',
            'SPDist2', 'SPDist3', 'LtdHeight', 'SplitZone', 'Easements', 'OwnerType', 'AreaSource', 'Ext',
            'ProxCode', 'IrrLotCode', 'BsmtCode', 'BoroCode', 'CondoNo', 'ZMCode', 'PLUTOMapID',
            'FIRM07_FLA', 'PFIRM15_FL', 'Version', 'MAPPLUTO_F']

3.1.2.1 Data Preparation

Data preparation is the process of cleaning (handling missing values, outliers, and inconsistencies), transforming (scaling, encoding, and creating new features) and organizing raw data to make it more compatible with machine learning algorithms.

Data Preparation

This stage focuses on building a clear understanding of the dataset through the following steps:

  • Redundant Columns:
    • Clean the Data: Identify and remove duplicate records, redundant columns, and highly correlated columns.
  • Data Subset Creation: Filter/select relevant columns.
  • Visualizations: Generate visual representations to understand data patterns and relationships.
  • Feature Engineering: Create new features like "Decade Built", "Sale Age", and "Street Address".
  • Clean the DataFrame: Handle missing values, outliers, and inconsistencies.
  • Dataframe Dimensions: Ensure correct dimensions after cleaning.

These steps provide a solid foundation for data preparation.

Redundant Columns: Cleaning Up the Data

To identify redundant columns, compare column names and descriptions for clues. Then analyze the data within the columns to see if the values are identical or highly correlated. Once you identify the redundant columns, you need to decide which one to remove. The column with more missing values, inconsistent formatting, or mostly a unique value can be removed.

Analyze Columns with Similar Naming Conventions

Analyse and compare pairs or groups of columns that appear similar due to their content or variations in naming conventions or formatting. The goal is to ensure data consistency, detect redundancy, and verify if columns represent the same information.

The following columns are compared:

  • borough, Borough, and BoroCode
  • YearBuilt and year_built
  • ZipCode and zip_code

Similarly, this can be applied to other columns, such as building_class, BldgClass and building_class_at_sale; Address and address; etc.

  1. Compare columns: borough, Borough, and BoroCode

    The column names “borough,” “Borough,” and “BoroCode” are quite similar, suggesting they may contain the same information. To verify this, print five random samples from each column for comparison. Additionally, based on the column descriptions, all three columns represent the Brooklyn borough, with values such as “BK” or “3.0.” Since the data is specific to Brooklyn, these columns are redundant and can be safely removed from the dataset.

    
    z.show(BROOKLYN[['borough','Borough', 'BoroCode']].sample(n=5))

    Sample data from columns borough, Borough, and BoroCode , displaying 5 random rows.

  2. Identify Matching Data Percentage

    def matching_percentage(data, cols, threshold=0.9):
        # Filter rows where both columns have non-zero values
        filtered_df = data[(data[cols[0]] != 0) & (data[cols[1]] != 0)]
        
        # Calculate matching percentage
        total_rows = len(filtered_df)
        matching_rows = len(filtered_df[filtered_df[cols[0]] == filtered_df[cols[1]]])
        matching_percentage = matching_rows / total_rows if total_rows else 0
        
        # Output result
        if matching_percentage >= threshold:
            print(f"The columns have a high percentage ({matching_percentage * 100:.2f}%) of matching values, suggesting similarity.")
        else:
            print("The columns do not have a high percentage of matching values.")
  3. Compare columns: YearBuilt and year_built

    The column names "YearBuilt" and "year_built" are quite similar, suggesting they may contain the same information. To verify this, we should print out 5 random samples from each column.

    
    matching_percentage(BROOKLYN, ['YearBuilt', 'year_built'])
    The columns have a high percentage (99.25%) of matching values, suggesting
    similarity.

    The columns "YearBuilt", "year_built" contain similar information,so remove one. Remove "year_built" from the dataset.

  4. Compare columns: ZipCode and zip_code

    
    matching_percentage(BROOKLYN, ['ZipCode', 'zip_code'])
    The columns have a high percentage (98.72%) of matching values, suggesting
    similarity.

    The column names "ZipCode" and "zip_code" are quite similar, suggesting they may contain the same information. To verify this, we should print out 5 random samples from each column.

    
    z.show(BROOKLYN[['ZipCode', 'zip_code']].sample(n=5))

    Sample data from columns ZipCode and zip_code, displaying 5 random rows.

Filter data by selecting the desired columns

To focus on more reliable data, only houses built after the 1800s are included in the dataset. This is because houses built before the 1800s frequently have a single YearBuilt value of 0, indicating potentially missing or inaccurate information and more is exaplained in the next step.

BROOKLYN2 = BROOKLYN[(BROOKLYN['YearBuilt']>= 1800)][['building_class_at_sale', 'HealthCent', 'YearBuilt', 'ResidFAR', 
'sale_date', 'building_class_category', 'GarageArea', 'CD', 'YearAlter1', 'ID', 'SchoolDist', 'SanitDistr', 'PolicePrct','address',
'CT2010', 'commercial_units', 'BldgArea','NumFloors', 'sale_price','AssessTot', 'ResArea','land_sqft','LotFront', 
'LotArea','AssessLand', 'SHAPE_Area','year_of_sale', 'gross_sqft','XCoord','YCoord', 'SHAPE_Leng']]

# Dataframe dimension
BROOKLYN2.shape
(295356, 31)

Feature Engineering and Visualization

Create new columns and modify existing features based on insights gathered from visualizations. The newly engineered features are then merged back into the dataset to enhance its quality and readiness for modeling.

  1. Built periods and their counts:

    • Analyze the distribution of the periods in which the houses were built and identify the least and most common periods within our dataset. The column, YearBuilt, is first rounded to the nearest integer value. These rounded values will then be categorized into predefined intervals, or bins. Count the number of YearBuilt within each bin to determine the frequency distribution of built periods.

      built_period = (BROOKLYN2['YearBuilt'] // 10) * 10 + oml.Integer(BROOKLYN2['YearBuilt'] % 10 >= 5) * 10
      bins_str = built_period.cut(bins=[1700,1800,1880,1900,1920,1940,1960,1980,2000,2020,2040])
      bins = sorted(bins_str.drop_duplicates().pull())
       
      z.show(pd.DataFrame({'Built Period':bins, 'Count':[oml.Integer(bins_str == b).sum() for b in bins]}))

      The count of category occurrences for column, built period.

    • Visualisation of built periods and their counts

      # Data might be incomplete when DECADE_BUILT < 1900
       
      Nbins = 141
      n, bins, patches = plt.hist(built_period.pull(), Nbins)
      plt.xlabel('Built_Period')
      plt.ylabel('number of records')
      plt.yscale('log')
      p = plt.xlim(1795, 2025)

      Visual for the count of category occurrences for column, built period.

  2. Preview sale price and count by binning

    • Analyze the distribution of the sale price of the houses and identify the least and most common sale price within our dataset. The column, sale_price, is rounded and then are categorized into predefined intervals, or bins. Count the number of sale_price within each bin to determine the frequency distribution of sale_price.

      Sale_Price=(BROOKLYN2['sale_price'].cut(bins=[-100000000,0,20000,40000,60000,80000,100000,1000000,10000000,500000000]))
      
      # bins_str = decade_built.cut(bins=[1700,1800,1880,1900,1920,1940,1960,1980,2000,2020,2040])
      bins = sorted(Sale_Price.drop_duplicates().pull())
       
      z.show(pd.DataFrame({'Sale Price':bins, 'Count':[oml.Integer(Sale_Price == b).sum() for b in bins]})) 

      This column shows the count of occurrences for column, sales period.

    • Examine logarithmic sales price distribution

      
      # Most properties have 10^5 < sale_price < 10^6.5=3.2M
      Nbins = 101
      
      n, bins, patches = plt.hist((BROOKLYN2[BROOKLYN2['sale_price']>0]['sale_price']).log(10).pull(), Nbins)
      plt.xlabel('log10(sale_price)')
      plt.ylabel('number of records')
      p = plt.xlim(3.9, 7.1)

      The visual shows the count of occurrences for column, sales price.

  3. Preview building class category, count and count percentage

    Analyze the categorical column, building_class_category, by computing a cross-tabulation. Sort this table in descending order. Finally, determine the frequency of each building class category.

    build_category= BROOKLYN2.crosstab('building_class_category').sort_values('count', ascending=False)
    count_percentage= ((build_category['count'] / len(BROOKLYN2)) * 100).round(decimals=2)
    z.show(build_category.concat({'count_percentage':count_percentage}))

    This figure shows the count of occurrences for each category.

  4. Examine logarithmic gross qft distribution

    Values in the gross qft column, when log-transformed, approximate a normal distribution.

    # Most properties have 10^2.9=800 < sale_price < 10^3.7=5000
    Nbins = 201
    
    n, bins, patches = plt.hist((BROOKLYN2[BROOKLYN2['gross_sqft']>0]['gross_sqft']).log(10).pull(), Nbins)
    plt.xlabel('log10(sale_price)')
    plt.ylabel('number of records')
    p = plt.xlim(2.5, 4.2)

    Shows the logarithmic gross qft distribution

Feature Engineering

Feature engineering is the process of creating new input features from existing data which explains the underlying patterns of a data. These new featues help to improce the model's predictability.

The following features have been engineered:

  • Built Period: The Period in which the house was built.
    built_period=(BROOKLYN2['YearBuilt'] // 10) * 10 + oml.Integer(BROOKLYN2['YearBuilt'] % 10 >= 5) * 10
    BROOKLYN2=BROOKLYN2.concat({'Built_Period':built_period})
  • Age_At_Sale: Age of the house at sale is the number of years from its construction to the sale date.
    Age_At_Sale2  = abs(BROOKLYN2['year_of_sale'] - BROOKLYN2['YearBuilt'])
    BROOKLYN2= BROOKLYN2.concat({'Age_At_Sale2': Age_At_Sale2})
  • Quarter: Refers to the quarter in which the house was built.
    time_period = oml.Datetime.strptime(BROOKLYN2['sale_date'], format="MM/DD/YYYY")
    Quarter= (oml.Integer((time_period.month - 1)// 3 + 1))

Clean the Dataframe

After feature engineering, remove columns that no longer contribute to the analysis and drop any rows that have a missing value.

BROOKLYN2=BROOKLYN2.drop(['sale_date'])
BROOKLYN2=BROOKLYN2.dropna()

Filter the data to include properties whose sale price, gross square footage, and the decade year of construction fall within a specific, relevant range.

BROOKLYN3 = BROOKLYN2[(BROOKLYN2['sale_price']>=1.0e5) & (BROOKLYN2['sale_price']<=5.0e6) & 
            (BROOKLYN2['gross_sqft']>=800) & (BROOKLYN2['gross_sqft']<=5000) & 
            (BROOKLYN2['Built_Period']>=1900) & (BROOKLYN2['Built_Period']<=2010) ]

Apply a log transformation to normalize the column. The original, untransformed column is then removed.

BROOKLYN3 = BROOKLYN3.concat({'log_gross_sqft': BROOKLYN3['gross_sqft'].log(10)})
BROOKLYN3=BROOKLYN3.drop(['gross_sqft'])

To improve the model's performance, filter the data to focus on properties with higher probabilities of belonging to specific building classes and categories.

BROOKLYN4 = BROOKLYN3[(BROOKLYN3['building_class_at_sale']=='A1') | (BROOKLYN3['building_class_at_sale']=='A2') 
            | (BROOKLYN3['building_class_at_sale']=='A4') | (BROOKLYN3['building_class_at_sale']=='A5') 
            | (BROOKLYN3['building_class_at_sale']=='B1') | (BROOKLYN3['building_class_at_sale']=='B2') 
            |  (BROOKLYN3['building_class_at_sale']=='B3') ]
BROOKLYN5 = BROOKLYN4[((BROOKLYN4['building_class_category']=='02 TWO FAMILY HOMES') | (BROOKLYN4['building_class_category']=='01 ONE FAMILY HOMES'))]

Dataframe Dimensions

Run the following script to verify the dataframe dimensions:

BROOKLYN5.shape
(67083, 32)

3.1.3 Build Model

Build your model using the training data set. Use the oml.glm function to build your model and specify model settings.

For a supervised learning, like Regression, before creating the model, split the data in to training and test data. Although you can use the entire data set to build a model, it is difficult to validate the model unless there are new data sets available. Therefore, to evaluate the model and to accurately assess the performance of the model on the same data, you generally split or separate the data into training and test data. You use the training data set to train the model and then use the test data set to test the accuracy of the model by running prediction queries. The testing data set already contains known values for the attribute that you want to predict. It is thus easy to determine whether the model's predictions are correct.

Algorithm Selection

Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a regression problem:

  • Extreme Gradient Boosting
  • Generalized Linear Model
  • Neural Network
  • Support Vector Machine

When you want to understand the data set, you always start from a simple and easy baseline model. The Generalized Linear Model algorithm is the right choice because it is simple and easy to interpret since it fits a linear relationship between the feature and the target. You can get an initial understanding of a new data set from the result of the linear model.

The following steps guide you to split your data and build your model with the selected algorithm.

  1. Split Data: Train/Test:

    Split the data into training and test data, with a 80/20 ratio respectively. The seed parameter is used for random splitting. The split method splits the data referenced by the DataFrame proxy object BROOKLYN5 into two new DataFrame proxy objects train, and test.
    TRAIN, TEST = BROOKLYN5.split(ratio = (0.8,0.2), seed=15)
    TRAIN_X = TRAIN.drop('sale_price')
    TRAIN_Y = TRAIN['sale_price']
    TEST_X = TEST
    TEST_Y = TEST['sale_price']
  2. Model Building:

    Specify the model settings and build a Generalized Linear Model (GLM) model object for predicting the sale_price attribute, run the following script. The settings are given as key-value or dictionary pairs where it refers to parameters name and value setting respectively.

    try:
        oml.drop(model = 'BROOKLYN_GLM_REGRESSION_MODEL')
    except:
        print('No such model')
     
    setting = {'PREP_AUTO':'ON',
               'GLMS_ROW_DIAGNOSTICS':'GLMS_ROW_DIAG_ENABLE',
               'GLMS_FTR_SELECTION':'GLMS_FTR_SELECTION_ENABLE',
               'GLMS_FTR_GENERATION':'GLMS_FTR_GENERATION_ENABLE'}
                
    glm_mod = oml.glm("regression", **setting)
    glm_mod = glm_mod.fit(TRAIN_X,TRAIN_Y,model_name = 'BROOKLYN_GLM_REGRESSION_MODEL',case_id = 'ID')
    

    Model setting parameters:

    • PREP_AUTO: Used to specify fully automated or user-directed general data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON.
    • GLMS_ROW_DIAGNOSTICS: Enables or disables the row diagnostics. By default, row diagnostics are disabled.
    • GLMS_FTR_SELECTION: Enables or disables feature selection for GLM. By default, feature selection is not enabled.
    • GLMS_FTR_GENERATION: Specifies whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

      Note:

      Feature generation can only be enabled when feature selection is also enabled.

3.1.4 Evaluate

Before you make predictions using your model on new data, you should first evaluate model accuracy. You can evaluate the model using different methods.

Information about Model settings

Evaluate the model by examining the various statistics generated after building the model. The statistics indicate the model's quality.

  • Model details: Run the following script for model details available through the GLM model object, like the model settings, attribute coefficients, fit details, etc.
    glm_mod

    They can also be displayed and viewed individually as shown below.

  • Attribute Coefficient: Run the following script to display the model's attribute coefficient.
    z.show(glm_mod.coef.round(2).head())

    Shows the attribute coefficient.

  • Fit Details: Run the following script to display the fit details of the model.
    z.show(glm_mod.fit_details.round(2).head())

    Shows the Fit Details.

Score

Scoring is the process of applying the model on the test data to access its performance.

  1. Predict sale price: Use the model to make predictions on test data.
    BROOKLYN_RES = glm_mod.predict(TEST.drop('sale_price'), supplemental_cols = TEST[:,['ID','sale_price']])
    z.show(BROOKLYN_RES.round(2).head())

    Shows the Fit Details

  2. Evaluate Model Performance: Evaluate the model's performance by using the score function.
    model_coef = len(glm_mod.coef)
    no_rows_test = TEST_X.shape[0]
    R_squared = glm_mod.score(TEST_X, TEST_Y).round(3)
    
    print(
        f"RMSE : {(((BROOKLYN_RES['PREDICTION'] - BROOKLYN_RES['sale_price']) ** 2).mean() ** .5).round(2)}\n"
        f"MAE: {((abs(BROOKLYN_RES['PREDICTION'] - BROOKLYN_RES['sale_price'])).mean()).round(2)}\n"
        f"R squared: {R_squared}\n"
        f"Adjusted R^2: {(1 - ( 1 - R_squared)*(no_rows_test-1)/(no_rows_test - model_coef -1)).round(4)}"
    )

    The interpreation of the model's performance based on the metrics are as follows:

    • RMSE (286,137.56): The model's predictions, on average, deviate by approximately 286,137.56 units from the actual values.
    • MAE (170,992.05): The average absolute error in predictions is 170,992.05 units, which provides a sense of the model's accuracy without penalizing large errors.
    • (0.581): The model explains 58.1% of the variance in the target variable, suggesting a moderate fit to the data.
    • Adjusted R² (0.5604): After adjusting for the number of predictors, the model explains about 56.04% of the variance

    Overall, the model demonstrates moderate predictive accuracy, with potential for further improvement.

  3. Residual Graph: The Residual plot is showing heteroscedastic pattern. This indicates that the errors in the model are inconsistent. The non-linear relationship between the fitted values (predicted values) and the residuals, suggests that the model can be improved further.
    import matplotlib.pyplot as plt
     
    y_pred = BROOKLYN_RES['PREDICTION'].pull()
    residuals = (BROOKLYN_RES['sale_price'] - BROOKLYN_RES['PREDICTION']).pull()
     
    plt.scatter(y_pred, residuals)
     
    plt.xlabel('Predicted Sale Prices')
    plt.ylabel('Residuals')
    plt.title('Residual Plot', fontsize=16)
    plt.grid(True)
    plt.axhline(y=0, color='r', linestyle='--')
    plt.show()

    Visual for Residual Graph

SQL Interface to Score Data and Display Prediction Details

You can score data and make predictions using the SQL interface. The test data is materialized into BROOKLYN_GLM_TEST_DATA so that you can query it using SQL. The materialized method writes the contents of an Oracle Machine Learning oml.DataFrame proxy table to an Oracle AI Database table.

  1. Materialize BROOKLYN_GLM_TEST_DATA for use in query below:
    try:
        oml.drop(table = 'BROOKLYN_GLM_TEST_DATA')
    except:
        pass
    _ = TEST.materialize(table = 'BROOKLYN_GLM_TEST_DATA')
  2. Display prediction with explanatory prediction details in SQL: The SQL command to score and display the prediction details. The prediction functions apply a glm regressional model named BROOKLYN_GLM_REGRESSION_MODEL to the data from the materialized table BROOKLYN_GLM_TEST_DATA. The query includes information about the predictors that have the greatest influence on the prediction.

    SELECT ID,
           round(PREDICTION_YRS_RES,3) PRED_YRS_RES,
           round(PRED_LOWER_LIMIT,1) LOWER_BOUND,
           round(PRED_UPPER_LIMIT,1) UPPER_BOUND,
           RTRIM(TRIM(SUBSTR(OUTPRED."Attribute1",17,100)),'rank="1"/>') FIRST_ATTRIBUTE,
           RTRIM(TRIM(SUBSTR(OUTPRED."Attribute2",17,100)),'rank="2"/>') SECOND_ATTRIBUTE,
           RTRIM(TRIM(SUBSTR(OUTPRED."Attribute3",17,100)),'rank="3"/>') THIRD_ATTRIBUTE
    FROM (SELECT ID,
                 PREDICTION(BROOKLYN_GLM_REGRESSION_MODEL USING *) PREDICTION_YRS_RES,
                 PREDICTION_BOUNDS(BROOKLYN_GLM_REGRESSION_MODEL USING *).LOWER PRED_LOWER_LIMIT,
                 PREDICTION_BOUNDS(BROOKLYN_GLM_REGRESSION_MODEL USING *).UPPER PRED_UPPER_LIMIT,
                 PREDICTION_DETAILS(BROOKLYN_GLM_REGRESSION_MODEL USING *) PD
          FROM BROOKLYN_GLM_TEST_DATA
          WHERE ID < 100015
          ORDER BY ID) OUT,
         XMLTABLE('/Details'
                  PASSING OUT.PD
                  COLUMNS 
                  "Attribute1" XMLType PATH 'Attribute[1]',
                  "Attribute2" XMLType PATH 'Attribute[2]',
                   "Attribute3" XMLType PATH 'Attribute[3]') OUTPRED

    Displayes prediction with explanatory prediction details in SQL

​To conclude, you have successfully predicted the median house prices in Brooklyn using Generalized Linear Model algorithm.