Regression Use case

3.1 Regression Use case

The Brooklyn housing dataset contains the sale prices of homes in brooklyn borough, along with various factors that influence these prices, such as the area of the house, its location, and the type of dwelling. You are tasked with analyzing years of historical home sales data to estimate sales prices, which will help optimize real estate operations. In this case study, you will learn how to predict sales prices using the regression technique and the GLM algorithm.

Topic	Link
OML4Py GitHub Example	OML4Py Regression GLM
About Generalized Linear Model	About Generalized Linear Model
About Machine Learning Classes and Algorithms	About Machine Learning Classes and Algorithms
Shared Settings	Shared Settings

3.1.1 Load Data

Load the data in your database and examine the data set and its attributes.

You can download the dataset from Rolling sales data for the Brooklyn borough. If you are using the Oracle Autonomous AI Database, you will upload files to the Oracle Cloud Infrastructure (OCI) Object Storage, create a sample table, load data into the sample table from files on the OCI Object Storage, and explore the data.

Examine Data

There are 110 attributes in the dataset; below are descriptions of a few important ones. For a complete description of the attribute, see Rolling sales data.

Attribute Name	Information
Borough	The borough in which the tax lot is located.
BoroCode	The borough in which the tax lot is located.
Year Built	The year construction of the building was completed.
Zip Code	A ZIP code that is valid for one of the addresses assigned to the tax lot.
Address	An address for the tax lot
BUILDING CLASS	A code describing the major use of structures on the tax lot.
HEALTH CENTER DISTRICT	The health center district in which the tax lot is located. Thirty health center districts were created by the City in 1930 to conduct neighborhood focused health interventions
LAND USE CATEGORY	A code for the tax lot's land use category
LOT AREA	Total area of the tax lot, expressed in square feet rounded to the nearest integer.
Building Area	The total gross area in square feet. Same as 'gross_sqft'
Gross Square Feet	The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property.
Year Built	Year the structure on the property was built.
Sales Price	Price paid for the property.

Import Data
Import data into the Oracle AI Database by using Object Storage (for Cloud).

Related Topics

Parent topic: Regression Use case

3.1.1.1 Import Data

Import data into the Oracle AI Database by using Object Storage (for Cloud).

If using a cloud account, one of the methods of importing the data is through Object Storage. Upload the data set to an Object Storage. The Object Storage URI will be used in another procedure.You can load data into yourOracle Autonomous AI Database (Autonomous Data Warehouse [ADW] or Autonomous Transaction Processing [ATP]) using Oracle AI Database tools, and Oracle and 3rd party data integration tools. You can load data:

from local files in your client computer, or
from files stored in a cloud-based object store

Follow the steps to upload your data file to the Object Storage bucket.

Login to your cloud account.
Click the left-side hamburger menu and select Storage from the menu.
Select Buckets from the Object Storage & Archive Storage option.
Select the compartment in which you want to upload the data.
Click Create Bucket.
Enter a name for your bucket. For example, Bucket1. Leave the rest of the fields as default.
Click Create.
Click on the bucket that you created. Scroll down and click Upload under Objects.
Leave the Object Name Prefix field black. Click select files to navigate to the data file that you want to upload or drag and drop the data file. In this use case, select the modified .csv file.
Click Upload. The data file appears under Objects.
Click the ellipses on the right side of the data file to view the menu. Click View Object Details.
Copy the URL PATH (URI) to a text file. This URI is used in the DBMS_CLOUD.COPY_DATA procedure.

This procedure creates an object storage containing the data file in your cloud account.

Create Auth Token

The Auth Token is required in the DBMS_CLOUD.CREATE_CREDENTIAL procedure. You can generate the Auth Token in your cloud account.

Login into your ADW Cloud account.
Hover your mouse cursor over the human figure icon at the top right of the console and click User Settings from the drop-down menu.
Click Auth Tokens under Resources on the left of the console.
Click Generate Token. A pop-up dialog appears.
Enter a description (optional).
Click Generate Token.
Copy the generated token to a text file. The token does not appear again.
Click Close.

Create Object Storage Credential

The object storage credential is used in the DBMS_CLOUD.COPY_DATA procedure.

Login to the OML Notebooks page and create a notebook. See Create a Notebook Classic
Open the notebook that you just created.
Enter the following query to create an object storage credentials:
```
%script
begin
  DBMS_CLOUD.create_credential (
    credential_name => 'CRED',
    username => '<your cloud account username>',
    password => '<your Auth Token>'
  );
end;
/
```
```
---------------------------  PL/SQL procedure successfully completed.
    ---------------------------
```
Examine the query:
- credential_name: The name of the credential to be stored. Provide any name. Here, CRED is the name given.
- username: This is your cloud account username.
- password: Enter your Auth Token password that you copied after generating the Auth Token.
Click the play icon to run the query in your notebook. Your credentials are stored in the ADW user schema.
In another para, run the following query to check the user credentials:
```
SELECT* FROM USER_CREDENTIALS;
```

Parent topic: Load Data

3.1.2 Explore Data

Explore the data to understand and assess the quality of the data. At this stage assess the data to identify data types and noise in the data. Look for missing values and numeric outlier values.

Data Understanding and Preparation

This stage focuses on building a clear understanding of the dataset through the following steps:

Import necessary libraries: Load essential Python libraries along with Oracle Machine Learning (OML).
Load the dataset: Import the dataset for initial exploration.
Create a DataFrame proxy object: Represent the table using a proxy object to simplify data manipulation.
Perform initial analysis: Examine the dataset's structure, including its shape, data types, missing values, and categorical feature cardinality.

These steps provide a solid foundation for deeper data exploration and preprocessing.

For data preparation and understanding run the following steps:

Import necessary libraries
Run the following script in a %python interpreter paragraph to import the oml modules, the Panda's module, and set the display options:
```
import oml
import ssl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
```

Load the dataset


url="https://objectstorage.us-ashburn-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/brooklyn_sales.csv"
ssl._create_default_https_context = ssl._create_unverified_context
brooklyn_sales = pd.read_csv(url, engine='python')
z.show(brooklyn_sales.head())

Figure 3-1 Raw Brooklyn Data

This screenshot displays the initial rows of a raw dataset from Brooklyn. The data is unprocessed and includes various attributes, such as ID, borough, etc.

Replace missing values (NaN with None)


df = brooklyn_sales.apply(lambda x: x.replace(np.nan, None) if x.dtypes == 'object' else x)

Impute missing values based on data type

First, the code removes any columns that are entirely empty or contain only missing values. Then, it goes through each remaining column, checks the data type of the non-missing values, and fills in any missing data with the most appropriate replacement based on the column's type.

 
# Drop columns where all values are missing
brooklyn_sales1 = brooklyn_sales.dropna(axis=1, how="all")

d = {}

# Iterate through each column and fill missing values based on its data type
for col in brooklyn_sales1:
    x = brooklyn_sales1[col].dropna().tolist()  # Get non-null values to check the column type
    
    if len(x) > 0:  
        # For text columns, replace missing values with an empty string
        if isinstance(x[0], str):
            y = brooklyn_sales1[col].fillna("")
        # For integer columns, missing values are left unchanged
        elif isinstance(x[0], int):
            y = brooklyn_sales1[col]
        # For other numeric columns (e.g., floats), replace missing values with 0.0
        else:
            y = brooklyn_sales1[col].fillna(float(0))
        
        d[col] = y  # Store the modified column

# Convert the dictionary back into a DataFrame
brooklyn_sales2 = pd.DataFrame.from_dict(d)

# Print the shape of the updated DataFrame
print(brooklyn_sales2.shape)

(390883, 110)

Create dataframe proxy object


try:
    oml.drop(table = 'BROOKLYN')
except:
    pass

# Create a persistent table named BROOKLYN in the Oracle AI
                                Database
BROOKLYN = oml.create(brooklyn_sales2, table="BROOKLYN")

Analyze the dataframe

Examine and interpret the shape, data types, missing values and find columns having low cardinality.

Shape of DataFrame:
```
BROOKLYN.shape
```
```
(390883, 110)
```

Data Types of Columns:

BROOKLYN.dtypes

ID                         <class 'oml.core.integer.Integer'>
borough                    <class 'oml.core.integer.Integer'>
neighborhood                 <class 'oml.core.string.String'>
building_class_category      <class 'oml.core.string.String'>
tax_class                    <class 'oml.core.string.String'>
                                          ...                
PFIRM15_FL                     <class 'oml.core.float.Float'>
Version                      <class 'oml.core.string.String'>
MAPPLUTO_F                     <class 'oml.core.float.Float'>
SHAPE_Leng                     <class 'oml.core.float.Float'>
SHAPE_Area                     <class 'oml.core.float.Float'>
Length: 110, dtype: object

Identify columns with missing values (>75% ):


def percent_missing(dat):
    per_miss={}
    large_miss_columns=[]
    for i in dat.columns:
        l=len(dat)
        a=100-(dat[i].count()/l)*100
        if a>=75:
            per_miss[i]=round(a)
    return per_miss
    
z.show(pd.DataFrame(list(percent_missing(BROOKLYN).items()), columns=["Columns", "% Missing"]))

Figure 3-2 View columns and their missing percentage.

This screenshot shows columns that have 75 percent or more missing values.

Identify columns with low cardinality:


def unique_values_less_10(data):
    cols=[]
    for x in data.columns:
        unique_values=data[x].nunique()
        if  unique_values< 10:
            cols.append(x)
    return cols

print(unique_values_less_10(BROOKLYN))

['borough', 'tax_class_at_sale', 'Borough', 'SanitBoro', 'ZoneDist4', 'Overlay1', 'Overlay2',
      'SPDist2', 'SPDist3', 'LtdHeight', 'SplitZone', 'Easements', 'OwnerType', 'AreaSource', 'Ext',
      'ProxCode', 'IrrLotCode', 'BsmtCode', 'BoroCode', 'CondoNo', 'ZMCode', 'PLUTOMapID',
      'FIRM07_FLA', 'PFIRM15_FL', 'Version', 'MAPPLUTO_F']

Data Preparation
Data preparation is the process of cleaning (handling missing values, outliers, and inconsistencies), transforming (scaling, encoding, and creating new features) and organizing raw data to make it more compatible with machine learning algorithms.

Parent topic: Regression Use case

3.1.2.1 Data Preparation

Data preparation is the process of cleaning (handling missing values, outliers, and inconsistencies), transforming (scaling, encoding, and creating new features) and organizing raw data to make it more compatible with machine learning algorithms.

Data Preparation

This stage focuses on building a clear understanding of the dataset through the following steps:

Redundant Columns:
- Clean the Data: Identify and remove duplicate records, redundant columns, and highly correlated columns.
Data Subset Creation: Filter/select relevant columns.
Visualizations: Generate visual representations to understand data patterns and relationships.
Feature Engineering: Create new features like "Decade Built", "Sale Age", and "Street Address".
Clean the DataFrame: Handle missing values, outliers, and inconsistencies.
Dataframe Dimensions: Ensure correct dimensions after cleaning.

These steps provide a solid foundation for data preparation.

Redundant Columns: Cleaning Up the Data

To identify redundant columns, compare column names and descriptions for clues. Then analyze the data within the columns to see if the values are identical or highly correlated. Once you identify the redundant columns, you need to decide which one to remove. The column with more missing values, inconsistent formatting, or mostly a unique value can be removed.

Analyze Columns with Similar Naming Conventions

Analyse and compare pairs or groups of columns that appear similar due to their content or variations in naming conventions or formatting. The goal is to ensure data consistency, detect redundancy, and verify if columns represent the same information.

The following columns are compared:

borough, Borough, and BoroCode
YearBuilt and year_built
ZipCode and zip_code

Similarly, this can be applied to other columns, such as building_class, BldgClass and building_class_at_sale; Address and address; etc.

Compare columns: borough, Borough, and BoroCode

The column names “borough,” “Borough,” and “BoroCode” are quite similar, suggesting they may contain the same information. To verify this, print five random samples from each column for comparison. Additionally, based on the column descriptions, all three columns represent the Brooklyn borough, with values such as “BK” or “3.0.” Since the data is specific to Brooklyn, these columns are redundant and can be safely removed from the dataset.
```
z.show(BROOKLYN[['borough','Borough', 'BoroCode']].sample(n=5))
```

Identify Matching Data Percentage

def matching_percentage(data, cols, threshold=0.9):
    # Filter rows where both columns have non-zero values
    filtered_df = data[(data[cols[0]] != 0) & (data[cols[1]] != 0)]
    
    # Calculate matching percentage
    total_rows = len(filtered_df)
    matching_rows = len(filtered_df[filtered_df[cols[0]] == filtered_df[cols[1]]])
    matching_percentage = matching_rows / total_rows if total_rows else 0
    
    # Output result
    if matching_percentage >= threshold:
        print(f"The columns have a high percentage ({matching_percentage * 100:.2f}%) of matching values, suggesting similarity.")
    else:
        print("The columns do not have a high percentage of matching values.")

Compare columns: YearBuilt and year_built

The column names "YearBuilt" and "year_built" are quite similar, suggesting they may contain the same information. To verify this, we should print out 5 random samples from each column.
```
matching_percentage(BROOKLYN, ['YearBuilt', 'year_built'])
```
```
The columns have a high percentage (99.25%) of matching values, suggesting
similarity.
```
The columns "YearBuilt", "year_built" contain similar information,so remove one. Remove "year_built" from the dataset.
Compare columns: ZipCode and zip_code
```
matching_percentage(BROOKLYN, ['ZipCode', 'zip_code'])
```
```
The columns have a high percentage (98.72%) of matching values, suggesting
similarity.
```
The column names "ZipCode" and "zip_code" are quite similar, suggesting they may contain the same information. To verify this, we should print out 5 random samples from each column.
```
z.show(BROOKLYN[['ZipCode', 'zip_code']].sample(n=5))
```

Filter data by selecting the desired columns

To focus on more reliable data, only houses built after the 1800s are included in the dataset. This is because houses built before the 1800s frequently have a single YearBuilt value of 0, indicating potentially missing or inaccurate information and more is exaplained in the next step.

BROOKLYN2 = BROOKLYN[(BROOKLYN['YearBuilt']>= 1800)][['building_class_at_sale', 'HealthCent', 'YearBuilt', 'ResidFAR', 
'sale_date', 'building_class_category', 'GarageArea', 'CD', 'YearAlter1', 'ID', 'SchoolDist', 'SanitDistr', 'PolicePrct','address',
'CT2010', 'commercial_units', 'BldgArea','NumFloors', 'sale_price','AssessTot', 'ResArea','land_sqft','LotFront', 
'LotArea','AssessLand', 'SHAPE_Area','year_of_sale', 'gross_sqft','XCoord','YCoord', 'SHAPE_Leng']]

# Dataframe dimension
BROOKLYN2.shape

(295356, 31)

Feature Engineering and Visualization

Create new columns and modify existing features based on insights gathered from visualizations. The newly engineered features are then merged back into the dataset to enhance its quality and readiness for modeling.

Built periods and their counts:
- Analyze the distribution of the periods in which the houses were built and identify the least and most common periods within our dataset. The column, YearBuilt, is first rounded to the nearest integer value. These rounded values will then be categorized into predefined intervals, or bins. Count the number of YearBuilt within each bin to determine the frequency distribution of built periods.
```
built_period = (BROOKLYN2['YearBuilt'] // 10) * 10 + oml.Integer(BROOKLYN2['YearBuilt'] % 10 >= 5) * 10
bins_str = built_period.cut(bins=[1700,1800,1880,1900,1920,1940,1960,1980,2000,2020,2040])
bins = sorted(bins_str.drop_duplicates().pull())
 
z.show(pd.DataFrame({'Built Period':bins, 'Count':[oml.Integer(bins_str == b).sum() for b in bins]}))
```
- Visualisation of built periods and their counts
```
# Data might be incomplete when DECADE_BUILT < 1900
 
Nbins = 141
n, bins, patches = plt.hist(built_period.pull(), Nbins)
plt.xlabel('Built_Period')
plt.ylabel('number of records')
plt.yscale('log')
p = plt.xlim(1795, 2025)
```

Preview sale price and count by binning

Analyze the distribution of the sale price of the houses and identify the least and most common sale price within our dataset. The column, sale_price, is rounded and then are categorized into predefined intervals, or bins. Count the number of sale_price within each bin to determine the frequency distribution of sale_price.
```
Sale_Price=(BROOKLYN2['sale_price'].cut(bins=[-100000000,0,20000,40000,60000,80000,100000,1000000,10000000,500000000]))

# bins_str = decade_built.cut(bins=[1700,1800,1880,1900,1920,1940,1960,1980,2000,2020,2040])
bins = sorted(Sale_Price.drop_duplicates().pull())
 
z.show(pd.DataFrame({'Sale Price':bins, 'Count':[oml.Integer(Sale_Price == b).sum() for b in bins]})) 
```

Examine logarithmic sales price distribution


# Most properties have 10^5 < sale_price < 10^6.5=3.2M
Nbins = 101

n, bins, patches = plt.hist((BROOKLYN2[BROOKLYN2['sale_price']>0]['sale_price']).log(10).pull(), Nbins)
plt.xlabel('log10(sale_price)')
plt.ylabel('number of records')
p = plt.xlim(3.9, 7.1)

The visual shows the count of occurrences for column, sales price.

Preview building class category, count and count percentage

Analyze the categorical column, building_class_category, by computing a cross-tabulation. Sort this table in descending order. Finally, determine the frequency of each building class category.
```
build_category= BROOKLYN2.crosstab('building_class_category').sort_values('count', ascending=False)
count_percentage= ((build_category['count'] / len(BROOKLYN2)) * 100).round(decimals=2)
z.show(build_category.concat({'count_percentage':count_percentage}))
```

Examine logarithmic gross qft distribution

Values in the gross qft column, when log-transformed, approximate a normal distribution.

# Most properties have 10^2.9=800 < sale_price < 10^3.7=5000
Nbins = 201

n, bins, patches = plt.hist((BROOKLYN2[BROOKLYN2['gross_sqft']>0]['gross_sqft']).log(10).pull(), Nbins)
plt.xlabel('log10(sale_price)')
plt.ylabel('number of records')
p = plt.xlim(2.5, 4.2)

Shows the logarithmic gross qft distribution

Feature Engineering

Feature engineering is the process of creating new input features from existing data which explains the underlying patterns of a data. These new featues help to improce the model's predictability.

The following features have been engineered:

Built Period: The Period in which the house was built.

built_period=(BROOKLYN2['YearBuilt'] // 10) * 10 + oml.Integer(BROOKLYN2['YearBuilt'] % 10 >= 5) * 10
BROOKLYN2=BROOKLYN2.concat({'Built_Period':built_period})

Age_At_Sale: Age of the house at sale is the number of years from its construction to the sale date.
```
Age_At_Sale2  = abs(BROOKLYN2['year_of_sale'] - BROOKLYN2['YearBuilt'])
BROOKLYN2= BROOKLYN2.concat({'Age_At_Sale2': Age_At_Sale2})
```

Quarter: Refers to the quarter in which the house was built.

time_period = oml.Datetime.strptime(BROOKLYN2['sale_date'], format="MM/DD/YYYY")
Quarter= (oml.Integer((time_period.month - 1)// 3 + 1))

Clean the Dataframe

After feature engineering, remove columns that no longer contribute to the analysis and drop any rows that have a missing value.

BROOKLYN2=BROOKLYN2.drop(['sale_date'])
BROOKLYN2=BROOKLYN2.dropna()

Filter the data to include properties whose sale price, gross square footage, and the decade year of construction fall within a specific, relevant range.

BROOKLYN3 = BROOKLYN2[(BROOKLYN2['sale_price']>=1.0e5) & (BROOKLYN2['sale_price']<=5.0e6) & 
            (BROOKLYN2['gross_sqft']>=800) & (BROOKLYN2['gross_sqft']<=5000) & 
            (BROOKLYN2['Built_Period']>=1900) & (BROOKLYN2['Built_Period']<=2010) ]

Apply a log transformation to normalize the column. The original, untransformed column is then removed.

BROOKLYN3 = BROOKLYN3.concat({'log_gross_sqft': BROOKLYN3['gross_sqft'].log(10)})
BROOKLYN3=BROOKLYN3.drop(['gross_sqft'])

To improve the model's performance, filter the data to focus on properties with higher probabilities of belonging to specific building classes and categories.

BROOKLYN4 = BROOKLYN3[(BROOKLYN3['building_class_at_sale']=='A1') | (BROOKLYN3['building_class_at_sale']=='A2') 
            | (BROOKLYN3['building_class_at_sale']=='A4') | (BROOKLYN3['building_class_at_sale']=='A5') 
            | (BROOKLYN3['building_class_at_sale']=='B1') | (BROOKLYN3['building_class_at_sale']=='B2') 
            |  (BROOKLYN3['building_class_at_sale']=='B3') ]
BROOKLYN5 = BROOKLYN4[((BROOKLYN4['building_class_category']=='02 TWO FAMILY HOMES') | (BROOKLYN4['building_class_category']=='01 ONE FAMILY HOMES'))]

Dataframe Dimensions

Run the following script to verify the dataframe dimensions:

BROOKLYN5.shape

(67083, 32)

Parent topic: Explore Data

3.1.3 Build Model

Build your model using the training data set. Use the oml.glm function to build your model and specify model settings.

For a supervised learning, like Regression, before creating the model, split the data in to training and test data. Although you can use the entire data set to build a model, it is difficult to validate the model unless there are new data sets available. Therefore, to evaluate the model and to accurately assess the performance of the model on the same data, you generally split or separate the data into training and test data. You use the training data set to train the model and then use the test data set to test the accuracy of the model by running prediction queries. The testing data set already contains known values for the attribute that you want to predict. It is thus easy to determine whether the model's predictions are correct.

Algorithm Selection

Before you build a model, choose the suitable algorithm. You can choose one of the following algorithms to solve a regression problem:

Extreme Gradient Boosting
Generalized Linear Model
Neural Network
Support Vector Machine

When you want to understand the data set, you always start from a simple and easy baseline model. The Generalized Linear Model algorithm is the right choice because it is simple and easy to interpret since it fits a linear relationship between the feature and the target. You can get an initial understanding of a new data set from the result of the linear model.

The following steps guide you to split your data and build your model with the selected algorithm.

Split Data: Train/Test:
Split the data into training and test data, with a 80/20 ratio respectively. The seed parameter is used for random splitting. The split method splits the data referenced by the DataFrame proxy object BROOKLYN5 into two new DataFrame proxy objects train, and test.
```
TRAIN, TEST = BROOKLYN5.split(ratio = (0.8,0.2), seed=15)
TRAIN_X = TRAIN.drop('sale_price')
TRAIN_Y = TRAIN['sale_price']
TEST_X = TEST
TEST_Y = TEST['sale_price']
```
Model Building:

Specify the model settings and build a Generalized Linear Model (GLM) model object for predicting the sale_price attribute, run the following script. The settings are given as key-value or dictionary pairs where it refers to parameters name and value setting respectively.
```
try:
    oml.drop(model = 'BROOKLYN_GLM_REGRESSION_MODEL')
except:
    print('No such model')
 
setting = {'PREP_AUTO':'ON',
           'GLMS_ROW_DIAGNOSTICS':'GLMS_ROW_DIAG_ENABLE',
           'GLMS_FTR_SELECTION':'GLMS_FTR_SELECTION_ENABLE',
           'GLMS_FTR_GENERATION':'GLMS_FTR_GENERATION_ENABLE'}
            
glm_mod = oml.glm("regression", **setting)
glm_mod = glm_mod.fit(TRAIN_X,TRAIN_Y,model_name = 'BROOKLYN_GLM_REGRESSION_MODEL',case_id = 'ID')
```
Model setting parameters:
- PREP_AUTO: Used to specify fully automated or user-directed general data preparation. By default, it is enabled with a constant value as 'PREP_AUTO': PREP_AUTO_ON.
- GLMS_ROW_DIAGNOSTICS: Enables or disables the row diagnostics. By default, row diagnostics are disabled.
- GLMS_FTR_SELECTION: Enables or disables feature selection for GLM. By default, feature selection is not enabled.
- GLMS_FTR_GENERATION: Specifies whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.
  
  Note:
  Feature generation can only be enabled when feature selection is also enabled.

Parent topic: Regression Use case

3.1.4 Evaluate

Before you make predictions using your model on new data, you should first evaluate model accuracy. You can evaluate the model using different methods.

Information about Model settings

Evaluate the model by examining the various statistics generated after building the model. The statistics indicate the model's quality.

Model details: Run the following script for model details available through the GLM model object, like the model settings, attribute coefficients, fit details, etc.
```
glm_mod
```
They can also be displayed and viewed individually as shown below.
Attribute Coefficient: Run the following script to display the model's attribute coefficient.
```
z.show(glm_mod.coef.round(2).head())
```
Fit Details: Run the following script to display the fit details of the model.
```
z.show(glm_mod.fit_details.round(2).head())
```

Score

Scoring is the process of applying the model on the test data to access its performance.

Predict sale price: Use the model to make predictions on test data.

BROOKLYN_RES = glm_mod.predict(TEST.drop('sale_price'), supplemental_cols = TEST[:,['ID','sale_price']])
z.show(BROOKLYN_RES.round(2).head())

Evaluate Model Performance: Evaluate the model's performance by using the score function.
```
model_coef = len(glm_mod.coef)
no_rows_test = TEST_X.shape[0]
R_squared = glm_mod.score(TEST_X, TEST_Y).round(3)

print(
    f"RMSE : {(((BROOKLYN_RES['PREDICTION'] - BROOKLYN_RES['sale_price']) ** 2).mean() ** .5).round(2)}\n"
    f"MAE: {((abs(BROOKLYN_RES['PREDICTION'] - BROOKLYN_RES['sale_price'])).mean()).round(2)}\n"
    f"R squared: {R_squared}\n"
    f"Adjusted R^2: {(1 - ( 1 - R_squared)*(no_rows_test-1)/(no_rows_test - model_coef -1)).round(4)}"
)
```
The interpreation of the model's performance based on the metrics are as follows:
- RMSE (286,137.56): The model's predictions, on average, deviate by approximately 286,137.56 units from the actual values.
- MAE (170,992.05): The average absolute error in predictions is 170,992.05 units, which provides a sense of the model's accuracy without penalizing large errors.
- R² (0.581): The model explains 58.1% of the variance in the target variable, suggesting a moderate fit to the data.
- Adjusted R² (0.5604): After adjusting for the number of predictors, the model explains about 56.04% of the variance
Overall, the model demonstrates moderate predictive accuracy, with potential for further improvement.

Residual Graph: The Residual plot is showing heteroscedastic pattern. This indicates that the errors in the model are inconsistent. The non-linear relationship between the fitted values (predicted values) and the residuals, suggests that the model can be improved further.

import matplotlib.pyplot as plt
 
y_pred = BROOKLYN_RES['PREDICTION'].pull()
residuals = (BROOKLYN_RES['sale_price'] - BROOKLYN_RES['PREDICTION']).pull()
 
plt.scatter(y_pred, residuals)
 
plt.xlabel('Predicted Sale Prices')
plt.ylabel('Residuals')
plt.title('Residual Plot', fontsize=16)
plt.grid(True)
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

SQL Interface to Score Data and Display Prediction Details

You can score data and make predictions using the SQL interface. The test data is materialized into BROOKLYN_GLM_TEST_DATA so that you can query it using SQL. The materialized method writes the contents of an Oracle Machine Learning oml.DataFrame proxy table to an Oracle AI Database table.

Materialize BROOKLYN_GLM_TEST_DATA for use in query below:

try:
    oml.drop(table = 'BROOKLYN_GLM_TEST_DATA')
except:
    pass
_ = TEST.materialize(table = 'BROOKLYN_GLM_TEST_DATA')

Display prediction with explanatory prediction details in SQL: The SQL command to score and display the prediction details. The prediction functions apply a glm regressional model named BROOKLYN_GLM_REGRESSION_MODEL to the data from the materialized table BROOKLYN_GLM_TEST_DATA. The query includes information about the predictors that have the greatest influence on the prediction.

SELECT ID,
       round(PREDICTION_YRS_RES,3) PRED_YRS_RES,
       round(PRED_LOWER_LIMIT,1) LOWER_BOUND,
       round(PRED_UPPER_LIMIT,1) UPPER_BOUND,
       RTRIM(TRIM(SUBSTR(OUTPRED."Attribute1",17,100)),'rank="1"/>') FIRST_ATTRIBUTE,
       RTRIM(TRIM(SUBSTR(OUTPRED."Attribute2",17,100)),'rank="2"/>') SECOND_ATTRIBUTE,
       RTRIM(TRIM(SUBSTR(OUTPRED."Attribute3",17,100)),'rank="3"/>') THIRD_ATTRIBUTE
FROM (SELECT ID,
             PREDICTION(BROOKLYN_GLM_REGRESSION_MODEL USING *) PREDICTION_YRS_RES,
             PREDICTION_BOUNDS(BROOKLYN_GLM_REGRESSION_MODEL USING *).LOWER PRED_LOWER_LIMIT,
             PREDICTION_BOUNDS(BROOKLYN_GLM_REGRESSION_MODEL USING *).UPPER PRED_UPPER_LIMIT,
             PREDICTION_DETAILS(BROOKLYN_GLM_REGRESSION_MODEL USING *) PD
      FROM BROOKLYN_GLM_TEST_DATA
      WHERE ID < 100015
      ORDER BY ID) OUT,
     XMLTABLE('/Details'
              PASSING OUT.PD
              COLUMNS 
              "Attribute1" XMLType PATH 'Attribute[1]',
              "Attribute2" XMLType PATH 'Attribute[2]',
               "Attribute3" XMLType PATH 'Attribute[3]') OUTPRED

Displayes prediction with explanatory prediction details in SQL

To conclude, you have successfully predicted the median house prices in Brooklyn using Generalized Linear Model algorithm.

Parent topic: Regression Use case