Load the Data
Perform the following steps to load the data:
- Create an instance of
SpatialDataFrame.The census dataset is stored in thela_block_groupstable in the database. To load it into Python, use aDBSpatialDatasetand create an instance ofSpatialDataFrame.import oml from oraclesai import SpatialDataFrame, DBSpatialDataset block_groups = SpatialDataFrame.create(DBSpatialDataset(table='la_block_groups', schema='oml_user'))The dataset contains information about different regions in the city of Los Angeles, and features such as
median_incomeandhouse_valueprovide information about each region's income. Other features provide demographic information about gender, race, and age. - Review the variables (shown in the following table) of the
SpatialDataFrameinstance and define the columns that represent the target variable, the explanatory variables, and the geometries.Variable Description MEDIAN_INCOMEThe target variable representing the median income. MEAN_AGEThe average age. MEAN_EDUCATION_LEVELScore based on the different education levels listed in the Census table. HOUSE_VALUEMedian value of houses in the region. PER_WHITEProportion of the white population in the region. PER_BLACKProportion of the black population in the region. The following code selects a subset of columns from the
SpatialDataFrameinstance.X = block_groups[['MEDIAN_INCOME', 'MEAN_AGE', 'MEAN_EDUCATION_LEVEL', 'HOUSE_VALUE', 'INTERNET', 'geometry']] - Define the training, validation, and test sets.
- Split the data into training and test sets using the
spatial_train_test_splitfunction fromoreaclesai.preprocessing. Assign 20% of the data for testing.from oraclesai.preprocessing import spatial_train_test_split X_train_valid, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", test_size=0.2, random_state=32) - Split the remaining 80% of the data again to create the training and
validation sets, using 10% for validation and the rest for training. The
validation set is helpful to evaluate the model’s performance before
using it with the test set.
X_train, X_valid, _, _, _, _ = spatial_train_test_split(X_train_valid, y="MEDIAN_INCOME", test_size=0.1, random_state=32)
- Split the data into training and test sets using the