地理分類子

GeographicalRegressorと同様に、GeographicalClassifierクラスは、グローバル・モデルと複数のローカル・モデルをトレーニングし、両方のモデルの重み付けされた結果を結合して予測します。

global_modelおよびmodel_clsパラメータを定義することで、scikit-learnグローバル分類子とローカル分類子をそれぞれ指定できます。分類子には、ランダム・フォレスト、サポート・ベクター、グラデーション・ブースト、デシジョン・ツリーなど、任意のscikit-learn分類子を指定できます。

GeographicalClassifierとGeographicalRegressorはいずれも、ランダム・フォレスト以外に基礎となる様々な機械学習アルゴリズムを使用できるようにし、ローカル・モデルのトレーニングで並列性をサポートして、堅牢でスケーラブルなパフォーマンスを確保することで、地理的ランダム・フォレスト・アルゴリズムを拡張します。地理的ランダム・フォレスト・アルゴリズムの詳細は、[ 4 ]を参照してください。

次の表に、Geographical Classifierクラスの主なメソッドを示します。

メソッド	説明
`fit`	まず、グローバル・モデルは、作成時に提供されたパラメータを使用して構築されます。空間関係が(`spatial_weights_definition`または`bandwidth`パラメータによって)指定されていない場合、内部的に計算されます。その後、いくつかのローカル・モデルが訓練されています。
`predict`	次のステップでは、予測メソッドについて説明します: 予測は、予測される観測値に近いローカル・モデルを見つけることによって実行されます。グローバル・モデルとローカル・モデルからの予測の加重平均を使用して、アルゴリズムはクラスに対応する値の個別範囲を推定し、各クラスに属する観測の確率を表します。最も高い確率に関連付けられたカテゴリは、予測値を表します。
`fit_predict`	トレーニング・データを使用して、`fit`および`predict`メソッドを順番にコールします。
`score`	指定されたデータのモデルの精度を返します。

詳細は、「Oracle Spatial AI Python APIリファレンス」の「地理分類子」クラスを参照してください。

次のコードでは、ロサンゼルス市の住宅情報を含むhouses_full SpatialDataFrameを使用します。この例では、次のステップを実行します:

HOUSE_VALUE_MEDIAN列に基づいてカテゴリ変数を作成します。
トレーニングおよびテスト・セットを定義します。
GeographicalClassifierのインスタンスを作成します。
scikit-learnのRandomForestClassifierを使用してローカル・モデルをトレーニングします。
predictメソッドとscoreメソッドをコールして、ターゲット変数とテスト・セットのモデルの精度をそれぞれ見積もります。

from oraclesai.preprocessing import spatial_train_test_split
from oraclesai.weights import DistanceBandWeightsDefinition
from sklearn.ensemble import RandomForestClassifier
from oraclesai.classification import GeographicalClassifier

# Define explanatory variables
feature_columns = [
    'BEDROOMS_TOTAL',
    'EDU_LEVEL_SCORE_MEDIAN',
    'POPULATION_DENSITY',
    'ROOMS_TOTAL',
    'COMPLETE_PLUMBING_PERC',
    'COMPLETE_KITCHEN_PERC',
    'HOUSE_AGE_MEDIAN',
    'RENTED_PERC',
    'UNITS_TOTAL'
]

# The target variable will be built from this column
target_column = 'HOUSE_VALUE_MEDIAN'

# Select a subset of columns
houses = houses_full[[target_column] + feature_columns]

# Remove rows with null values
houses = houses.dropna()

# Define training and test sets
X_train, X_test, y_train, y_test, geom_train, geom_test = spatial_train_test_split(houses,
                                                                                   y=target_column, 
                                                                                   test_size=0.33,
                                                                                   numpy_result=True,
                                                                                   random_state=32)

# Define constants to create a categorical variable
y = houses[target_column].values
y_mean = y.mean()
y_std = y.std()

# House prices below the mean minus 0.5 std are considered a low-value
# House prices above the mean plus 0.5 std are considered a high-value
mid_low_price =  y_mean - y_std * 0.5
mid_hi_price = y_mean + y_std * 0.5

# Define the function that generates the target variable based on the house value
def classify_house_value(house_value):
    if house_value < mid_low_price:
        return 0.0
    if house_value > mid_hi_price:
        return 2.0
    return 1.0

# Generate the target variable for the training and test sets
y_c_train = [classify_house_value(inc) for inc in y_train]
y_c_test = [classify_house_value(inc) for inc in y_test]

# Define the spatial weights
weights_definition = DistanceBandWeightsDefinition(threshold=2388.51)

# Create an instance of GeographicalClassifier
grfc_model = GeographicalClassifier(model_cls=RandomForestClassifier, 
                                    n_estimators=10, 
                                    local_weight=0.80, 
                                    spatial_weights_definition=weights_definition, 
                                    random_state=32) 
# Train the model
grfc_model.fit(X_train, y=y_c_train, geometries=geom_train, n_jobs=-1)

# Print the predictions with the test set
grfc_predictions_test = grfc_model.predict(X_test, geometries=geom_test).flatten()
print(f"\n>> predictions (X_test):\n {grfc_predictions_test[:10]}")

# Print the score with the test set
grfc_accuracy = grfc_model.score(X_test, y_c_test, geometries=geom_test)
print(f"\n>> accuracy (X_test):\n {grfc_accuracy}")

出力は、テスト・セットの最初の10個の観測の予測と、同じテスト・セットを使用したモデルの精度で構成されます。

>> predictions (X_test):
 [1 1 0 2 2 1 1 0 0 0]

>> accuracy (X_test):
 0.7343004295345901