データのクロス集計

7.2.3 データのクロス集計

crosstabメソッドを使用してoml.DataFrameオブジェクトのクロス集計分析を実行し、pivot_tableメソッドを使用してoml.DataFrameをスプレッドシート形式のピボット・テーブルに変換します。

クロス集計は、2つの列の値の間にある相互依存関係を見つける統計手法です。crosstabメソッドは、2つ以上の列のクロス集計を計算します。デフォルトでは、列と集計関数が渡されないかぎり、列の頻度表が計算されます。

pivot_tableメソッドは、データセットをピボット・テーブルに変換します。1000列というデータベースの制限があるため、1000列を超えるピボット・テーブルは自動的に切り捨てられ、それぞれの列値についてエントリが最も多いカテゴリが表示されます。

これらのメソッドの引数の詳細は、help(oml.DataFrame.crosstab)またはhelp(oml.DataFrame.pivot_table)を呼び出すか、Oracle Machine Learning for Python APIリファレンスを参照してください。

例7-10 クロス集計およびピボット・テーブルの作成

この例では、crosstabおよびpivot_tableメソッドの使用方法を示します。

import pandas as pd
import oml

x = pd.DataFrame({
     'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 
                None, 'F', 'M', 'F'],
     'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
              'R', 'R', 'R', 'R'],
     'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1, 
               39.6, 46.4, 12, 25.3, 37.5],
     'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
                  .89, .84, .91, .95]
    })
x = oml.push(x)

# Find the categories that the most entries belonged to.
x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)

# For each gender value and across all entries, find the ratio of entries
# with different hand values.
x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)

# Find the mean speed across all gender and hand combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED')

# Find the median accuracy and speed for every gender and hand combination.
x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)

# Find the max and min speeds for every gender and hand combination and 
# across all combinations.
x.pivot_table('GENDER', 'HAND', 'SPEED', 
               aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
               margins = True)

この例のリスト

>>> import pandas as pd
>>> import oml
>>> 
>>> x = pd.DataFrame({
...       'GENDER': ['M', 'M', 'F', 'M', 'F', 'M', 'F', 'F',
...                  None, 'F', 'M', 'F'],
...       'HAND': ['L', 'R', 'R', 'L', 'R', None, 'L', 'R',
...                'R', 'R', 'R', 'R'],
...       'SPEED': [40.5, 30.4, 60.8, 51.2, 54, 29.3, 34.1,
...                 39.6, 46.4, 12, 25.3, 37.5],
...       'ACCURACY': [.92, .94, .87, .9, .85, .97, .96, .93,
...                    .89, .84, .91, .95]
...     })
>>> x = oml.push(x)
>>> 
>>> # Find the categories that the most entries belonged to.
... x.crosstab('GENDER', 'HAND').sort_values('count', ascending=False)
  GENDER  HAND  count
0      F     R      5
1      M     L      2
2      M     R      2
3      M  None      1
4      F     L      1
5   None     R      1
>>> 
>>> # For each gender value and across all entries, find the ratio of entries
... # with different hand values.
... x.crosstab('GENDER', 'HAND', pivot = True, margins = True, normalize = 0)
  GENDER  count_(L)  count_(R)  count_(None)
0   None   0.000000   1.000000      0.000000
1      F   0.166667   0.833333      0.000000
2      M   0.400000   0.400000      0.200000
3    All   0.250000   0.666667      0.083333
>>>
>>> # Find the mean speed across all gender and hand combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED')
  GENDER  mean(SPEED)_(L)  mean(SPEED)_(R)  mean(SPEED)_(None)
0   None              NaN            46.40                 NaN
1      F            34.10            40.78                 NaN
2      M            45.85            27.85                29.3
>>>
>>> # Find the median accuracy and speed for every gender and hand combination.
... x.pivot_table('GENDER', 'HAND', aggfunc = oml.DataFrame.median)
  GENDER  median(ACCURACY)_(L)  median(ACCURACY)_(R)  median(ACCURACY)_(None)  \
0   None                   NaN                 0.890                      NaN   
1      F                  0.96                 0.870                      NaN   
2      M                  0.91                 0.925                     0.97   

   median(SPEED)_(L)  median(SPEED)_(R)  median(SPEED)_(None)  
0                NaN              46.40                   NaN  
1              34.10              39.60                   NaN  
2              45.85              27.85                  29.3  
>>> 
>>> # Find the max and min speeds for every gender and hand combination and 
... # across all combinations.
... x.pivot_table('GENDER', 'HAND', 'SPEED',
...                aggfunc = [oml.DataFrame.max, oml.DataFrame.min],
...                margins = True)
  GENDER  max(SPEED)_(L)  max(SPEED)_(R)  max(SPEED)_(None)  max(SPEED)_(All)  \
0   None             NaN            46.4                NaN              46.4   
1      F            34.1            60.8                NaN              60.8   
2      M            51.2            30.4               29.3              51.2   
3    All            51.2            60.8               29.3              60.8   

   min(SPEED)_(L)  min(SPEED)_(R)  min(SPEED)_(None)  min(SPEED)_(All)  
0             NaN            46.4                NaN              46.4  
1            34.1            12.0                NaN              12.0  
2            40.5            25.3               29.3              25.3  
3            34.1            12.0               29.3              12.0

親トピック: データの探索