7.2.6 データの集計
describe
メソッドは、各列のデータの中心傾向、分散および形状を要約する記述統計を計算します。
また、結果に含める列または結果から除外する列の型を指定することもできます。
sum
およびcumsum
メソッドを使用すると、oml.DataFrame
の各Float列またはBoolean列の合計および累積合計を計算できます。
describe
メソッドは、次の統計の検出をサポートしています。
-
平均、最小、最大、中央値、上位文字、標準偏差
-
Null以外の値、一意の値、上位文字の数
-
0から1の間のパーセンタイル
例7-12 記述統計の計算
次の例にこれらの操作を示します。
import pandas as pd
import oml
df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
'string' : [None, None, 'a', 'a', 'a', 'b'],
'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
'string':'CHAR(1)',
'bytes':'RAW(1)'})
# Combine a Boolean column with oml_df.
oml_bool = oml_df['numeric'] > 3
oml_df = oml_df.concat(oml_bool)
oml_df.rename({'COL4':'boolean'})
# Describe all of the columns.
oml_df.describe(include='all')
# Exclude Float columns.
oml_df.describe(exclude=[oml.Float])
# Get the sum of values in each Float or Boolean column.
oml_df.sum()
# Find the cumulative sum of values in each Float or Boolean column
# after oml_df is sorted by the bytes column in descending order.
oml_df.cumsum(by = 'bytes', ascending = False)
# Compute the skewness of values in the Float columns.
oml_df.skew()
# Find the median value of Float columns.
oml_df.median()
# Calculate the kurtosis of Float columns.
oml_df.kurtosis()
この例のリスト
>>> import pandas as pd
>>> import oml
>>>
>>> df = pd.DataFrame({'numeric': [1, 1.4, -4, 3.145, 5, None],
... 'string' : [None, None, 'a', 'a', 'a', 'b'],
... 'bytes' : [b'a', b'b', b'c', b'c', b'd', b'e']})
>>>
>>> oml_df = oml.push(df, dbtypes = {'numeric': 'BINARY_DOUBLE',
... 'string':'CHAR(1)',
... 'bytes':'RAW(1)'})
>>>
>>> # Combine a Boolean column with oml_df.
... oml_bool = oml_df['numeric'] > 3
>>> oml_df = oml_df.concat(oml_bool)
>>> oml_df.rename({'COL4':'boolean'})
bytes numeric string boolean
0 b'a' 1.000 None False
1 b'b' 1.400 None False
2 b'c' -4.000 a False
3 b'c' 3.145 a True
4 b'd' 5.000 a True
5 b'e' NaN b True
>>>
>>> # Describe all of the columns.
... oml_df.describe(include='all')
bytes numeric string boolean
count 6 5.000000 4 6
unique 5 NaN 2 2
top b'c' NaN a False
freq 2 NaN 3 3
mean NaN 1.309000 NaN NaN
std NaN 3.364655 NaN NaN
min NaN -4.000000 NaN NaN
25% NaN 1.000000 NaN NaN
50% NaN 1.400000 NaN NaN
75% NaN 3.145000 NaN NaN
max NaN 5.000000 NaN NaN
>>>
>>> # Exclude Float columns.
... oml_df.describe(exclude=[oml.Float])
bytes string boolean
count 6 4 6
unique 5 2 2
top b'c' a False
freq 2 3 3
>>>
>>> # Get the sum of values in each Float or Boolean column.
... oml_df.sum()
numeric 6.545
boolean 3.000
dtype: float64
>>>
>>> # Find the cumulative sum of values in each Float or Boolean column
... # after oml_df is sorted by the bytes column in descending order.
... oml_df.cumsum(by = 'bytes', ascending = False)
numeric boolean
0 NaN 1
1 5.000 2
2 1.000 2
3 4.145 3
4 5.545 3
5 6.545 3
>>>
>>> # Compute the skewness of values in the Float columns.
... oml_df.skew()
numeric -0.683838
dtype: float64
>>>
>>> # Find the median value of Float columns.
... oml_df.median()
numeric 1.4
dtype: float64
>>>
>>> # Calculate the kurtosis of Float columns.
... oml_df.kurtosis()
numeric -0.582684
dtype: float64
親トピック: データの探索