演習: KURTOSIS関数を使用したデータの依存性の測定

この演習では、KURTOSIS_POPおよびKURTOSIS_SAMP集計関数を使用してデータの依存性を測定する方法を示します。尖度が高いほど、適度なサイズの偏差が頻繁に発生するのとは対照的に、頻度が低い極端な偏差が発生した結果として、より多くの分散が生じることを意味します。正規分布の尖度はゼロです。

  1. 新しい演習を開始する前に、演習の環境の推奨事項を参照してください。
  2. HRとしてPDB20に接続し、/home/oracle/labs/M104784GC10/Houses_Prices.sql SQLスクリプトを実行します。このSQLスクリプトでは、データを含む表を作成します。
    $ cd /home/oracle/labs/M104784GC10
    $ sqlplus hr@PDB20
    
    SQL*Plus: Release 20.0.0.0.0 - Production on Mon Mar 16 08:49:39 2020
    Version 20.2.0.0.0
    
    Copyright (c) 1982, 2020, Oracle.  All rights reserved.
    
    Enter password: password
    Last Successful login time: Mon Mar 16 2020 08:48:58 +00:00
    
    Connected to:
    Oracle Database 20c Enterprise Edition Release 20.0.0.0.0 - Production
    Version 20.2.0.0.
    SQL> @/home/oracle/labs/M104784GC10/Houses_Prices.sql
    SQL>
    SQL> DROP TABLE houses;
    
    Table dropped.
    
    SQL> CREATE TABLE houses (house NUMBER, price_big_city NUMBER, price_small_city NUMBER, price_date DATE);
    
    Table created.
    
    SQL> INSERT INTO houses VALUES (1,100000,10000, sysdate);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,200000,15000, sysdate+1);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,300000,25000, sysdate+1);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,400000,28000, sysdate+2);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,500000,30000, sysdate+3);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,600000,32000, sysdate+3);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,700000,35000, sysdate+4);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,800000,38000, sysdate+4);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (1,900000,40000, sysdate+5);
    
    1 row created.
    
    SQL>
    SQL> INSERT INTO houses VALUES (2,2000000,1000000, sysdate+6);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (2,200000,20000, sysdate);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (2,400000,35000, sysdate+1);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (2,600000,55000, sysdate+1);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (2,800000,48000, sysdate+2);
    
    1 row created.
    
    SQL>
    SQL> INSERT INTO houses VALUES (3,400000,40000, sysdate+3);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (3,500000,42000, sysdate+3);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (3,600000,45000, sysdate+4);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (3,700000,48000, sysdate+4);
    
    1 row created.
    
    SQL> INSERT INTO houses VALUES (3,800000,49000, sysdate+5);
    
    1 row created.
    
    SQL> COMMIT;
    
    Commit complete.
    
    SQL>
  3. 表の行を表示します。HOUSE列の値は、調べる住宅のタイプを指し、統計的に調べて相互に比較するデータを分類します。
    SQL> SET PAGES 100
    SQL> SELECT * FROM houses;
    
         HOUSE PRICE_BIG_CITY PRICE_SMALL_CITY PRICE_DAT
    ---------- -------------- ---------------- ---------
             1         100000            10000 06-FEB-20
             1         200000            15000 07-FEB-20
             1         300000            25000 07-FEB-20
             1         400000            28000 08-FEB-20
             1         500000            30000 09-FEB-20
             1         600000            32000 09-FEB-20
             1         700000            35000 10-FEB-20
             1         800000            38000 10-FEB-20
             1         900000            40000 11-FEB-20
             2        2000000          1000000 12-FEB-20
             2         200000            20000 06-FEB-20
             2         400000            35000 07-FEB-20
             2         600000            55000 07-FEB-20
             2         800000            48000 08-FEB-20
             3         400000            40000 09-FEB-20
             3         500000            42000 09-FEB-20
             3         600000            45000 10-FEB-20
             3         700000            48000 10-FEB-20
             3         800000            49000 11-FEB-20
    
    19 rows selected.
    
    SQL>
  4. 3種類の住宅について、母集団の尖度(KURTOSIS_POP)および標本の尖度(KURTOSIS_SAMP)の結果を表示します。
    SQL> SELECT house, kurtosis_pop(price_big_city), kurtosis_pop(price_small_city) FROM houses
          GROUP BY house;
    
         HOUSE KURTOSIS_POP(PRICE_BIG_CITY) KURTOSIS_POP(PRICE_SMALL_CITY)
    ---------- ---------------------------- ------------------------------
             1                        -1.23                      -.7058169
             2                        -.212                     .245200191
             3                         -1.3                     -1.5417881
    
    SQL> SELECT house, kurtosis_samp(price_big_city), kurtosis_samp(price_small_city) FROM houses
         GROUP BY house;
    
         HOUSE KURTOSIS_SAMP(PRICE_BIG_CITY) KURTOSIS_SAMP(PRICE_SMALL_CITY)
    ---------- ----------------------------- -------------------------------
             1                          -1.2                        -.201556
             2                         3.152                      4.98080076
             3                          -1.2                      -2.1671526
    
    SQL>

    PRICE_SMALL_CITYは、PRICE_BIG_CITYと比較して尖度が高くなります。PRICE_SMALL_CITYおよびPRICE_BIG_CITYで、より多くのデータが尾にあるのか、ピーク前後にあるのかに注目します。

  5. 表に行をさらに挿入します。
    SQL> INSERT INTO houses SELECT * FROM houses;
    
    19 rows created.
    
    SQL> /
    
    38 rows created.
    
    SQL> /
    
    76 rows created.
    
    SQL> /
    
    152 rows created.
    
    SQL> COMMIT;
    
    Commit complete.
    
    SQL> SELECT house, KURTOSIS_POP(price_big_city), KURTOSIS_POP(price_small_city) FROM houses
         GROUP BY house ORDER BY 1;
    
         HOUSE KURTOSIS_POP(PRICE_BIG_CITY) KURTOSIS_POP(PRICE_SMALL_CITY)
    ---------- ---------------------------- ------------------------------
             1                        -1.23                      -.7058169
             2                        -.212                     .245200191
             3                         -1.3                     -1.5417881
    
    SQL> SELECT house, KURTOSIS_SAMP(price_big_city), KURTOSIS_SAMP(price_small_city) FROM houses
         GROUP BY house ORDER BY 1;
    
         HOUSE KURTOSIS_SAMP(PRICE_BIG_CITY) KURTOSIS_SAMP(PRICE_SMALL_CITY)
    ---------- ----------------------------- -------------------------------
             1                    -1.2309485                      -.68809876
             2                    -.14695105                      .340165838
             3                    -1.3061439                      -1.5637533
    
    SQL>

    ご覧のとおり、データセット内の値の数が増加すると、KURTOSIS_SAMPおよびKURTOSIS_POPの計算値の差は減少します。

  6. PRICE_SMALL_CITYおよびPRICE_BIG_CITYの個別値の尖度を確認します。
    SQL> SELECT house, 
                         KURTOSIS_POP(DISTINCT price_big_city) pop_big_city, 
                         KURTOSIS_SAMP(DISTINCT price_big_city) samp_big_city,
                         KURTOSIS_POP(DISTINCT price_small_city) pop_small_city, 
                         KURTOSIS_SAMP(DISTINCT price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1        -1.23          -1.2      -.7058169        -.201556
             2        -.212         3.152     .245200191      4.98080076
             3         -1.3          -1.2     -1.5417881      -2.1671526
    
    SQL>

    問合せで列PRICE_BIG_CITYおよびPRICE_SMALL_CITYの個別値が評価されない場合、結果は大きく異なりますか。

    SQL> SELECT house, 
                         KURTOSIS_POP(price_big_city) pop_big_city, 
                         KURTOSIS_SAMP(price_big_city) samp_big_city,
                         KURTOSIS_POP(price_small_city) pop_small_city, 
                         KURTOSIS_SAMP(price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1        -1.23    -1.2309485      -.7058169      -.68809876
             2        -.212    -.14695105     .245200191      .340165838
             3         -1.3    -1.3061439     -1.5417881      -1.5637533
    
    SQL>

    まったく同じ行が挿入されたため、母集団の依存性の値は異なりません。

  7. HOUSE番号1のビッグ・データ・セットが含まれる表にさらに行を挿入します。
    SQL> INSERT INTO houses (house, price_big_city, price_small_city)
                  SELECT house, price_big_city*0.5, price_small_city*0.1 
                  FROM houses WHERE house=1;
    
    144 rows created.
    
    SQL> /
    
    288 rows created.
    
    SQL> /
    
    576 rows created.
    
    SQL> /
    
    1152 rows created.
    
    SQL> /
    
    2304 rows created.
    
    SQL> COMMIT;
    
    Commit complete.
    
    SQL> SELECT house, count(house) FROM houses GROUP BY house ORDER BY 1;
    
          HOUSE COUNT(HOUSE)
    ---------- ------------
             1         4608
             2           80
             3           80
    		 
    SQL> SELECT house, 
                         KURTOSIS_POP(price_big_city) pop_big_city, 
                         KURTOSIS_SAMP(price_big_city) samp_big_city,
                         KURTOSIS_POP(price_small_city) pop_small_city, 
                         KURTOSIS_SAMP(price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1   9.12746931    9.13868421     33.7452495      33.7831972
             2        -.212    -.14695105     .245200191      .340165838
             3         -1.3    -1.3061439     -1.5417881      -1.5637533
    
    SQL>EXIT
    $

    これで、住宅番号1のデータの依存性が正になり、データが右に偏ります。

    PRICE_SMALL_CITYは、PRICE_BIG_CITYと比較して尖度がずっと高くなります。これは、PRICE_BIG_CITYでは、適度なサイズの偏差が非常に頻繁に発生することに分散が起因するのに対し、PRICE_SMALL_CITYでは、頻度が低い極端な偏差が多数発生した結果としてより多くの分散が生じることを意味します。