Practice: Measuring Asymmetry in Data with the SKEWNESS Functions

Overview

This practice shows how to use the SKEWNESS_POP and SKEWNESS_SAMP aggregate functions to measure asymmetry in data. For a given set of values, the result of population skewness (SKEWNESS_POP) and sample skewness (SKEWNESS_SAMP) is always deterministic.

Before starting any new practice, refer to the Practices Environment recommendations.

Step 1 : Set up the environment

  • Connect to PDB21 as HR and execute the /home/oracle/labs/M104784GC10/Houses_Prices.sql SQL script to create a table with skewed data.

    
    $ cd /home/oracle/labs/M104784GC10
    $ sqlplus system@PDB21
    
    Copyright (c) 1982, 2020, Oracle.  All rights reserved.
    
    Enter password:
    Last Successful login time: Mon Mar 16 2020 08:49:41 +00:00
    
    Connected to:
    
    SQL> @/home/oracle/labs/M104784GC10/Houses_Prices.sql
    SQL> SET ECHO ON
    SQL>SQL> DROP TABLE houses;
    DROP TABLE houses
               *
    ERROR at line 1:
    ORA-00942: table or view does not exist
    
    SQL> CREATE TABLE houses (house NUMBER, price_big_city NUMBER, price_small_city NUMBER, price_date DATE);
    
    Table created.
    
    SQL> INSERT INTO houses VALUES (1,100000,10000, sysdate);
    
    1 row created.
    ...
    SQL> INSERT INTO houses VALUES (1,900000,40000, sysdate+5);
    
    1 row created.
    ...
    SQL> COMMIT;
    
    Commit complete.
    
    SQL> 

Step 2 : Examine skewed data

  • Display the table rows. The HOUSE column values refer to types of houses that you want to look at and is used to categorize the data that you look at statistically and compare. With skewness you measure whether there is more data towards the left or the right end of the tail (positive/negative) or how close you are to a normal distribution (skewness = 0).

    
    SQL> SET PAGES 100
    SQL> SELECT * FROM houses;
    
         HOUSE PRICE_BIG_CITY PRICE_SMALL_CITY PRICE_DAT
    ---------- -------------- ---------------- ---------
             1         100000            10000 05-FEB-20
             1         200000            15000 06-FEB-20
             1         300000            25000 06-FEB-20
             1         400000            28000 07-FEB-20
             1         500000            30000 08-FEB-20
             1         600000            32000 08-FEB-20
             1         700000            35000 09-FEB-20
             1         800000            38000 09-FEB-20
             1         900000            40000 10-FEB-20
             2        2000000          1000000 11-FEB-20
             2         200000            20000 05-FEB-20
             2         400000            35000 06-FEB-20
             2         600000            55000 06-FEB-20
             2         800000            48000 07-FEB-20
             3         400000            40000 08-FEB-20
             3         500000            42000 08-FEB-20
             3         600000            45000 09-FEB-20
             3         700000            48000 09-FEB-20
             3         800000            49000 10-FEB-20
    
    19 rows selected.
    
    SQL> 
  • Display the result of population skewness prices (SKEWNESS_POP) and sample skewness prices (SKEWNESS_SAMP) for the three houses in the table.

    
    SQL> SELECT house, count(house) FROM houses GROUP BY house ORDER BY 1;
    
         HOUSE COUNT(HOUSE)
    ---------- ------------
             1            9
             2            5
             3            5
    
    SQL> SELECT house, SKEWNESS_POP(price_big_city), SKEWNESS_POP(price_small_city) FROM houses
            GROUP BY house;
    
         HOUSE SKEWNESS_POP(PRICE_BIG_CITY) SKEWNESS_POP(PRICE_SMALL_CITY)
    ---------- ---------------------------- ------------------------------
             1                            0                     -.66864012
             2                   1.13841996                     1.49637083
             3                            0                     -.12735442
    
    SQL> SELECT house, SKEWNESS_SAMP(price_big_city), SKEWNESS_SAMP(price_small_city) FROM houses
            GROUP BY house;
    
         HOUSE SKEWNESS_SAMP(PRICE_BIG_CITY) SKEWNESS_SAMP(PRICE_SMALL_CITY)
    ---------- ----------------------------- -------------------------------
             1                             0                      -.81051422
             2                    1.69705627                      2.23065793
             3                             0                      -.18984876
    
    SQL> 

    Skewness is important in a situation where PRICE_BIG_CITY and PRICE_SMALL_CITY represent the prices of houses to buy and you want to determine whether the outliers in data are biased towards the left end or right end of the distribution, that is, if there are more values to the left of the mean when compared to the number of values to the right of the mean.

Step 3 : Examine skewed data after data evolution

  • Insert more rows into the table.

    
    SQL> INSERT INTO houses SELECT * FROM houses;
    
    19 rows created.
    
    SQL> /
    
    38 rows created.
    
    SQL> /
    
    76 rows created.
    
    SQL> /
    
    152 rows created.
    
    SQL> COMMIT;
    
    Commit complete.
    
    SQL> SELECT house, SKEWNESS_POP(price_big_city), SKEWNESS_POP(price_small_city) FROM houses
         GROUP BY house ORDER BY 1;
    
         HOUSE SKEWNESS_POP(PRICE_BIG_CITY) SKEWNESS_POP(PRICE_SMALL_CITY)
    ---------- ---------------------------- ------------------------------
             1                            0                     -.66864012
             2                   1.13841996                     1.49637083
             3                            0                     -.12735442
    
    SQL> SELECT house, SKEWNESS_SAMP(price_big_city), SKEWNESS_SAMP(price_small_city) FROM houses
         GROUP BY house ORDER BY 1;
    
         HOUSE SKEWNESS_SAMP(PRICE_BIG_CITY) SKEWNESS_SAMP(PRICE_SMALL_CITY)
    ---------- ----------------------------- -------------------------------
             1                             0                      -.67569912
             2                     1.1602897                      1.52511703
             3                             0                      -.12980098
    
    SQL> 

    As the number of values in the data set increases, the difference between the computed values of SKEWNESS_SAMP and SKEWNESS_POP decreases.

  • Determine the skewness of distinct values in the PRICE_BIG_CITY and PRICE_SMALL_CITY columns.

    
    SQL> SELECT house, 
                         SKEWNESS_POP(DISTINCT price_big_city) pop_big_city, 
                         SKEWNESS_SAMP(DISTINCT price_big_city) samp_big_city,
                         SKEWNESS_POP(DISTINCT price_small_city) pop_small_city, 
                         SKEWNESS_SAMP(DISTINCT price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1            0             0     -.66864012      -.81051422
             2   1.13841996    1.69705627     1.49637083      2.23065793
             3            0             0     -.12735442      -.18984876
    
    SQL> 

    Is the result much different if the query does not evaluate the distinct values in PRICE_BIG_CITY and PRICE_SMALL_CITY?

    
    SQL> SELECT house, 
                         SKEWNESS_POP(price_big_city) pop_big_city, 
                         SKEWNESS_SAMP(price_big_city) samp_big_city,
                         SKEWNESS_POP(price_small_city) pop_small_city, 
                         SKEWNESS_SAMP(price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1            0             0     -.66864012      -.67569912
             2   1.13841996     1.1602897     1.49637083      1.52511703
             3            0             0     -.12735442      -.12980098
    
    SQL> 

    The population skewness value is not different because the same exact rows were inserted.

  • Insert more rows into the table with a big data set for HOUSE number 1.

    
    SQL> INSERT INTO houses (house, price_big_city, price_small_city)
                  SELECT house, price_big_city*0.5, price_small_city*0.1 
                  FROM houses WHERE house=1;
    
    144 rows created.
    
    SQL> /
    
    288 rows created.
    
    SQL> /
    
    576 rows created.
    
    SQL> /
    
    1152 rows created.
    
    SQL> /
    
    2304 rows created.
    
    SQL> COMMIT;
    
    Commit complete.
    
    SQL> SELECT house, count(house) FROM houses GROUP BY house ORDER BY 1;
    
          HOUSE COUNT(HOUSE)
    ---------- ------------
             1         4608
             2           80
             3           80
    		 
    SQL> SELECT house, 
                         SKEWNESS_POP(price_big_city) pop_big_city, 
                         SKEWNESS_SAMP(price_big_city) samp_big_city,
                         SKEWNESS_POP(price_small_city) pop_small_city, 
                         SKEWNESS_SAMP(price_small_city) samp_small_city  
                  FROM houses
                  GROUP BY house;
    
         HOUSE POP_BIG_CITY SAMP_BIG_CITY POP_SMALL_CITY SAMP_SMALL_CITY
    ---------- ------------ ------------- -------------- ---------------
             1   2.57050631    2.57134341      5.7418481      5.74371797
             2   1.13841996     1.1602897     1.49637083      1.52511703
             3            0             0     -.12735442      -.12980098
    
    SQL> EXIT
    $