Skip Headers
Oracle® R Enterprise User's Guide
Release 11.2 for Linux, Solaris, AIX, and Windows

Part Number E26499-05
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

4 Oracle R Enterprise Statistical Functions

This chapter describes Oracle R Enterprise functions that perform most common or base statistical procedures. These functions are designed to help users who are converting from commercially available products to Oracle R Enterprise.

Oracle R Enterprise provides these collections of functions:

The use of the functions is illustrated with examples. Most of the examples use the same data, described in Data for Examples.

Data for Examples

Most of the examples use the table NARROW, which is installed in your database when you install with Oracle R Enterprise.

NARROW is an ore.frame with 9 columns:

R> class(NARROW)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"    

ore.corr

ore.corr performs correlation analysis across numeric columns in an ore.frame.

ore.corr supports partial correlations with a control column.

ore.corr enables aggregations prior to correlations.

ore.corr allows post-processing of results and integration into an R code flow.

The output of ore.corr can be made to conform to output of the R cor() function; this allows the output of ore.corr to be post-processed by any R function or graphics.

See ore.corr Parameters for syntax and output and ore.corr Examples for examples.

ore.corr Parameters

ore.corr has these parameters:

  • data: The data for which to compute correlation coefficients as an ore.frame.

  • var: The numeric column(s) of data for which to build correlation matrix

  • group.by: Indicates the correlation matrices to calculate; ore.corr calculates as many correlation matrices as unique values in group.by columns; default value is NULL

  • weight: A column of the data whose numeric values provide a multiplicative factor for var columns; default value is NULL

  • partial: columns of data to use as control variables for partial correlation; default value is NULL

  • stats: The method of calculating correlations; one of pearson (default), spearman, kendall

ore.corr returns an ore.frame as output in all cases except when group.by is used. If group.by is used, returns an Oracle R Enterprise list object.

To convert the output of ore.corr into R cor()-compatible output format, use:

OREeda:::ore.corr.as.matrix()

ore.corr Examples

These examples show how to use ore.corr:

These examples use the NARROW data set; for more information, see Data for Examples.

Basic Correlation Calculations

Before you can use ore.corr, you must project out all non-numerical values:

R> names(NARROW)
 [1] "ID"             "GENDER"     "AGE"    "MARITAL_STATUS" 
"COUNTRY"        "EDUCATION"      "OCCUPATION"
 [8] "YRS_RESIDENCE"  "CLASS"          "AGEBINS"
R> NARROW=NARROW[,c(3,8,9)]

Now calculate correlation in several ways:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS')
#Calculate using Spearman
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')
# Calculate using Kendall
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')
# Convert output so that it is compatible with output of R cor
R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x)
R> class(cor_compatible_matrix)
[1] "matrix"

Partial Correlation

Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.

Calculate partial correlation using Spearman's methods:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman', partial='GENDER')

Create Several Correlation Matrices

Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.

Create several correlation matrices and then convert the output so that it is compatible with R output:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='pearson', partial='GENDER', group.by='COUNTRY')
R> class(x)
[1] "list"
R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x[[1]])

Visualization of Correlations

If you calculate several matrices, you can use R packages to visualize them.

ore.crosstab

Cross tabulation is a statistical technique that finds an interdependent relationship between two tables of values.

ore.crosstab enables cross column analysis of an ore.frame. This function is a sophisticated variant of the R table() function.

ore.crosstab must be performed before frequency analysis is done using ore.freq.

You can extend the cross tab calculation with various sums as described in ore.extend.

ore.crosstab is written in R. The function is mapped to SQL that gets executed at the database server.

See ore.crosstab Parameters for syntax and output and ore.crosstab Examples for examples.

You can use ore.extend to augment crass tabulation.

ore.crosstab Parameters

ore.crosstab has these parameters:

  • expr: the cross tabulation definition

    [COLUMN_SPEC] ~ COLUMN_SPEC [*<WEIGHTING COLUMN>] [/<GROUPING COLUMN>]
     [^<STRATIFICATION COLUMN>] [|ORDER_SPECIFICATION]
            COLUMN_SPEC is <column-name>[+COLUMN_SET][+COLUMN_RANGE]
            COLUMN_SET is <column_name>[+COLUMN_SET]
            COLUMN_RANGE is <FROM COLUMN>-<TO COLUMN
    

    where

    COLUMN_SPEC is <column>[+COLUMN_SET][+COLUMN_RANGE]
    COLUMN_SET is <column>[+COLUMN_SET]
    COLUMN_RANGE is (<from column>-<to column>)
    ORDER_SPECIFICATION is one of [-]NAME, [-]DATA, [-]FREQ, or INTERNAL
    

    The stratification column is used to cluster, or group, data. When used, the values contribute to the ORE$STRATA column of the resulting cross-tabulated table.

  • data: the ore.frame containing the data to cross tabulate

  • grouping column: as many cross tabulations as unique values in grouping columns; default value is NULL

  • order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction.

  • weights: column of the data that indicates the frequency of the corresponding row; default value is NULL

  • partial: columns of data to use as control variables for partial correlation; default value is NULL

ore.crosstab returns an ore.frame as output in all cases except when multiple tables are created. If multiple tables are created, ore.crosstab returns an Oracle R Enterprise list object.

ore.crosstab Examples

These examples illustrate use of ore.crosstab:

These examples use the NARROW data set; for more information, see Data for Examples.

Single-Column Frequency Table

The most basic use case is to create a single column frequency table. The following command filters NARROW) grouping by GENDER:

R> ct = ore.crosstab(AGE, data=NARROW)
R> ct

Analyze Two Columns

This command analyses AGE by GENDER aid AGE by CLASS:

R> ct = ore.crosstab(AGE~GENDER+CLASS, data=NARROW)
R> head(ct)

Weighting Rows

To weight rows, include count based on another column; this example weights values in AGE and GENDER using values in YRS_RESIDENCE:

R> ct = ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW)
R> head(ct)

Order Rows in the Cross Tabulated Table

There are several possibilities:

  • Default or NAME- Order by the columns being analyzed

  • FREQ - Order by frequency counts

  • -NAME or -FREQ does reverse ordering

  • INTERNAL - Bypass ordering

Here are two examples:

R> ct = ore.crosstab(AGE~GENDER|FREQ, data=NARROW)
R> head(ct)
  AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP

R> ct = ore.crosstab(AGE~GENDER|-FREQ, data=NARROW)
R> head(ct)

Analyze Three or More Columns

This is similar to what SQL GROUPING SETs accomplish:

ct = ore.crosstab(AGE+COUNTRY~GENDER, NARROW)

Specify a Range of Columns

You can specify a range of columns instead of having to type all the column names, as illustrated in this example:

R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"    

Since AGE, MARITAL_STATUS and COUNTRY are successive columns, you can simply use

ct = ore.crosstab(AGE-COUNTRY~GENDER, NARROW)

An equivalent version is

ct = ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)

Produce One Cross Table for Each Value of Another Column

This command produces one cross table (AGE, GENDER) for each unique value of another column COUNTRY:

R> ct=ore.crosstab(~AGE/COUNTRY, data=NARROW)
R> head(ct)

You can extend this to more than one column. For example, this command produces one (AGE,EDUCATION) table for each unique combination of (COUNTRY, GENDER):

R> ct = ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW)

Augment Cross Tabulation with Stratification

All of the above cross tabs can be augmented with stratification. For example,

R> ct = ore.crosstab(AGE~GENDER^CLASS, data=NARROW) 
R> head(ct)

The command in this example is the same as

ct = ore.crosstab(~GENDER, NARROW, strata="CLASS")

Custom Binning Followed by Cross Tabulation

First bin AGE, the calculate cross tabulation for GENDER and the bins:

R> NARROW$AGEBINS=ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2, ifelse(NARROW$AGE<40,3,4)))
R> ore.crosstab(GENDER~AGEBINS, NARROW)

ore.extend

The cross tabulation produced using ore.crosstab can be further augmented with these three basic statistics:

  • Row and Column Sums

    crosstab = ore.extend.sum(crosstab)
    
  • Cumulative sums for each cell of the table

    crosstab = ore.extend.cumsum(crosstab)
    
  • Total for the entire table

    crosstab = ore.extend.total(crosstab)
    

The following example illustrates ore.extend:

R> ct = ore.crosstab(~GENDER, NARROW)
R> ct = ore.extend.sum(ct)
R> ct
  GENDER ORE$FREQ ORE$STRATA ORE$GROUP ORE$SUM$GENDER
0      F      421          1         1            421
1      M      880          1         1            880

ore.freq

ore.crosstab must be performed before frequency analysis is done using ore.freq.

ore.freq analyses the output of ore.crosstab and automatically determines the techniques that are relevant to an ore.crosstab result. The techniques depend on the kind of cross tables:

ore.freq uses Oracle Database SQL functions when available.

See ore.freq Parameters for syntax and output and ore.freq Examples for examples.

ore.freq Parameters

ore.freq supports these parameters:

  • crosstab: ore.frame output from ore.crosstab()

  • stats: List of statistics required; these statistics are supported:

    • Chi Square: AJCHI, LRCHI, MHCHI, PCHISQ

    • Kappa: KAPPA, WTKAP

    • Lambda: LAMCR, LAMRC, LAMDAS

    • Correlation: KENTB,PCORR, SCORR

    • Stuart's Tau, Somers: D|C, STUTC, SMDCR,SMDRC

    • Fisher's, Cochran's Q, FISHER, COCHQ

    • Odds Ratio: OR, MHOR, LGOR

    • Relative Risk: RR,MHRR,ALRR

    • Others: MCNEM, PHI, CRAMV, CONTGY, TSYM, TREND, GAMMA

    The default value is NULL.

  • Params: Control parameters specific to the statistical function specified in stats:

    • SCORE: TABLE|RANK|RIDIT|MODRIDIT

    • ALPHA: number

    • WEIGHTS: number

    The default value is NULL.

  • skip.missing: Either TRUE or FALSE; skip cells with missing values in the cross table; default value is FALSE

  • skip.failed: Either TRUE or FALSE; if a statistical test required fails on the cross table because it is found to be in-applicable to the table then return immediately; ; default value is FALSE

ore.freq returns an ore.frame in all cases.

ore.freq Examples

These examples use the NARROW data set; for more information, see Data for Examples.

Before you use ore.freq, you must calculate cross tabs.

For example:

R> ct = ore.crosstab(~GENDER, NARROW)
R> ore.freq(ct)
  METHOD     FREQ DF PVALUE      DESCR GROUP
0   PCHI 161.9377  1      0 Chi-Square     1

ore.rank

ore.rank analyzes distribution of values in numeric columns of an ore.frame.

ore.rank supports useful functionality, including:

ore.rank syntax is simpler that the corresponding SQL queries.

See ore.rank Parameters for syntax and ore.rank Examples for examples.

ore.rank Parameters

ore.rank supports these parameters:

  • data: The ore.frame containing the data to rank

  • var: numeric columns in data to rank

  • desc: If desc=TRUE, rank in descending order; otherwise, rank in ascending order. (The default is to rank in ascending order.)

  • groups: Partition rows into #groups based on ranks. For percentiles, groups=100, For deciles, groups=10, For quartiles, groups=4.

    The default value is NULL.

  • group.by: Rank each group identified by group.by columns separately

    The default value is NULL.

  • ties: Specify how to treat ties. Assign the largest of, or smallest of, or mean of corresponding ranks to tied values

    The default value is NULL.

  • fraction: The rank of a column value divided by the number of non-missing column values; the default value is FALSE.

    Use with nplus1 to estimate the cumulative distribution function

  • nplus1: fraction plus 1, that is, 1 plus the rank of a column value divided by the number of non-missing column values; the default value is FALSE.

    Use with fraction to estimate the cumulative distribution function

  • percent: fraction converted to a percent value, that is fraction * 100.

ore.rank returns anore.frame in all instances.

You can use these R scoring methods with ore.rank:

  • To compute exponential scores from ranks, use savage.

  • To compute normal scores, use one of blom, tukey, or vw(van der Waerden).

ore.rank Examples

These examples illustrate using ore.rank:

These examples use the NARROW data set; for more information, see Data for Examples.

Rank Two Columns

This example ranks the two columns AGE and CLASS and reports the results as derived columns; values are ranked in the default order (ascending):

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')

Handle Ties

This example ranks the two columns AGE and CLASS. If there is a tie, the smallest value is signed to all tied values:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')

Rank Within Groups

This example ranks the two columns AGE and CLASS and ranks the values according to COUNTRY:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')

Partition into Deciles

This example ranks the two columns AGE and CLASS and partitions the columns into deciles (10 partitions):

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)

To partition the columns into a different number of partitions, change the value of groups. For example, groups=4 partitions into quartiles.

Estimate Cumulative Distribution Function

This example ranks the two columns AGE and CLASS and estimates the cumulative distribution function for both columns:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)

Score Ranks

This example ranks the two columns AGE and CLASS and scores the ranks in tow different ways. The first command all partitions the columns into percentiles (100 groups). savage calculates exponential scores and blom calculates normal scores:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge,            CLASS=RankOfClass',score='savage', groups=100, group.by='COUNTRY')
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='blom')

ore.sort

ore.sort enables flexible sorting of a data frame along one or more columns specified in a by clause.

ore.sort can be used with other data pre-processing functions. The results of sorting can provide input to R visualization.

ore.sort sorting takes places in the Oracle database. ore.sort supports the database nls.sort option.

See ore.sort Parameters for syntax and ore.sort Examplesfor examples.

ore.sort Parameters

ore.sort supports these parameters:

  • data: ore.frame containing the data to be sorted; required

  • by: the column(s) in data by which to sort the data; required

  • stable: Relative order is maintained within sorted group (TRUE or FALSE); default value is FALSE

  • reverse: Optional reversal of collation order for character variables (TRUE or FALSE); default value is FALSE

  • unique.keys: Optional deletion of observations with duplicate values in the columns being sorted, TRUE or FALSE; default value is FALSE

  • unique.data: Optional deletion of observations duplicate values in all columns, TRUE or FALSE; default value is FALSE

data and by are required parameters; all other parameters are optional

ore.sort returns an ore.frame.

ore.sort Examples

The following examples illustrate using ore.sort:

Most of these examples use the NARROW data set; for more information, see Data for Examples. There are also Examples Using ONTIME_S.

Sort Columns in Descending Order

Sort the columns AGE and GENDER in descending order:

R> x=ore.sort(data=NARROW,by='AGE,GENDER', reverse=TRUE)

Sort Different Columns in Different Orders

Sort AGE in descending order and GENDER in ascending order:

R> x=ore.sort(data=NARROW,by='-AGE,GENDER')

Sort and Return One Row per Unique Value

Sort by AGE and keep one row per unique value of AGE:

R> x=ore.sort(data=NARROW,by='AGE', unique.key=TRUE)

Remove Duplicate Columns

Sort by AGE and remove duplicate rows:

R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE)

Remove Duplicate Columns and Return One Row per Unique Value

Sort by AGE. Also remove duplicate rows, and return one row per unique value of AGE:

R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE,unique.key = TRUE)

Preserve Relative Order in Output

Maintain the relative order in the sorted output:

R> x=ore.sort(data=NARROW,by='AGE', stable=TRUE)

Examples Using ONTIME_S

These examples use the ONTIME_S airline data that is installed when you install Oracle R Enterprise:

  • Sort ONTIME_S by airline name in descending order and departure delay in ascending order:

    R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')
    
  • Sort ONTIME_S by airline name and departure delay and select one of each combination (that is, return a unique key):

    R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY', unique.key=TRUE)
    

ore.summary

ore.summary calculates descriptive statistics and supports extensive analysis of columns in an ore.frame, along with flexible row aggregations.

ore.summary supports these statistics:

ore.summary provides a relatively simple syntax compared with SQL queries for the same results.

See ore.summary Parameters for syntax and ore.summary Examples for examples.

ore.summary Parameters

ore.summary supports these parameters:

  • data: the data to aggregate as an ore.frame

  • class: column(s) of data to aggregate (that is, SQL GROUP BY); default value is NULL

  • var: column(s) of data on which to apply statistics functions (SQL SELECT list)

  • stats: list of statistics functions to be applied on var columns

    mean, min, max, cnt, n, nmiss, css, uss, cv, sum, sumwgt, range, stddev, stderr, var, t, kurt, skew,p1, p5, p10, p25, p50, p75, p90, p95, p99, qrange, lclm, rclm, clm, mode that can be requested on var columns.

    The default value are n, mean, min, max.

  • weight: A column of data whose numeric values provide a multiplicative factor for var columns

  • maxid, minid: for each group optionally list maximum or minimum value from other columns in data; default value is NULL

  • ways: restrict output to only certain grouping levels of the class variables; default value is NULL

  • group.by: column(s) of data to stratify summary results across; default value is NULL

  • order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction

  • _FREQ: frequency, number of observations in a group

  • _TYPE: identifies the grouping, binary code based

  • _LEVEL: identifies number of variables used in grouping

ore.summary returns an ore.frame as output in all cases except when a group.by clause is used. If a group.by clause is used, ore.summary returns a list of ore.frames, one frame per stratum.

ore.summary Examples

These examples illustrate the use of ore.summary:

These examples use the NARROW data set; for more information, see Data for Examples.

Calculate Default Statistics

This example calculates mean, min, max for columns AGE and CLASS and rolls up (aggregates) GENDER:

R> ore.summary(NARROW, class='GENDER', var ='AGE,CLASS'. order='freq')

Skew and t Test

This example calculates skew for skew of AGE as column A and the t-test for CLASS as column B:

R> ore.summary(NARROW, class='GENDER', var='AGE,CLASS',  stats='skew(AGE)=A, probt(CLASS)=B')

Weighted Sum

This example calculates weighted sum for AGE aggregated by GENDER with YRS_RESIDENCE as weights; in other words, it calculates sum(var*weight):

R> ore.summary(NARROW, class='GENDER', var='AGE', stat='sum=X', weight='YRS_RESIDENCE')

Two Separate Group By Columns

Group CLASS by GENDER and MARITAL_STATUS:

r> ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways=1)

All Possible Group By

This example groups CLASS in all possible ways by GENDER and MARTIAL_STATUS:

R> ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways='nway')

ore.univariate

ore.univariate provides distribution analysis of numeric variables in an ore.frame.

ore.univariate provides these statistics:

See ore.univariate Parameters for syntax and ore.univariate Examples for examples.

ore.univariate Parameters

ore.univariate supports these parameters:

  • data: The data to aggregate as an ore.frame

  • var: Numerical column(s) of data to analyze

  • weight: A column of the data whose numeric values provide a multiplicative factor for var columns; the default value is NULL

  • stats: Optional specification of a subset of statistics to calculate and display:

    • moments: n, sumwgt, mean, sum, stddev, var, skew, kurt., uss.css.cv, stderr

    • measures: mean, stddev, median, var, mode, range, iqr

    • quantiles: p100, p99, p95, p90, p75, p50, p25, p10, p5, p1, p0

    • location: studentt, studentp, signt, signp, srankt, srankp

    • normality

    • loccount: loc<,loc>,loc!

    • extremes

    The default value is NULL.

ore.univariate returns anore.frame as output in all cases.

ore.univariate Examples

These examples illustrate the use of ore.univariate:

These examples use the NARROW data set; for more information, see Data for Examples.

Default Univariate Statistics

This example calculates the default univariate statistics for AGE, YRS_RESIDENCE, and CLASS:

R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")

Location Statistics

This example calculates location statistics for YRS_RESIDENCE:

R> ore.univariate(NARROW, var="YRS_RESIDENCE",stats="location")

Complete Quantile Statistics

This example calculates complete quantile statistics for AGE and YRS_RESIDENCE:

R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")