MGPS independence model

Prior to Version 5.0 of WebVDME (which was the previous name of the Oracle Empirica Signal application), signal scores for 3D data mining results were computed using the MGPS independence model described here. Starting in WebVDME 5.0, the primary way of computing signal scores for 3D results changed to the MGPS interaction model. For compatibility with older data mining runs, the application continues to compute scores using the independence model as well.

The MGPS interaction model computes values of INTSS, described in MGPS computations. The independence model computes values of EBGMDIF_IND. The names of columns related to the independence model end in _IND.

The difference between the independence model and the interaction model affects only 3D results. Thus, the values of EBGM and EBGM_IND, which are for 2D results, are the same.

2D results

For a 2D result, it is assumed that the probability of finding items i and j in the same report is the same as the probability of finding i times the probability of finding j:

P(i,j) = P(i) x P(j)

For a two-dimensional run that is not stratified, expected counts are computed are as follows:

E(i,j) = Ntotal x P(i,j) = Ntotal x P(i) x P(j)

For a two-dimensional run that is stratified, computations are as follows:

E(i,j) = ΣEc(i,j) = ΣNtotal,c x Pc(i,j) = ΣNtotal,c x Pc(i) x Pc (j)

where the summations occur over the C strata.

The signal score is the EBGM_IND value, which is a more stable estimate than RR; the so-called shrinkage estimate. The EBGM value is computed as the geometric mean of the posterior distribution of the true Relative Ratio.

3D results

For a 3D result, it is assumed that the probability of finding items i, j, and k in the same report is the same as the probability of finding i times the probability of finding j times the probability of finding k:

P(i,j,k) = P(i) x P(j) x P(k)

For a three-dimensional unstratified run, the expected count is computed as follows:

E(i,j,k) = Ntotal x P(i,j,k) = Ntotal x P(i) x P(j) x P(k)

In the stratified independence model, it is assumed the above equations hold for each stratum. So, if there are C strata:

Pc(i,j,k) = Pc(i) x Pc(j) x Pc(k)

for a given stratum c.

For a three-dimensional stratified analysis, the expected count is computed as follows:

E(i,j,k) = ΣEc(i,j,k) = ΣNtotal,c x Pc(i,j,k) = ΣNtotal,c x Pc(i) x Pc(j) x Pc(k)

where the summations occur over the C strata.

The signal score for a 3D combination is the EBGMDIF_IND value, computed as follows:

EBGMDIF_IND = EBGM_IND - E2D_IND / E_IND

where E2D_IND is the expected count based on the all-2-factor log linear model.