MGPS computations

The analytical core of the Oracle Empirica Signal application is a high-performance implementation of the MGPS (Multi-Item Gamma Poisson Shrinker) algorithm. MGPS is based on the metaphor of the market basket problem, in which a database of transactions (adverse event reports) is mined for the occurrence of interesting (unexpectedly frequent) itemsets. For example, these itemsets can represent simple drug-event pairings, or more complex combinations of multiple drugs and events representing interactions and/or syndromes. Interestingness is related to the factor by which the observed frequency of an itemset differs from a nominal baseline frequency.

For each itemset in the database, a Relative Reporting Ratio (RR) is defined as the observed count (N) for that itemset divided by the expected count (E). MGPS allows for the possibility that the database may contain heterogeneous strata with significantly different item frequency distributions occurring in the various strata. To avoid concluding that an itemset is unusually frequent just because the items involved individually all tend to occur more frequently in a particular stratum (Simpson's paradox), MGPS uses the Mantel-Haenszel approach of computing strata-specific expected counts and then summing over the strata to obtain a database-wide value for the expected count.

To improve the estimation of true value for each RR (especially for small counts), the empirical Bayesian approach of MGPS assumes that the many observed values of RR are related in that they can be treated as having arisen from a common super population of unknown, true RR-values. The method assumes that the set of unknown RR values is distributed according to a mixture of two parameterized Gamma Poisson density functions, and the parameters are estimated from a maximum likelihood fit to the data. This process provides a prior distribution for all the RRs, and then the Bayes rule can be used to compute a posterior distribution for each RR. Since this method improves over the simple use of each N/E as the estimate of the corresponding RR, it can be said that the values of N/E borrow strength from each other to improve the reliability of every estimate.

The improved estimates of RR are actually derived from the expectation value of the logarithm of RR under the posterior probability distributions for each true RR.

If you select subset variables for the run, MGPS computations are performed for each subset. If you select stratification variables for the run, the MGPS computations are modified to use stratification.

Item variables

When you create an MGPS run, you select item variables. Typically, the item variables represent drugs and events. The counting and estimation that the MGPS computation performs divides the combinations of the selected item types into separate itemsets, based on the combinations possible for the number of dimensions in the run. For the two items D and E in a three-dimensional run, MGPS performs computations on the itemsets DD, DDD, DE, DDE, DEE, EE, and EEE.

Drug-event combination scores

For a combination of drug and event (a two-dimensional or 2D combination), the signal score is the EBGM (Empirical Bayes Geometric Mean) value.

EBGM is defined as the exponential of the expectation value of log (true RR). EBGM has the property that it is nearly identical to N/E when the counts are moderately large, but is shrunk towards the average value of N/E (typically ~1.0) when N/E is unreliable because of stability issues with small counts. The posterior probability distribution also supports the computation of lower and upper 95% confidence limits (EB05, EB95) for EBGM, which is an estimate of the relative reporting ratio.

Interaction scores

Computations for combinations of dimensions greater than 2 focus on measuring and scoring cross-item type combinations, but allow any observed dependence within item types to be fit completely without generating a score. The only exception to this is during the analysis of a homogeneous itemset, such as three events (E1,E2,E3) where all items are of the same type and in which case high EBGM scores do alert reviewers to a potential within-item type association. Thus, for a mixed-type 3D combination like (D1,D2,E1), the values of EBGM, EB05 and so on measure how many times more frequently the pair (D1,D2) occurs with E1 than would be expected if the pair (D1,D2) were independent of E1, without making any statement about whether the drugs D1 and D2 appear in the same reports more frequently than independence would predict.

The signal score for a 3D combination is the INTSS (Interaction Signal Score) value, which is essentially a way of measuring the strength of a higher-order association above and beyond what would be expected from any of the component pairs of items of different types. INTSS is computed as follows:

INTSS = EB05 / EB95MAX

where:

  • EB05 is the conservative estimate score for the 3D combination.
  • EB95MAX is the highest EB95 score found for the component pairs of items of different types.

An INTSS of greater than 1 indicates a stronger association than is found for the component pairs of items.

Note:

Typically, it would not be concluded that the potential interaction was interesting unless the EB05 score for that DDE combination is large (for example, over 1.5 or 2.0).

To support backward compatibility with versions of the application prior to 5.0, signal scores based on the independence model are still computed for dimensions higher than 2. These values are stored in columns ending in _IND. See MGPS independence model.

Custom terms and synthetic values

When estimating shrinkage parameters for EBGM scores, MGPS does not consider the following:

The raw RR scores for combinations involving custom terms and the excluded synthetic values are shrunk by the Bayesian formula, but they do not participate in the determination of the formula itself.

Note:

Users creating a data mining run do not need to take any additional steps to exclude custom terms or already-excluded synthetic values.