Signal detection methods

Note:

This information was excerpted from the WebVDME Newsletter, Volume 2, Issue 2 (Autumn 2005). WebVDME is the previous name of the Oracle Empirica Signal application.

What benefits can signal detection methods bring to the pharmacovigilance effort, and what are the costs? How can different methods be compared and assessed? What do data mining statistics do? The information in this article is intended to help non-statisticians understand similarities and differences between the statistical algorithms that are used for safety data mining. It can also help reframe these concerns into new questions that can make your evaluation and implementation of signal detection methods more effective.

What do data mining statistics do?

Data mining is the process of applying statistical algorithms to a database, resulting in the generation of statistical values or scores. As part of a pharmacovigilance effort, these scores can alert safety evaluators to potential safety issues, including actual safety signals.

At Oracle, we believe that data mining statistics yield information that safety evaluators can use to help prioritize investigations. Whether a pharmacovigilance group uses data mining or not does not change the need for experienced medical professionals to perform careful evaluations to determine causal relationships between drugs and adverse events.

What data mining statistics do is indicate the strength of the association between a drug of interest and an event in the reporting database: the higher the score, the stronger the statistical association.

Statistical association vs. causal relationship

While a strong association (high score) may indicate a causal relationship between the drug and the event, it could also be the result of some other circumstance: some examples include an AE that is an indication for the drug (known as “reverse causality”), or a product-event combination that has been subject to media attention and that has a higher reporting rate than would otherwise occur as a result (often called “publicity effects”). A strong association can also indicate a causal relationship that is already known and labeled.

So, while a high score may serve as an alert to a previously undetected problem, medical knowledge is crucial to the interpretation of data mining results.

Prioritizing with rankings or thresholds

When used with a large database, such as the public FDA AERS data, data mining statistics help users sort through several million potential combinations of drugs and events. Even a hypothetical pharmacovigilance department with an unlimited staff and budget, that could investigate every possible product-event relationship for causality, could use data mining statistics to help them determine which reviews to undertake first.

Using ranking

A straightforward way to prioritize investigations using statistical scores would be to start with the highest score generated and work down through the list to the lowest score. Outside of our hypothetical big-bucks department, practical considerations mean that the number of investigations must be limited: a prioritization approach that ranks scores might result in investigations into the top 10, 250, or 10,000 scores, whatever number is feasible for a group’s resources.

Choosing a threshold

Another way to limit the number of investigations is to choose one score as a threshold. Any score that exceeds this predetermined threshold alerts reviewers to a potential signal for investigation. Unlike the cutoff for ranking, the number of scores above the threshold will fluctuate over time. But, like ranking, departments can select a threshold score that makes the best use of their resources while helping prioritize investigations.

Other factors to consider

Regardless of whether ranking or a threshold is used for the scores, some reports are likely to be prioritized ahead of any others: a safety evaluator wouldn’t have to wait for a high data mining score to flag just one new report of Torsade de pointes, QT prolongation, Stevens-Johnson syndrome, or a similar sentinel event as a high priority for investigation. Serious events in general are likely to need investigation earlier than less serious events. To meet this need for different prioritization levels, a low threshold could be set for sentinel events, a slightly higher one for other serious events, and a third threshold for all other events.

Problems in nonscientific public safety databases, including the biases that result from under reporting, over reporting, and publicity effects, have been widely discussed. When working with signal scores generated from these databases, a higher threshold (or set of thresholds) might be used than when working with a smaller, more controlled and complete company database.

For departments that use ranking as the prioritization method, the relative reliability of the sequencing or ordering of the scores produced by a method may be a consideration. When a threshold is used, any score above that threshold is examined, while with ranking, the highest scores should reflect the strongest associations.

Other considerations, including company goals, the medical knowledge and safety evaluation experience of individual evaluators, and other tools available for the pharmacovigilance effort, also contribute to the prioritization effort.

What are the data mining statistics?

Some of the data mining statistics that are widely used for signal detection are:

  • EBGM – Empirical Bayes Geometric Mean
  • IC – Information Component
  • PRR – Proportional Reporting Ratio
  • ROR – Reporting Odds Ratio

Each of these statistics is the result of a different statistical algorithm. All of the algorithms were designed to uncover the same type of information: a disproportionately high occurrence of reports for an event and drug, when compared to reports for that event in the entire database. As a result, for large sample sizes, the score that each of these statistics produces for any given product-event combination is likely to be similar.

Note:

The sample size (N) generally is defined as large if there are 20 or more instances of a product-event combination in the database.

Drugs that are new to the market, or that are prescribed to a small number of patients, may have a small presence in the database. For these drugs, the N may be quite low, which tends to make scores calculated for their product-event combinations low also.

Some data mining algorithms include statistical techniques that compensate for disparate sample sizes in the safety database and the extreme scores that can result.

How does the method cope with statistical uncertainty?

The PRR and ROR statistics do not in themselves indicate a level of certainty or uncertainty due to a small sample size. Rather, each product-event combination is often accompanied by an additional statistic, a chi squared test statistic, and its corresponding p (significance) value. When you assess a PRR or ROR value, you are supposed to suspect the reliability of large values unless the chi squared value is very large (or, equivalently, the p value is very small). However, there is no clear rule for evaluating a large PRR value that has a moderate p value. Similarly, there is no rule for adjusting for the “multiple comparisons” problem, in that you are focusing on the few largest values of PRR or ROR out of millions that may have been computed in a data mining run.

The algorithms that calculate IC and EBGM address both statistical variability due to small N and multiple comparisons by using a technique called “Bayesian shrinkage.” This statistical technique results in a single relative reporting ratio (with confidence limits) that is easier to interpret directly without the added complexity of a separate chi squared or p value. Reporting ratios that are based on small counts (and that are therefore less reliable) are “shrunk” or moved toward the value 1 using an appropriate Bayesian probability formula. The set of EBGM scores across all product-event combinations is more restricted than the set of unshrunk scores, so focusing on more extreme EBGM scores is less likely to lead to overestimates of the true reporting ratios than would searching for the largest values of PRR or ROR. The IC method has a similar advantage.

The Bayesian shrinkage technique primarily affects estimates of product-event reporting ratios based on small counts or small expected counts (that is, when reports mentioning either the drug or the event are quite rare in the database.) Other statistical issues, such as confounding, can cause biased relative reporting ratios even for more frequent product-event combinations.

Can the method adjust for confounding?

When data mining scores are calculated, adjustments can be made for “confounding” variables. Confounding variables occur in any database that is not completely random. In a drug safety database, for example, a drug that is administered only to the elderly, or only to male subjects, or that is new on the market, has the subject’s age group or gender or the report’s receipt year (or all three) as a confounding variable.

The presence of confounding variables in the data can result in high scores for product-event combinations that otherwise would not have a strong statistical association. For example, a drug that is generally prescribed for women over 50, and an event that occurs more frequently for women over 50, may well result in a high score for the combination of that drug and event, based only on the coincidence of the gender and age group involved.

Statistical techniques are available to adjust for confounding variables: for example, a technique called Mantel-Haenszel stratification (used in MGPS) divides the database into groups, one for each value that the variable in question has in the database, and then recalculates an overall score. When a technique like stratification is used, the result is often a lower score (and investigational priority) for a product-event combination that is affected by a confounding variable, while the effect is minimal on scores for combinations that are not subject to confounding.

Why do statistics for significance matter?

Each of the data mining statistics listed above, IC, EBGM, PRR, and ROR, indicate the strength of the association between a drug and an event. Each one can also have associated scores that indicate the statistical significance (or uncertainty) of the relationship. As mentioned above, for the PRR and ROR statistics, the chi-squared test of statistical significance can be used to calculate a p value. Unfortunately, the chi squared and p values are in different units than are the basic reporting ratio values, making it hard to interpret them all together. The algorithms for EBGM and IC represent statistical uncertainty with confidence limits, which are less likely to be misinterpreted.

Calculations for significance incorporate information about the sample size. The p value indicates the likelihood that such a large reporting ratio could occur by chance alone if the drug and the event are completely unrelated, but does not provide information about what range of reporting ratios are most likely. A confidence interval shows what range of reporting ratios are consistent with the data. A small N results in a wide confidence interval: the larger the N, the narrower the confidence interval.

Using significance scores

A pharmacovigilance group might choose to prioritize investigations based on scores for statistical significance, rather than for association, to avoid following up potential associations that could have arisen merely by chance. However, using a PRR or ROR p value to rank associations may result in excessive focus on drugs and events that are common overall in the database. This may occur because frequently reported drugs and events may have reporting ratios that are only slightly greater than 1, but have very tiny p values. Investigating these product-event combinations can potentially eclipse larger reporting ratios for less-frequently reported drugs or events.

Both the statistical significance and the reporting ratio can contribute to a prioritization system. This is easier to accomplish with confidence limits than with p values. Ranking product-event combinations by their lower confidence limits (for MGPS, by using EB05 rather than EBGM) provides an extra layer of protection against false alarms due to chance fluctuation (beyond that afforded by the Bayesian shrinkage discussed above).

Comparing significance and association scores

Comparing the scores for both types of statistics can also be revealing. This illustration shows two product-event combinations that have identical EB05 scores. The higher EBGM score for the first combination, however, indicates a stronger likelihood of an association than the EBGM score for the second combination. This graph also gives an indication of the relative number of reports in the database for each combination: Combination 1 has fewer reports than Combination 2, since its confidence interval is relatively wide.

Comparing the drug-event combination statistical score.

Some pharmacovigilance groups may find it effective to prioritize investigations based on both significance and association scores, rather than relying on only one score. For example, a threshold of EBGM>2 AND EB05>1 could be used. This is roughly equivalent to the common recommendation for PRR, which requires PRR > 2 AND N > 2 AND chi squared > 4.

Benefits: automated, objective analysis

Signal detection methods are an independent and automated process for assessing all of the product-event combinations in a safety database, which may contain millions of records. Using these statistics can provide earlier warning than methods that rely on human knowledge, skill, and intuition, and reduce the amount of effort needed to discover signals in safety data.

The value of using a signal detection method is highest when scores alert pharmacovigilance professionals to previously unknown, unexpected, or uncharacteristic events that happen more frequently in combination with one drug than with most all other drugs. These objectively produced alerts can then be evaluated using other pharmacovigilance tools to determine the possibility of a causal relationship.

Costs: false alarms

When a pharmacovigilance group uses data mining scores to prioritize investigation, resources may be spent investigating “false positives,” or high scores that do not, in fact, indicate a relationship or causal relationship.

Conversely, a “false negative” can occur when a low score is calculated for a product-event combination that is, in fact, causally related. Errors of both types can occur with any signal detection method.

If data mining scores are used alone to determine investigational priorities, the group is faced with a conundrum: if you investigate fewer high scores to reduce false positives, you may also increase the number of false negatives. The need for efficiency must be balanced against the cost of missing any true signals.

Fortunately, pharmacovigilance professionals have access to other tools to help make the best choices they can for the situations they need to address. Impact analysis, tracking changes in scores over time, and the “other factors to consider” described above can help a department make the best use of the information that the statistical scores are designed to give them.

Summary: comparing signal detection methods

One way to assess the comparative value of different signal scores is to evaluate the results of different algorithms when used to data mine the same database. When setting up such a test, these questions can act as guidelines for the comparison:

  • Is the same minimum sample size used by all of the algorithms?
  • Is stratification used for all of the algorithms? What confounding variables were selected?
  • Are the scores selected for comparison all scores for statistical association, or are they all scores for statistical significance?
  • How many signal scores does each algorithm generate over all?
  • Is the same threshold used for all scores?

It may be difficult to know how many true signals there are in the database used for the test. If the number is known, then the number of false positives and false negatives produced by each algorithm can also be compared.

Improving statistical algorithms

With WebVDME 5.0, Oracle introduced a new statistical method for evaluation of database disproportionately ratios: Logistic Regression. This statistical technique is intended to support focused investigations into possible polypharmacy effects. Logistic Regression currently alerts WebVDME users to potential signals in cases where the scores calculated by other algorithms do not adjust for confounding by polypharmacy or cloaking. Oracle is also researching the potential for this statistical technique to serve as a database screening tool.

Data mining bibliography

The following articles can provide additional information and detail.

Faich G. Department of Health and Human Services, Food and Drug Administration. Risk Management Public Workshop: Pharmacovigilance Practices and Pharmacoepidemiologic Assessment. Transcript of April 11, 2003: 53-65 [online]. Available from URL: www.fda.gov/cder/meeting/ RMtranscript3.doc [Accessed 2005 Sep 12]

Clark JA, Klincewicz SL, Stang PE. Spontaneous Adverse Event Signaling Methods: Classification and Use with Health Care Treatment Products. Epidemiologic Reviews 2001; 23 (2): 191-210

DuMouchel W. Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System (with discussion). The American Statistician 1999; 53 (3): 177-202

DuMouchel W, Pregibon D. Empirical Bayes screening for multi-item associations. In: Conference on Knowledge Discovery in Data. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 2001 Aug 26-29; San Francisco (CA). New York: ACM Press, 2001: 67-76

Fram DM, Almenoff JS, DuMouchel W. Empirical Bayesian data mining for discovering patterns in post-marketing drug safety. In: Conference on Knowledge Discovery in Data. Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. 2003 Aug 24-27; Washington, D.C. New York: ACM Press, 2003: 359-368

Hauben M, Reich L. Safety related drug-labelling changes: findings from two data mining algorithms. Drug Saf 2004; 27(10):735-44

Lilienfeld DE. A challenge to the data miners. Pharmacoepidemiology and Drug Safety 2004 Dec; 13 (12): 881-884

van Puijenbroek EP, Bate A, Leufkens HGM, Lindquist M et al. A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiology and Drug Safety 2002; 11: 3-10