16.7.1 Introduction

This is an archived version of the Handbook. For the current version, please go to training.cochrane.org/handbook/current or search for this chapter here.

16.7.1 Introduction

A Cochrane review might include multiple analyses because of a choice of several outcome measures, outcomes measured at multiple time points, a desire to explore subgroup analyses, the inclusion of multiple intervention comparisons, or other reasons. The more analyses that are done, the more likely it is that some of them will be found to be ‘statistically significant’ by chance alone. Using the conventional significance level of 5%, it is expected that one in 20 tests will be statistically significant even when there is truly no difference between the interventions being compared. However, after 14 independent tests, it is more likely than not (probability greater than 0.5) that at least one test will be significant, even when there is no true effect. The probability of finding at least one statistically significant result increases with the number of tests performed. The likelihood of a spurious finding by chance is higher when the analyses are independent. For example, multiple analyses of different subgroups are usually more problematic in this regard than multiple analyses of various outcomes, since the latter involve the same participants so are not independent.

The problem of multiple significance tests occurs in clinical trials, epidemiology and public health research (Bauer 1991, Ottenbacher 1998) as well as in systematic reviews (Bender 2008). There is an extensive statistical literature about the multiplicity issue. Many statistical approaches have been developed to adjust for multiple testing in various situations (Bender 2001, Cook 2005, Dmitrienko 2006). However, there is no consensus about when multiplicity should be taken into account, or about which statistical approach should be used if an adjustment for multiple testing is made. For example, the use of adjustments appropriate for independent tests will lead to P values that are too large when the multiple tests are not independent. Adjustments for multiple testing are used in confirmatory clinical trials to protect against spuriously significant conclusions when multiple hypothesis tests are used (Koch 1996) and have been incorporated in corresponding statistical guidelines (CPMP Working Party on Efficacy of Medicinal Products 1995). In exploratory studies, in which there is no pre-specified key hypothesis, adjustments for multiple testing might not be required and are often not feasible (Bender 2001). Statistically significant results from exploratory studies should be thought of as ‘hypothesis generating’, regardless of whether adjustments for multiple testing have been performed.