16.7.2 Multiplicity in systematic reviews

This is an archived version of the Handbook. For the current version, please go to training.cochrane.org/handbook/current or search for this chapter here.

16.7.2 Multiplicity in systematic reviews

Adjustments for multiple tests are not routinely used in systematic reviews, and we do not recommend their use in general. Nevertheless, issues of multiplicity apply just as much to systematic reviews as to other types of research. Review authors should remember that in a Cochrane review the emphasis should generally be on estimating intervention effects rather than testing for them. However, the general problem of multiple comparisons affects interval estimation just as much as hypothesis testing (Chen 2005, Bender 2008).

Some additional problems associated with multiplicity occur in systematic reviews. For instance, when the results of a study are presented, it is not always possible to know how many tests or analyses were done. It is likely that in some studies interesting findings were selected for presentation or publication in relation to statistical significance, and other ‘uninteresting’ findings omitted, leading to misleading results and spurious conclusions. Such selective reporting is discussed in more detail in Chapter 8 (Section 8.14).

Adequate planning of the statistical testing of hypotheses (including any adjustments for multiple testing) should ideally be done at the design stage. Unfortunately, this can be difficult for systematic reviews when it might not be known, at the outset, which outcomes and which effect measures will be available from the included studies. This makes the a priori planning of multiple test procedures for systematic reviews more difficult or even impossible. Moreover, only some of the multiple comparison procedures developed for single studies can be used in meta-analyses of summary data. More research is required to develop adequate multiple comparison procedures for use in systematic reviews (Bender 2008).

In summary, there is no simple or completely satisfactory solution to the problem of multiple testing and multiple interval estimation in systematic reviews. However, the following general advice can be offered. More detailed advice can be found elsewhere (Bender 2008).

In the protocol for the review, state which analyses and outcomes are of particular interest (the fewer the better). Outcomes should be classified in advance as primary and secondary outcomes, and main outcomes to appear in the ‘Summary of findings’ table should be pre-specified. If there is a clear key hypothesis, which could be tested by means of multiple significance tests, performing an adequate adjustment for multiple testing will lead to stronger confidence in any conclusions that are drawn.
Although it is recommended that Cochrane reviews should seek to include all outcomes that are likely to be important to users of the review, overall conclusions are more difficult to draw if there are multiple analyses. Bear in mind, when drawing conclusions, that approximately one in 20 independent statistical tests will be statistically significant (at a 5% significance level) due to chance alone when there is no real difference between the groups.
Do not select results for emphasis (e.g. in the abstract) on the basis of a statistically significant P value.
If there is a choice of time-points for an outcome, attempts should be made to present a summary effect over all time-points, or to choose one time-point that is the most appropriate one (although availability of suitable data from all trials may be a problem). Multiple testing of the effect at each of the time-points should be avoided.
Keep subgroup analyses to a minimum and interpret them cautiously.
Interpret cautiously any findings that were not hypothesized in advance, even when they are ‘statistically significant’. Such findings should only be used to generate hypotheses, not to prove them.