This is an archived version of the Handbook. For the current version, please go to or search for this chapter here.

17.7  Comparability of different patient-reported outcome measures

Investigators may choose different instruments to measure PROs, either because they use different definitions of a particular PRO or because they choose different instruments to measure the same PRO. For example, an investigator may choose to use a generic instrument to measure functional status or a different disease-specific instrument to measure functional status. The definition of the outcome may or may not differ. Review authors must decide how to categorize PROs across studies, and when to pool results. These decisions will be based in the characteristics of the PRO, which will need to be extracted and reported in the review. 


On many occasions, studies using PROs will make baseline and follow-up measurements and the outcome of interest will thus be the difference in change from baseline to follow-up between intervention and control groups. Ideally then, to pool data across two PROs that are conceptually related, one will have evidence of strong longitudinal correlations of change in the two measures in individual patient data, and evidence of similar responsiveness of the instruments. Further supportive evidence could come from correlations of differences between treatment and control, or difference between before and after measurements, across studies. If one cannot find any of these data, one could fall back on cross-sectional correlations in individual patients at a point in time.


For example, the two major instruments used to measure health-related quality of life in patients with chronic obstructive disease are the Chronic Respiratory Questionnaire (CRQ) and the St. George’s Respiratory Questionnaire (SGRQ). Correlations between the two questionnaires in individual studies have varied from 0.3 to 0.6 in both cross-sectional (correlations at a point in time) and longitudinal (correlations of change) comparisons (Rutten-van Mölken 1999, Singh 2001, Schünemann 2003, Schünemann 2005).


In a subsequent investigation, investigators examined the correlations between mean changes in the CRQ and SGRQ in 15 studies including 23 patient groups and found a correlation of 0.88 (Puhan 2006). Despite this extremely strong correlation, the CRQ proved more responsive than the SGRQ: standardized response means of the CRQ (median of the standardized response means 0.51, IQR 0.19 to 0.98) were significantly higher (P<0.001) than those associated with the SGRQ (median of the standardized response means 0.26, IQR –0.03 to 0.40). That is, in situations when both instruments were used together in the same study, the CRQ yielded systematically larger treatment effects. As a result, pooling results from trials using these two instruments could lead to underestimates of treatment effect in studies using the SGRQ.


Most of the time, unfortunately, detailed data such as those described in the previous paragraph will be unavailable. Investigators must then fall back on intuitive decisions about the extent to which different instruments are measuring the same underlying construct. For example, the authors of a meta-analysis of psychosocial interventions in the treatment of pre-menstrual syndrome faced a profusion of outcome measures, with 25 PROs reported in their nine eligible studies. They dealt with this problem by having two investigators independently examine each instrument – including all domains – and group them into six discrete conceptual categories; discrepancies were resolved by discussion to achieve consensus. The pooled analysis of each category included between two and six studies. 


Meta-analyses of studies using different measurement scales will usually be undertaken using standardized mean differences (SMDs; see Chapter 9, Section 9.2.3). However, SMDs are highly problematic when the focus is on comparing change from baseline in intervention and control groups, because standard deviations of change do not measure between-patient variation (they depend also on the correlation between baseline and final measurements; see Chapter 9, Section


Similar principles apply to studies in which review authors choose to focus on available data that are presented in dichotomous fashion, or from which review authors can extract dichotomous outcome data with relative ease. For example, investigators studying the impact of flavanoids on symptoms of haemorrhoids found that eligible randomized trials did not consistently use similar symptom measures; all but one of 14 trials, however, recorded the proportion of patients either free of symptoms, with symptom improvement, still symptomatic, or worse (Alonso-Coello 2006). In the primary analysis investigators considered outcomes of patients free of symptoms and patients with symptomatic/some improvement as equivalent, and pooled each outcome of interest based on the a priori expectation of a similar magnitude and direction of treatment effect.


This left a question of how to deal with studies that reported that patients experienced ‘some improvement’. The investigators undertook analyses comparing the approach of dichotomizing including ‘some improvement’ as a positive outcome and as a negative outcome (similar to no improvement). Dichotomizing outcomes is often very useful, particularly for making results easily interpretable for clinicians and patients. Imaginative and yet rigorous ways of dichotomizing will result in summary statistics that provide useful guides to clinical practice.


The use of multiple instruments for measuring a particular PRO, and experimentation with multiple methods for analysis, can lead to selective reporting of the most interesting findings and introduce serious bias into a systematic review. Review authors focusing on PROs should be alert to this problem. When only a small number of eligible studies have reported a particular outcome, particularly if it is a salient outcome that one would expect conscientious investigators to measure, authors should note the possibility of reporting bias (see Chapter 10).