How to Read a Clinical Trial: RCTs, NNT, Confidence Intervals & the 5 Questions Every GP Should Ask
Before you change your practice based on a headline, ask these five questions — your patients deserve no less

A trial is published. The headline says "Drug X reduces heart attacks by 25%." Your inbox fills with patient queries. A colleague changes their prescribing. Should you? This guide gives GPs the statistical literacy to read any clinical trial critically — understanding RCT design, absolute vs relative risk, NNT, confidence intervals, p-values, and the five questions that separate practice-changing evidence from noise.
Clinical Decision Support: This article is for educational purposes and supports — not replaces — clinical judgment. Always verify with current national guidelines, BNF, and specialist consultation when needed.
In November 2015, the SPRINT trial was published in the New England Journal of Medicine. The headline: intensive blood pressure control (target systolic <120 mmHg) reduced major cardiovascular events by 25% and all-cause mortality by 27% compared to standard control (<140 mmHg). Within weeks, cardiologists were calling for a revision of hypertension guidelines. Within months, the AHA/ACC had lowered the hypertension threshold to 130/80 mmHg — reclassifying 46% of American adults as hypertensive overnight. NICE, after careful appraisal, did not change its threshold. Same trial. Same data. Opposite conclusions. The difference was not ideology — it was critical appraisal. NICE asked five questions that the headline did not answer, and the answers changed everything. This guide teaches you to ask those questions for every trial you read.
Part 1: Understanding RCT Design — The Foundation
The randomised controlled trial (RCT) is the gold standard of clinical evidence because randomisation — when done properly — eliminates confounding. If you randomly assign 5,000 patients to Drug X and 5,000 to placebo, any difference in outcomes between the groups is attributable to the drug, not to differences in age, sex, comorbidities, or lifestyle. This is the power of the RCT. But it is also its limitation: the trial only tells you what happened in those 10,000 patients, in that setting, over that time period. Whether it applies to your patient is a separate question entirely.
The Anatomy of an RCT: What Every Section Tells You
| Section | What It Contains | What to Look For | Red Flags |
|---|---|---|---|
| Abstract | Summary of design, population, intervention, outcomes, results | Primary outcome; effect size; p-value; follow-up duration | Relative risk only (no absolute risk); composite outcomes buried in footnotes |
| Introduction | Background and rationale | What gap the trial claims to fill; prior evidence | Overstated prior uncertainty; cherry-picked prior studies |
| Methods — Population | Inclusion and exclusion criteria | Who was enrolled; who was excluded; how representative | Highly selected population; exclusion of elderly, women, comorbidities |
| Methods — Intervention | What was done to each group | Dose, duration, co-interventions; what "usual care" actually was | Unusual dosing; unusually good usual care (makes drug look worse) |
| Methods — Outcomes | Primary and secondary outcomes; how measured | Pre-specified primary outcome; composite vs single outcomes | Primary outcome changed after trial started (outcome switching); composite outcomes that inflate effect |
| Methods — Randomisation | How patients were allocated | Allocation concealment; stratification | Inadequate concealment; post-randomisation exclusions |
| Methods — Blinding | Who was blinded to treatment allocation | Double-blind (patient + assessor); open-label | Open-label trials for subjective outcomes (bias risk) |
| Results — Baseline | Characteristics of each group at randomisation | Are groups balanced? Any important differences? | Imbalanced groups despite randomisation (chance or selective reporting) |
| Results — Primary Outcome | Main result | Absolute risk reduction; relative risk reduction; NNT; confidence interval; p-value | Only relative risk reported; wide confidence intervals; p just below 0.05 |
| Results — Secondary Outcomes | Additional outcomes | Consistency with primary outcome; pre-specified vs post-hoc | Multiple secondary outcomes without correction; post-hoc subgroup analyses presented as primary |
| Discussion | Interpretation of results | Authors' own limitations section; generalisability | Overstated conclusions; minimised harms; ignored negative secondary outcomes |
| Funding | Who paid for the trial | Industry vs independent funding | Industry-funded trials are 4× more likely to report positive results |
The CONSORT statement (Consolidated Standards of Reporting Trials) provides a 25-item checklist for reporting RCTs. Journals that require CONSORT compliance produce more transparent, reproducible trial reports. When reading a trial, check whether it reports a CONSORT flow diagram — this shows how many patients were screened, enrolled, randomised, completed the trial, and were analysed. Missing patients are not missing data — they are a signal.
Part 2: The Statistics You Actually Need
You do not need a statistics degree to read a clinical trial critically. You need to understand six concepts: absolute risk, relative risk, absolute risk reduction, relative risk reduction, number needed to treat, and confidence intervals. Everything else is detail. Master these six, and you can appraise any trial result.
Absolute Risk vs Relative Risk: The Most Important Distinction in Medicine
This is the single most important statistical concept for clinical practice, and the one most commonly misrepresented in medical headlines. Consider a trial where 2% of patients in the control group have a heart attack over 5 years, and 1% of patients in the treatment group have a heart attack. The relative risk reduction is 50% — treatment halved the risk. The absolute risk reduction is 1% — treatment prevented 1 heart attack per 100 patients over 5 years. Both statements are mathematically correct. The relative risk reduction sounds dramatic. The absolute risk reduction tells you what actually happened.
| Measure | Formula | Example (Control 2%, Treatment 1%) | Clinical Meaning | When Misleading |
|---|---|---|---|---|
| Control Event Rate (CER) | Events in control group / total in control group | 2/100 = 2% | Baseline risk without treatment | Never — always report this |
| Experimental Event Rate (EER) | Events in treatment group / total in treatment group | 1/100 = 1% | Risk with treatment | Never — always report this |
| Absolute Risk Reduction (ARR) | CER − EER | 2% − 1% = 1% | The actual reduction in risk per patient treated | Low ARR with high RRR can mislead — always calculate |
| Relative Risk Reduction (RRR) | (CER − EER) / CER | (2% − 1%) / 2% = 50% | Proportional reduction in risk | Misleading when baseline risk is low — 50% of a tiny risk is still tiny |
| Number Needed to Treat (NNT) | 1 / ARR | 1 / 0.01 = 100 | How many patients need treatment for 1 to benefit | Must be interpreted alongside treatment duration and harms |
| Relative Risk (RR) | EER / CER | 1% / 2% = 0.5 | Risk in treatment group relative to control | Does not convey absolute magnitude of benefit |
| Odds Ratio (OR) | (EER/(1−EER)) / (CER/(1−CER)) | (0.01/0.99) / (0.02/0.98) ≈ 0.495 | Ratio of odds of event in each group | Overestimates RR when event rates are high (>10%) |
The pharmaceutical industry and medical journals preferentially report relative risk reductions because they sound more impressive. A drug that reduces heart attacks from 2% to 1% will be marketed as "reduces heart attack risk by 50%" — not "prevents 1 heart attack per 100 patients treated over 5 years." Both are true. Only one is useful for clinical decision-making. Always convert relative risk to absolute risk before making a prescribing decision.
Number Needed to Treat (NNT): The Clinician's Statistic
The NNT is the most clinically intuitive statistic in medicine. It answers the question: "How many patients do I need to treat for one to benefit?" An NNT of 10 means 1 in 10 patients benefits — the other 9 receive the drug without benefit (though they may still experience side effects). An NNT of 1,000 means you need to treat 1,000 patients for one to benefit. NNT must always be interpreted in context: an NNT of 100 for a cheap, safe drug with no side effects may be entirely acceptable. An NNT of 100 for an expensive drug with serious adverse effects is not.
| Intervention | NNT | Time Period | Outcome Prevented | Clinical Interpretation |
|---|---|---|---|---|
| Aspirin post-MI (secondary prevention) | ~50 | 2 years | Non-fatal MI or death | Highly worthwhile — cheap, safe, clear benefit |
| Statin for primary prevention (QRISK3 ≥10%) | ~100–200 | 5 years | Major cardiovascular event | Acceptable — cheap, generally safe; long-term benefit accumulates |
| Antihypertensive (Stage 1, low risk) | ~300–500 | 5 years | Major cardiovascular event | Marginal — benefit depends on individual risk; lifestyle first |
| Semaglutide 2.4 mg (SELECT trial, CVD + obesity) | ~67 | 3.3 years | MACE (MI, stroke, CV death) | Meaningful — high-risk population; dual weight + CV benefit |
| Antibiotics for acute otitis media (>2 years) | ~15 | 7 days | Pain relief at 24 hours | Modest — NNT 15 for symptom relief; weigh against resistance |
| Tamoxifen for breast cancer prevention (high risk) | ~22 | 5 years | Invasive breast cancer | Significant — but NNH for serious adverse events also relevant |
| Statins for primary prevention (low risk, QRISK3 <5%) | ~500–1000 | 5 years | Major cardiovascular event | Very marginal — lifestyle intervention likely more appropriate |
The NNT is only meaningful when paired with the Number Needed to Harm (NNH) — the number of patients who need to receive the treatment for one to experience a significant adverse effect. A drug with NNT 50 and NNH 500 has a favourable benefit-harm ratio (10:1). A drug with NNT 50 and NNH 50 has a neutral ratio — you benefit one patient for every one you harm. Always look for the NNH in the safety data.
Confidence Intervals: What They Actually Mean
A confidence interval (CI) is a range of values within which the true population effect is likely to lie, with a specified level of certainty (usually 95%). A 95% CI of 0.75 to 0.85 for a relative risk means: if we repeated this trial 100 times, 95 of those trials would produce a relative risk within this range. It does not mean there is a 95% probability that the true relative risk is in this range — a common misconception. The width of the CI tells you about precision: a narrow CI (e.g., 0.78–0.82) indicates a precise estimate; a wide CI (e.g., 0.50–0.95) indicates uncertainty. The location of the CI tells you about clinical significance: if the CI for a relative risk crosses 1.0 (the null value), the result is not statistically significant.
| CI Result | Statistical Significance | Clinical Interpretation | Example |
|---|---|---|---|
| RR 0.80 (95% CI 0.72–0.88) | Significant (CI does not cross 1.0) | Treatment reduces risk by 20%; true effect likely between 12–28% reduction | EMPA-REG OUTCOME: empagliflozin CV death RR 0.62 (0.49–0.77) |
| RR 0.90 (95% CI 0.78–1.04) | Not significant (CI crosses 1.0) | No statistically significant effect; could be 22% reduction or 4% increase | Inconclusive trial — do not change practice |
| RR 0.80 (95% CI 0.40–1.60) | Not significant (wide CI, crosses 1.0) | Underpowered trial — too few patients to detect a real effect | Small pilot trial — hypothesis-generating only |
| RR 0.99 (95% CI 0.98–1.00) | Borderline significant | Statistically significant but clinically trivial — 1% relative risk reduction | Large trial with tiny effect — statistical significance ≠ clinical significance |
| RR 0.50 (95% CI 0.30–0.85) | Significant but imprecise | Large effect but wide CI — small trial; replicate before changing practice | Small RCT — needs confirmation in larger trial |
The P-Value: The Most Misunderstood Number in Medicine
The p-value is the probability of observing a result at least as extreme as the one obtained, assuming the null hypothesis (no effect) is true. A p-value of 0.03 means: if the drug had no effect, there is a 3% chance of seeing a result this extreme by chance alone. It does not mean there is a 97% probability that the drug works. It does not tell you the size of the effect. It does not tell you whether the effect is clinically meaningful. The p-value is a binary gate (significant/not significant) that tells you almost nothing about the clinical importance of a finding.
- p < 0.05 is an arbitrary threshold — Ronald Fisher, who introduced it in 1925, explicitly stated it should not be used as a fixed decision rule
- A p-value of 0.049 and 0.051 are statistically indistinguishable — treating them as categorically different (significant vs not significant) is scientifically indefensible
- Large trials can produce statistically significant results for clinically trivial effects — a trial of 100,000 patients can detect a 0.1% difference with p < 0.001
- Small trials can produce non-significant results for clinically important effects — a trial of 100 patients may be underpowered to detect a 20% risk reduction
- The American Statistical Association (2019) has called for the abandonment of "statistical significance" as a binary concept — focus on effect size and confidence intervals instead
- p-hacking: running multiple analyses until p < 0.05 is achieved — inflates false positive rate; look for pre-registered primary outcomes
The SPRINT trial had a p-value of <0.001 for its primary outcome. This tells you the result was very unlikely to be due to chance. It does not tell you whether the result applies to your patients, whether the BP measurement method was comparable to standard clinic practice, or whether the harms of intensive treatment outweigh the benefits in lower-risk patients. Statistical significance is the beginning of critical appraisal, not the end.
Part 3: Study Design Hierarchy — Not All Evidence Is Equal
Evidence-based medicine operates on a hierarchy of study designs, from the most to least reliable for establishing causation. Understanding where a study sits in this hierarchy is the first step in appraising its clinical relevance.
| Level | Study Design | Strength | Limitation | Clinical Use |
|---|---|---|---|---|
| 1a | Systematic review + meta-analysis of RCTs | Highest — synthesises all available RCT evidence | Only as good as the included trials; heterogeneity can mislead | Guideline development; definitive treatment decisions |
| 1b | Individual RCT (well-designed, adequately powered) | High — randomisation eliminates confounding | Specific population; may not generalise; short follow-up | Practice-changing evidence if well-designed and replicated |
| 2a | Systematic review of cohort studies | Moderate — large populations; real-world data | Confounding by indication; selection bias | Long-term outcomes; rare adverse effects |
| 2b | Individual cohort study | Moderate — observational; no randomisation | Confounding; recall bias; loss to follow-up | Hypothesis generation; long-term safety data |
| 3a | Systematic review of case-control studies | Lower — retrospective; recall bias | Cannot establish causation; selection bias | Rare outcomes; hypothesis generation |
| 3b | Individual case-control study | Lower — retrospective; significant bias risk | Cannot establish causation | Rare diseases; adverse drug reactions |
| 4 | Case series / case reports | Very low — no control group | Cannot establish causation; selection bias | Signal generation; rare adverse effects; novel presentations |
| 5 | Expert opinion / editorials | Lowest — opinion, not evidence | Bias; conflicts of interest | Background context only; not for clinical decisions |
Observational studies (cohort, case-control) can generate hypotheses but cannot establish causation. The classic example: coffee drinkers have higher rates of lung cancer in observational studies — not because coffee causes lung cancer, but because coffee drinkers are more likely to smoke (confounding by association). The only way to establish causation is randomisation. When a headline says "X is associated with Y," it means correlation — not causation.
Meta-Analysis: Power and Pitfalls
A meta-analysis pools data from multiple trials to produce a single, more precise estimate of effect. It is represented as a forest plot — a visual display of individual trial results and the pooled estimate. The pooled estimate is shown as a diamond at the bottom; the width of the diamond represents the confidence interval. If the diamond does not cross the vertical line of no effect (relative risk = 1.0), the pooled result is statistically significant. Meta-analyses are powerful but have important limitations: they are only as good as the trials they include (garbage in, garbage out), and heterogeneity between trials (different populations, doses, follow-up periods) can make pooling misleading.
- Publication bias: Positive trials are more likely to be published than negative trials — meta-analyses that include only published trials overestimate treatment effects. Check for funnel plot asymmetry.
- Heterogeneity (I²): Measures the proportion of variability between trials due to true differences rather than chance. I² >50% indicates substantial heterogeneity — pooling may be inappropriate.
- Fixed vs random effects: Fixed effects assume all trials estimate the same true effect; random effects allow for variation between trials. Random effects models are more conservative and generally more appropriate.
- GRADE system: Cochrane and NICE use the GRADE system to rate the quality of evidence from meta-analyses: High, Moderate, Low, or Very Low. A meta-analysis of poor-quality trials produces Low-quality evidence regardless of the pooled p-value.
- Network meta-analysis: Compares multiple treatments simultaneously, including indirect comparisons (A vs B and B vs C to infer A vs C). Powerful but complex — indirect comparisons are less reliable than direct head-to-head trials.
Part 4: Common Trial Design Traps
Clinical trials are designed by humans with interests, deadlines, and funding pressures. Understanding the most common design traps helps you identify when a trial result may be less reliable than it appears.
Composite Outcomes: The Inflation Trick
Composite outcomes combine multiple endpoints into a single outcome measure — for example, "major adverse cardiovascular events (MACE)" typically includes cardiovascular death, non-fatal MI, and non-fatal stroke. Composite outcomes increase statistical power (more events occur) and make trials more feasible. But they can be misleading when the components have very different clinical importance. A drug that reduces non-fatal MI by 30% but has no effect on cardiovascular death or stroke will show a statistically significant reduction in MACE — but the most important component (death) is unchanged. Always decompose composite outcomes and look at each component separately.
| Trial | Composite Outcome | Headline Result | What Actually Drove It | Clinical Implication |
|---|---|---|---|---|
| EMPA-REG OUTCOME (empagliflozin) | CV death, non-fatal MI, non-fatal stroke | MACE reduced by 14% | CV death reduced by 38%; MI and stroke not significantly reduced | Empagliflozin reduces CV death — the most important component |
| LEADER (liraglutide) | CV death, non-fatal MI, non-fatal stroke | MACE reduced by 13% | CV death reduced by 22%; MI reduced by 14%; stroke not significant | Liraglutide reduces CV death and MI |
| ACCORD (intensive glycaemia) | CV death, non-fatal MI, non-fatal stroke | MACE not significantly reduced | CV death increased by 22% in intensive arm | Intensive glycaemia increased mortality — composite masked the harm |
| SPRINT (intensive BP) | MI, ACS, stroke, HF, CV death | MACE reduced by 25% | HF and CV death drove the result; stroke not significant | Benefit concentrated in HF prevention; stroke benefit uncertain |
| HOPE-3 (rosuvastatin + candesartan) | CV death, MI, stroke | Rosuvastatin reduced MACE by 24% | MI reduced; stroke reduced; CV death not significant | Statin benefit in intermediate-risk patients without established CVD |
Surrogate Outcomes: When the Proxy Is Not the Prize
A surrogate outcome is a measurable variable used as a proxy for a clinical outcome that is harder to measure. HbA1c is a surrogate for diabetic complications. LDL-C is a surrogate for cardiovascular events. Blood pressure is a surrogate for stroke and MI. Surrogate outcomes are useful for early-phase trials and regulatory approval, but they can mislead when the surrogate does not reliably predict the clinical outcome. The most famous example: the CAST trial (1989) found that antiarrhythmic drugs (flecainide, encainide) suppressed ventricular ectopics (surrogate) but increased mortality (clinical outcome) — the surrogate pointed in the opposite direction to the clinical outcome.
- Validated surrogates: LDL-C (for CV events), HbA1c (for microvascular complications), blood pressure (for stroke/MI) — these have strong biological plausibility and consistent trial validation
- Unvalidated surrogates: Bone mineral density (for fractures — not always predictive), tumour shrinkage (for survival — often not predictive), CD4 count (for HIV outcomes — validated in some contexts)
- The surrogate trap: A drug that improves a surrogate but worsens clinical outcomes is worse than useless — it is harmful. Always ask: has this surrogate been validated in trials that measured clinical outcomes?
- Regulatory approval vs clinical use: Drugs can be approved on surrogate endpoints (faster, cheaper trials) but may not improve clinical outcomes. FDA accelerated approval pathway has produced several drugs later withdrawn when clinical outcome trials were negative.
Subgroup Analyses: The Garden of Forking Paths
Subgroup analyses divide the trial population into subgroups (by age, sex, diabetes status, baseline risk, etc.) and examine whether the treatment effect differs between subgroups. They are seductive — they seem to offer personalised medicine insights. But they are statistically treacherous. If you divide a trial into 20 subgroups and test each one, you expect 1 to show a statistically significant result by chance alone (at p < 0.05). The more subgroups you test, the more false positives you generate. The ISIS-2 trial famously found that aspirin was ineffective in patients born under the star signs Gemini and Libra — a subgroup analysis of a real trial, illustrating the absurdity of data dredging.
- Pre-specified vs post-hoc: Pre-specified subgroup analyses (defined in the protocol before the trial started) are more reliable than post-hoc analyses (defined after seeing the data). Always check the trial protocol.
- Interaction test: A valid subgroup analysis requires a statistically significant interaction test (p for interaction < 0.05) — this tests whether the treatment effect genuinely differs between subgroups, not just whether it is significant in one and not the other.
- Biological plausibility: A subgroup finding is more credible if there is a plausible biological mechanism. "Drug X works better in patients with high baseline CRP" is more credible if CRP is part of the drug's mechanism of action.
- Replication: A subgroup finding should be replicated in an independent trial before changing practice. The PLATO trial subgroup suggesting ticagrelor was less effective in North American patients was not replicated and is now considered a chance finding.
- Clinical application: Even a valid subgroup finding should be applied cautiously — it is based on a subset of the trial population, with reduced statistical power and increased uncertainty.
Part 5: The 5 Questions Every GP Should Ask
You do not need to read every trial in full. You need a rapid, systematic framework for deciding whether a trial result should change your practice. These five questions can be applied to any trial in under 10 minutes — and they will protect your patients from the harms of premature practice change.
Question 1: Does My Patient Resemble the Trial Population?
This is the most important question, and the one most often skipped. Every trial has inclusion and exclusion criteria that define the population in which the results are valid. If your patient does not resemble that population, the trial result may not apply to them — or may apply in a different direction. The SPRINT trial excluded patients with diabetes, prior stroke, heart failure, and eGFR <20 mL/min/1.73m² — four of the most common comorbidities in hypertensive patients in primary care. The trial also used an unattended automated BP measurement protocol that reads 5–10 mmHg lower than standard clinic measurements. A patient with diabetes, prior stroke, and a clinic BP of 138/88 mmHg is not the SPRINT patient.
- Check the inclusion criteria: What was the minimum age? What was the required diagnosis? What was the baseline risk level?
- Check the exclusion criteria: Were patients with your patient's comorbidities excluded? Were elderly patients excluded? Were women underrepresented?
- Check the baseline characteristics table: What was the mean age? What proportion had diabetes, CKD, prior CVD? What was the mean baseline risk?
- Check the setting: Was this a specialist centre or primary care? Was it a high-income country? Was the "usual care" comparator comparable to your practice?
- The generalisability question: "Would my patient have been eligible for this trial?" If the answer is no, apply the results with caution.
Question 2: What Was the Absolute Benefit — and Is It Clinically Meaningful?
Convert every relative risk to an absolute risk reduction. Calculate the NNT. Then ask: is this NNT clinically meaningful given the cost, side effects, and patient burden of the treatment? A drug that reduces cardiovascular events from 10% to 8% (ARR 2%, NNT 50) over 5 years is meaningfully different from a drug that reduces events from 1% to 0.8% (ARR 0.2%, NNT 500) over 5 years — even if both have a relative risk reduction of 20%. The first drug prevents 1 event per 50 patients treated; the second prevents 1 event per 500 patients treated. The clinical decision is very different.
| Scenario | Control Rate | Treatment Rate | RRR | ARR | NNT (5 years) | Clinical Verdict |
|---|---|---|---|---|---|---|
| High-risk patient (QRISK3 20%) | 20% | 14% | 30% | 6% | 17 | Statin clearly worthwhile — 1 in 17 patients benefits |
| Intermediate-risk patient (QRISK3 10%) | 10% | 7% | 30% | 3% | 33 | Statin worthwhile — discuss with patient |
| Low-risk patient (QRISK3 5%) | 5% | 3.5% | 30% | 1.5% | 67 | Marginal — lifestyle intervention may be preferable |
| Very low-risk patient (QRISK3 2%) | 2% | 1.4% | 30% | 0.6% | 167 | Not worthwhile — NNT too high; harms likely outweigh benefits |
The NNT is time-dependent — it changes with the duration of treatment. An NNT of 100 over 5 years is equivalent to an NNT of 50 over 10 years (assuming constant annual risk reduction). When comparing NNTs across trials, always check the follow-up duration. A trial with a 1-year follow-up will have a higher NNT than a trial with a 5-year follow-up for the same drug — not because the drug is less effective, but because fewer events occur in a shorter time.
Question 3: What Were the Harms — and Were They Fully Reported?
Every effective treatment has harms. The question is not whether harms exist, but whether the benefit-harm ratio is favourable for your patient. Harms are systematically underreported in clinical trials — particularly industry-funded trials. A 2012 Cochrane review found that harms data were inadequately reported in 65% of RCTs. The reasons are structural: trials are powered to detect benefits, not harms; follow-up is often too short to detect long-term harms; and industry sponsors have financial incentives to minimise harm reporting.
- Look for the Number Needed to Harm (NNH): If not reported, calculate it from the adverse event rates in the safety table
- Check the duration of follow-up: Short trials (1–2 years) cannot detect long-term harms (e.g., cancer risk, cognitive effects, bone density changes)
- Check for serious adverse events (SAEs): These are pre-defined events requiring hospitalisation or causing death — they are the most clinically important harms
- Check for discontinuation rates: High discontinuation due to adverse effects in the treatment arm suggests real-world tolerability is worse than the trial suggests
- Check for harms in subgroups: Harms may be concentrated in specific subgroups (elderly, renal impairment, drug interactions) that are underrepresented in the trial
- Look for the trial registration and protocol: Compare reported outcomes with pre-specified outcomes — selective outcome reporting (reporting only favourable outcomes) is a form of research misconduct
Question 4: Is the Result Biologically Plausible and Consistent with Other Evidence?
A single trial, however well-designed, is not sufficient to change practice. The result must be biologically plausible (consistent with what we know about the mechanism of action), consistent with prior evidence (not contradicting well-established findings), and ideally replicated in independent trials. The Bradford Hill criteria — originally developed for establishing causation in epidemiology — provide a useful framework: strength of association, consistency across studies, specificity, temporality, biological gradient (dose-response), plausibility, coherence, experiment (does removing the cause remove the effect?), and analogy.
- Biological plausibility: Does the mechanism of action explain the observed effect? Statins reduce LDL-C → LDL-C causes atherosclerosis → statins reduce cardiovascular events. Plausible and confirmed.
- Consistency: Is the result consistent with prior trials of the same drug class? The cardiovascular benefit of SGLT2 inhibitors has been replicated across empagliflozin (EMPA-REG), canagliflozin (CANVAS), dapagliflozin (DECLARE), and ertugliflozin (VERTIS) — consistent class effect.
- Dose-response: Does a higher dose produce a greater effect? The DiRECT trial showed a clear dose-response between weight loss and T2DM remission — more weight lost, higher remission rate. This strengthens the causal inference.
- Replication: Has the result been replicated in an independent trial by a different research group? A single positive trial from one group, not yet replicated, should be treated as hypothesis-generating.
- Coherence: Does the result fit with what we know from other sources (animal studies, mechanistic studies, epidemiology)? A result that contradicts all prior evidence requires extraordinary evidence.
Question 5: Who Funded the Trial — and Does It Matter?
Funding source is not a reason to dismiss a trial, but it is a reason to scrutinise it more carefully. A 2017 systematic review in the BMJ found that industry-funded trials were 4 times more likely to report positive results than independently funded trials, even after adjusting for study quality. The mechanisms are multiple: selective publication (negative trials not published), outcome switching (changing the primary outcome after seeing the data), selective reporting (reporting only favourable secondary outcomes), and spin (framing negative results positively in the abstract and discussion). The most reliable trials are independently funded (NIHR, MRC, NIH), pre-registered, and published with full data transparency.
| Funding Source | Positive Result Rate | Key Risk | How to Mitigate |
|---|---|---|---|
| Industry-funded | ~85% positive | Publication bias; outcome switching; spin | Check trial registration; compare protocol with published outcomes; look for independent replication |
| Independently funded (NIHR, MRC, NIH) | ~50% positive | Lower risk of bias; more likely to publish negative results | Still check methodology; independent funding does not guarantee quality |
| Investigator-initiated (industry drug, independent design) | Intermediate | Drug supply from industry but independent design and analysis | Check data access agreement; who controlled the analysis? |
| Cochrane systematic review | N/A — synthesises all evidence | Depends on quality of included trials; publication bias in included trials | Check for funnel plot asymmetry; GRADE rating of evidence quality |
ClinicalTrials.gov and the ISRCTN registry allow you to look up the pre-registered protocol for any trial. Compare the pre-specified primary outcome with the published primary outcome. If they differ, the trial may have engaged in outcome switching — a serious form of research misconduct that inflates false positive rates. The AllTrials campaign (alltrials.net) advocates for mandatory registration and publication of all clinical trials.
Part 6: Applying the Framework — The SPRINT Trial Revisited
Let us apply the five questions to the SPRINT trial — the trial that drove the AHA/ACC to lower the hypertension threshold to 130/80 mmHg, and that NICE chose not to follow.
| Question | SPRINT Answer | Clinical Implication |
|---|---|---|
| 1. Does my patient resemble the trial population? | SPRINT excluded: diabetes, prior stroke, HF, eGFR <20, age >75 (initially), nursing home residents. Used unattended automated BP (reads 5–10 mmHg lower than standard clinic BP). | Most primary care hypertensive patients do NOT resemble the SPRINT population. The BP measurement method is not standard in UK primary care. |
| 2. What was the absolute benefit? | Primary outcome (MACE): 5.2% vs 6.8% over 3.26 years. ARR = 1.6%. NNT = 63 over 3.26 years. CV death: 0.8% vs 1.4%. ARR = 0.6%. NNT = 167. | Meaningful benefit in the SPRINT population — but this is a high-risk population (mean 10-year CVD risk ~20%). In lower-risk patients, the ARR would be much smaller. |
| 3. What were the harms? | Serious adverse events: 38.3% vs 37.1% (not significantly different overall). Hypotension: 2.4% vs 1.4% (p<0.001). Syncope: 2.3% vs 1.7% (p=0.05). AKI: 4.4% vs 2.6% (p<0.001). Electrolyte abnormalities: 3.1% vs 1.5% (p<0.001). | Significant increase in AKI, hypotension, and electrolyte abnormalities. In elderly patients and those with CKD, these harms may outweigh the cardiovascular benefit. |
| 4. Is the result biologically plausible and consistent? | Plausible — lower BP reduces cardiovascular events. But inconsistent with ACCORD (intensive BP in diabetes showed no benefit and possible harm) and HOT trial (no benefit below 140 mmHg in most patients). | The SPRINT result may be specific to the SPRINT population (high-risk, non-diabetic, non-stroke). It does not generalise to all hypertensive patients. |
| 5. Who funded the trial? | SPRINT was funded by the NIH (National Heart, Lung, and Blood Institute) — independent funding. No industry involvement. | Independent funding reduces publication bias risk. The result is more likely to be genuine. But generalisability questions remain. |
NICE's conclusion: the SPRINT result is real and applies to the SPRINT population. But the SPRINT population is not representative of the general hypertensive population in UK primary care. The BP measurement method is not standard. The harms (AKI, hypotension, electrolyte abnormalities) are clinically significant, particularly in elderly patients and those with CKD. The absolute benefit in lower-risk patients (QRISK3 <10%) is too small to justify the harms. Therefore, NICE retained the 140/90 mmHg threshold for most patients, while acknowledging that lower targets may be appropriate for high-risk subgroups. This is not NICE being conservative — it is NICE being rigorous.
Part 7: A Rapid Appraisal Checklist for Busy GPs
You will not always have time for a full critical appraisal. The following 10-point checklist can be completed in under 5 minutes and will identify the most important issues with any trial result.
- 1. What was the primary outcome? (Not a surrogate — a clinical outcome)
- 2. What was the absolute risk reduction? (Convert relative risk to absolute risk)
- 3. What is the NNT? (And over what time period?)
- 4. What were the serious adverse events? (Calculate NNH)
- 5. Does my patient resemble the trial population? (Check inclusion/exclusion criteria)
- 6. Was the trial adequately powered? (Check sample size calculation)
- 7. Were the groups balanced at baseline? (Check Table 1)
- 8. Was the primary outcome pre-specified? (Check trial registration)
- 9. Who funded the trial? (Industry vs independent)
- 10. Has the result been replicated? (One trial is not enough to change practice)
The CASP (Critical Appraisal Skills Programme) provides free, validated checklists for appraising RCTs, systematic reviews, cohort studies, and diagnostic test accuracy studies. Available at casp-uk.net. The BMJ's "How to read a paper" series by Trisha Greenhalgh is the definitive accessible guide to critical appraisal for clinicians — available free online.
Part 8: When to Change Practice — and When to Wait
The hardest clinical skill is knowing when a new trial result should change your practice and when it should not. The following framework provides a structured approach to this decision.
| Scenario | Action | Rationale |
|---|---|---|
| Single RCT, large effect, high-quality, independent funding, replicated | Change practice | Strong evidence; low risk of false positive; consistent with prior evidence |
| Single RCT, large effect, industry-funded, not yet replicated | Watch and wait; discuss with patients who ask | Possible true effect but replication needed; publication bias risk |
| Single RCT, small effect (NNT >200), statistically significant | Do not change practice | Statistical significance ≠ clinical significance; harms may outweigh marginal benefit |
| Meta-analysis with high heterogeneity (I² >50%) | Interpret with caution; look at individual trials | Pooling heterogeneous trials can produce misleading estimates |
| Subgroup analysis showing benefit in your patient type | Do not change practice unless pre-specified and replicated | Subgroup analyses are hypothesis-generating; high false positive rate |
| Surrogate outcome trial (e.g., HbA1c, LDL-C) | Do not change practice until clinical outcome trial available | Surrogate improvement does not guarantee clinical benefit |
| Observational study showing association | Do not change practice; use as hypothesis for RCT | Association ≠ causation; confounding likely |
| NICE guideline update based on new evidence | Change practice in line with NICE | NICE has appraised the evidence; follow unless patient-specific reason not to |
The most dangerous moment in evidence-based medicine is the period between a single positive trial and its replication — when the result is new, exciting, and widely publicised, but not yet confirmed. This is when premature practice change causes the most harm. The history of medicine is littered with interventions that seemed promising in early trials and were later shown to be ineffective or harmful: hormone replacement therapy for cardiovascular prevention (WHI trial), intensive glycaemic control in T2DM (ACCORD), antiarrhythmics for ventricular ectopics (CAST), and many others. The discipline of waiting for replication before changing practice is not conservatism — it is the highest form of patient advocacy.
The GRADE system (Grading of Recommendations Assessment, Development and Evaluation) provides a transparent framework for rating the quality of evidence and the strength of recommendations. Evidence is rated High, Moderate, Low, or Very Low based on study design, risk of bias, inconsistency, indirectness, imprecision, and publication bias. A strong recommendation requires High or Moderate quality evidence. NICE uses GRADE for all its guideline recommendations.
Key Clinical Takeaways
- Always convert relative risk to absolute risk reduction — a 50% relative risk reduction from 2% to 1% is an NNT of 100, not a dramatic benefit
- The NNT is the most clinically useful statistic: it tells you how many patients need treatment for one to benefit — always pair it with the NNH (Number Needed to Harm)
- P < 0.05 means the result is unlikely to be due to chance — it does not mean the effect is clinically important, large, or applicable to your patient
- Confidence intervals tell you about precision and statistical significance — a CI that crosses 1.0 (for relative risk) or 0 (for absolute risk) is not statistically significant
- The 5 questions: Does my patient resemble the trial population? What was the absolute benefit? What were the harms? Is the result biologically plausible and consistent? Who funded the trial?
- Composite outcomes can inflate apparent benefit — always decompose and examine each component separately
- Subgroup analyses are hypothesis-generating, not practice-changing — require pre-specification, significant interaction test, and independent replication
- One trial is not enough to change practice — wait for replication, especially for industry-funded trials with large relative risk reductions
You might also like
Cardiovascular Risk Assessment in Primary Care: QRISK3, Statins & Beyond
Cardiovascular disease remains the leading cause of death globally. QRISK3 is the validated tool for 10-year CVD risk estimation in UK primary care. This guide covers who to screen, how to interpret QRISK3, statin prescribing thresholds, and lifestyle interventions.
NICE vs WHO vs AHA: When the Guidelines Disagree — A GP's Field Guide
NICE says treat at 140/90 mmHg. AHA says 130/80 mmHg. WHO says 140/90 mmHg but with different drug choices. Your patient is sitting in front of you with a BP of 135/85 mmHg. Who do you follow? This guide cuts through the noise — comparing NICE, WHO, AHA, ESC, and other major bodies on the clinical questions that matter most in primary care, explaining why they diverge, and giving you a framework for making defensible, patient-centred decisions when the guidelines don't agree.
Shared Decision Making in 10 Minutes: How to Present Risk, NNT, and Treatment Options to Patients Without a Statistics Degree
Your patient has a 10-year cardiovascular risk of 12%. You want to start a statin. The trial shows a 30% relative risk reduction. You say this. They hear "30% chance of a heart attack." The consultation derails. Shared decision making is not about dumbing down the evidence — it is about translating it into the language of lived experience. This guide gives GPs the specific tools, scripts, and visual frameworks to present risk, NNT, and treatment options in a way that genuinely informs patient choice — in the time available in a real consultation.
Priya Nair