Skip to main content
GPManual/Blog/
Clinical Practice17 min readUpdated 14 April 2026

How to Read a Clinical Trial: RCTs, NNT, Confidence Intervals & the 5 Questions Every GP Should Ask

Before you change your practice based on a headline, ask these five questions — your patients deserve no less

Dr. Marcus Chen
Dr. Marcus Chen
GP & Clinical Educator, Cardiology
Published 14 April 2026
How to Read a Clinical Trial: RCTs, NNT, Confidence Intervals & the 5 Questions Every GP Should Ask

A trial is published. The headline says "Drug X reduces heart attacks by 25%." Your inbox fills with patient queries. A colleague changes their prescribing. Should you? This guide gives GPs the statistical literacy to read any clinical trial critically — understanding RCT design, absolute vs relative risk, NNT, confidence intervals, p-values, and the five questions that separate practice-changing evidence from noise.

Clinical Decision Support: This article is for educational purposes and supports — not replaces — clinical judgment. Always verify with current national guidelines, BNF, and specialist consultation when needed.

In November 2015, the SPRINT trial was published in the New England Journal of Medicine. The headline: intensive blood pressure control (target systolic <120 mmHg) reduced major cardiovascular events by 25% and all-cause mortality by 27% compared to standard control (<140 mmHg). Within weeks, cardiologists were calling for a revision of hypertension guidelines. Within months, the AHA/ACC had lowered the hypertension threshold to 130/80 mmHg — reclassifying 46% of American adults as hypertensive overnight. NICE, after careful appraisal, did not change its threshold. Same trial. Same data. Opposite conclusions. The difference was not ideology — it was critical appraisal. NICE asked five questions that the headline did not answer, and the answers changed everything. This guide teaches you to ask those questions for every trial you read.

Part 1: Understanding RCT Design — The Foundation

The randomised controlled trial (RCT) is the gold standard of clinical evidence because randomisation — when done properly — eliminates confounding. If you randomly assign 5,000 patients to Drug X and 5,000 to placebo, any difference in outcomes between the groups is attributable to the drug, not to differences in age, sex, comorbidities, or lifestyle. This is the power of the RCT. But it is also its limitation: the trial only tells you what happened in those 10,000 patients, in that setting, over that time period. Whether it applies to your patient is a separate question entirely.

The Anatomy of an RCT: What Every Section Tells You

SectionWhat It ContainsWhat to Look ForRed Flags
AbstractSummary of design, population, intervention, outcomes, resultsPrimary outcome; effect size; p-value; follow-up durationRelative risk only (no absolute risk); composite outcomes buried in footnotes
IntroductionBackground and rationaleWhat gap the trial claims to fill; prior evidenceOverstated prior uncertainty; cherry-picked prior studies
Methods — PopulationInclusion and exclusion criteriaWho was enrolled; who was excluded; how representativeHighly selected population; exclusion of elderly, women, comorbidities
Methods — InterventionWhat was done to each groupDose, duration, co-interventions; what "usual care" actually wasUnusual dosing; unusually good usual care (makes drug look worse)
Methods — OutcomesPrimary and secondary outcomes; how measuredPre-specified primary outcome; composite vs single outcomesPrimary outcome changed after trial started (outcome switching); composite outcomes that inflate effect
Methods — RandomisationHow patients were allocatedAllocation concealment; stratificationInadequate concealment; post-randomisation exclusions
Methods — BlindingWho was blinded to treatment allocationDouble-blind (patient + assessor); open-labelOpen-label trials for subjective outcomes (bias risk)
Results — BaselineCharacteristics of each group at randomisationAre groups balanced? Any important differences?Imbalanced groups despite randomisation (chance or selective reporting)
Results — Primary OutcomeMain resultAbsolute risk reduction; relative risk reduction; NNT; confidence interval; p-valueOnly relative risk reported; wide confidence intervals; p just below 0.05
Results — Secondary OutcomesAdditional outcomesConsistency with primary outcome; pre-specified vs post-hocMultiple secondary outcomes without correction; post-hoc subgroup analyses presented as primary
DiscussionInterpretation of resultsAuthors' own limitations section; generalisabilityOverstated conclusions; minimised harms; ignored negative secondary outcomes
FundingWho paid for the trialIndustry vs independent fundingIndustry-funded trials are 4× more likely to report positive results

The CONSORT statement (Consolidated Standards of Reporting Trials) provides a 25-item checklist for reporting RCTs. Journals that require CONSORT compliance produce more transparent, reproducible trial reports. When reading a trial, check whether it reports a CONSORT flow diagram — this shows how many patients were screened, enrolled, randomised, completed the trial, and were analysed. Missing patients are not missing data — they are a signal.

Part 2: The Statistics You Actually Need

You do not need a statistics degree to read a clinical trial critically. You need to understand six concepts: absolute risk, relative risk, absolute risk reduction, relative risk reduction, number needed to treat, and confidence intervals. Everything else is detail. Master these six, and you can appraise any trial result.

Absolute Risk vs Relative Risk: The Most Important Distinction in Medicine

This is the single most important statistical concept for clinical practice, and the one most commonly misrepresented in medical headlines. Consider a trial where 2% of patients in the control group have a heart attack over 5 years, and 1% of patients in the treatment group have a heart attack. The relative risk reduction is 50% — treatment halved the risk. The absolute risk reduction is 1% — treatment prevented 1 heart attack per 100 patients over 5 years. Both statements are mathematically correct. The relative risk reduction sounds dramatic. The absolute risk reduction tells you what actually happened.

MeasureFormulaExample (Control 2%, Treatment 1%)Clinical MeaningWhen Misleading
Control Event Rate (CER)Events in control group / total in control group2/100 = 2%Baseline risk without treatmentNever — always report this
Experimental Event Rate (EER)Events in treatment group / total in treatment group1/100 = 1%Risk with treatmentNever — always report this
Absolute Risk Reduction (ARR)CER − EER2% − 1% = 1%The actual reduction in risk per patient treatedLow ARR with high RRR can mislead — always calculate
Relative Risk Reduction (RRR)(CER − EER) / CER(2% − 1%) / 2% = 50%Proportional reduction in riskMisleading when baseline risk is low — 50% of a tiny risk is still tiny
Number Needed to Treat (NNT)1 / ARR1 / 0.01 = 100How many patients need treatment for 1 to benefitMust be interpreted alongside treatment duration and harms
Relative Risk (RR)EER / CER1% / 2% = 0.5Risk in treatment group relative to controlDoes not convey absolute magnitude of benefit
Odds Ratio (OR)(EER/(1−EER)) / (CER/(1−CER))(0.01/0.99) / (0.02/0.98) ≈ 0.495Ratio of odds of event in each groupOverestimates RR when event rates are high (>10%)

The pharmaceutical industry and medical journals preferentially report relative risk reductions because they sound more impressive. A drug that reduces heart attacks from 2% to 1% will be marketed as "reduces heart attack risk by 50%" — not "prevents 1 heart attack per 100 patients treated over 5 years." Both are true. Only one is useful for clinical decision-making. Always convert relative risk to absolute risk before making a prescribing decision.

Number Needed to Treat (NNT): The Clinician's Statistic

The NNT is the most clinically intuitive statistic in medicine. It answers the question: "How many patients do I need to treat for one to benefit?" An NNT of 10 means 1 in 10 patients benefits — the other 9 receive the drug without benefit (though they may still experience side effects). An NNT of 1,000 means you need to treat 1,000 patients for one to benefit. NNT must always be interpreted in context: an NNT of 100 for a cheap, safe drug with no side effects may be entirely acceptable. An NNT of 100 for an expensive drug with serious adverse effects is not.

InterventionNNTTime PeriodOutcome PreventedClinical Interpretation
Aspirin post-MI (secondary prevention)~502 yearsNon-fatal MI or deathHighly worthwhile — cheap, safe, clear benefit
Statin for primary prevention (QRISK3 ≥10%)~100–2005 yearsMajor cardiovascular eventAcceptable — cheap, generally safe; long-term benefit accumulates
Antihypertensive (Stage 1, low risk)~300–5005 yearsMajor cardiovascular eventMarginal — benefit depends on individual risk; lifestyle first
Semaglutide 2.4 mg (SELECT trial, CVD + obesity)~673.3 yearsMACE (MI, stroke, CV death)Meaningful — high-risk population; dual weight + CV benefit
Antibiotics for acute otitis media (>2 years)~157 daysPain relief at 24 hoursModest — NNT 15 for symptom relief; weigh against resistance
Tamoxifen for breast cancer prevention (high risk)~225 yearsInvasive breast cancerSignificant — but NNH for serious adverse events also relevant
Statins for primary prevention (low risk, QRISK3 <5%)~500–10005 yearsMajor cardiovascular eventVery marginal — lifestyle intervention likely more appropriate

The NNT is only meaningful when paired with the Number Needed to Harm (NNH) — the number of patients who need to receive the treatment for one to experience a significant adverse effect. A drug with NNT 50 and NNH 500 has a favourable benefit-harm ratio (10:1). A drug with NNT 50 and NNH 50 has a neutral ratio — you benefit one patient for every one you harm. Always look for the NNH in the safety data.

Confidence Intervals: What They Actually Mean

A confidence interval (CI) is a range of values within which the true population effect is likely to lie, with a specified level of certainty (usually 95%). A 95% CI of 0.75 to 0.85 for a relative risk means: if we repeated this trial 100 times, 95 of those trials would produce a relative risk within this range. It does not mean there is a 95% probability that the true relative risk is in this range — a common misconception. The width of the CI tells you about precision: a narrow CI (e.g., 0.78–0.82) indicates a precise estimate; a wide CI (e.g., 0.50–0.95) indicates uncertainty. The location of the CI tells you about clinical significance: if the CI for a relative risk crosses 1.0 (the null value), the result is not statistically significant.

CI ResultStatistical SignificanceClinical InterpretationExample
RR 0.80 (95% CI 0.72–0.88)Significant (CI does not cross 1.0)Treatment reduces risk by 20%; true effect likely between 12–28% reductionEMPA-REG OUTCOME: empagliflozin CV death RR 0.62 (0.49–0.77)
RR 0.90 (95% CI 0.78–1.04)Not significant (CI crosses 1.0)No statistically significant effect; could be 22% reduction or 4% increaseInconclusive trial — do not change practice
RR 0.80 (95% CI 0.40–1.60)Not significant (wide CI, crosses 1.0)Underpowered trial — too few patients to detect a real effectSmall pilot trial — hypothesis-generating only
RR 0.99 (95% CI 0.98–1.00)Borderline significantStatistically significant but clinically trivial — 1% relative risk reductionLarge trial with tiny effect — statistical significance ≠ clinical significance
RR 0.50 (95% CI 0.30–0.85)Significant but impreciseLarge effect but wide CI — small trial; replicate before changing practiceSmall RCT — needs confirmation in larger trial

The P-Value: The Most Misunderstood Number in Medicine

The p-value is the probability of observing a result at least as extreme as the one obtained, assuming the null hypothesis (no effect) is true. A p-value of 0.03 means: if the drug had no effect, there is a 3% chance of seeing a result this extreme by chance alone. It does not mean there is a 97% probability that the drug works. It does not tell you the size of the effect. It does not tell you whether the effect is clinically meaningful. The p-value is a binary gate (significant/not significant) that tells you almost nothing about the clinical importance of a finding.

  • p < 0.05 is an arbitrary threshold — Ronald Fisher, who introduced it in 1925, explicitly stated it should not be used as a fixed decision rule
  • A p-value of 0.049 and 0.051 are statistically indistinguishable — treating them as categorically different (significant vs not significant) is scientifically indefensible
  • Large trials can produce statistically significant results for clinically trivial effects — a trial of 100,000 patients can detect a 0.1% difference with p < 0.001
  • Small trials can produce non-significant results for clinically important effects — a trial of 100 patients may be underpowered to detect a 20% risk reduction
  • The American Statistical Association (2019) has called for the abandonment of "statistical significance" as a binary concept — focus on effect size and confidence intervals instead
  • p-hacking: running multiple analyses until p < 0.05 is achieved — inflates false positive rate; look for pre-registered primary outcomes

The SPRINT trial had a p-value of <0.001 for its primary outcome. This tells you the result was very unlikely to be due to chance. It does not tell you whether the result applies to your patients, whether the BP measurement method was comparable to standard clinic practice, or whether the harms of intensive treatment outweigh the benefits in lower-risk patients. Statistical significance is the beginning of critical appraisal, not the end.

Part 3: Study Design Hierarchy — Not All Evidence Is Equal

Evidence-based medicine operates on a hierarchy of study designs, from the most to least reliable for establishing causation. Understanding where a study sits in this hierarchy is the first step in appraising its clinical relevance.

LevelStudy DesignStrengthLimitationClinical Use
1aSystematic review + meta-analysis of RCTsHighest — synthesises all available RCT evidenceOnly as good as the included trials; heterogeneity can misleadGuideline development; definitive treatment decisions
1bIndividual RCT (well-designed, adequately powered)High — randomisation eliminates confoundingSpecific population; may not generalise; short follow-upPractice-changing evidence if well-designed and replicated
2aSystematic review of cohort studiesModerate — large populations; real-world dataConfounding by indication; selection biasLong-term outcomes; rare adverse effects
2bIndividual cohort studyModerate — observational; no randomisationConfounding; recall bias; loss to follow-upHypothesis generation; long-term safety data
3aSystematic review of case-control studiesLower — retrospective; recall biasCannot establish causation; selection biasRare outcomes; hypothesis generation
3bIndividual case-control studyLower — retrospective; significant bias riskCannot establish causationRare diseases; adverse drug reactions
4Case series / case reportsVery low — no control groupCannot establish causation; selection biasSignal generation; rare adverse effects; novel presentations
5Expert opinion / editorialsLowest — opinion, not evidenceBias; conflicts of interestBackground context only; not for clinical decisions

Observational studies (cohort, case-control) can generate hypotheses but cannot establish causation. The classic example: coffee drinkers have higher rates of lung cancer in observational studies — not because coffee causes lung cancer, but because coffee drinkers are more likely to smoke (confounding by association). The only way to establish causation is randomisation. When a headline says "X is associated with Y," it means correlation — not causation.

Meta-Analysis: Power and Pitfalls

A meta-analysis pools data from multiple trials to produce a single, more precise estimate of effect. It is represented as a forest plot — a visual display of individual trial results and the pooled estimate. The pooled estimate is shown as a diamond at the bottom; the width of the diamond represents the confidence interval. If the diamond does not cross the vertical line of no effect (relative risk = 1.0), the pooled result is statistically significant. Meta-analyses are powerful but have important limitations: they are only as good as the trials they include (garbage in, garbage out), and heterogeneity between trials (different populations, doses, follow-up periods) can make pooling misleading.

  • Publication bias: Positive trials are more likely to be published than negative trials — meta-analyses that include only published trials overestimate treatment effects. Check for funnel plot asymmetry.
  • Heterogeneity (I²): Measures the proportion of variability between trials due to true differences rather than chance. I² >50% indicates substantial heterogeneity — pooling may be inappropriate.
  • Fixed vs random effects: Fixed effects assume all trials estimate the same true effect; random effects allow for variation between trials. Random effects models are more conservative and generally more appropriate.
  • GRADE system: Cochrane and NICE use the GRADE system to rate the quality of evidence from meta-analyses: High, Moderate, Low, or Very Low. A meta-analysis of poor-quality trials produces Low-quality evidence regardless of the pooled p-value.
  • Network meta-analysis: Compares multiple treatments simultaneously, including indirect comparisons (A vs B and B vs C to infer A vs C). Powerful but complex — indirect comparisons are less reliable than direct head-to-head trials.

Part 4: Common Trial Design Traps

Clinical trials are designed by humans with interests, deadlines, and funding pressures. Understanding the most common design traps helps you identify when a trial result may be less reliable than it appears.

Composite Outcomes: The Inflation Trick

Composite outcomes combine multiple endpoints into a single outcome measure — for example, "major adverse cardiovascular events (MACE)" typically includes cardiovascular death, non-fatal MI, and non-fatal stroke. Composite outcomes increase statistical power (more events occur) and make trials more feasible. But they can be misleading when the components have very different clinical importance. A drug that reduces non-fatal MI by 30% but has no effect on cardiovascular death or stroke will show a statistically significant reduction in MACE — but the most important component (death) is unchanged. Always decompose composite outcomes and look at each component separately.

TrialComposite OutcomeHeadline ResultWhat Actually Drove ItClinical Implication
EMPA-REG OUTCOME (empagliflozin)CV death, non-fatal MI, non-fatal strokeMACE reduced by 14%CV death reduced by 38%; MI and stroke not significantly reducedEmpagliflozin reduces CV death — the most important component
LEADER (liraglutide)CV death, non-fatal MI, non-fatal strokeMACE reduced by 13%CV death reduced by 22%; MI reduced by 14%; stroke not significantLiraglutide reduces CV death and MI
ACCORD (intensive glycaemia)CV death, non-fatal MI, non-fatal strokeMACE not significantly reducedCV death increased by 22% in intensive armIntensive glycaemia increased mortality — composite masked the harm
SPRINT (intensive BP)MI, ACS, stroke, HF, CV deathMACE reduced by 25%HF and CV death drove the result; stroke not significantBenefit concentrated in HF prevention; stroke benefit uncertain
HOPE-3 (rosuvastatin + candesartan)CV death, MI, strokeRosuvastatin reduced MACE by 24%MI reduced; stroke reduced; CV death not significantStatin benefit in intermediate-risk patients without established CVD

Surrogate Outcomes: When the Proxy Is Not the Prize

A surrogate outcome is a measurable variable used as a proxy for a clinical outcome that is harder to measure. HbA1c is a surrogate for diabetic complications. LDL-C is a surrogate for cardiovascular events. Blood pressure is a surrogate for stroke and MI. Surrogate outcomes are useful for early-phase trials and regulatory approval, but they can mislead when the surrogate does not reliably predict the clinical outcome. The most famous example: the CAST trial (1989) found that antiarrhythmic drugs (flecainide, encainide) suppressed ventricular ectopics (surrogate) but increased mortality (clinical outcome) — the surrogate pointed in the opposite direction to the clinical outcome.

  • Validated surrogates: LDL-C (for CV events), HbA1c (for microvascular complications), blood pressure (for stroke/MI) — these have strong biological plausibility and consistent trial validation
  • Unvalidated surrogates: Bone mineral density (for fractures — not always predictive), tumour shrinkage (for survival — often not predictive), CD4 count (for HIV outcomes — validated in some contexts)
  • The surrogate trap: A drug that improves a surrogate but worsens clinical outcomes is worse than useless — it is harmful. Always ask: has this surrogate been validated in trials that measured clinical outcomes?
  • Regulatory approval vs clinical use: Drugs can be approved on surrogate endpoints (faster, cheaper trials) but may not improve clinical outcomes. FDA accelerated approval pathway has produced several drugs later withdrawn when clinical outcome trials were negative.

Subgroup Analyses: The Garden of Forking Paths

Subgroup analyses divide the trial population into subgroups (by age, sex, diabetes status, baseline risk, etc.) and examine whether the treatment effect differs between subgroups. They are seductive — they seem to offer personalised medicine insights. But they are statistically treacherous. If you divide a trial into 20 subgroups and test each one, you expect 1 to show a statistically significant result by chance alone (at p < 0.05). The more subgroups you test, the more false positives you generate. The ISIS-2 trial famously found that aspirin was ineffective in patients born under the star signs Gemini and Libra — a subgroup analysis of a real trial, illustrating the absurdity of data dredging.

  • Pre-specified vs post-hoc: Pre-specified subgroup analyses (defined in the protocol before the trial started) are more reliable than post-hoc analyses (defined after seeing the data). Always check the trial protocol.
  • Interaction test: A valid subgroup analysis requires a statistically significant interaction test (p for interaction < 0.05) — this tests whether the treatment effect genuinely differs between subgroups, not just whether it is significant in one and not the other.
  • Biological plausibility: A subgroup finding is more credible if there is a plausible biological mechanism. "Drug X works better in patients with high baseline CRP" is more credible if CRP is part of the drug's mechanism of action.
  • Replication: A subgroup finding should be replicated in an independent trial before changing practice. The PLATO trial subgroup suggesting ticagrelor was less effective in North American patients was not replicated and is now considered a chance finding.
  • Clinical application: Even a valid subgroup finding should be applied cautiously — it is based on a subset of the trial population, with reduced statistical power and increased uncertainty.

Part 5: The 5 Questions Every GP Should Ask

You do not need to read every trial in full. You need a rapid, systematic framework for deciding whether a trial result should change your practice. These five questions can be applied to any trial in under 10 minutes — and they will protect your patients from the harms of premature practice change.

Question 1: Does My Patient Resemble the Trial Population?

This is the most important question, and the one most often skipped. Every trial has inclusion and exclusion criteria that define the population in which the results are valid. If your patient does not resemble that population, the trial result may not apply to them — or may apply in a different direction. The SPRINT trial excluded patients with diabetes, prior stroke, heart failure, and eGFR <20 mL/min/1.73m² — four of the most common comorbidities in hypertensive patients in primary care. The trial also used an unattended automated BP measurement protocol that reads 5–10 mmHg lower than standard clinic measurements. A patient with diabetes, prior stroke, and a clinic BP of 138/88 mmHg is not the SPRINT patient.

  • Check the inclusion criteria: What was the minimum age? What was the required diagnosis? What was the baseline risk level?
  • Check the exclusion criteria: Were patients with your patient's comorbidities excluded? Were elderly patients excluded? Were women underrepresented?
  • Check the baseline characteristics table: What was the mean age? What proportion had diabetes, CKD, prior CVD? What was the mean baseline risk?
  • Check the setting: Was this a specialist centre or primary care? Was it a high-income country? Was the "usual care" comparator comparable to your practice?
  • The generalisability question: "Would my patient have been eligible for this trial?" If the answer is no, apply the results with caution.

Question 2: What Was the Absolute Benefit — and Is It Clinically Meaningful?

Convert every relative risk to an absolute risk reduction. Calculate the NNT. Then ask: is this NNT clinically meaningful given the cost, side effects, and patient burden of the treatment? A drug that reduces cardiovascular events from 10% to 8% (ARR 2%, NNT 50) over 5 years is meaningfully different from a drug that reduces events from 1% to 0.8% (ARR 0.2%, NNT 500) over 5 years — even if both have a relative risk reduction of 20%. The first drug prevents 1 event per 50 patients treated; the second prevents 1 event per 500 patients treated. The clinical decision is very different.

ScenarioControl RateTreatment RateRRRARRNNT (5 years)Clinical Verdict
High-risk patient (QRISK3 20%)20%14%30%6%17Statin clearly worthwhile — 1 in 17 patients benefits
Intermediate-risk patient (QRISK3 10%)10%7%30%3%33Statin worthwhile — discuss with patient
Low-risk patient (QRISK3 5%)5%3.5%30%1.5%67Marginal — lifestyle intervention may be preferable
Very low-risk patient (QRISK3 2%)2%1.4%30%0.6%167Not worthwhile — NNT too high; harms likely outweigh benefits

The NNT is time-dependent — it changes with the duration of treatment. An NNT of 100 over 5 years is equivalent to an NNT of 50 over 10 years (assuming constant annual risk reduction). When comparing NNTs across trials, always check the follow-up duration. A trial with a 1-year follow-up will have a higher NNT than a trial with a 5-year follow-up for the same drug — not because the drug is less effective, but because fewer events occur in a shorter time.

Question 3: What Were the Harms — and Were They Fully Reported?

Every effective treatment has harms. The question is not whether harms exist, but whether the benefit-harm ratio is favourable for your patient. Harms are systematically underreported in clinical trials — particularly industry-funded trials. A 2012 Cochrane review found that harms data were inadequately reported in 65% of RCTs. The reasons are structural: trials are powered to detect benefits, not harms; follow-up is often too short to detect long-term harms; and industry sponsors have financial incentives to minimise harm reporting.

  • Look for the Number Needed to Harm (NNH): If not reported, calculate it from the adverse event rates in the safety table
  • Check the duration of follow-up: Short trials (1–2 years) cannot detect long-term harms (e.g., cancer risk, cognitive effects, bone density changes)
  • Check for serious adverse events (SAEs): These are pre-defined events requiring hospitalisation or causing death — they are the most clinically important harms
  • Check for discontinuation rates: High discontinuation due to adverse effects in the treatment arm suggests real-world tolerability is worse than the trial suggests
  • Check for harms in subgroups: Harms may be concentrated in specific subgroups (elderly, renal impairment, drug interactions) that are underrepresented in the trial
  • Look for the trial registration and protocol: Compare reported outcomes with pre-specified outcomes — selective outcome reporting (reporting only favourable outcomes) is a form of research misconduct

Question 4: Is the Result Biologically Plausible and Consistent with Other Evidence?

A single trial, however well-designed, is not sufficient to change practice. The result must be biologically plausible (consistent with what we know about the mechanism of action), consistent with prior evidence (not contradicting well-established findings), and ideally replicated in independent trials. The Bradford Hill criteria — originally developed for establishing causation in epidemiology — provide a useful framework: strength of association, consistency across studies, specificity, temporality, biological gradient (dose-response), plausibility, coherence, experiment (does removing the cause remove the effect?), and analogy.

  • Biological plausibility: Does the mechanism of action explain the observed effect? Statins reduce LDL-C → LDL-C causes atherosclerosis → statins reduce cardiovascular events. Plausible and confirmed.
  • Consistency: Is the result consistent with prior trials of the same drug class? The cardiovascular benefit of SGLT2 inhibitors has been replicated across empagliflozin (EMPA-REG), canagliflozin (CANVAS), dapagliflozin (DECLARE), and ertugliflozin (VERTIS) — consistent class effect.
  • Dose-response: Does a higher dose produce a greater effect? The DiRECT trial showed a clear dose-response between weight loss and T2DM remission — more weight lost, higher remission rate. This strengthens the causal inference.
  • Replication: Has the result been replicated in an independent trial by a different research group? A single positive trial from one group, not yet replicated, should be treated as hypothesis-generating.
  • Coherence: Does the result fit with what we know from other sources (animal studies, mechanistic studies, epidemiology)? A result that contradicts all prior evidence requires extraordinary evidence.

Question 5: Who Funded the Trial — and Does It Matter?

Funding source is not a reason to dismiss a trial, but it is a reason to scrutinise it more carefully. A 2017 systematic review in the BMJ found that industry-funded trials were 4 times more likely to report positive results than independently funded trials, even after adjusting for study quality. The mechanisms are multiple: selective publication (negative trials not published), outcome switching (changing the primary outcome after seeing the data), selective reporting (reporting only favourable secondary outcomes), and spin (framing negative results positively in the abstract and discussion). The most reliable trials are independently funded (NIHR, MRC, NIH), pre-registered, and published with full data transparency.

Funding SourcePositive Result RateKey RiskHow to Mitigate
Industry-funded~85% positivePublication bias; outcome switching; spinCheck trial registration; compare protocol with published outcomes; look for independent replication
Independently funded (NIHR, MRC, NIH)~50% positiveLower risk of bias; more likely to publish negative resultsStill check methodology; independent funding does not guarantee quality
Investigator-initiated (industry drug, independent design)IntermediateDrug supply from industry but independent design and analysisCheck data access agreement; who controlled the analysis?
Cochrane systematic reviewN/A — synthesises all evidenceDepends on quality of included trials; publication bias in included trialsCheck for funnel plot asymmetry; GRADE rating of evidence quality

ClinicalTrials.gov and the ISRCTN registry allow you to look up the pre-registered protocol for any trial. Compare the pre-specified primary outcome with the published primary outcome. If they differ, the trial may have engaged in outcome switching — a serious form of research misconduct that inflates false positive rates. The AllTrials campaign (alltrials.net) advocates for mandatory registration and publication of all clinical trials.

Part 6: Applying the Framework — The SPRINT Trial Revisited

Let us apply the five questions to the SPRINT trial — the trial that drove the AHA/ACC to lower the hypertension threshold to 130/80 mmHg, and that NICE chose not to follow.

QuestionSPRINT AnswerClinical Implication
1. Does my patient resemble the trial population?SPRINT excluded: diabetes, prior stroke, HF, eGFR <20, age >75 (initially), nursing home residents. Used unattended automated BP (reads 5–10 mmHg lower than standard clinic BP).Most primary care hypertensive patients do NOT resemble the SPRINT population. The BP measurement method is not standard in UK primary care.
2. What was the absolute benefit?Primary outcome (MACE): 5.2% vs 6.8% over 3.26 years. ARR = 1.6%. NNT = 63 over 3.26 years. CV death: 0.8% vs 1.4%. ARR = 0.6%. NNT = 167.Meaningful benefit in the SPRINT population — but this is a high-risk population (mean 10-year CVD risk ~20%). In lower-risk patients, the ARR would be much smaller.
3. What were the harms?Serious adverse events: 38.3% vs 37.1% (not significantly different overall). Hypotension: 2.4% vs 1.4% (p<0.001). Syncope: 2.3% vs 1.7% (p=0.05). AKI: 4.4% vs 2.6% (p<0.001). Electrolyte abnormalities: 3.1% vs 1.5% (p<0.001).Significant increase in AKI, hypotension, and electrolyte abnormalities. In elderly patients and those with CKD, these harms may outweigh the cardiovascular benefit.
4. Is the result biologically plausible and consistent?Plausible — lower BP reduces cardiovascular events. But inconsistent with ACCORD (intensive BP in diabetes showed no benefit and possible harm) and HOT trial (no benefit below 140 mmHg in most patients).The SPRINT result may be specific to the SPRINT population (high-risk, non-diabetic, non-stroke). It does not generalise to all hypertensive patients.
5. Who funded the trial?SPRINT was funded by the NIH (National Heart, Lung, and Blood Institute) — independent funding. No industry involvement.Independent funding reduces publication bias risk. The result is more likely to be genuine. But generalisability questions remain.

NICE's conclusion: the SPRINT result is real and applies to the SPRINT population. But the SPRINT population is not representative of the general hypertensive population in UK primary care. The BP measurement method is not standard. The harms (AKI, hypotension, electrolyte abnormalities) are clinically significant, particularly in elderly patients and those with CKD. The absolute benefit in lower-risk patients (QRISK3 <10%) is too small to justify the harms. Therefore, NICE retained the 140/90 mmHg threshold for most patients, while acknowledging that lower targets may be appropriate for high-risk subgroups. This is not NICE being conservative — it is NICE being rigorous.

Part 7: A Rapid Appraisal Checklist for Busy GPs

You will not always have time for a full critical appraisal. The following 10-point checklist can be completed in under 5 minutes and will identify the most important issues with any trial result.

  • 1. What was the primary outcome? (Not a surrogate — a clinical outcome)
  • 2. What was the absolute risk reduction? (Convert relative risk to absolute risk)
  • 3. What is the NNT? (And over what time period?)
  • 4. What were the serious adverse events? (Calculate NNH)
  • 5. Does my patient resemble the trial population? (Check inclusion/exclusion criteria)
  • 6. Was the trial adequately powered? (Check sample size calculation)
  • 7. Were the groups balanced at baseline? (Check Table 1)
  • 8. Was the primary outcome pre-specified? (Check trial registration)
  • 9. Who funded the trial? (Industry vs independent)
  • 10. Has the result been replicated? (One trial is not enough to change practice)

The CASP (Critical Appraisal Skills Programme) provides free, validated checklists for appraising RCTs, systematic reviews, cohort studies, and diagnostic test accuracy studies. Available at casp-uk.net. The BMJ's "How to read a paper" series by Trisha Greenhalgh is the definitive accessible guide to critical appraisal for clinicians — available free online.

Part 8: When to Change Practice — and When to Wait

The hardest clinical skill is knowing when a new trial result should change your practice and when it should not. The following framework provides a structured approach to this decision.

ScenarioActionRationale
Single RCT, large effect, high-quality, independent funding, replicatedChange practiceStrong evidence; low risk of false positive; consistent with prior evidence
Single RCT, large effect, industry-funded, not yet replicatedWatch and wait; discuss with patients who askPossible true effect but replication needed; publication bias risk
Single RCT, small effect (NNT >200), statistically significantDo not change practiceStatistical significance ≠ clinical significance; harms may outweigh marginal benefit
Meta-analysis with high heterogeneity (I² >50%)Interpret with caution; look at individual trialsPooling heterogeneous trials can produce misleading estimates
Subgroup analysis showing benefit in your patient typeDo not change practice unless pre-specified and replicatedSubgroup analyses are hypothesis-generating; high false positive rate
Surrogate outcome trial (e.g., HbA1c, LDL-C)Do not change practice until clinical outcome trial availableSurrogate improvement does not guarantee clinical benefit
Observational study showing associationDo not change practice; use as hypothesis for RCTAssociation ≠ causation; confounding likely
NICE guideline update based on new evidenceChange practice in line with NICENICE has appraised the evidence; follow unless patient-specific reason not to

The most dangerous moment in evidence-based medicine is the period between a single positive trial and its replication — when the result is new, exciting, and widely publicised, but not yet confirmed. This is when premature practice change causes the most harm. The history of medicine is littered with interventions that seemed promising in early trials and were later shown to be ineffective or harmful: hormone replacement therapy for cardiovascular prevention (WHI trial), intensive glycaemic control in T2DM (ACCORD), antiarrhythmics for ventricular ectopics (CAST), and many others. The discipline of waiting for replication before changing practice is not conservatism — it is the highest form of patient advocacy.

The GRADE system (Grading of Recommendations Assessment, Development and Evaluation) provides a transparent framework for rating the quality of evidence and the strength of recommendations. Evidence is rated High, Moderate, Low, or Very Low based on study design, risk of bias, inconsistency, indirectness, imprecision, and publication bias. A strong recommendation requires High or Moderate quality evidence. NICE uses GRADE for all its guideline recommendations.

Key Clinical Takeaways

  • Always convert relative risk to absolute risk reduction — a 50% relative risk reduction from 2% to 1% is an NNT of 100, not a dramatic benefit
  • The NNT is the most clinically useful statistic: it tells you how many patients need treatment for one to benefit — always pair it with the NNH (Number Needed to Harm)
  • P < 0.05 means the result is unlikely to be due to chance — it does not mean the effect is clinically important, large, or applicable to your patient
  • Confidence intervals tell you about precision and statistical significance — a CI that crosses 1.0 (for relative risk) or 0 (for absolute risk) is not statistically significant
  • The 5 questions: Does my patient resemble the trial population? What was the absolute benefit? What were the harms? Is the result biologically plausible and consistent? Who funded the trial?
  • Composite outcomes can inflate apparent benefit — always decompose and examine each component separately
  • Subgroup analyses are hypothesis-generating, not practice-changing — require pre-specification, significant interaction test, and independent replication
  • One trial is not enough to change practice — wait for replication, especially for industry-funded trials with large relative risk reductions

You might also like

All Clinical Practice articles
Cardiovascular Risk Assessment in Primary Care: QRISK3, Statins & Beyond
Cardiovascular

Cardiovascular Risk Assessment in Primary Care: QRISK3, Statins & Beyond

Cardiovascular disease remains the leading cause of death globally. QRISK3 is the validated tool for 10-year CVD risk estimation in UK primary care. This guide covers who to screen, how to interpret QRISK3, statin prescribing thresholds, and lifestyle interventions.

Dr. Sarah MitchellSarah Mitchell
10 min read
Read article
NICE vs WHO vs AHA: When the Guidelines Disagree — A GP's Field Guide
Clinical Practice
Featured

NICE vs WHO vs AHA: When the Guidelines Disagree — A GP's Field Guide

NICE says treat at 140/90 mmHg. AHA says 130/80 mmHg. WHO says 140/90 mmHg but with different drug choices. Your patient is sitting in front of you with a BP of 135/85 mmHg. Who do you follow? This guide cuts through the noise — comparing NICE, WHO, AHA, ESC, and other major bodies on the clinical questions that matter most in primary care, explaining why they diverge, and giving you a framework for making defensible, patient-centred decisions when the guidelines don't agree.

Dr. Sarah MitchellSarah Mitchell
16 min read
Read article
Shared Decision Making in 10 Minutes: How to Present Risk, NNT, and Treatment Options to Patients Without a Statistics Degree
Clinical Practice
Featured

Shared Decision Making in 10 Minutes: How to Present Risk, NNT, and Treatment Options to Patients Without a Statistics Degree

Your patient has a 10-year cardiovascular risk of 12%. You want to start a statin. The trial shows a 30% relative risk reduction. You say this. They hear "30% chance of a heart attack." The consultation derails. Shared decision making is not about dumbing down the evidence — it is about translating it into the language of lived experience. This guide gives GPs the specific tools, scripts, and visual frameworks to present risk, NNT, and treatment options in a way that genuinely informs patient choice — in the time available in a real consultation.

Dr. Priya NairPriya Nair
15 min read
Read article
Topics:Evidence-Based MedicineRCTNNTConfidence IntervalsStatisticsClinical TrialsCritical AppraisalP-value