Research MethodologyJune 25, 2026

Sample Size Calculation with Claude: When to Trust the Output

I got burned on a cluster trial. Fed Claude the basic parameters — expected proportions, alpha, power — and it returned a clean number. Looked reasonable. Passed the IRB. During peer review, a biostatistician flagged it in the opening paragraph: I'd used a parallel-arm formula for a cluster design. The ICC inflation factor was missing. Required n was 40% higher than my estimate.

Claude hadn't fabricated anything. It calculated exactly what I asked. That was the problem.

For sample size calculation AI, the failure mode is almost never arithmetic. It's design family selection — Claude picks the wrong formula more often than it warns you.

Feed the Design Narrative, Not the Parameters

Don't give Claude your parameters and ask for a number. Give Claude your design narrative first and ask it to name the right formula.

Wrong:

"Expected proportions: control 30%, intervention 45%. Alpha 0.05, power 80%. Calculate sample size."

Right:

"Here is my full study design: [paste narrative — allocation method, cluster structure, primary outcome type, randomization unit]. Name the statistical test family appropriate for this design, list every required parameter, then ask for any missing information before calculating."

The second prompt forces Claude to commit to a design family before touching the arithmetic. It also surfaces what it doesn't know. That one shift catches most errors.

Three Design Types and Where It Goes Wrong

Two-arm parallel RCT

Claude performs reliably here. It defaults to the right formula for binary or continuous outcomes, handles two-tailed testing correctly, and usually asks about allocation ratio unprompted. The gaps to verify manually: interim analysis inflation factors and non-compliance rate (separate from expected dropout). Both are frequently omitted.

Cluster trial

High error rate. Left to defaults, Claude treats every cluster member as independent. You have to specify the cluster design, provide your ICC estimate, and ask it to compute the design effect (DEFF = 1 + (m−1) × ICC, where m = mean cluster size).

In my case: ICC = 0.04, mean cluster size = 15, DEFF = 1.56. That's a 56% inflation over the naive parallel-arm estimate. Claude's default was off by more than half the required recruitment.

Prompt fix: "This is a cluster RCT. Randomization unit is the cluster. Mean cluster size [m], ICC estimate [ρ]. Compute DEFF and apply it before giving me n."

Diagnostic accuracy study

Completely different framework. Sizing a diagnostic study is against a confidence interval bound, not power to detect a difference. Claude regularly conflates this with a two-sample proportion test.

Correct frame: "Estimating sensitivity. Target lower 95% CI bound = [x]%. Calculate required n using Wilson CI for a single proportion — this is not a power calculation."

Without that framing, you get the wrong formula category.

What Claude Actually Gets Right

Once you've constrained the design family, the arithmetic is clean. Claude also runs sensitivity analyses quickly — how does required n shift with different ICC assumptions or dropout rates? That kind of exploration is genuinely useful. Tedious by hand, fast with Claude.

Use it as a formula executor with the design constraint set by you. Not as a design consultant.

In practice, that means making Claude spell out the assumptions in plain language before you accept a number: unit of randomization, anticipated attrition, multiplicity, endpoint scale, and whether analysis happens at the individual or cluster level. If any of those are unclear, stop and restate the design first. That extra pause is cheap. It prevents the most embarrassing failure mode: presenting a confident number that answers the wrong question.

The Manual Gate

After any estimate: confirm the formula name matches your actual study structure, verify that dropout and non-compliance are both accounted for, and do a magnitude plausibility check against your resources.

For pre-registered or RCT settings, run the same calculation independently in G*Power or the relevant R package. If Claude's output and the external tool disagree beyond rounding error, the design family is probably wrong.

The framing problem that precedes calculation — whether your study is powered to answer the right question at all — is covered in When a Study Is Too Small to Matter. The same attention to assumption documentation applies to your analysis plan: see Statistical Assumption Check with Claude for the equivalent audit at the results stage.

The Paper Checker at aiforacademic.world runs a pre-submission audit on your methods section — including whether your reported sample size formula matches your stated design, and whether assumption testing is documented before reviewers ask for it.