Or try with a sample dataset:

The intraclass correlation coefficient (ICC) is the standard measure of reliability for continuous measurements — it quantifies how consistent or interchangeable measurements are when made by different raters, instruments, or test occasions. Unlike Pearson's r (which requires exactly two variables and ignores systematic bias), ICC can handle any number of raters, accounts for both systematic and random sources of disagreement, and is defined as the ratio of between-subject variance to total variance (between-subject + within-subject + error). ICC ranges from 0 (no reliability — all variance is random error) to 1 (perfect reliability — all variance reflects true between-subject differences).
The key conceptual distinction in ICC is between consistency and absolute agreement. Consistency asks whether raters rank subjects in the same order — it ignores systematic bias (one rater always scoring 3 points higher than another). Absolute agreement additionally requires that raters give the same numerical values — it penalizes systematic differences between raters. For instrument interchangeability (e.g., can device A replace device B?), absolute agreement is appropriate. For assessing whether raters can discriminate between subjects (e.g., ranking pain severity), consistency is sufficient. Shrout and Fleiss (1979) and McGraw and Wong (1996) provide the definitive taxonomy: ICC(1,1) — one-way random, single measures; ICC(2,1) — two-way mixed/random, absolute agreement, single measures; ICC(3,1) — two-way mixed, consistency, single measures; and their average-measures counterparts ICC(1,k), ICC(2,k), ICC(3,k).
A practical example: three physical therapists rate shoulder abduction range-of-motion in 30 patients. ICC(2,1) absolute agreement = 0.89 (95% CI: 0.82–0.94), classified as "good" reliability. The profile plot reveals that Rater 2 consistently scores 3° higher than the other two raters — a systematic bias that makes absolute agreement lower than consistency (ICC(2,1) consistency = 0.92). The standard error of measurement (SEM = SD × √(1−ICC)) = 4.2° indicates the typical measurement error for a single rater. The minimal detectable change (MDC₉₅ = 1.96 × √2 × SEM) = 11.7° gives the threshold above which a change in a patient's score can be attributed to a real change rather than measurement error.
| Column | Description | Example |
|---|---|---|
subject | Subject/item identifier | P001, P002 |
rater1 | Scores from rater 1 | 6.2, 8.5, 4.1 |
rater2 | Scores from rater 2 | 5.8, 8.9, 4.4 |
rater3 | Optional: additional rater | 6.0, 8.3, 4.6 |
Wide format only (one row per subject, one column per rater). If data are in long format (one row per rating), ask the AI to pivot to wide format first. All measurements must be on the same numeric scale. Missing values exclude that subject from the analysis.
| Output | What it means |
|---|---|
| ICC point estimate | Proportion of total variance due to true subject differences — higher = more reliable |
| 95% CI | Uncertainty in the ICC estimate — always report; wide CI with small n is common |
| ICC model | Which of the 6 Shrout-Fleiss models was used — must match the study design |
| Reliability classification | Poor < 0.50 · Moderate 0.50–0.75 · Good 0.75–0.90 · Excellent ≥ 0.90 (Koo & Mae, 2016) |
| SEM | Standard Error of Measurement = SD × √(1−ICC) — absolute measurement precision in original units |
| MDC₉₅ | Minimal Detectable Change = 1.96 × √2 × SEM — smallest real change detectable above measurement noise |
| Variance components | Between-subject, between-rater, and residual variance — identifies the primary source of unreliability |
| Profile plot | Rater scores per subject with group means — reveals systematic rater bias visually |
| Scenario | What to type |
|---|---|
| Basic 2-rater ICC | 2 raters, 25 subjects; ICC(2,1) absolute agreement; 95% CI; classify reliability; scatter plot rater1 vs rater2 |
| Test-retest | test and retest scores for same instrument; ICC(3,1) consistency; SEM; MDC95; Bland-Altman plot of difference vs mean |
| 3+ raters | 4 raters, 30 subjects; ICC(2,1) and ICC(2,4) average measures; compare single vs average reliability; profile plot |
| Variance components | compute variance components: between-subject, between-rater, residual; pie chart; identify main source of unreliability |
| Absolute vs consistency | compute both ICC(2,1) absolute and ICC(2,1) consistency; compare; if they differ substantially, report rater bias |
| SEM and MDC | ICC(2,1); compute SEM and MDC95 in original units; interpret: what score change is clinically meaningful vs noise? |
| Two-way ANOVA table | full two-way ANOVA table underlying the ICC: SS, df, MS for subjects, raters, residual; F-tests |
| Minimum sample size | how many subjects needed to estimate ICC ≥ 0.75 with CI width ≤ 0.20? compute with 3 raters |
Use the Bland-Altman Plot Generator alongside ICC — ICC quantifies the proportion of variance due to subjects, while the Bland-Altman plot visualizes the actual magnitude of disagreement between two measurement methods in original units; for method comparison studies, both are required. Use the Cronbach's Alpha Calculator when items are parallel indicators of a latent construct (psychometric reliability) rather than repeated measurements of the same physical quantity — Cronbach's alpha and ICC(2,k) consistency are mathematically equivalent in the two-way mixed model. Use the Linear Mixed Effects Model Calculator when ICC is a secondary output (the model's random intercept variance / total variance) in a longitudinal study where the primary goal is estimating fixed effects. Use the Power Analysis Calculator to determine sample size needed to estimate ICC with a target CI width.
Which ICC model should I use? Follow the Koo and Mae (2016) decision tree: (1) Are all subjects rated by the same raters? If NO → ICC(1,1). If YES → (2) Are the raters a random sample from a larger pool (you want to generalize to new raters)? If YES → ICC(2,1). If NO (these specific raters are the only ones of interest) → ICC(3,1). Then: (3) Is the goal to use a single rater's score in practice? → single measures ICC (the k=1 version). Is the goal to average multiple raters' scores? → average measures ICC(k). For clinical outcome measurement validation, ICC(2,1) absolute agreement is typically most appropriate because you want the measurement to be reliable across any trained clinician, not just the specific raters in the study.
What is the difference between SEM and MDC? The Standard Error of Measurement (SEM) = SD × √(1−ICC) quantifies the typical magnitude of measurement error for a single measurement — it is in the same units as the original scale. A patient's true score is estimated to be within ±1.96 × SEM of their observed score with 95% probability. The Minimal Detectable Change (MDC₉₅) = 1.96 × √2 × SEM is the smallest change in score between two measurement occasions that exceeds measurement error with 95% confidence — any change smaller than MDC₉₅ cannot be distinguished from random fluctuation. Example: if SEM = 3 points on a pain scale, MDC₉₅ = 1.96 × √2 × 3 = 8.3 points — a patient's pain must decrease by at least 8.3 points to confidently claim a real improvement. MDC should be contrasted with the Minimal Clinically Important Difference (MCID), which is determined by patient-reported outcomes, not psychometrics.
My ICC is high (0.85) but raters clearly disagree — what went wrong? High ICC with visible rater disagreement usually means: (1) large between-subject variance — if subjects vary enormously in their true scores, even large absolute rater differences produce a high ICC (ICC measures relative agreement, not absolute agreement); (2) wrong ICC model — if you used consistency ICC when absolute agreement was needed, systematic rater biases are ignored; (3) small absolute error relative to range — with a 0–100 scale and subjects spanning the full range, 5-point rater differences produce high ICC even though 5 points may be clinically meaningful. Always report both ICC and SEM/MDC together — ICC alone is insufficient for clinical decision-making.
How many raters and subjects do I need? The precision of the ICC estimate (CI width) depends on both the number of subjects (n) and raters (k). Adding subjects narrows the CI more efficiently than adding raters. Rough guidance for ICC ≈ 0.70–0.85: to achieve 95% CI width ≤ 0.20, you need approximately n = 30 subjects with k = 2 raters, or n = 20 with k = 3 raters. For CI width ≤ 0.10 (regulatory-grade precision): n ≈ 100 with k = 2, or n ≈ 60 with k = 3. Use a dedicated ICC sample size calculator (e.g., Bonett's method) for exact calculations with your target ICC and acceptable CI width.