[{"data":1,"prerenderedAt":982},["ShallowReactive",2],{"content-query-6s597ReCuv":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"heading":10,"prompt":11,"tags":15,"files":17,"nav":6,"presets":18,"gallery":36,"body":38,"_type":975,"_id":976,"_source":977,"_file":978,"_stem":979,"_extension":980,"sitemap":981},"/tools/ab-test-calculator","tools",false,"","A/B Test Calculator for Conversion Experiments","Analyze A/B tests online from Excel or CSV data. Calculate conversion rates, lift, p-values, confidence intervals, and sample size checks with AI.","A/B Test Calculator",{"prefix":12,"label":13,"placeholder":14},"Analyze A/B test results","Describe your test variants and metric","e.g. Control vs Variant B, binary conversion outcome; compute conversion rates, z-test p-value, 95% CI, lift %; bar chart with CI; check if sample size was sufficient for 80% power",[16],"statistics",true,[19,25,30],{"label":20,"prompt":21,"dataset_url":22,"dataset_title":23,"dataset_citation":24},"Email Campaign Conversion Test","A/B test: control vs variant email; binary conversion outcome (0/1); compute conversion rates, z-test for proportions, p-value, 95% CI, relative lift %; bar chart with CI error bars; sampling distribution plot; was sample size adequate for 80% power?","https://data.cdc.gov/api/views/iuq5-y9ct/rows.csv?accessType=DOWNLOAD","NHANES Mental Health Assessment","CDC",{"label":26,"prompt":27,"dataset_url":28,"dataset_title":29,"dataset_citation":24},"Health Intervention Comparison","A/B test comparing two health interventions: binary outcome (improved/not improved); proportion z-test with continuity correction; relative risk and odds ratio; 95% CI; p-value; minimum detectable effect at observed sample size","https://data.cdc.gov/api/views/dppn-5tm3/rows.csv?accessType=DOWNLOAD","NCHS Health and Nutrition Examination Survey",{"label":31,"prompt":32,"dataset_url":33,"dataset_title":34,"dataset_citation":35},"Multi-Variant Test (A/B/C)","Multi-variant test with 3 groups (A/B/C); pairwise proportion z-tests with Bonferroni correction for multiple comparisons; conversion rate per variant; bar chart; identify winning variant; Bonferroni-adjusted p-values","https://ourworldindata.org/grapher/happiness-cantril-ladder.csv","Self-Reported Life Satisfaction","Our World in Data",[37],"/img/tools/ab-test-calculator.png",{"type":39,"children":40,"toc":964},"root",[41,50,78,125,145,151,225,231,414,426,432,582,588,746,752,820,826,869,875,892,916,947],{"type":42,"tag":43,"props":44,"children":46},"element","h2",{"id":45},"what-is-an-ab-test",[47],{"type":48,"value":49},"text","What Is an A/B Test?",{"type":42,"tag":51,"props":52,"children":53},"p",{},[54,56,62,64,69,71,76],{"type":48,"value":55},"An ",{"type":42,"tag":57,"props":58,"children":59},"strong",{},[60],{"type":48,"value":61},"A/B test",{"type":48,"value":63}," (also called a split test or controlled experiment) is a randomized experiment that compares two (or more) versions of a product, message, or intervention by exposing different groups of users to each version and measuring which performs better on a target metric. The core logic is identical to a clinical randomized controlled trial applied to digital products and marketing: the ",{"type":42,"tag":57,"props":65,"children":66},{},[67],{"type":48,"value":68},"control group",{"type":48,"value":70}," (A) receives the current version, the ",{"type":42,"tag":57,"props":72,"children":73},{},[74],{"type":48,"value":75},"treatment group",{"type":48,"value":77}," (B) receives the changed version, and outcomes (click-through rate, purchase conversion, sign-up rate, revenue) are compared after sufficient data has been collected. The statistical test determines whether the observed difference in performance could plausibly be explained by chance sampling variation or represents a genuine effect of the change.",{"type":42,"tag":51,"props":79,"children":80},{},[81,83,88,90,95,97,102,104,109,111,116,118,123],{"type":48,"value":82},"For ",{"type":42,"tag":57,"props":84,"children":85},{},[86],{"type":48,"value":87},"binary outcomes",{"type":48,"value":89}," (converted / not converted), the appropriate test is the ",{"type":42,"tag":57,"props":91,"children":92},{},[93],{"type":48,"value":94},"two-proportion z-test",{"type":48,"value":96},": it computes a z-statistic based on the observed proportions and their pooled standard error, then derives a p-value — the probability of observing a difference at least this large if the true conversion rates were equal. For ",{"type":42,"tag":57,"props":98,"children":99},{},[100],{"type":48,"value":101},"continuous outcomes",{"type":48,"value":103}," (revenue per user, time on page), a ",{"type":42,"tag":57,"props":105,"children":106},{},[107],{"type":48,"value":108},"two-sample t-test",{"type":48,"value":110}," is used instead. The ",{"type":42,"tag":57,"props":112,"children":113},{},[114],{"type":48,"value":115},"relative lift",{"type":48,"value":117}," — (p_B − p_A) / p_A × 100% — is the most business-relevant effect size: it expresses how much better (or worse) the variant performs relative to the control in percentage terms. A statistically significant test tells you the difference is unlikely due to chance; ",{"type":42,"tag":57,"props":119,"children":120},{},[121],{"type":48,"value":122},"lift",{"type":48,"value":124}," tells you the magnitude of the business impact.",{"type":42,"tag":51,"props":126,"children":127},{},[128,130,135,137,143],{"type":48,"value":129},"A concrete example: an e-commerce site tests a new checkout button color. Control: 4,821 users, 412 conversions (8.54%). Variant: 4,956 users, 531 conversions (10.71%). Lift = +25.4%, z = 3.63, p = 0.0003 — the result is highly significant. However, a critical follow-up question is whether the test was ",{"type":42,"tag":57,"props":131,"children":132},{},[133],{"type":48,"value":134},"pre-registered with a predetermined sample size",{"type":48,"value":136},": if the team peeked at the data daily and stopped the test early when it looked significant, the reported p-value is invalid due to multiple looks (the \"peeking problem\"). Proper A/B testing requires deciding the sample size ",{"type":42,"tag":138,"props":139,"children":140},"em",{},[141],{"type":48,"value":142},"before",{"type":48,"value":144}," running the test, based on the minimum detectable effect and desired power.",{"type":42,"tag":43,"props":146,"children":148},{"id":147},"how-it-works",[149],{"type":48,"value":150},"How It Works",{"type":42,"tag":152,"props":153,"children":154},"ol",{},[155,183,198],{"type":42,"tag":156,"props":157,"children":158},"li",{},[159,164,166,173,175,181],{"type":42,"tag":57,"props":160,"children":161},{},[162],{"type":48,"value":163},"Upload your data",{"type":48,"value":165}," — provide a CSV or Excel file with one row per user/observation, a column indicating the variant assignment (e.g., ",{"type":42,"tag":167,"props":168,"children":170},"code",{"className":169},[],[171],{"type":48,"value":172},"group",{"type":48,"value":174},": \"control\"/\"variant\"), and a column for the outcome (e.g., ",{"type":42,"tag":167,"props":176,"children":178},{"className":177},[],[179],{"type":48,"value":180},"converted",{"type":48,"value":182},": 0/1). Aggregate summaries (just counts and totals) can also be described directly in the prompt.",{"type":42,"tag":156,"props":184,"children":185},{},[186,191,193],{"type":42,"tag":57,"props":187,"children":188},{},[189],{"type":48,"value":190},"Describe the test",{"type":48,"value":192}," — e.g. ",{"type":42,"tag":138,"props":194,"children":195},{},[196],{"type":48,"value":197},"\"control vs variant, binary conversion outcome in column 'converted', group column is 'group'; z-test for proportions; p-value, 95% CI, relative lift; bar chart; check if sample size was adequate for 80% power\"",{"type":42,"tag":156,"props":199,"children":200},{},[201,206,208,215,217,223],{"type":42,"tag":57,"props":202,"children":203},{},[204],{"type":48,"value":205},"Get full results",{"type":48,"value":207}," — the AI writes Python code using ",{"type":42,"tag":209,"props":210,"children":212},"a",{"href":211},"https://docs.scipy.org/doc/scipy/",[213],{"type":48,"value":214},"scipy",{"type":48,"value":216}," and ",{"type":42,"tag":209,"props":218,"children":220},{"href":219},"https://plotly.com/python/",[221],{"type":48,"value":222},"Plotly",{"type":48,"value":224}," to compute conversion rates, run the significance test, compute confidence intervals, and produce the conversion rate bar chart and sampling distribution visualization",{"type":42,"tag":43,"props":226,"children":228},{"id":227},"required-data-format",[229],{"type":48,"value":230},"Required Data Format",{"type":42,"tag":232,"props":233,"children":234},"table",{},[235,259],{"type":42,"tag":236,"props":237,"children":238},"thead",{},[239],{"type":42,"tag":240,"props":241,"children":242},"tr",{},[243,249,254],{"type":42,"tag":244,"props":245,"children":246},"th",{},[247],{"type":48,"value":248},"Column",{"type":42,"tag":244,"props":250,"children":251},{},[252],{"type":48,"value":253},"Description",{"type":42,"tag":244,"props":255,"children":256},{},[257],{"type":48,"value":258},"Example",{"type":42,"tag":260,"props":261,"children":262},"tbody",{},[263,314,349,388],{"type":42,"tag":240,"props":264,"children":265},{},[266,275,280],{"type":42,"tag":267,"props":268,"children":269},"td",{},[270],{"type":42,"tag":167,"props":271,"children":273},{"className":272},[],[274],{"type":48,"value":172},{"type":42,"tag":267,"props":276,"children":277},{},[278],{"type":48,"value":279},"Variant assignment",{"type":42,"tag":267,"props":281,"children":282},{},[283,289,291,297,299,305,306,312],{"type":42,"tag":167,"props":284,"children":286},{"className":285},[],[287],{"type":48,"value":288},"control",{"type":48,"value":290},", ",{"type":42,"tag":167,"props":292,"children":294},{"className":293},[],[295],{"type":48,"value":296},"variant",{"type":48,"value":298}," (or ",{"type":42,"tag":167,"props":300,"children":302},{"className":301},[],[303],{"type":48,"value":304},"A",{"type":48,"value":290},{"type":42,"tag":167,"props":307,"children":309},{"className":308},[],[310],{"type":48,"value":311},"B",{"type":48,"value":313},")",{"type":42,"tag":240,"props":315,"children":316},{},[317,325,330],{"type":42,"tag":267,"props":318,"children":319},{},[320],{"type":42,"tag":167,"props":321,"children":323},{"className":322},[],[324],{"type":48,"value":180},{"type":42,"tag":267,"props":326,"children":327},{},[328],{"type":48,"value":329},"Binary outcome",{"type":42,"tag":267,"props":331,"children":332},{},[333,339,341,347],{"type":42,"tag":167,"props":334,"children":336},{"className":335},[],[337],{"type":48,"value":338},"1",{"type":48,"value":340}," (converted) or ",{"type":42,"tag":167,"props":342,"children":344},{"className":343},[],[345],{"type":48,"value":346},"0",{"type":48,"value":348}," (not converted)",{"type":42,"tag":240,"props":350,"children":351},{},[352,361,366],{"type":42,"tag":267,"props":353,"children":354},{},[355],{"type":42,"tag":167,"props":356,"children":358},{"className":357},[],[359],{"type":48,"value":360},"revenue",{"type":42,"tag":267,"props":362,"children":363},{},[364],{"type":48,"value":365},"Optional: continuous outcome",{"type":42,"tag":267,"props":367,"children":368},{},[369,374,375,381,382],{"type":42,"tag":167,"props":370,"children":372},{"className":371},[],[373],{"type":48,"value":346},{"type":48,"value":290},{"type":42,"tag":167,"props":376,"children":378},{"className":377},[],[379],{"type":48,"value":380},"29.99",{"type":48,"value":290},{"type":42,"tag":167,"props":383,"children":385},{"className":384},[],[386],{"type":48,"value":387},"149.00",{"type":42,"tag":240,"props":389,"children":390},{},[391,400,405],{"type":42,"tag":267,"props":392,"children":393},{},[394],{"type":42,"tag":167,"props":395,"children":397},{"className":396},[],[398],{"type":48,"value":399},"user_id",{"type":42,"tag":267,"props":401,"children":402},{},[403],{"type":48,"value":404},"Optional: unique identifier",{"type":42,"tag":267,"props":406,"children":407},{},[408],{"type":42,"tag":167,"props":409,"children":411},{"className":410},[],[412],{"type":48,"value":413},"U12345",{"type":42,"tag":51,"props":415,"children":416},{},[417,419,424],{"type":48,"value":418},"Any column names work — describe them in your prompt. For aggregate data (you only have totals, not individual rows), describe the numbers directly: ",{"type":42,"tag":138,"props":420,"children":421},{},[422],{"type":48,"value":423},"\"Control: 4821 users, 412 conversions; Variant: 4956 users, 531 conversions\"",{"type":48,"value":425},".",{"type":42,"tag":43,"props":427,"children":429},{"id":428},"interpreting-the-results",[430],{"type":48,"value":431},"Interpreting the Results",{"type":42,"tag":232,"props":433,"children":434},{},[435,451],{"type":42,"tag":236,"props":436,"children":437},{},[438],{"type":42,"tag":240,"props":439,"children":440},{},[441,446],{"type":42,"tag":244,"props":442,"children":443},{},[444],{"type":48,"value":445},"Output",{"type":42,"tag":244,"props":447,"children":448},{},[449],{"type":48,"value":450},"What it means",{"type":42,"tag":260,"props":452,"children":453},{},[454,470,486,502,518,534,550,566],{"type":42,"tag":240,"props":455,"children":456},{},[457,465],{"type":42,"tag":267,"props":458,"children":459},{},[460],{"type":42,"tag":57,"props":461,"children":462},{},[463],{"type":48,"value":464},"Conversion rate (p_A, p_B)",{"type":42,"tag":267,"props":466,"children":467},{},[468],{"type":48,"value":469},"Proportion of users who converted in each group",{"type":42,"tag":240,"props":471,"children":472},{},[473,481],{"type":42,"tag":267,"props":474,"children":475},{},[476],{"type":42,"tag":57,"props":477,"children":478},{},[479],{"type":48,"value":480},"Absolute difference",{"type":42,"tag":267,"props":482,"children":483},{},[484],{"type":48,"value":485},"p_B − p_A — the raw percentage point improvement",{"type":42,"tag":240,"props":487,"children":488},{},[489,497],{"type":42,"tag":267,"props":490,"children":491},{},[492],{"type":42,"tag":57,"props":493,"children":494},{},[495],{"type":48,"value":496},"Relative lift",{"type":42,"tag":267,"props":498,"children":499},{},[500],{"type":48,"value":501},"(p_B − p_A) / p_A × 100% — how much better the variant is relative to control",{"type":42,"tag":240,"props":503,"children":504},{},[505,513],{"type":42,"tag":267,"props":506,"children":507},{},[508],{"type":42,"tag":57,"props":509,"children":510},{},[511],{"type":48,"value":512},"z-statistic",{"type":42,"tag":267,"props":514,"children":515},{},[516],{"type":48,"value":517},"Test statistic comparing the two proportions under the null hypothesis of no difference",{"type":42,"tag":240,"props":519,"children":520},{},[521,529],{"type":42,"tag":267,"props":522,"children":523},{},[524],{"type":42,"tag":57,"props":525,"children":526},{},[527],{"type":48,"value":528},"p-value",{"type":42,"tag":267,"props":530,"children":531},{},[532],{"type":48,"value":533},"Probability of observing a difference this large if the null (no effect) is true — p \u003C 0.05 is the conventional significance threshold",{"type":42,"tag":240,"props":535,"children":536},{},[537,545],{"type":42,"tag":267,"props":538,"children":539},{},[540],{"type":42,"tag":57,"props":541,"children":542},{},[543],{"type":48,"value":544},"95% CI on difference",{"type":42,"tag":267,"props":546,"children":547},{},[548],{"type":48,"value":549},"Confidence interval for the true difference p_B − p_A — excludes 0 = significant",{"type":42,"tag":240,"props":551,"children":552},{},[553,561],{"type":42,"tag":267,"props":554,"children":555},{},[556],{"type":42,"tag":57,"props":557,"children":558},{},[559],{"type":48,"value":560},"Statistical power",{"type":42,"tag":267,"props":562,"children":563},{},[564],{"type":48,"value":565},"Probability the test would detect a true effect of the observed size (retrospective power)",{"type":42,"tag":240,"props":567,"children":568},{},[569,577],{"type":42,"tag":267,"props":570,"children":571},{},[572],{"type":42,"tag":57,"props":573,"children":574},{},[575],{"type":48,"value":576},"Required sample size",{"type":42,"tag":267,"props":578,"children":579},{},[580],{"type":48,"value":581},"N per group needed to detect a given minimum effect with target power",{"type":42,"tag":43,"props":583,"children":585},{"id":584},"example-prompts",[586],{"type":48,"value":587},"Example Prompts",{"type":42,"tag":232,"props":589,"children":590},{},[591,607],{"type":42,"tag":236,"props":592,"children":593},{},[594],{"type":42,"tag":240,"props":595,"children":596},{},[597,602],{"type":42,"tag":244,"props":598,"children":599},{},[600],{"type":48,"value":601},"Scenario",{"type":42,"tag":244,"props":603,"children":604},{},[605],{"type":48,"value":606},"What to type",{"type":42,"tag":260,"props":608,"children":609},{},[610,627,644,661,678,695,712,729],{"type":42,"tag":240,"props":611,"children":612},{},[613,618],{"type":42,"tag":267,"props":614,"children":615},{},[616],{"type":48,"value":617},"Basic conversion test",{"type":42,"tag":267,"props":619,"children":620},{},[621],{"type":42,"tag":167,"props":622,"children":624},{"className":623},[],[625],{"type":48,"value":626},"control vs variant, binary conversion column; z-test; p-value; relative lift; bar chart with 95% CI",{"type":42,"tag":240,"props":628,"children":629},{},[630,635],{"type":42,"tag":267,"props":631,"children":632},{},[633],{"type":48,"value":634},"Two-sided vs one-sided",{"type":42,"tag":267,"props":636,"children":637},{},[638],{"type":42,"tag":167,"props":639,"children":641},{"className":640},[],[642],{"type":48,"value":643},"one-sided z-test (variant > control); p-value for directional hypothesis; 95% one-sided CI",{"type":42,"tag":240,"props":645,"children":646},{},[647,652],{"type":42,"tag":267,"props":648,"children":649},{},[650],{"type":48,"value":651},"Continuous outcome",{"type":42,"tag":267,"props":653,"children":654},{},[655],{"type":42,"tag":167,"props":656,"children":658},{"className":657},[],[659],{"type":48,"value":660},"revenue per user is continuous; two-sample t-test; mean revenue by group; 95% CI on difference; Cohen's d effect size",{"type":42,"tag":240,"props":662,"children":663},{},[664,669],{"type":42,"tag":267,"props":665,"children":666},{},[667],{"type":48,"value":668},"Multiple variants (A/B/C)",{"type":42,"tag":267,"props":670,"children":671},{},[672],{"type":42,"tag":167,"props":673,"children":675},{"className":674},[],[676],{"type":48,"value":677},"3 variants (A, B, C); pairwise z-tests; Bonferroni correction for multiple comparisons; adjusted p-values",{"type":42,"tag":240,"props":679,"children":680},{},[681,686],{"type":42,"tag":267,"props":682,"children":683},{},[684],{"type":48,"value":685},"Sample size check",{"type":42,"tag":267,"props":687,"children":688},{},[689],{"type":42,"tag":167,"props":690,"children":692},{"className":691},[],[693],{"type":48,"value":694},"was sample size adequate? compute minimum detectable effect at 80% power and α=0.05 given observed n",{"type":42,"tag":240,"props":696,"children":697},{},[698,703],{"type":42,"tag":267,"props":699,"children":700},{},[701],{"type":48,"value":702},"Pre-test power analysis",{"type":42,"tag":267,"props":704,"children":705},{},[706],{"type":42,"tag":167,"props":707,"children":709},{"className":708},[],[710],{"type":48,"value":711},"what sample size per group needed to detect 10% relative lift from baseline 8% conversion rate with 80% power?",{"type":42,"tag":240,"props":713,"children":714},{},[715,720],{"type":42,"tag":267,"props":716,"children":717},{},[718],{"type":48,"value":719},"Sequential testing",{"type":42,"tag":267,"props":721,"children":722},{},[723],{"type":42,"tag":167,"props":724,"children":726},{"className":725},[],[727],{"type":48,"value":728},"compute p-value at each week using group sequential testing (O'Brien-Fleming boundaries); plot alpha spending",{"type":42,"tag":240,"props":730,"children":731},{},[732,737],{"type":42,"tag":267,"props":733,"children":734},{},[735],{"type":48,"value":736},"Segmented analysis",{"type":42,"tag":267,"props":738,"children":739},{},[740],{"type":42,"tag":167,"props":741,"children":743},{"className":742},[],[744],{"type":48,"value":745},"run A/B test overall and separately by device (mobile vs desktop); check for heterogeneous treatment effects",{"type":42,"tag":43,"props":747,"children":749},{"id":748},"assumptions-to-check",[750],{"type":48,"value":751},"Assumptions to Check",{"type":42,"tag":753,"props":754,"children":755},"ul",{},[756,773,783,793,810],{"type":42,"tag":156,"props":757,"children":758},{},[759,764,766,771],{"type":42,"tag":57,"props":760,"children":761},{},[762],{"type":48,"value":763},"Randomization",{"type":48,"value":765}," — users must be randomly and independently assigned to control or variant; if the same user can appear in both groups, or assignment was not random (e.g., all morning users got the variant), the test is invalid; check for ",{"type":42,"tag":57,"props":767,"children":768},{},[769],{"type":48,"value":770},"sample ratio mismatch (SRM)",{"type":48,"value":772}," — a significant chi-square test on the group sizes indicates a randomization failure",{"type":42,"tag":156,"props":774,"children":775},{},[776,781],{"type":42,"tag":57,"props":777,"children":778},{},[779],{"type":48,"value":780},"No peeking / pre-specified sample size",{"type":48,"value":782}," — the decision to stop the test must be made on pre-specified criteria, not by continuously monitoring the p-value and stopping when it drops below 0.05; repeated peeking inflates the Type I error rate substantially above the nominal α; if peeking is unavoidable, use sequential testing methods (O'Brien-Fleming, alpha spending) or Bayesian methods",{"type":42,"tag":156,"props":784,"children":785},{},[786,791],{"type":42,"tag":57,"props":787,"children":788},{},[789],{"type":48,"value":790},"Stable unit treatment value assumption (SUTVA)",{"type":48,"value":792}," — one user's exposure to the variant should not affect another user's outcome; violations occur in social networks (one user sees a variant and tells friends) or two-sided marketplaces (variant users compete with control users for the same inventory)",{"type":42,"tag":156,"props":794,"children":795},{},[796,801,803,808],{"type":42,"tag":57,"props":797,"children":798},{},[799],{"type":48,"value":800},"Single metric focus",{"type":48,"value":802}," — testing many metrics simultaneously without correction inflates false positives; designate one ",{"type":42,"tag":57,"props":804,"children":805},{},[806],{"type":48,"value":807},"primary metric",{"type":48,"value":809}," before running the test; secondary metrics require multiple comparison correction (Bonferroni, Benjamini-Hochberg)",{"type":42,"tag":156,"props":811,"children":812},{},[813,818],{"type":42,"tag":57,"props":814,"children":815},{},[816],{"type":48,"value":817},"Novelty effect",{"type":48,"value":819}," — a change in behavior due to users noticing something new (independent of its actual value) can create temporarily inflated treatment effects that fade over time; run the test for at least one full weekly cycle and monitor for effect decay",{"type":42,"tag":43,"props":821,"children":823},{"id":822},"related-tools",[824],{"type":48,"value":825},"Related Tools",{"type":42,"tag":51,"props":827,"children":828},{},[829,831,837,839,843,845,851,853,859,861,867],{"type":48,"value":830},"Use the ",{"type":42,"tag":209,"props":832,"children":834},{"href":833},"/tools/power-analysis",[835],{"type":48,"value":836},"Power Analysis Calculator",{"type":48,"value":838}," to design the test ",{"type":42,"tag":138,"props":840,"children":841},{},[842],{"type":48,"value":142},{"type":48,"value":844}," running it — determine the required sample size based on your baseline conversion rate, minimum detectable effect, significance level, and desired power. Use the ",{"type":42,"tag":209,"props":846,"children":848},{"href":847},"/tools/fishers-exact-test",[849],{"type":48,"value":850},"Fisher's Exact Test Calculator",{"type":48,"value":852}," for small-sample A/B tests (fewer than ~30 conversions per group) where the normal approximation underlying the z-test is unreliable. Use the ",{"type":42,"tag":209,"props":854,"children":856},{"href":855},"/tools/chi-square-test",[857],{"type":48,"value":858},"Chi-Square Test Calculator",{"type":48,"value":860}," for multi-variant tests (A/B/C/D) testing whether conversion rates differ across groups overall before running pairwise comparisons. Use the ",{"type":42,"tag":209,"props":862,"children":864},{"href":863},"/tools/logistic-regression",[865],{"type":48,"value":866},"Logistic Regression",{"type":48,"value":868}," calculator when you want to control for covariates (user demographics, device type, traffic source) in the A/B test analysis — regression-adjusted estimators can improve precision and account for imbalance.",{"type":42,"tag":43,"props":870,"children":872},{"id":871},"frequently-asked-questions",[873],{"type":48,"value":874},"Frequently Asked Questions",{"type":42,"tag":51,"props":876,"children":877},{},[878,883,885,890],{"type":42,"tag":57,"props":879,"children":880},{},[881],{"type":48,"value":882},"What p-value threshold should I use?",{"type":48,"value":884},"\nThe conventional threshold is ",{"type":42,"tag":57,"props":886,"children":887},{},[888],{"type":48,"value":889},"α = 0.05",{"type":48,"value":891}," (5% false positive rate), but the right threshold depends on the stakes. For high-traffic tests where a wrong decision is cheap to reverse, α = 0.05 or even α = 0.10 may be appropriate to avoid slowing down iteration. For tests with large irreversible consequences (pricing changes, major product overhauls), α = 0.01 reduces false positives. The key insight is that p-value thresholds are a policy decision about the acceptable false positive rate, not a mathematical law — always consider the cost of a false positive (shipping a harmful change) versus a false negative (missing a real improvement) in context.",{"type":42,"tag":51,"props":893,"children":894},{},[895,900,902,907,909,914],{"type":42,"tag":57,"props":896,"children":897},{},[898],{"type":48,"value":899},"How long should I run the test?",{"type":48,"value":901},"\nRun the test for at least the pre-specified duration determined by your sample size calculation — typically until you reach the target N per group. Never stop early just because the result looks significant (peeking problem) or because it has been running for a round number of days. As a minimum: (1) wait for at least ",{"type":42,"tag":57,"props":903,"children":904},{},[905],{"type":48,"value":906},"one full business cycle",{"type":48,"value":908}," (typically 1–2 weeks) to capture weekly seasonality in user behavior; (2) ensure you have reached the ",{"type":42,"tag":57,"props":910,"children":911},{},[912],{"type":48,"value":913},"minimum detectable effect sample size",{"type":48,"value":915}," before concluding no effect; (3) if you see the effect decaying over time, you may be observing a novelty effect rather than a real treatment effect.",{"type":42,"tag":51,"props":917,"children":918},{},[919,924,926,931,933,938,940,945],{"type":42,"tag":57,"props":920,"children":921},{},[922],{"type":48,"value":923},"My test is significant but the lift is tiny — should I ship it?",{"type":48,"value":925},"\nStatistical significance and ",{"type":42,"tag":57,"props":927,"children":928},{},[929],{"type":48,"value":930},"practical significance",{"type":48,"value":932}," are separate questions. A very large sample can make even a 0.1% relative lift statistically significant (p \u003C 0.05) when the true effect is negligibly small. Always evaluate results against a ",{"type":42,"tag":57,"props":934,"children":935},{},[936],{"type":48,"value":937},"minimum detectable effect (MDE)",{"type":48,"value":939}," that represents the smallest improvement worth the engineering cost of shipping the change. A useful rule of thumb: if the 95% confidence interval ",{"type":42,"tag":138,"props":941,"children":942},{},[943],{"type":48,"value":944},"entirely",{"type":48,"value":946}," falls below your MDE (even though it excludes 0), the test is statistically significant but practically inconclusive — the true effect is likely too small to matter. Report both the p-value and the confidence interval on the absolute and relative lift so stakeholders can make an informed decision.",{"type":42,"tag":51,"props":948,"children":949},{},[950,955,957,962],{"type":42,"tag":57,"props":951,"children":952},{},[953],{"type":48,"value":954},"What is a sample ratio mismatch (SRM) and why does it matter?",{"type":48,"value":956},"\nAn ",{"type":42,"tag":57,"props":958,"children":959},{},[960],{"type":48,"value":961},"SRM",{"type":48,"value":963}," occurs when the actual ratio of users in the control and treatment groups differs significantly from the intended ratio (e.g., you aimed for 50/50 but observed 48/52). SRMs are diagnosed with a chi-square goodness-of-fit test on group sizes. Common causes: logging bugs that miss events in one variant, bot filtering that affects groups differently, or assignment cache inconsistencies. SRM is critical to investigate before analyzing results because it indicates the randomization mechanism is broken — the groups are no longer comparable, and the estimated treatment effect is biased. Always check SRM before reporting A/B test results.",{"title":7,"searchDepth":965,"depth":965,"links":966},2,[967,968,969,970,971,972,973,974],{"id":45,"depth":965,"text":49},{"id":147,"depth":965,"text":150},{"id":227,"depth":965,"text":230},{"id":428,"depth":965,"text":431},{"id":584,"depth":965,"text":587},{"id":748,"depth":965,"text":751},{"id":822,"depth":965,"text":825},{"id":871,"depth":965,"text":874},"markdown","content:tools:082.ab-test-calculator.md","content","tools/082.ab-test-calculator.md","tools/082.ab-test-calculator","md",{"loc":4},1775502475404]