📊 A/B Test Calculator

Easily calculate A/B test statistical significance, required sample size, and CVR improvement rate.

Ⓐ Pattern A (Control)

Ⓑ Pattern B (Variant)

Explanation of Formulas

Z-test (Normal approximation) Significance Testing

Pooled estimate:p̂ = (xA + xB) / (nA + nB)

Standard error:SE = √( p̂(1 - p̂)(1/nA + 1/nB) )

Z-statistic:Z = (pB - pA) / SE

P-value: (two-tailed test)p = 2 × (1 - Φ(|Z|))

Sample size calculation

n = (Zα/2 + Zβ)2 × (p1(1 - p1) + p2(1 - p2)) / (p2 - p1)2

p = baseline CVR, p = p × (1 + MDE/100), Z = critical value for significance level, Z = critical value for statistical power121α/2β

How to Read Results

Metric Meaning Reference
P-value Probability the difference is due to chance Significant if < 0.05
Z-value Number of standard deviations of difference 95% significant if |Z| > 1.96
Improvement Rate CVR change from A to B (CVR_B - CVR_A) / CVR_A
Confidence Interval Estimated range of difference Significant if excludes 0

Usage and Application Examples

  • In 'Significance Testing Mode,' you can enter the number of visitors and conversions for A and B to determine statistical significance.
  • In 'Sample Size Calculation Mode,' you can estimate the required sample size in advance.
  • You can select from 90%, 95%, or 99% significance levels. 95% is typically recommended.
  • Visually compare CVR between A and B in the comparison chart.
  • Copy results to clipboard and paste into your report.

What is an A/B Test Calculator?

An A/B Test Calculator is a statistical tool that determines whether the difference between two groups in an experiment is statistically significant or just due to random chance. Marketers, product managers, and UX designers use it to validate that improvements actually work. Instead of guessing, you get confidence-backed answers about which version performs better.

How to Use

The calculator requires four inputs: Control visitors (Group A traffic), Control conversions (Group A successes), Variation visitors (Group B traffic), and Variation conversions (Group B successes). Enter your raw numbers from your experiment period—no percentages needed. The tool automatically computes conversion rates, the uplift percentage, statistical significance (p-value), and confidence intervals. Most calculators use a standard 95% confidence threshold; results showing p-value under 0.05 or confidence intervals that don't overlap zero indicate your result is statistically sound.

Use Cases

Scenario 1: An e-commerce site tests a new checkout button color. Control: 10,000 visitors with 450 purchases. Variation: 10,500 visitors with 525 purchases. The calculator reveals an 8.5% uplift that's statistically significant, justifying the rollout. Scenario 2: A SaaS company tests pricing. Control: 2,000 free-trial signups; 300 conversions. Variation: 2,100 signups; 280 conversions. The 2% decline shows no significance—the price change didn't hurt, but didn't help either. Scenario 3: A content site tests headline wording. Control: 5,000 visitors, 120 article clicks. Variation: 4,800 visitors, 180 clicks. Uplift is 38%—statistically significant at small scale, informing broader content strategy changes.

Tips & Insights

Sample size matters: tiny experiments with hundreds of visitors often show false positives. Run tests long enough to capture weekly variation—weekends differ from weekdays. Statistical significance (95% confidence) means there's only a 5% chance your result happened by random variation. Practical significance is different: a 1% lift on revenue might be statistically valid but negligible. Always decide sample size before running the test to avoid peeking bias, where you stop early when results look good.

Frequently Asked Questions

What is statistical significance in A/B testing?

Statistical significance means the difference between pattern A and pattern B can be statistically determined to be real rather than due to chance. Generally, if the p-value is less than 0.05 (95% confidence level), it is considered statistically significant.

How is the required sample size calculated?

It is calculated statistically from four parameters: baseline CVR, minimum detectable effect (MDE), significance level, and statistical power. Typically, 5% significance level and 80% statistical power are used.

What is the p-value?

The p-value is the probability of obtaining results as extreme as or more extreme than observed, assuming the null hypothesis (no difference between A and B) is correct. A smaller p-value indicates higher likelihood of statistical significance.

How is CVR (conversion rate) calculated?

CVR = Conversions / Visitors × 100 (%). For example, if 50 out of 1,000 visitors convert, the CVR is 5.0%.

How long should the test period be?

We recommend a minimum of 1–2 weeks of complete business cycle. Continue testing until sufficient sample size is collected, and avoid stopping early by peeking at interim results—this is key to obtaining statistically accurate results.

When should I stop an A/B test early?

You should avoid stopping tests early even if results look promising, as this can introduce selection bias and inflate your false positive rate. A best practice is to set your sample size in advance and let the test run to completion. If you must stop early, consider using a Bayesian approach or sequential testing methods that account for multiple peeks.

What's the difference between one-tailed and two-tailed tests?

A two-tailed test checks if variant B is different from variant A in either direction (better or worse), while a one-tailed test only checks if B is better. Two-tailed tests are more conservative and recommended for most A/B tests unless you have a strong reason to only care about improvement in one direction. If you use a one-tailed test incorrectly, you risk missing important negative effects.

How do I account for testing multiple hypotheses simultaneously?

When running multiple A/B tests in parallel, you increase the risk of false positives due to the multiple comparison problem. You can adjust your significance level using Bonferroni correction (divide 0.05 by the number of tests) or more advanced methods like FDR control. This calculator tests one hypothesis at a time, so apply corrections manually if you're testing several variants.

What's a good baseline conversion rate to use?

Baseline conversion rates vary widely by industry—e-commerce averages 2-3%, SaaS may be 5-10%, and form signups could be 1-5%. You should use your own historical data as the baseline rather than industry benchmarks to ensure your sample size accounts for your actual traffic patterns. Higher baseline rates require smaller sample sizes to achieve the same statistical power.

Can I use this calculator for metrics other than conversion rates?

Yes, this calculator works for any binary metric where you're comparing two rates, including email open rates, click-through rates, sign-up rates, or engagement metrics. Simply input your baseline metric rate and your expected improvement to get the required sample size. The statistical principles are identical whether you're testing a button color or an email subject line.

What should I do if my test doesn't reach statistical significance?

If your test is underpowered, you can run it longer to collect more data, accept a larger minimum effect size, or declare the test inconclusive. Never draw strong conclusions from non-significant results—you simply don't have enough evidence either way. Consider whether your minimum detectable effect was realistic, as sometimes the true difference is smaller than initially assumed.