Why Most Winners Aren't

A split test ends. Version B is ahead by 12%. B gets shipped, the result gets written up, everyone moves on to the next test. The problem is that B being ahead and B being better are not the same thing.

Run two identical pages against each other — the exact same page, tested against itself — and one of them will still "win." It has to. With a few hundred visitors and a handful of conversions on each side, random variation alone produces a gap. Flip two fair coins forty times each and one will usually come out ahead. Nobody would call that coin better.

This is the quiet failure mode of conversion testing. Teams run tests, declare winners, and ship changes that were never real. Over a year of that, you have made a dozen confident decisions on a dozen coin flips, and your conversion rate has not moved.

What Significance Means

Statistical significance is the tool that separates a real difference from a lucky one. It answers a precise question: if there were truly no difference between A and B, how often would pure chance hand me a gap at least this big?

If the answer is "rarely," the difference is probably real. If the answer is "fairly often," you are looking at noise. The common convention is 95% confidence, which means you only accept a result when chance would have produced it less than 5% of the time.

"Statistical significance does not tell you that you are right. It tells you that you are probably not being fooled by randomness. That is a lower bar, and it is the one most tests fail."

Two things drive significance: the size of the gap between the variants and the amount of data behind it. A huge gap on tiny traffic is still a coin flip. A small gap on large traffic can be rock solid. You cannot eyeball this. You have to run the numbers.

Using the Calculator

Running the numbers is exactly what the calculator does. You give it four inputs — the visitors and the conversions for each variant — and it returns the confidence level, so you know whether your result has cleared the bar or not.

Free Download

This is the spreadsheet I have used for years to pressure-test split-test results before acting on them. Download the Split Test Calculator (Excel) — enter your numbers, read the confidence level, and decide with evidence instead of hope.

Reading the result is simple. Above roughly 95% confidence, you have a result worth acting on. Below it, you have not finished the test — the honest move is to keep it running or accept that the variants are effectively tied. A result at 80% confidence is not a small win. It is an unfinished test.

Get 7 Proven Trainings Free

I've sold these trainings individually clients have paid me $10,000+ to implement them. Enter your email and the first one arrives immediately.

Rules for Tests You Can Trust

Framework: Tests You Can Trust

Five rules that keep a result honest

One: decide the sample size before you start, not after. Two: do not peek and stop the moment it looks good — early leads evaporate, and stopping at a flattering moment manufactures fake winners. Three: run full business cycles, complete weeks, so weekday and weekend behavior are both represented. Four: change one variable at a time, or a win tells you nothing you can repeat. Five: the more things you test at once, the more "winners" chance hands you, so raise your bar accordingly.

None of this slows you down in any way that matters. It simply moves the slow part to the front, before you have shipped a change and built a quarter of strategy on top of a result that was never there. A test you can trust is worth three tests you cannot.


A trustworthy testing habit is the engine behind continuous improvement. The framework that turns a stream of small, verified wins into serious growth is marginal gains for business.