## Question 515:

1

Your question touches upon many aspects of statistical inference and requires going into some detail to get to some solutions. I've provided detail here for you to get the concepts and calculations. There are a few different things it looks like you want to know in your A/B testing. First, it is, given the observed difference in proportions from the sample, can you conclude the difference is greater than chance alone. The second is likely, how large of a sample would you need in order to detect a difference. I'll start with the first one.

Typically these questions involve using the binomial distribution to compare probabilities. Since your sample sizes are relatively large, you can use a technique called the normal approximation to the binomial which makes the computations easier and faster. It's usually Ok to use the normal approximation to the binomial when n*p and n*q > 5, where p is the proportion converting and q is the proportion not converting. Using the smallest sample you have provided, 36/1532 is a p of .023. That makes .023*1532 = 36 and q = .976, .976*1532 = 1496, so you're fine using this approach and probably will be for most samples size above 300.

Let's walk through your first example. You observed 82/4616 in one version, then 76/4616 in another version. The two proportions you are calculating are then .0177 and .0164 and the difference is .001299. Given your sample size of 4616 can we conclude the difference of .001299 is greater than chance?

You divide this difference by the square-root of a denominator which accounts for the chance which will provide you with a z-score. The denominator is

1         1
--   +  --     * PQ
n1       n2

Where P = (x1 + x2)/(n1+n2)  Q is 1-P. The x's are just the number of conversions and the n's are the sample size.

P = (82+76)/(4616+4616) = .017
Q = 1-.017 = .982
PQ = .982*.017 = .0168

1/n1 + 1/n2 = 1/4616 + 1/4616 = .00043

So multiply .0168 * .00043 = 0.000007224

Now the square root of this is SQRT(0.000007224) = .002688

So the equation is the observed difference .001299/.002688 = .483

That last result of .483 is the z-score which is your test-statistic. You now look this value up using the z-score to percentile calculator using the 2-sided area. This gets us a p-value of .629. Which means the probability the observed difference is due to chance is 62.9%, which is pretty high. Most people would conclude there is no difference.

Using the same procedure for your second set of data, I get a z-score of 1.53 and p-value of .126, which means there is about a 12.6% chance the difference between 50/1532 (.032) and 36/1532 (.023) is greater than chance. I don't know the context, but I'd feel pretty good about concluding there was a difference there. There is an 87.4% chance the difference is not due to chance, which seems pretty convincing to me.

Next you'll likely want a confidence interval around the observed difference in proportions.  To compute this you'll use much of the same information you used to compute the 2-proportion test.

p1- p2 +/- za/2 SQRT ( p1q1/n1 + p2q2/n1)

Lets plug in the numbers for the second example.

p1= .032
p2 = .023
za/2 = 1.96
q1 = .968
q2 = .977
n1 = 1532
n2 = 1532

Plugging in we get .023-0.32 +/- 1.96 * SQRT( (.032*.968)/1532 + (.023*.977)/1532 ) gets us .0092 +/- .00157 and a 95% confidence interval between (-0.0025 and 0.0208).

Notice how the confidence interval crosses 0  (since the low end is below zero and high end is above 0) so we cannot conclude with 95% confidence the difference is not due to chance, but as we calculated earlier, we can conclude with about 87.4% confidence. You'll need to consider what threshold of confidence you are comfortable with. I suggest strong evidence, but not a preponderance of evidence (e.g > 70), again this depends on the context of being wrong with your conversions.

This will tell you the likely range on the actual difference. This nice thing about the confidence interval is that it also tells us significance. If the boundary of the interval does not cross zero, than the difference in proportions it is statistically different than 0 plus you see how much higher the difference is. The other nice thing about a confidence interval is it makes a nice visual of the differenceso you don't need to get caught up in the numbersyou just look at the width and location of the interval.

Something to keep in mind is while the difference might not be "statistically' different given the samples you've observed, it is also likely if you were to continue testing you will see a very small difference between the two versions. I suspect you'd still go with the one which had a better conversion rate, even if it was a fraction of a percent, because after all, the big numbers of visitors add up to a lot. The problem is, you just need a very large sample to test these users.

Your final question deals with estimating the sample size prior to testing. This approach is called a power analysis and you need to provide some additional information. First, you need to specify how large of a difference you're interested in observing. Recall the general rule that to detect smaller differences between proportions you need a larger sample size. For very large differences you can use much smaller sizes (e.g. a difference of 20% to 30%) versus differences of 1% to 2%. The most difficult thing about power analysis is not knowing what difference is important, ahead of time. Since, well as we said earlier, while larger differences are better, pretty much any difference is better than no difference and rarely do we know what the likely difference is ahead of time.

The next piece of information we'd work with is the sample size, the power (beta) and the confidence level (alpha). The typical power value used is .8 (1-beta) and the typical alpha used is .05 (1-confidence level). By using the power of .8 or (80%) we're saying that we're ok with a 20% probability of not detecting a statistical difference when one in-fact exists. I'll walk through an example using the data. The formula are a bit complicated so I'll show you the results and you let me know if this is a path you're interested in pursuing and I'll be happy to provide some more formula details.

Let's say you're interested in observing a difference in proportions of 1% between a 2% conversion rate and 3% conversion rate. We're using the power of 80% and confidence level of 95%. In order to have an 80% chance of detecting this difference, we'd need to test 3826 page-views on each version.  We can also fix the sample size and calculate power. If we wanted to test only 1000 page views the probability of detecting a 1% conversion rate difference is 30%. So after we viewed 1000 pages on each version and we did not observe a statistical difference (p <.05) then there is a 70% chance there is in fact a difference, we just need to test more page views.

Power analysis involves trading off the confidence level, sample size and power values to find a mix that works.

My recommendation to you is start with the first two approaches as they are easier to implement. I'm happy to help you with the code. I sell PHP and JavaScript algorithms that will run these calculations for you and can document what they are doing. You'll just need to pass them the parameters of number converted and total sample size. Let me know if that's something you're interested in pursuing and if you want more detail on the power calculations and anything else I covered in this answer.