Optimizely Stats Engine - Sequential Testing Vs T Testing

Published On Thu 28 October, 2021

In this tutorial, I will hopefully explain some of the magic behind Optimizely Stats engine framework. We have all heard of companies like Amazon and Netflix who have successfully used data-driven decisions to turn their companies into household names. In order for your company to make these data-driven decisions, you need to be able to capture the correct data. The way to get this data is to run experiments at all levels within your software development life cycle. Using painted door experiments in the design phase will allow you to validate an idea has merit. Being data-driven will mean no more wasted time and money building features that your customers do not want. Product investment can be spent only on building validated ideas. As we all know, building something and then checking if customers will use it is the most expensive and time-consuming way to improve.

The only way to be data-driven is through experimentation. Running tests is great, however, tests on their own are pretty meaningless. The important part of experimentation is the data. Optimizely recognised this and in 2015 create a new type of stats framework, called Stats Engine. Stats Engine is a massive jump forward in getting access to your experimentation data quicker and more accurately. Stats engine will give a company a competitive advantage compared to companies that only use classical statistic modelling methodologies in their experiments.

Stats engine was built in combination with a team of statisticians at Standford university. Its unique approach combines sequential testing with false discovery rate control. If you are a business wanting to make data-driven decisions, your ability to iterate through the learning process will be directly tied to the amount of new changes you can introduce. Basically, the more sophisticated the stats engine, the more competitive advantage you will have over your competitors. The takeaway is that being able to prove that your hypothesis is either correct or not is really important.

It is all very good saying that sequential testing and false discovery controls are clever, but, what does that actually mean?!?!?! This is where this article comes in. I am by no means a maths or stats guru and most of the articles that I have read on this topic seem to assume that you are. Well... either that or my comprehension of maths is so low, I need my own special dummies guide to understand it! The purpose of this post is to give a plain English account of what stats engine does and why you could care. If I can get it, then I'm confident you can 😊. We start this journey talking about Guinness 🍺☘️🍺.

T Testing

Around 1900 Guinness was experiencing production issues. They wanted to create more of the black stuff, however, they did not want to impact quality. Up until this point, Guinness used to manually sample large quantities of hops to ensure high quality, however, they realised that this process would block them from scaling. They needed a new way to measure quality. This is where William S. Gosset comes in and creates the idea of statistical significance. In essence, he created an algorithm that allowed Guinness to run samples on their hops. His algorithm could be used to predict the number of samples required over a fixed period in order to mathematical guarantee that the quality assurance results would be accurate. This is what statistical significance is. His model allowed Guinness to use more hops and reduce the time needed to check that the hops were up to scratch, improving their throughput.

Fast-forward 100 years, people are now building websites. Websites and web pages are not hops. Web pages can be updated and changed very easily. In fact, I will make a bold claim... there are pretty much no similarities between hops and web pages! A lot of web experimentation companies stats engines were mainly based on the original framework which was first developed within Guinness for agriculture. This model is known as the T distribution model (T Test). This model has been used for well over 100 years, so we can definitely say it works. As this model was not specifically designed for web experimentation, this raises the question, does it calculate results in the fastest way possible?

When Stats Modal Meet Web Experiments

We've established that the classic statistical models were originally created to calculate a fixed sample size of hops during the harvest period (fixed time) to ensure a consistent high-quality yield. Websites don't really work like this. First, people will visit your site when they feel like it. On Black Friday you may have thousands of visitors, on Christmas day none. Most companies are also not static, they change and evolve over time. Ask any tech-focused companies and they will all have scheduled activities, like code releases, content updates, server patches, or marketing campaigns planned in some shape. It would be strange if a company was not making constant new improvements to their digital estate.

The problem with any statistical predictive modelling is that there will always be some form of error margin. No model will be 100% accurate all the time. In general, the longer an experiment runs, the more accurate the data. This is where statistical significance comes into play. When an experiment hits this figure, you can trust it is accurate to a high probability. What happens though if something changes 3/4s of the way through the experiment. As the result can not be trusted until you hit statistical significance, do you start the experiment again?

Speed Vs Accuracy

When you are performing a web experiment you are making a hypothesis. You are trying to measure the differences between a variation and an original to see if the variation is going to provide some sort of benefit. The point of an experiment is to compare two things. When using any predictive analytics, there is always a chance of false positives occurring. A false positive is a result that incorrectly indicates whether a hypothesis is either true or false. Every time you add an extra variation, or, add a new metric to an experiment, you increase the chances that a false-positive result can occur.

There are two main ways that false positives can occur when performing a web test. Every time you add an extra variation, or, add a new metric to an experiment, you increase the chances that a false-positive result can occur. You can also find false positives in a T distribution model by testing the same metric multiple times over the course of an experiment. These challenges are officially known as the multiple comparison problem and the peeking problem.

The easiest way I found to get my head around the peeking problem was summed up in this blog post. I found that it really helped me to understand some of the challenges around reporting. When statisticians try to validate statistical models, one widely used approach is to run a number of A/A tests and compare the results. An A/A test is the same as an A/B test, however, the experiments are made using the same variations. In theory, if a testing model is optimal, over time the experiments should give the same results. The researchers from this blog post found that in nearly 60% of their experiments, during the lifetime of an experiment a variation would be flagged as a winner.

![Optimizely Stats Engine Explained For Dummies 1][1]

Take my simple graph above, if you peeked in week 3, Line 1 would be flagged as a winner. If you peaked in week 6, Line 2 would be the winner. If you looked in week 7 the results would be a draw. This is exactly what a real-life study would show. Having a level of noise in the data is unavoidable. The longer you run the experiments, the results would eventually even out. This means that with traditional statistic modelling you can not speed things up. You have to wait for the fixed horizon time to hit before you know for sure that the data is valid. Taking any actions on the data before the allocated experiment time has been reached would mean that you would be making decisions on potentially incorrect data. Completely invalidating the purpose of being data-driven.

Let us talk about maths 🤕. When determining the differences between the variation against and the null hypothesis (the original), something called a p-value is used to express the likelihood of false positives occurring within the data. In order for the data within an experiment to be trustworthy enough, it needs to hit something called statistical significance. The p-value is the thing that allows the algorithm to calculate when statistical significance is reached and to prove one way or the other if the test does what you expect it to. When the data set hits a certain threshold the data can be trusted and the experiment will be either approved or disapproved.

In a normal A/B test the error rate for the p-value is calculated at 0.05%. This value can then be used to determine how long a test needs to run before we can be confident that the results that are being reported can be trusted. When the p-value is used in an experiment, it is known as a fixed horizon test. YOu need to wait x amount of time before you can trust the data. There are no shortcuts!

As a society, we do not like to wait. We want things instantly. Imagine being halfway through a test at work, and you have Black Friday coming up in a week. Everyone wants to increase sales to hit their bonuses...

Variation A is winning, let use it!
Variation B is not performing, we can't afford to leave it running through the sale!
There's no clear winner.... well extend the experiment for another few months!

If you are using fixed horizon analysis, in all the scenarios listed above, there is nothing you can do. You can make a best guess on the data and make a judgement call, however, you are not being data-driven. You are still delivering business value based on guesswork. This is why traditional models do not fit nicely with web experimentation. To get a competitive advantage, you need to be given enough information about the risks in the data to be able to make real-time decisions. Having a p-value that does not change will not give you this.

Sequential Testing And Low False Discovery Rate

Around world war 2 time, a statistical modelling approach called sequential testing was invented. This approach uses a different equation compared to the T Testing model/fixed horizon approach. In sequential testing, the sample size does not need to be fixed in advance. This means that the data can be evaluated in real-time as soon as it is collected. The business value in sequential testing is that you do not need to wait x amount of time until you can trust your data. You can hit statistical significant sooner, without having to worry about false positives corrupting the results. In essence, you can prove your hypothesis quicker. This is obviously very beneficial for any business running experiments. Sequential testing will increase your test velocity, which can be a key factor in being competitive.

The other cool thing about this stats engine is that it also reports winners and losers using a low false discovery rate. That's a lot of words, what exactly does this mean and why should you care? The more goals and variations you add to your experiments the higher the chance you will see false positives. This makes sense to me. The more complex your hypothesis the higher the chance the data will be wrong. A low false discovery rate algorithm will dynamically shift how the algorithm determines the p-value. In complex tests, the algorithm will be more conservative in calling a winner or loser compared to a simpler test. In essence, it will take longer to reach statistical significance in complex experiments, as the probability of false-positive data corrupting your business decisions is higher.

Stats engine can provide a real-time and accurate confidence level on all experiments, indicating how confident the algorithm is in that the uplift that it is reporting is correct and not false-positive data. The engine uses machine learning to basically dynamically change the p-value based on the complexity of the experiment.

Using this combined approach will massively improve your test programs velocity. Instead of using a single baseline for every test and having to wait longer than you need to, the Optimizely stats engine auto-adjusts to your tests to give you the results for each experiment as quickly as possible. This is much better than having to wait for a fixed time for every single test. You get these auto-adjustments automatically without having to do anything.

This auto-correcting algorithm is a nice touch. In stats engines that do not provide this ability, it is left up to you to use a manual sample size calculator like this one to try and figure out if you can close an experiment sooner. This not only wastes your time but also adds another touchpoint where false-positive data can occur. This in essence is what's unique to the Optimizely Stats engine. Sequential testing, combined with a false discovery rate provides a real-time accurate view of your experiments that also gives you a real-time assessment of how confident the algorithm is of being correct. This means you can test as many goals and variations as you want with guaranteed accuracy. This is the first time a web experimentation platform has been able to provide this ability. Pretty cool, hey? Happy Coding 🤘

[1]: /media/cmrhnhmv/optimizely-stats-engine-explained-for-dummies.jpg