Within this tutorial, you will learn all about a new category of software tooling, feature experimentation. The aim of this article is to provide you with a little bit more information on what capabilities a feature experimentation tool has to offer.

The trouble you will encounter when researching this category of software is finding unbiased information. Most articles have been written by software vendors making it very hard to know what capabilities are essential and what are nice-to-haves.

This is why the aim of this article is to provide you with an unbiased checklist of some of the decision criteria that you should consider when you need to pick a feature experimentation tool for your company.

If you are new to this category and you want to learn how to become a feature experimentation authority, read on 🔥🔥🔥


What is feature experimentation?

The first question to answer is what is a feature experimentation tool? The simple answer is a feature flagging tool combined with experimentation. Historically, software vendors either came from an experimentation background or a feature management background.

As with all software, each type of vendor continued to improve its product year-on-year. In recent years, the cross between these two different types of tools has become increasingly smaller. Nowadays, each type of tool has features that were historically part of the other's skillset. The gap is now so small it's not even noticeable. This is why a new term is needed, feature experimentation.

Feature experimentation encompasses both feature flagging and server-side experimentation capabilities within a single tool. As part of a decent feature experimentation tool, you should expect to find these capabilities:

  • Feature flagging
  • Dark launches, staged, progressive, and target releases
  • Feature configuration and management
  • Experimentation

These four factors combined make up the category of feature experimentation.

Clientside Vs Server Side Experimentation

Out of the list above, the feature with the broadest scope is experimentation. Feature experimentation does not encompass all types of experimentation, instead, its focus is on server-side experimentation only. In this section, we will look at what that means.

When it comes to A/B testing, client-side experimentation was the original technique to allow companies to run experimentation campaigns on their websites. To get started with client-side experimentation, you add a vendor's snippet of Javascript to your website. After doing this, non-technical people will then get access to a visual editor. These folks can then use that editor, to test hypotheses and design changes on their website. Using statistical significance, they can then compare the output of both variations

Server-side experimentation is a different beast. Server-side experimentation is more focused on developers and product owners. The reason why I say this is that you can translate server-side experimentation into a more technical definition of code-first experimentation.

Regardless if you want to use server-side experimentation, or feature flagging, at some point a developer will need to write some code. The code that the developer would need to write in order to implement either a feature flag or a server-side experiment is very similar. Technically, it does not make sense for a developer to write one lot of code for feature flagging within your codebase and another lot of code for just to run experiments. This is why feature experimentation goes hand in hand.

Feature management capabilities

Assuming you are sold on the idea of needing a feature experimentation tool, the hard part of the process starts, how do you pick which is the best tool to use?

At one end of the spectrum, some teams might want to build something in-house. It is not uncommon for some teams to use a database tool like Firebase as a feature toggling solution. The big issue with a homegrown solution is the hidden cost of time and maintenance.

An in-house team will never be able to build something that is better than a company that specializes in experimentation and feature flagging day-to-day. The hidden cost of a homegrown solution might not be initially obvious at first glance. That hidden cost is time.

After that homegrown tool is initially built, someone then needs to keep the lights running on it. This ongoing maintenance work could include fixing it when it breaks, loss in production during downtimes, server upgrades, server costs, updates to the solution so it works with new technologies, etc...

It is impossible to say exactly how much time this will take, however, one thing you can say is that a number of your developers will not be focused on doing activities to meet the company's KPIs. Instead, they will be spending more time on busy work in order to keep the lights on.

Assuming you decide to use an off-the-shelf approach when you go out and start looking at potential vendors, you will quickly notice a theme. You have some companies like Optimizely and AB Tasty that come from an experimentation background, while you have other vendors like Launch Darkly and FlagSmith, which comes from a sole feature-flagging background.

This might sound like I'm stating the obvious, but as you would expect, the experimentation companies tend to be much stronger in experimentation and the feature flagging tools tend to have a slight edge in terms of feature flagging capabilities.

This makes sense, as for experimentation vendors, server-side experimentation is a natural next step. Feature flagging vendors come from a different angle, however, they came to the same conclusion. At some point, these vendors start to realize that they also need to provide testing capabilities in order to ensure that any component that is wrapped in a feature flag is the best version it can be.

Granted, experimentation-first companies can often lack a few features that feature flag-first companies possess. This is exactly the same the other way around. For example, Launch darkly, only started doing experimentation around 2022, while Optimizely helped define the experimentation category in 2020. Obviously, there will be no comparison between someone who has been a market leader for 13 years and someone who just got into it.

When you are looking at the current set of tools, this is the trade-off that you will need to make. Is feature flagging or experimentation more important to you? If you look towards the future, I personally think that it will be easier for an experimentation company to become the leading feature flagging tool than vice-versa. The reason I say this is that being really good at experimentation is hard. You need data, algorithms, partners with universities, clients, data sets, etc... Feature flagging while still hard is a much easier thing to get into.

So what do you need to think about when it comes to experimenting with features, my view on this is:

  • A/B testing

  • Multi-arm bandits

  • Multi-flag experiments

There are features that are worth considering like:

  • Mutual exclusive tests

  • MVT tests

From my experience, these types of features are only needed by really large companies that run at scale.

Asides from the types of experiments that the platform offers you also need to consider the results. The biggest value of an experimentation tool is being able to give you the most amount of results in the quickest amount of time. Personally, I would rule out anything that uses a technique known as a T-Test to calculate statistical analyses. This model requires sample size calculations and is prone to something called the peeking problem.

The market leader Optimziely uses a unique way of doing stats. They built a model in partnership with Standford. This approach is documented in widely available white papers and all the maths exists to prove it is a good way of working with data in an experimentation platform. The other model used by other vendors is the Baysien model. In my opinion, not as powerful as a sequential model, but, still better than a T-Test

Again, the aim of this article is not to say which capabilities are better than the other, however, in terms of what criteria I would deem essential to a feature management tool, I think the statistical approach is one of the most important factors to consider.

A final point that's worth mentioning for your decision criteria is around client-side experiments. One of the most important things about server-side experimentation is that it requires a developer to implement all the tests. A lot of companies will want non-technical people AND technical people to be able to run experiments. Picking a tool today that will not grow with your team in the future, could be a hidden danger. This is why I recommend you also think about will client-side experiments be important later on.

Streaming Flags Vs Config Flags

The final area that I want to focus on is feature-flagging. When it comes to thinking about picking a feature experimentation tool, flexibility is an important consideration. You will want to pick a platform that allows the team to create feature flags and tests anywhere you want and in any way, you see fit.

As feature management is about code, this equates to what I class as developer experience. How does a programmer actually write a feature flag and in what programming languages can that developer write code?

When it comes to feature flag creation, there are two main approaches that the different vendors take.

The first approach which is the one most used is called a streaming flag. Just like an event stream, or a sub/pub model, in this model, your code needs to constantly communicate with the vendor's server. The nice thing about a streaming flag is that the code required to implement the flag is standardized. Write some code and the SDK in question has your back.

Playing devil's advocate, a potential concern with this type of service could be what happens if the vendor's server goes down. As we are talking enterprise software, I'm not saying that a vendor's service will go down frequently. All vendors provide SLAs so it's an unlikely use case. I'm also not saying that your application would blow up if the vendor's servers were down. All leading vendor's SDKs will have capabilities to prevent that, however, it might be a consideration for your list.

The second approach to implementing a feature flag uses a configuration file to bridge the gap between the codebase and the vendor's platform. The tool that I have the most familiarity with that uses this type of approach is Optimizely. As far as I'm aware Optimizely is the only tool in the market to use this approach.

Personally, as someone who has worked with this type of architecture a lot, I can say from first-hand experience the benefits are not immediately obvious. It is only after working on a few projects using this approach, that you can see the additional benefits of having full control over when how and when a flag gets updated.

When using a datafile-based solution, you will have more options to decide on how often your flags are updated. You may opt to only update the configuration file every time you do a release. Alternatively, you may opt to pool Optimizely and update the configuration file every 5 minutes. You could even configure the SDK to make a web request to get the config file every time it is loaded. The takeaway is that the data file approach tends to give you more options.

Again playing devil's advocate, any system that offers more flexibility will come with a trade-off with complexity. As developers need to think about state management, often the code to implement a config-based solution might require a few extra lines to write compared to a streaming flag. This is also true of Optimize ly's multi-experiment per flag feature. Being able to run multiple tests against the same feature flag can be really useful, however, a developer will likely need to write an extra line of code to get that benefit.

Asides from implementation, other feature flagging criteria you should consider are:

  • Support languages
  • Documentation
  • Varibles
  • REST API Access
  • Ci/Cd
  • Scheduling
  • Permissions

At this point, I'm hoping you have a solid understanding of what a feature experimentation tool is and some of the key decision criteria that you will need to consider when picking one. Happy Coding 🤘