Multi-Armed Bandit Algorithms for Website Optimization

Multi Armed Bandit Algorithms For Website Optimization

For anyone thinking of optimizing your website for more conversions, A/B testing would be a default option. No doubt, it is the best way to find which versions of your website work best and can bring more conversions. Here we will be discussing Bandit algorithms for website optimization which is nothing but a more advanced or complex form of A/B testing itself. Multi-Armed Bandit algorithms are machine learning algorithms used to optimize A/B testing.

A Recap on standard A/B testing

Before we jump on to bandit algorithms for website optimization, let’s take a quick look at the standard A/B testing process and why you need Bandit algorithms.

In a standard A/B test process, the traffic is split to the two variations equally and test till you reach the statistical significance (a measure of the confidence that the effect we measure is actually present) to determine the winning variant. This is the exploration phase. Once you find the winner, usually you send 100% of your traffic to the winning variant which is exploitation.

Well designed A/B tests can provide a wealth of information and insights to marketers on how to optimize a website or landing page or marketing campaigns on which the tests are run.

So, what are the limitations of standard A/B testing?

  • During the exploratory phase, there is a wastage of resources while exploring the inferior options in order to gather as much data as possible.
  • You would have to run the tests for quite some time in order to reach the statistical confidence and get enough traffic to the variation for the results to be accurate.
  • It jumps directly from exploration to exploitation rather than a smooth transition.

The multi-armed bandit algorithms try to solve these limitations of A/B by being more adaptive by making a smooth transition from exploration to exploitation automatically and minimizing the opportunity costs and losses as in A/B testing. We will be seeing this in detail in the coming sections of the blog. Let’s first understand what bandit algorithms are.

The Multi-Armed bandit problem

Multi-armed bandits in casino

The term “multi-armed-bandits” came from the classical slot machines in casinos, where a person must choose between multiple levers of the slot machine, each giving unknown payouts(called bandit because casinos usually rob your money). The goal of the gambler is to determine the lever to pull in order to get the maximum reward after a set of trials. Now, the question is how can the gambler learn which slot machine arm is the best to be pulled to get the most money in the shortest amount of time.

An approach to this problem is doing a set of trials for all the arms to identify which arm gives returns more frequently and more money (exploration) and after a set of trials when you get confidence, to decide on which arm to play often (exploitation). Exploration is the learning phase and exploitation is the earning phase.

Thus, it can be seen that the multi-armed bandit problem is all about gathering enough information to come up with the best strategy for any activity and then explore new actions. In machine learning, this exploration and exploitation approach can be achieved through learning algorithms that acquire knowledge and take actions in its environment to maximize the rewards.

Standard A/B testing vs Multi-armed Bandit algorithms- Why You need Multi-armed Bandits?

We have already seen a few of the limitations of standard A/B testing. The Bandit algorithms we have discussed above help in balancing the exploration and exploitation phases. Standard A/B testing is a pure exploratory method where you assign an equal number of users to different versions of your website or landing page or campaign or whatever and it then jumps to complete exploitation based on results.

Whereas in the case of Multi-Armed Bandits(MAB) solution, instead of two distinct periods of exploration and exploitation, it simultaneously includes both exploration and exploitation. It uses existing results to allocate more traffic to variants that are performing well while allocating less traffic to underperforming variants.

See the below figure which shows a demonstration of traffic distribution over time and the cumulative count of conversions for A/B testing and Multi-Armed bandits optimization. It can be clearly seen than the conversions are more in the case of MAB.

Image Courtesy: Optimizely

Hence, with bandits, traffic will be automatically steered towards better-performing pages. Thus, multi-armed bandits can produce higher overall payoff while still allowing you to gather information on how users react to each variant.

Types of Bandit Algorithms

The Bandit algorithms use the reinforcement learning paradigm of machine learning to balance exploration and exploitation phases during the process of learning by taking action based on the current best information. There are different bandit algorithms and each of these algorithms tries to balance the exploration-exploitation dilemma at different degrees. Here we will be taking a look at some of the popular bandit algorithms used:

  1. Epsilon Greedy: Epsilon greedy algorithm as the name implies is “greedy”. What do “greedy people” do? They take action based on what seems most beneficial to them at the moment. So is this greedy algorithm, and most of the time, that is the  (say 80%) picks the option that makes the most sense at that moment(exploitation). However, the algorithm randomly explores other options(exploration) available for a certain percentage of the time (say 20%). Thus, the epsilon greedy algorithm uses exploration with probability e and exploitation with probability(1-e).

However, Epsilon greedy is not a very principled approach and doesn’t take into account statistical significance.

Image Courtesy: Conductrics

Image Courtesy

2. Upper Confidence Bound: It is a more complicated algorithm and figures out which option could be the best one based on your statistical level of confidence, and play that option as much as possible. In this method, for each variant, the algorithm identifies upper confidence bound (UCB), which represents the highest guess at the possible pay off for that option.

3. Thompson Sampling(Bayesian Bandits): Thompson sampling is also a more principled approach. In this approach, a probability distribution is built for each variant of their success rate using observed results. For each instance, we sample one possible success rate from the distribution corresponding to each variant and the variant with the largest sampled success rate is selected. With more observed data points, the sampled success rates will be more and more likely to be closer to the true rate.

Image Courtesy: Conductrics

4. Contextual Bandits: Contextual bandits, in fact, are extensions of the Multi-armed bandit where the situation in experiments or state of the environment is also considered while making a decision, which makes the algorithm context-aware. This model thus allows you to optimize decisions based on previous observations as well as personalize decisions for every situation. The contextual bandit algorithm observes a context, makes a decision based on it, with a goal to maximize average reward.
The context in case of website optimization, is the information about users, where they come from, previously visited pages of the site, etc to give a more personalized experience to the user based on this information by choosing which content to display to the user, select the best image to show on the page, etc.

Pros and Cons of Multi-Armed bandit tests

Multi-armed bandits can give better optimization results faster and can be a better option to standard A/B testing in many instances. Here we are summarizing some of the advantages of using Bandit algorithms for website optimization:

  • Speed: They can give you answers more quickly.
  • Automation: Naturally automates the selection optimization and moves traffic toward winning variations gradually using machine learning.
  • Opportunity Cost: Minimizes the loss incurred during data collection while running optimization experiments.

But, every coin has another side and here are some of the disadvantages of Bandits:

  • Computational complexity: It is more difficult and resource-intensive to run multi-armed bandit tests.
  • In bandits, you are handing over all the decision making to a system, and there can be instances where the system may become biased based on inputs to the system and make wrong decisions due to its inability to understand.

When to use Multi-Armed Bandits instead of standard A/B tests?

There have been debates on Multi-armed bandits vs A/B tests for website optimization, but I feel, both have their own benefits when used for the right purposes. A clear cut comparison of the two shouldn’t be done as both have different purposes.

A/B testing is the right choice for strict experiments. Hence, if you want to monitor the effect of some treatment or any element, in particular, A/B testing might be the best option. Rather if you are keen on optimization and conversions more then understanding the effect, bandits are probably the best bet to optimize your average conversions. Multi-Armed Bandits lets you earn while you learn.

Here are some of the best use cases for Bandit algorithms.

  1. Short tests: Bandit algorithms are best suited if you want to run short term tests and you have only a short time for both exploration and exploitation. The best examples are news headlines and short terms campaigns or promotions. Say if you are running tests for an eCommerce site for black Friday, with A/B testing you might be able to decide only at the end of the day, but with bandit algorithm, you can drive more traffic to better performing variation faster and thus increase your revenue.
  2. Long term changes: Bandit algorithms are suited for long term or ongoing tests as well. It lets you automate ongoing optimization for your site at a lower risk. Say, for a news website, a bandit algorithm can be used to determine the best order to display the top stories.
  3. Targeting: Contextual Bandit algorithms can be used for targeting specific ads or content to user sets.


Bandit algorithms can be definitely a great choice for conversion optimization of landing pages, promo pages, and digital advertising. You can benefit from faster results and increased average conversions using these algorithms. It’s worth giving a try. Many A/B testing tools today have multi-armed bandit algorithms incorporated in them.

For example, Optimizely, one of the most popular A/B testing tools in the market today, allows its users to set up multi-armed bandit optimization in their experiments easily. It also offers another feature-Stats accelerator which can be described as multi-armed bandits as they help users capture more value by reducing the time to statistical significance or increasing conversions by monitoring ongoing experiments and automatically adjusting traffic distribution among variations.

I hope the article would be helpful in optimizing your website conversions. Leave a comment in case of any questions!

How much is a great User Experience worth to you?

Browsee helps you understand your user's behaviour on your site. It's the next best thing to talking to them.

Browsee Product