FAQ
How does user distribution in A/B testing work?
The service offers manual and automatic traffic distribution between testing options.

Using manual (classic) distribution, you determine shares of traffic volume for each traffic option on your own. For instance, 50/50 or 70/30. 

Manual distribution works great when you need to control the share of the audience on which you test a new feature — for instance, in classic A/B tests.

Automatic distribution is an algorithm that picks out a testing option demonstrating better performance and drives the best part of traffic to it. What makes this method beneficial is that the algorithm can adjust distribution on the go, without interrupting the experiment. 

Down the road, you get more conversions during the test. It helps save time and earn more than with classic A/B testing. This distribution technique is underpinned by Bayesian statistics and Thompson sampling.

What is the difference between automatic and manual distribution?
In classic A/B testing, users are divided into groups; all groups are shown different options. After some time, the experiment is evaluated by a specific metric (e.g., conversions into purchase). In this model, traffic shares remain the same throughout the experiment. 

The algorithm that automatic distribution relies on distributes traffic dynamically. During the test, the algorithm continuously analyzes performance of options in terms of the selected metric and distributes users accordingly: the better result an option delivers, the more traffic is driven to it. 

For instance, you test subscription screens and have selected conversions into purchase as the key metric. You have 4 paywall options, all traffic is evenly distributed between them. 
After some — preset — time after the launch, Paywall 1's performance looks better than the others. In this case, the algorithm allocates more users to this paywall and continues to analyze the metrics. If another option starts showing a higher conversion rate, the algorithm will redistribute traffic again.

The algorithm collects the metrics it needs and revenue data on its own. In such tests, poorly performing options are shown less to users — thus, experiment costs decrease. 
How does A/B testing with automatic distribution make money faster?
We conducted an A/B test of paywalls using manual traffic distribution for our client from the dating vertical. Based on the results, we used the best performing subscription screen. 
Some time later, we fed historic data to the automatic distribution system — to see how the algorithm would distribute traffic.

Results:  

-Perfect option: 1,070 (100%). This is the number of purchases we could have generated if we guessed the best option at the very beginning and applied it without further testing.
-Historic: 745 purchases (69%). This is the number of purchases we generated with manual distribution.
-Thompson sampling: 935 (87%). This is the number of purchases that the automatic distribution algorithm could have delivered.

It turns out that the algorithm would have handled the task better. According to the test result, the option that we picked out during manual distribution, would be recognized the best by the automatic algorithm. But for the same money that was actually spent, the client could have generated more purchases at a lower CAC. In monetary terms, the project missed some $7,500 on the focus group for 5 days.
How we lost $7,500 on mobile app A/B tests but learned how to conduct them
Why do I need to segment traffic when carrying out A/B testing?
Users on different channels, in different regions, using different devices may behave differently in the app. 

A great case from our experience: for one of our clients, an app, we bought traffic from Facebook and Snapchat. The test showed that Facebook campaigns were better optimized for conversions into trials, while major in-app purchases came from Snapchat. 
This is why you need to keep in mind the user acquisition specifics and segment traffic when carrying out A/B testing. Our tool helps segment traffic by source, region, device, operating system, and other parameters.

Who would benefit from the A/B testing service?
  • Those who continuously put new hypotheses to the test and want to automate proofing;
  • Those who make informed, statistics-inspired decisions about a product;
  • Those who want to cut costs on testing hypotheses and accelerate this process.
What is the mechanism for allocating the audience by automatic distribution of traffic?
Based on the current metric value of the test variants, we consider probabilities of superiority (probability to be best) for each and allocate the audience accordingly. For example, one hour after the start of the experiment, the chance to beat the version is equal to 55 %, compared to 45 % for the test. Therefore, 55% of users will subsequently be assigned a test version, and 45% a control one.
From the statistical point of view, what criteria determine the sample size, the duration of the experiment and the winner: in the cases of classical AB testing and the automatic audience distribution?
The sample size depends on the chosen metric. The z-test for proportions is used in case of conversions and the Student test is used for ARPU. The test duration depends on how important it is to consider the effects of weekdays, weekends and holidays. The classical test winner is determined by statistical criteria: z-test for proportions, Student Test, Mann-Whitney-Wilcoxon test and bootstrap. 
In classical approach you should not stop the test even if the statistical significance has been achieved due to peeking problem. 
In the case of Bayesian testing, the winner is determined based on the probabilities of superiority, credible intervals, expected losses, size of the effect and ROPE.
Why do Bayesian algorithms need smaller sample sizes to produce reliable results?
Bayesian statistics allow us to think in terms of the probability distribution of our metrics. In this case, we can calculate the probability of the target metric taking a given value; in the case of classical tests, we calculate the metric only once. Because of that, we can make a decision sooner and with fewer users.
What is the point of multiarmed bandits algorithm?
The multiarmed bandits algorithm distributes traffic during an experiment depending on the results obtained in the previous steps. The algorithm starts giving more traffic to the better variant, increasing the total gain over the entire experiment, compared to classical tests with equal distribution.
Why can't we apply multiarmed bandits to financial metrics?
Thompson's algorithm, which we use for multi-armed bandits, assumes only two values: 0 (no event) and 1 (event complete). If the range stays within these limits, we can guarantee that the algorithm will converge and correctly determine the winner; financial metrics have more than two possible values.
What are the differences between a classical approach to AB testing and Bayesian methods?
In classical tests, we use a frequency-based statistical approach to determine the winner:
- data is collected
- type I, II error are fixed
- when assumptions are met, the test statistic is calculated: random variable with known distribution, which may vary depending on the distribution of the target metric.
- made a conclusion about statistically significant difference between options

The following problems are encountered using the frequency approach in AB testing:
- multiple testing
- peeking problem
- violating the assumptions of the tests
- hard to interpret the p-value

Bayesian tests are based on the probabilistic approach to calculating our metrics. However, unlike frequency-based statistics, we calculate the distribution not of the test statistics but of the metric itself.
The main advantages compared to the frequency approach:
 - better interpreted
 - no problem with multiple testing
 - it is possible interrupt the test carefully (no peeking problem)
 - allows us to get a result more quickly