Glossary

A/B testing

A decision-making method for implementing product changes in which different groups of users are shown old and new options.

A/A testing

A method of checking the accuracy of the chosen approach for A/B testing in which different groups of users are shown the same options. In the case of statistical differences being found, it is concluded that there are problems in the design or chosen experimental methodology.

Frequency statistical approach

An approach in which a point estimate of an unknown parameter is calculated, as well as test statistics having random distributions. Inferences about the winner are made based on the p-value or confidence intervals. __Read more__

Statistical test

A method of hypothesis testing in frequency statistics, with a certain degree of probability indicating the rejection or non-rejection of the null hypothesis. Statistical tests may differ depending on the experiment parameters. __Read more__

Statistical significance

A situation in which the hypothesis that there is no statistical difference is rejected by the collected data.

Null hypothesis

A hypothesis put forward before an experiment. After the experiment, we make a decision to reject or not.

A hypothesis proposed as an alternative to the null one.

Type 1 error** **(FPR - false positive rate)

The probability of detecting a statistically significant effect in the case that it does not actually exist, e.g., an A/A test showing statistically significant differences.

Type 2 error (FNR - false negative rate)

The probability of not detecting a statistically significant effect in the case that it does actually exist, e.g. an A/B test showing no statistically significant differences when there actually are.

The ability of the test to detect the effect when it actually exists, calculated by subtracting the FNR value from 100%.

The required number of users that need to be collected to identify a statistically significant effect with fixed error probabilities.

p-value** **(probability-value)

The result of hypothesis testing using the frequency statistical approach. This value is compared with the significance level (FPR). A conclusion is made about statistically significant differences between the test variants: if the p-value is less than the fixed type 1 error, the null hypothesis is rejected.

Confidence interval

An interval that covers the true value of a parameter with a certain level of confidence. It is one of the test results for the frequency statistical approach. It is used interchangeably with the p-value.

Note that it cannot be interpreted as an interval containing a certain fraction of all possible true values.__Read more__

Note that it cannot be interpreted as an interval containing a certain fraction of all possible true values.

Multiple testing

A situation in which more than one hypothesis needs to be tested in a single experiment.

Multiple comparisons problem

An error that occurs during multiple testing. __Read more__

Multiple testing correction

Methods allow multiple choices over the course of an experiment.

Peeking problem

An error that occurs in the premature completion of a classic A/B test, usually when a statistically significant result is achieved. It leads to an increase of FPR and FNR.

Bayesian inference

An approach in which not only the point estimate of a statistical parameter is calculated, but rather its distribution, based on our assumptions about its type (prior expectations) and received information (collected data). __Read more__

Prior distribution

A type of distribution of values for the studied parameter, which are assumed before conducting a bayesian test. __Read more__

Posterior distribution

The distribution of all possible values of the metric, recalculated with prior expectations and the data obtained during the experiment being taken into account. __Read more__

Probability superiority (chance to beat control)

the probability that the selected option is better than the other test options. It can be used as a criterion for completing a Bayesian experiment.

Expected losses

How much we expect to lose on average when choosing a test option. It can be used as a criterion for completing a Bayesian experiment.

Credible interval

The interval that contains a certain fraction of all possible values for the studied parameter. It is one of the test results for the bayesian approach. It can be used as a criterion for completing a Bayesian experiment. __Read more__

The task of multi-armed bandits in A/B testing

A task, in which it is needed to optimally distribute users among test variables during the experiment in order to maximize total revenue. With multi-armed bandits, the proportions of users by test variants will differ during the experiment compared to classical and Bayesian tests where the breakdown into groups occurs in equal proportions.

Thompson's algorithm

An algorithm based on the Bayesian statistical approach. It allows the revenue to be maximized in the multi-armed bandit task in A/B testing as more traffic is diverted to the leading variant.