Study Notes on Online Controlled Experiments — Part 2 (Multivariant testing, practical lessons)

MultiVariable Testing

Multivariable testing (MVT): an experiment that includes more than one factor (experimental variable). e.g. treatment variant has more than one change compared to the control variant. With this single test, we can estimate the (main) effects of each factor and the interactive effects between factors.

  • Test many factors in a short period of time, accelerating improvement.
  • You can estimate the interaction between two factors. Two factors have an interaction effect if their combined effects are different from the sum of two individual effects.
  • Some combination of factors may give a poor user experience.
  • Analysis and interpretation are more difficult.
  • It can take longer to begin the test.

3 Approaches to Conducting MVT

  1. Traditional MVT — fractional factorial/Plackett and Burman (Taguchi)
  • You can estimate any interactions you want with a full factorial (ANOVA test)
  • You can turn off any factors at any time if any factors is disastrous without affecting the other factors.
  • increasing the number of levels (variants) for a factor
  • unbalanced population between treatments and control.

Implementation architecture

Implementing an experiment on a website involves three components

  1. randomization algorithm
  2. assignment method
  3. data path

Randomization algorithms

a function that maps end-users to variants. Randomization algorithms must have the following three properties to support statistically correct experiments:

  1. end-users must be equally likely to see each variant of an experiment
  2. repeat assignment of a single end-user must be consistent, the end-user should be assigned to the same variant on each successive visit to the site (concern w/ cross-device tracking, incognito visits)
  3. when multiple experiments are run, no correlation between experiments and user assignment. An end user’s assignment to a variant in one experiment must have no effect on the probability of being assigned to a variant in any other experiment
  • support monotonic ramp-up: percentage of users who see a treatment can be slowly increased without changing the assignments of users who were previously assigned to that treatment. Treatment percentage can be slowly increased without impacting the user experience or damaging the validity of the experiment.
  • support external control, meaning that users can be manually forced into and out of variants.
  1. Pseudorandom with caching

Assignment method

  1. Traffic splitting
  • Amazon’s home page is built on a content management system that assembles pages from individual units — slots.
  • The system refers to page metadata at render time to determine how to assemble the page
  • Non-technical content editors schedule pieces of content in each slot through a UI that allows users to edit this page metadata with respect to a specific experiment.
  • As page request comes in, the system executes the assignment logic for each scheduled experiment and saves the results to page context where the page assembly mechanism can react to it.
  • The content management system only needs to be modified once and experiments can be designed, implemented, and removed by modifying the page metadata through the UI.

Summary of relative advantages and disadvantages of different assignment methods

Data Path

To compare metrics across experiment variants, a site must record treatment assignment, page view, clicks, revenue, user identifier, variant identifier, experiment identifier, etc. Site experimentation raises some specific data issues

  1. Event-triggered filtering
  • using existing data collection — not set up statistical analysis so require manual and require frequent change of code to accommodate for new experiments
  • local data collection
  • server-based collection: service call to centralize all observation data and make it easy for analysis. This is the preferred method when possible.

Practical Lessons Learned


  1. Mine the data. An experiment provides rich data, more than just a bit about whether the difference in OEC is to stay sig. E.g. An experiment showed no significant difference overall, but a population of users with a specific browser version was significantly worse for the treatment, then revealed that the treatment JS was buggy for that browser version.
  • do single-factor experiments for gaining invites and when you make incremental changes that could be decoupled
  • try some bold bets and very different designs and you might then perturb the winning version to improve it further
  • use full or fractional factorial designs for estimating interaction when several factors are suspected to interact strongly

Trust and execution

  1. Run continuous A/A tests to validate the following in parallel with other experiments:
  • Are users splitting according to the planned percentages
  • Is data collected matching the system of record
  • Are results showing non-significant results 95% of the time
  • It is recommended to gradually increase the percentage of users assigned to treatment(s)
  • Then an experimentation system that analyses the experiment data in near-real-time can automatically shut down a treatment if it’s significantly underperforming relative to the control
  • An auto-abort reduces the percentage of users assigned to underperforming treatment to zero. Then you can make bold bets and innovate faster.
  • Decide on the statistical power, the effect you would like to detect, and estimate the variability of the OEC through an A/A test. You can compute the minimum sample size needed and run time for your website
  • A common mistake is to run experiments that are underpowered.
  • Instead of doing 99% vs. 1% split worrying that treatment would affect too many users, we recommend treatment ramp-up and maintaining a 50–50 test split.
  • The test would need to be run longer with an unbalanced split. An approximation for the increase in run time for an A/B test relative to a 50–50 split is 1/(4p(1-p)) where treatment receives a portion p of the traffic. A 99%/1% split would run 25 times longer compared to a 50%/50% test.

Culture and Business

  1. Agree on the OEC upfront



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store



Books | Marketing | Data Viz | Analytics & Experimentation | Entrepreneurship 💡Founder of | Blog: