CUPED: Old Wine in a New Bottle?
Anyone running AB tests in industry knows the challenges. Detecting small but meaningful effects is difficult and time consuming. You can run the experiment longer, recruit more users, or accept that some effects will go undetected. None of those options appealing when you’re running lots of tests and your stakeholders wanted the results yesterday.
Deng et al. (2013) proposed CUPED (Controlled-experiment Using Pre-Experiment Data) to address this problem. The idea is to collect outcome data on users before the experiment starts and statistically adjust for those pre-experiment observations when analyzing results. The pre-treatment version of the outcome captures user-specific variance that adds noise to your estimates. Removing it lets you detect a smaller effect with the same sample size, or the same effect with a smaller sample size. Let’s see how this works in practice.
A Working Example
Imagine we are running an experiment to test if an updated page layout increases the use of a search feature on our platform. We end up randomly assigning about 10,000 users to either treatment (new layout) or control (old layout). Users vary in how likely they are to use the search function. Some search more while others users search less, on average. This natural variation acts as noise when estimating the treatment effect.
Most of those 10,000 users used the search feature at least once before the experiment started, and we have their prior usage rate. The average pre-experiment search rate across the users is 17%. For each user, CUPED subtracts that average, , from their individual pre-experiment search rate, , adjusts this difference by a weight , and then subtracts the result from the rate observed during the experiment, . The result is a version of the outcome, , free of individual differences in search tendencies. Putting it all together, we get the following adjustment formula:
The coefficient represents how strongly the pre-experiment search behavior correlates with behavior measured during the experiment. A user with a search rate 10 percentage points higher than average before the experiment will tend to search more during, for reasons unrelated to the treatment. The adjustment subtracts that expected excess, leaving behind variation attributable to the treatment.
To estimate , we need to calculate the covariance between the pre-experiment and in-experiment search rates, and divide by the variance of the pre-experiment metric.
Astute readers might recognize this as the slope coefficient for a simple linear regression of on . In our hypothetical experiment, = 0.60. A user with a pre-experiment search rate of 27% (10 points above the 17% average) gets their observed rate adjusted downward by 0.60 10 = 6.0 percentage points. A user at 7% (10 points below average) gets adjusted upward by 6.0 points. Users at the mean get no adjustment.
After performing the adjustment, we can estimate the treatment effect as a simple difference in means of the new outcome values, .
The variance reduction depends on the correlation between the pre-experiment and in-experiment metrics. In our data, pre-experiment and in-experiment search rates are positively correlated, = 0.69. The variance of the adjusted outcome is:
That’s a 47% reduction in the variance of . The standard error of the treatment effect, , drops from 0.17 to 0.12, a 29% reduction. The sampling distribution of (the estimated treatment effect) gets visibly tighter.
The adjusted estimate (blue) is more precise, concentrating probability mass closer to the true effect. In practice, this means you’d need roughly half as many users to achieve the same statistical power.
The relationship between and variance remaining follows a curve that accelerates as increases.
CUPED in Context
Anyone trained in the design and analysis of experiments will likely recognize the technique CUPED is employing. Analysis of covariance (ANCOVA) has traditionally been used to adjust for covariates to produce more precise treatment estimates (Keppel, 1991). The ideal covariate has a strong linear relationship with the outcome and no relationship with the treatment. Random assignment guarantees the second condition for all pre-treatment variables: however, the optimal covariate is almost always the pre-treatment version of the outcome.
Instead of first adjusting each user’s outcome and then comparing group means, ANCOVA fits a single regression model that estimates the treatment effect and the covariate adjustment simultaneously.1
The treatment effect estimate is . The coefficient plays the same role as in CUPED. It accounts for the fact that users with higher pre-experiment search rates tend to have higher in-experiment rates. By explaining that variation, the model reduces residual error and produces a more precise estimate of .
Here’s what the two methods look like side-by-side using data from our hypothetical experiment.
The treatment effect estimates are functionally equivalent.2 Our ANCOVA model produces = 3.03 with = 0.61. CUPED produces = 3.03 with = 0.60.
The primary difference between methods appears to be mechanical. CUPED multiplies by the mean-centered pre-experiment outcome, which zeros out the slopes and collapses the parallel lines into flat group means. The methodological distinction is that ANCOVA estimates simultaneously with , while CUPED estimates in a prior step, treating it as a known constant during inference.3
Under random assignment, a user’s pre-experiment search behavior is independent of their assigned condition. Estimating the covariate coefficient from experiment data versus historical data converges to the same answer as sample size increases. Both methods achieve a variance reduction of .
The Case for CUPED
The main argument for using CUPED over ANCOVA and regression adjustment seems to be the concern that assumptions are often violated in real data. In motivating CUPED, Deng et al. state:
Moreover, the technique should preferably not be based on any parametric model because model assumptions tend to be unreliable and a model that works for one metric does not necessarily work for another.
However, the linear model makes strong assumptions that are usually not satisfied in practice, i.e., the conditional expectation of the outcome metric is linear in the treatment assignment and covariates. In addition, it also requires all residuals to have a common variance.
Whether these are actually “strong” assumptions is open to debate. In their classic book on regression modeling, Gelman and Hill (2006) point out that not all assumptions are created equal. They rank them as follows:
- Validity
- Additivity and linearity
- Independence of errors
- Equal variance of errors
- Normality of errors
The linearity assumption is concerned with the model coefficients, not the shape of the relationship between and . When a straight line is inappropriate, we can add polynomial terms without violating linearity. The same goes for covariate adjustment in experiments. The real issue is whether the regression slopes of on in treatment and control are parallel. Non-parallel slopes mean the treatment effect varies with (maybe the new layout helps light searchers more than heavy ones). Under randomization, both ANCOVA and CUPED still produce consistent estimates of the average treatment effect even when slopes differ. What you lose is efficiency and the ability to interpret as a constant effect for all users. An interaction term would capture the heterogeneity and recover additional variance reduction.
The equal variance assumption (aka, homoscedasticity) sits near the bottom of the list. This assumption is weaker than most econometricians would have you believe. In large samples, regression estimates are generally robust to unequal error variance. Robust standard errors have been available for decades to deal with this problem. While they can be difficult to use in some situations, robust methods are perfectly suitable for a simple treatment model with only two variables.
CUPED does have a practical advantage worth noting. The two-step structure makes it natural to pre-compute outside the experiment. If a platform has historical paired observations (e.g., user behavior in two consecutive weeks before the experiment), it can estimate the covariance structure once and reuse it across experiments. You could technically do the same with regression, pre-computing from historical data and plugging it in. CUPED’s formulation makes this separation obvious, which has engineering value for platforms running hundreds of experiments. The statistical result is identical either way.
What to Make of CUPED
On the surface, CUPED appears to be a novel method for improving statistical power when analyzing online experiments. Once you peer below the surface, the novelty is less obvious. Both CUPED and ANCOVA can be used to reduce residual variance. Both achieve the same amount of variance reduction because they are governed by the same value, . The differences are largely mechanical. CUPED pre-computes a single adjustment coefficient and applies it before comparing group means. ANCOVA estimates everything in one shot. In large samples, the methods will converge.
Recognizing CUPED as a reformulation of regression adjustment opens up possibilities that the formula alone obscures. If there are multiple variables correlated with the outcome (prior click rate, session length, user tenure) you can use them all. Variance reduction depends on how much additional outcome variance each predictor explains beyond what the others already capture. With CUPED, incorporating multiple covariates would require constructing a composite score, which amounts to fitting a regression anyway. Non-linear relationships between pre-experiment and in-experiment metrics could be handled with a polynomial terms. If you suspect the layout change works differently for new users versus veteran users, include an interaction term. None of these require exotic techniques. They are standard regression tools.
Treating CUPED as a standalone formula, rather than a special case of regression, leaves flexibility on the table. None of this diminishes the value of CUPED. The method brought covariate adjustment into the online experimentation mainstream at a time when many platforms still analyzed raw means. The next step is realizing that regression works just fine.
References
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 123–132.
Gelman, A. & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). Prentice-Hall, Inc.
I’m using ANCOVA and regression interchangeably to emphasize the fact that all you’re doing is fitting a linear model to estimate a treatment effect, , while controlling for a covariate, . ANCOVA is typically used when the treatment variable is categorical. Regression can handle categorical treatment variables, as well as continuous treatments. Thus, ANCOVA can be seen as a special case of regression. ↩︎
Technically, they are not algebraically identical in finite samples. CUPED uses the marginal slope ; ANCOVA technically estimates a partial regression coefficient. These converge asymptotically under random assignment but can differ in any given sample. ↩︎
Because CUPED ignores estimation uncertainty in , its standard errors will tend to be optimistic (too small). ANCOVA accounts for this through joint estimation of and . The difference will be negligible in large samples but matters for smaller experiments. ↩︎