Skip to content

Conversation

@bertilhatt
Copy link
Contributor

Customers often ask why their results, that were significant without CUPED are not with CUPED. There were promised faster results, and this feels unfair to them.

Those tend to be non-technical stakeholders, so this guide is meant as an introduction to CUPED and how it impacts results.

I’m hesitant to add graphical elements to be clearer.

We could also give people hints about how to breakdown large CUPED correction.

@netlify
Copy link

netlify bot commented Oct 22, 2025

Deploy Preview for eppo-data-docs ready!

Name Link
🔨 Latest commit 1f9f791
🔍 Latest deploy log https://app.netlify.com/projects/eppo-data-docs/deploys/6904b8c3e7a10c0007af6ee3
😎 Deploy Preview https://deploy-preview-702--eppo-data-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Dpananos
Copy link

LGTM

@bertilhatt
Copy link
Contributor Author

@tbuffington7 @Dpananos, I’ve done material edits, let me know if you like it more without so many typos.

Copy link
Contributor

@tbuffington7 tbuffington7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few minor comments that are mostly nits that you can take or leave


When splitting a large group of thousands or millions of subjects in two, you expect both halves to be identical, or rather, almost identical. Out of randomness, there could be small differences. With smaller samples, or if you have a few very active subjects, even if we split randomly and fairly, those differences between the two groups might be more noticeable.

Say Control has a few more very active customers, then the comparison will be unfair, because that will make Treatment look worse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: it would be good to clarify that they are not active because of the experiment. Without that, one might think "why's it unfair? The treatment could just be very bad"


## Why is the correction so large?

The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.
The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variants, and larger differences are possible.


The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.

There is a limit though: some gaps are too large to be caused by randomness. As we’ll see further down, we flag those cases as suspicious.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small nit: I think this phrasing is too strong. Technically any amount of imbalance is possible with proper randomness even if quite unlikely. I'd say something like "some imbalances are large enough to indicate a potential problem with the experiment's randomization."


## Significance

Why do we present CUPED as "reducing noise" and accelerating experiments if it’s correcting for uneven splits, then?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit: "uneven splits" reads to me like uneven traffic splits (SRM). Maybe "pre-experiment imbalances" is better?

Technically that ignores assignment properties, so "covariate imbalances" is the most precise, but arguably too jargon-y for the audience you have in mind here


CUPED removes some of the noise from one possible source of non-relevant difference between Control and Treatment. By doing so, it reduces the noise and the uncertainty of the experiment. If we set the significance threshold (the acceptable level of wrongly detecting something when there’s no difference) to 5% without that source of noise, the confidence interval can be narrower. With more precision, the same measured impact would be more likely to be significant. However, using CUPED, the measured impact might change so it’s not a guarantee.

That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster, however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster, however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.
That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster; however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.

Is Cuped always the right approach? Generally, yes. If your metrics are hard to predict based on past activity, CUPED might have no material effect. However, with the right precautions, it can’t make your results less reliable or slower.

What are those precautions? When applying the CUPED correction, we assume that your split is fair. In practice, that means that we make two assumptions:
1. The split between Control and Treatment is balanced through randomness: the users in both groups had had the same level of engagement before the experiment, joined at the same time, had the same number of new, or VIP users, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit confusing because you say previously that imbalances happen due to randomness. I think you mean "balanced in expectation" here.

Strictly speaking, the assumption of CUPED is statistical independence between covariates and the treatment assignment. Similar pre-experiment behavior helps us verify the assumption but is not an assumption of CUPED in itself. You could state the assumption of independence as two separate assumptions:

  1. The experiment was properly randomized in the sense that all subjects are equally likely to be in a given variant regardless of their pre-experiment behavior (pre-experiment behavior has no impact on treatment assignment)
  2. The treatment assignment has no impact on pre-experiment data

1. The split between Control and Treatment is balanced through randomness: the users in both groups had had the same level of engagement before the experiment, joined at the same time, had the same number of new, or VIP users, etc.
2. In particular, their behavior before the experiment should be similar. We should expect small differences, but not large ones: they should be close to identical, depending on how large the samples are.

If we notice larger differences than expected, then we flag this a Diagnostic error, either *Traffic imbalance* by assignment properties or a *Pre-experiment metric imbalance*. Those imbalances should not happen: Control and Treatment should be taken from the same population and split randomly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to link to the diagnostics page here


If we notice larger differences than expected, then we flag this a Diagnostic error, either *Traffic imbalance* by assignment properties or a *Pre-experiment metric imbalance*. Those imbalances should not happen: Control and Treatment should be taken from the same population and split randomly.

If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED- or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED- or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.
If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.


All common causes for Traffic imbalance (telemetry issues, including non-participants, etc.) would also affect CUPED.

What are patterns that specifically trigger an imbalance in pre-experiment metrics?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think incorrectly specified experiment dates (without other complications like gradual rollouts) is the simplest and most intuitive example. If the experiment started before Eppo thinks it did, then the treatment can cause an imbalance before the incorrectly specified start date.

What are patterns that specifically trigger an imbalance in pre-experiment metrics?

- **Iterating features**: If you test a first version of your new feature, and it fails because of a bug, then you might try again. You might want to use the same split so that users who saw the feature don’t see it vanish. Then what happened before the second experiment is different for Control (who didn’t see anything new) and Treatment (who saw a buggy version of the same feature). In that case, CUPED wouldn’t apply fairly. We recommend that you start the experiment assignments when you started the split originally, at the start of the first version; you can exclude events when the feature wasn’t working properly and use a later event start date.
- **Gradual roll-out**: If you want to roll-out a feature gradually, we recommend that you expose a small portion of customers, split that group between Control and Treatment, and expose more users gradually maintaining the split. This allows users to stay in their assigned variations once they are exposed. If you set the start date of your experiment to once the test was fully rolled-out, then the users who were assigned early would have a differentiated experiment prior to that. In that case, CUPED wouldn’t apply fairly either. Instead, start your test from the earliest roll-out. Eppo flags are designed to exclude the subjects who are not part of the test yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the phrasing "differentiated experiment prior to that." Maybe something like "who were assigned early could exhibit metric imbalance due to the treatment affecting behavior before the specified start date"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants