Adding a guide about CUPED and significance #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

bertilhatt wants to merge 2 commits into main from bertil/cuped_significance

Contributor

bertilhatt commented Oct 22, 2025

Customers often ask why their results, that were significant without CUPED are not with CUPED. There were promised faster results, and this feels unfair to them.

Those tend to be non-technical stakeholders, so this guide is meant as an introduction to CUPED and how it impacts results.

I’m hesitant to add graphical elements to be clearer.

We could also give people hints about how to breakdown large CUPED correction.


          Adding a guide about CUPED and significance

bertilhatt requested review from DemetriPananoss, Dpananos and MaloChevillotte

October 22, 2025 15:59

netlify bot commented Oct 22, 2025 •

edited

Loading

✅ Deploy Preview for eppo-data-docs ready!

Name	Link
🔨 Latest commit	`1f9f791`
🔍 Latest deploy log	https://app.netlify.com/projects/eppo-data-docs/deploys/6904b8c3e7a10c0007af6ee3
😎 Deploy Preview	https://deploy-preview-702--eppo-data-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Dpananos reviewed

View reviewed changes

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

Dpananos commented Oct 27, 2025

LGTM

Dpananos approved these changes

View reviewed changes

tbuffington7 reviewed

View reviewed changes

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved

docs/guides/advanced-experimentation/cuped_and_significance.md Outdated Show resolved Hide resolved


          Typos, comments about false positives and more examples

1f9f791

Contributor Author

bertilhatt commented Oct 31, 2025

@tbuffington7 @Dpananos, I’ve done material edits, let me know if you like it more without so many typos.

bertilhatt requested review from Dpananos and tbuffington7

October 31, 2025 15:07

tbuffington7 approved these changes

View reviewed changes

Contributor

tbuffington7 left a comment

Looks good overall. Left a few minor comments that are mostly nits that you can take or leave

docs/guides/advanced-experimentation/cuped_and_significance.md


		When splitting a large group of thousands or millions of subjects in two, you expect both halves to be identical, or rather, almost identical. Out of randomness, there could be small differences. With smaller samples, or if you have a few very active subjects, even if we split randomly and fairly, those differences between the two groups might be more noticeable.

		Say Control has a few more very active customers, then the comparison will be unfair, because that will make Treatment look worse.

Contributor

tbuffington7 Oct 31, 2025

Small nit: it would be good to clarify that they are not active because of the experiment. Without that, one might think "why's it unfair? The treatment could just be very bad"

docs/guides/advanced-experimentation/cuped_and_significance.md


		## Why is the correction so large?

		The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.

Contributor

tbuffington7 Oct 31, 2025

Suggested change

      
            The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.
          
            The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variants, and larger differences are possible.

docs/guides/advanced-experimentation/cuped_and_significance.md


		The larger the imbalance, the larger the correction. With a homogeneous audience of millions of users, there’s little risk to see big differences. However, if you happen to have a handful of users that represent a large part of your activity, they might not split evenly between variances, and larger gaps are possible.

		There is a limit though: some gaps are too large to be caused by randomness. As we’ll see further down, we flag those cases as suspicious.

Contributor

tbuffington7 Oct 31, 2025

Another small nit: I think this phrasing is too strong. Technically any amount of imbalance is possible with proper randomness even if quite unlikely. I'd say something like "some imbalances are large enough to indicate a potential problem with the experiment's randomization."

docs/guides/advanced-experimentation/cuped_and_significance.md


		## Significance

		Why do we present CUPED as "reducing noise" and accelerating experiments if it’s correcting for uneven splits, then?

Contributor

tbuffington7 Oct 31, 2025

Another nit: "uneven splits" reads to me like uneven traffic splits (SRM). Maybe "pre-experiment imbalances" is better?

Technically that ignores assignment properties, so "covariate imbalances" is the most precise, but arguably too jargon-y for the audience you have in mind here

docs/guides/advanced-experimentation/cuped_and_significance.md


		CUPED removes some of the noise from one possible source of non-relevant difference between Control and Treatment. By doing so, it reduces the noise and the uncertainty of the experiment. If we set the significance threshold (the acceptable level of wrongly detecting something when there’s no difference) to 5% without that source of noise, the confidence interval can be narrower. With more precision, the same measured impact would be more likely to be significant. However, using CUPED, the measured impact might change so it’s not a guarantee.

		That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster, however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.

Contributor

tbuffington7 Oct 31, 2025

Suggested change

      
            That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster, however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.
          
            That is the benefit of CUPED when looking at your overall experimentation program: it makes results faster; however, it does not just shrink the confidence interval. The effect on each result is that it corrects for small, measurable engagement imbalances, changing the estimated impact to make a fairer comparison. That correction allows us to reduce the confidence interval around the new, corrected value.

docs/guides/advanced-experimentation/cuped_and_significance.md

+              Is Cuped always the right approach? Generally, yes. If your metrics are hard to predict based on past activity, CUPED might have no material effect. However, with the right precautions, it can’t make your results less reliable or slower.
+              What are those precautions? When applying the CUPED correction, we assume that your split is fair. In practice, that means that we make two assumptions:
+. The split between Control and Treatment is balanced through randomness: the users in both groups had had the same level of engagement before the experiment, joined at the same time, had the same number of new, or VIP users, etc.

Contributor

tbuffington7 Oct 31, 2025

I think this is a bit confusing because you say previously that imbalances happen due to randomness. I think you mean "balanced in expectation" here.

Strictly speaking, the assumption of CUPED is statistical independence between covariates and the treatment assignment. Similar pre-experiment behavior helps us verify the assumption but is not an assumption of CUPED in itself. You could state the assumption of independence as two separate assumptions:

The experiment was properly randomized in the sense that all subjects are equally likely to be in a given variant regardless of their pre-experiment behavior (pre-experiment behavior has no impact on treatment assignment)
The treatment assignment has no impact on pre-experiment data

docs/guides/advanced-experimentation/cuped_and_significance.md

+. The split between Control and Treatment is balanced through randomness: the users in both groups had had the same level of engagement before the experiment, joined at the same time, had the same number of new, or VIP users, etc.
+. In particular, their behavior before the experiment should be similar. We should expect small differences, but not large ones: they should be close to identical, depending on how large the samples are.
+              If we notice larger differences than expected, then we flag this a Diagnostic error, either *Traffic imbalance* by assignment properties or a *Pre-experiment metric imbalance*. Those imbalances should not happen: Control and Treatment should be taken from the same population and split randomly.

Contributor

tbuffington7 Oct 31, 2025

Might be good to link to the diagnostics page here

docs/guides/advanced-experimentation/cuped_and_significance.md


		If we notice larger differences than expected, then we flag this a Diagnostic error, either Traffic imbalance by assignment properties or a Pre-experiment metric imbalance. Those imbalances should not happen: Control and Treatment should be taken from the same population and split randomly.

		If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED- or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.

Contributor

tbuffington7 Oct 31, 2025

Suggested change

      
            If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED- or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.
          
            If either of those diagnostic warnings or errors appear, we strongly recommend that you investigate that before looking at results; notably, we recommend you address those before looking at CUPED or non-CUPED-corrected results. Do not hesitate to reach out to support if you are not sure what to do.

docs/guides/advanced-experimentation/cuped_and_significance.md


		All common causes for Traffic imbalance (telemetry issues, including non-participants, etc.) would also affect CUPED.

		What are patterns that specifically trigger an imbalance in pre-experiment metrics?

Contributor

tbuffington7 Oct 31, 2025

I think incorrectly specified experiment dates (without other complications like gradual rollouts) is the simplest and most intuitive example. If the experiment started before Eppo thinks it did, then the treatment can cause an imbalance before the incorrectly specified start date.

docs/guides/advanced-experimentation/cuped_and_significance.md

+              What are patterns that specifically trigger an imbalance in pre-experiment metrics?
+              - **Iterating features**: If you test a first version of your new feature, and it fails because of a bug, then you might try again. You might want to use the same split so that users who saw the feature don’t see it vanish. Then what happened before the second experiment is different for Control (who didn’t see anything new) and Treatment (who saw a buggy version of the same feature). In that case, CUPED wouldn’t apply fairly. We recommend that you start the experiment assignments when you started the split originally, at the start of the first version; you can exclude events when the feature wasn’t working properly and use a later event start date.
+              - **Gradual roll-out**: If you want to roll-out a feature gradually, we recommend that you expose a small portion of customers, split that group between Control and Treatment, and expose more users gradually maintaining the split. This allows users to stay in their assigned variations once they are exposed. If you set the start date of your experiment to once the test was fully rolled-out, then the users who were assigned early would have a differentiated experiment prior to that. In that case, CUPED wouldn’t apply fairly either. Instead, start your test from the earliest roll-out. Eppo flags are designed to exclude the subjects who are not part of the test yet.

Contributor

tbuffington7 Oct 31, 2025

I don't understand the phrasing "differentiated experiment prior to that." Maybe something like "who were assigned early could exhibit metric imbalance due to the treatment affecting behavior before the specified start date"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet