NO-JIRA: job-run-aggregator: remove relaxing in disruption aggregation #4899

petr-muller · 2026-01-14T16:40:19Z

I do not entirely understand why the relaxation for disruption tests exists here but the TODO exists this may be worth tightening. The pityFactor will relax the threshold in some cases (high attempts, high pass rate) so the additional relaxation would only apply in the remaining cases.

Our current aggregation logic is too sensitive, leading to the rejection of payloads for non-regressions. Analysis shows that a significant portion of rejected payloads are failing due to infrastructure noise or existing flakes rather than genuine code regressions. We have component readiness as a backstop to identify regressions with greater sample sizes.

I do not _entirely_ understand why the relaxation for disruption tests exists here but the TODO exists this may be worth tightening. The `pityFactor` will relax the threshold in some cases (high attempts, high pass rate) so the additional relaxation would only apply in the remaining cases.

openshift-ci-robot · 2026-01-14T16:40:22Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-01-14T16:40:23Z

@petr-muller: This pull request explicitly references no jira issue.

Details

In response to this:

I do not entirely understand why the relaxation for disruption tests exists
here but the TODO exists this may be worth tightening. The pityFactor will
relax the threshold in some cases (high attempts, high pass rate) so the
additional relaxation would only apply in the remaining cases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-01-14T16:40:33Z

Walkthrough

The changes introduce a "pity factor" mechanism that relaxes pass requirements for small sample sizes in job run aggregation analysis. A new helper function computes adjusted pass thresholds allowing up to 2 failures for smaller samples, reducing false failures from infrastructure noise. This affects percentile disruption checks and failure determination logic across multiple functions.

Changes

Cohort / File(s)	Summary
Implementation - Pity Factor Logic `pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go`	Introduces `pityFactor()` helper function that relaxes required pass thresholds for small sample sizes by allowing up to 2 failures. Modifies `CheckFailed()` and `innerCheckPercentileDisruptionWithGrace()` to compute `strictRequiredNumberOfPasses` and apply pity factor logic. Returns adjusted required passes and descriptive messages. Propagates pity factor messaging into summary strings conditionally.
Testing - Pity Factor Coverage `pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go`	Adds two new test functions: `TestCheckFailedWithPityFactor()` and `TestInnerCheckPercentileDisruptionWithPityFactor()` to validate pity factor behavior across varying passes, failures, skips, and percentile thresholds. Updates existing percentile-disruption test cases to reflect new expected pass requirements and introduces test scaffolding for dynamic test case construction. Adds `jobrunaggregatorapi` import.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 69ff8fa and f180675.

📒 Files selected for processing (2)

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go
pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go
pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go

🧬 Code graph analysis (1)

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go (2)

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go (1)

TestKey (166-169)

pkg/jobrunaggregator/jobrunaggregatorapi/types_row_aggregatedtestrun.go (1)

AggregatedTestRunRow (9-22)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: CodeQL analysis (go)

🔇 Additional comments (9)

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go (4)

540-554: LGTM!

The conditional logic correctly appends the pity factor message only when the relaxation actually helped the test pass (i.e., when numberOfPasses < strictRequiredNumberOfPasses). This provides useful diagnostic information in the summary.

623-648: LGTM!

The implementation is consistent with innerCheckPercentileDisruptionWithGrace. The pity factor logic is applied correctly, and the message formatting provides clear diagnostic output.

651-659: LGTM!

The pityFactor function is well-documented and implements the intended relaxation correctly:

Allows up to 2 failures by capping required passes at numberOfAttempts - 2

Takes the minimum of this cap and the strict requirement, ensuring no over-relaxation

Returns both the adjusted requirement and a descriptive message for diagnostics

The implementation aligns with the PR objective of reducing aggregation failures caused by infrastructure noise while relying on component readiness for regression detection over larger samples.

504-506: Potential index out-of-bounds if numberOfAttempts exceeds 12.

The requiredPassesByPassPercentageByNumberOfAttempts array only has entries for attempts 0-12. If numberOfAttempts > 12 (i.e., more than 12 job runs are passed to innerCheckPercentileDisruptionWithGrace or CheckFailed), accessing the array at line 505 or 623 will panic. No bounds validation prevents this scenario. This is a pre-existing issue, not introduced by this PR, but it's exposed by the modified code paths.

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go (5)

9-9: LGTM!

The import is required for the AggregatedTestRunRow type used in the new pity factor tests.

61-87: LGTM!

Test expectations correctly updated to reflect the removal of the additional relaxation. The required passes and disruption arrays are consistent with the new pity factor calculation logic.

207-219: LGTM!

The updated expectations for the 80th percentile test correctly reflect 5 natural passes needed, with the disruption array producing exactly 5 successes (values ≤ 1).

295-486: Well-structured test coverage for pity factor behavior.

The table-driven tests comprehensively cover the pity factor logic across different reliability tiers (70%-100% working percentage). The test setup pattern using queryTestRunsOnce.Do(func() {}) to bypass the BigQuery client is a pragmatic approach for unit testing.

488-571: LGTM!

Good test coverage for innerCheckPercentileDisruptionWithGrace. The invariant assertion on line 567-568 ensuring totalAttempts == successes + failures is a valuable sanity check. Tests correctly validate both the required passes calculation and the final status.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

petr-muller · 2026-01-14T16:40:57Z

/hold

Needs #4894

openshift-ci · 2026-01-14T16:42:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/jobrunaggregator/OWNERS~~ [petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2026-01-14T16:44:39Z

/uncc @danilo-gemoli @droslean
/cc @neisw @xueqzhan

Copilot

Pull request overview

This pull request removes the simple -1 relaxation for disruption tests and replaces it with a more sophisticated pityFactor function that allows up to 2 failures regardless of the strict pass requirement. The change affects both disruption percentile checks and regular test failure checks.

Changes:

Replaced the -1 adjustment with a pityFactor function that allows up to 2 failures
Updated test cases to reflect new expected pass requirements (7 instead of 6, 5 instead of 4)
Added comprehensive test coverage for the pity factor behavior

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go	Removed `-1` relaxation, added `pityFactor` function, integrated pity factor into disruption and regular checks, updated summary messages
pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go	Updated existing test expectations, added comprehensive tests for `CheckFailedWithPityFactor` and `innerCheckPercentileDisruptionWithPityFactor`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go

petr-muller · 2026-01-15T15:54:01Z

/test all

openshift-ci · 2026-01-15T17:20:14Z

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/breaking-changes	`f180675`	link	false	`/test breaking-changes`
ci/prow/images	`f180675`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller added 2 commits January 12, 2026 18:39

Copilot AI review requested due to automatic review settings January 14, 2026 16:40

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 14, 2026

Copilot started reviewing on behalf of petr-muller January 14, 2026 16:40 View session

petr-muller marked this pull request as draft January 14, 2026 16:41

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 14, 2026

openshift-ci bot requested review from danilo-gemoli and droslean January 14, 2026 16:42

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2026

Copilot AI reviewed Jan 14, 2026

View reviewed changes

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail.go Show resolved Hide resolved

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go Show resolved Hide resolved

pkg/jobrunaggregator/jobrunaggregatoranalyzer/pass_fail_test.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NO-JIRA: job-run-aggregator: remove relaxing in disruption aggregation #4899

NO-JIRA: job-run-aggregator: remove relaxing in disruption aggregation #4899

Uh oh!

petr-muller commented Jan 14, 2026 •

edited

Loading

Uh oh!

openshift-ci-robot commented Jan 14, 2026

Uh oh!

openshift-ci-robot commented Jan 14, 2026

Uh oh!

coderabbitai bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

petr-muller commented Jan 14, 2026

Uh oh!

openshift-ci bot commented Jan 14, 2026

Uh oh!

petr-muller commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petr-muller commented Jan 15, 2026

Uh oh!

openshift-ci bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NO-JIRA: job-run-aggregator: remove relaxing in disruption aggregation #4899

Are you sure you want to change the base?

NO-JIRA: job-run-aggregator: remove relaxing in disruption aggregation #4899

Uh oh!

Conversation

petr-muller commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 14, 2026

Uh oh!

openshift-ci-robot commented Jan 14, 2026

Uh oh!

coderabbitai bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

petr-muller commented Jan 14, 2026

Uh oh!

openshift-ci bot commented Jan 14, 2026

Uh oh!

petr-muller commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petr-muller commented Jan 15, 2026

Uh oh!

openshift-ci bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petr-muller commented Jan 14, 2026 •

edited

Loading

coderabbitai bot commented Jan 14, 2026 •

edited

Loading