[COST-7249] Phase 2: Rate Calculation Isolation — RatesToUsage pipeline by jordigilh · Pull Request #6017 · project-koku/koku

jordigilh · 2026-04-21T19:03:32Z

Summary

PR 2 of 5 in the COST-7249 phased delivery plan.

PR	Status	Scope
#5980 Phase 1a: Design docs	Merged	Architecture documentation
#5984 Phase 1b: Rate table backfill	Open	Data migration (JSON → Rate rows)
This PR — Phase 2: Rate Calculation Isolation	Open	RatesToUsage pipeline
Phase 3+4: Per-rate distribution, breakdown API	Planned	Distribution SQL, API endpoint
Phase 5: Legacy cleanup	Planned	Remove `usage_costs.sql` and `CostModel.rates` JSON

What this PR does

Introduces the RatesToUsage table and pipeline, which replaces the direct-write path in usage_costs.sql with a two-step approach:

INSERT per-rate cost rows into rates_to_usage (partitioned by month, one row per rate × namespace × node × day at full label granularity)
Aggregate those rows back into the existing reporting_ocpusagelineitem_daily_summary columns (cost_model_cpu_cost, cost_model_memory_cost, cost_model_volume_cost)

This makes rates_to_usage the single source of truth for cost model calculations, enabling per-rate cost breakdown in Phases 3-4.

Pipeline flow

DELETE stale RTU rows (by source_uuid + date range)
    ↓
INSERT per-rate rows into rates_to_usage
  (CTE-based SQL: 6 usage metrics + cluster_cost_per_hour + raw infra cost)
    ↓
UPDATE usage_costs (legacy path — still active, removed in Phase 5)
    ↓
UPDATE markup, monthly, tag-based costs
    ↓
AGGREGATE rates_to_usage → daily summary (DELETE + INSERT)
    ↓
DISTRIBUTE costs (reads cost_model_*_cost from daily summary)
    ↓
UPDATE UI summary tables

Artifacts

Artifact	File	Description
`RatesToUsage` model	`reporting/provider/ocp/models.py`	New partitioned Django model with `label_hash` for performant GROUP BY
Migration M3	`reporting/migrations/0345_create_rates_to_usage.py`	DDL: partitioned table, indexes on `(source_uuid, usage_start)` and `label_hash`
RTU INSERT SQL	`masu/database/sql/openshift/cost_model/insert_usage_rates_to_usage.sql`	CTE + UNION ALL producing per-rate rows — replaces `usage_costs.sql` direct-write
Aggregation SQL	`masu/database/sql/openshift/cost_model/aggregate_rates_to_daily_summary.sql`	DELETE + INSERT: rolls up RTU → daily summary `cost_model_*_cost` columns
DELETE SQL	`masu/database/sql/openshift/cost_model/delete_rates_to_usage.sql`	Cleans stale RTU rows before recalculation
CI Validation SQL	`masu/database/sql/openshift/cost_model/validate_rates_against_daily_summary.sql`	Read-only comparison verifying aggregation correctness (CI-only)
Orchestration	`masu/processor/ocp/ocp_cost_model_cost_updater.py`	`_update_usage_rates_to_usage()` and `_aggregate_rates_to_daily_summary()` methods; wired into `update_summary_cost_model_costs()`
Accessor methods	`masu/database/ocp_report_db_accessor.py`	`populate_usage_rates_to_usage()`, `aggregate_rates_to_daily_summary()`, `delete_rates_to_usage()`
Rate-table accessor	`masu/database/cost_model_db_accessor.py`	`price_list` property reads from Rate table instead of JSON
Purge wiring	`masu/processor/ocp/ocp_report_db_cleaner.py`	`rates_to_usage` added to partition cleanup
Partition management	`ocp_cost_model_cost_updater.py`	`_ensure_rates_to_usage_partitions()` called before RTU writes
Test suite	`masu/test/processor/ocp/test_phase2_rates_to_usage.py`	Unit, integration, behavioral, and e2e tests (~984 lines)

Design documentation

Phase 2 section in phased-delivery.md — artifacts, validation criteria, benchmark plan
SQL pipeline reference — detailed SQL sketches, aggregation step, orchestration pseudocode
Data model reference — RatesToUsage DDL, label_hash rationale, partition strategy
Risk register — R2/R3 (aggregation correctness), R13 (JSONB JOIN performance)

Dependencies

Requires #5984 (Phase 1b: Rate table backfill) — Phase 2 reads from the Rate table; without the backfill migration, Rate rows would be empty and the RTU pipeline would produce no output.

Test plan

New test suite (test_phase2_rates_to_usage.py) passes
Existing cost model test suite passes (aggregation produces identical daily summary values)
CI validation SQL (validate_rates_against_daily_summary.sql) returns zero rows (no divergence)
rates_to_usage partitions created and cleaned by purge
Migration 0345 applies cleanly on fresh and existing databases

Made with Cursor

gemini-code-assist

Code Review

This pull request implements Phase 2 of the cost breakdown project, introducing the RatesToUsage table and a new SQL-based pipeline for OCP cost calculations. Key changes include normalizing JSON rates into a dedicated Rate table, updating the cost model manager for diff-based synchronization, and integrating the RatesToUsage logic into the OCPCostModelCostUpdater. Feedback identifies a critical logic error where the aggregation step is positioned outside the month loop, which would result in missing data for multi-month ranges. Furthermore, multiple SQL templates require single quotes around Jinja placeholders for date, string, and UUID literals to prevent syntax errors. A design concern was also raised regarding the level of aggregation in the RatesToUsage table, which may need to be more granular to support future per-rate breakdown requirements.

gemini-code-assist · 2026-04-21T19:08:05Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    'cpu_core_usage_per_hour', 'cpu', {{rate_type}},
+    NULL, b.cpu_usage_hours * {{cpu_core_usage_per_hour}}, b.cost_category_id
+FROM base b WHERE {{cpu_core_usage_per_hour}} != 0


Missing quotes around Jinja placeholders for UUID and string values ({{cost_model_id}}, {{source_uuid}}, {{rate_type}}). This issue persists across all UNION ALL components in this file.

SELECT uuid_generate_v4(), '{{cost_model_id}}', {{report_period_id}}, '{{source_uuid}}', b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias, b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash, 'cpu_core_usage_per_hour', 'cpu', '{{rate_type}}', NULL, b.cpu_usage_hours * {{cpu_core_usage_per_hour}}, b.cost_category_id FROM base b WHERE {{cpu_core_usage_per_hour}} != 0

False positive — same JinjaSql parameterization applies. {{cost_model_id}}, {{source_uuid}}, and {{rate_type}} are all rendered as %(param)s placeholders by JinjaSql and quoted by psycopg2 at execution time. No manual quotes needed.

jordigilh · 2026-04-21T20:37:22Z

New Risks Identified: R20 & R21

@myersCody — Two new risks surfaced during an adversarial audit of the Phase 2 implementation. Full details are in the risk register (v1.2). Summary below for your review.

R20 — Aggregation DELETE Scope Too Broad for Phase 2

Problem: aggregate_rates_to_daily_summary.sql deletes all daily summary rows matching cost_model_rate_type IN ('Infrastructure', 'Supplementary') AND monthly_cost_type IS NULL, but the subsequent INSERT only re-creates rows from rates_to_usage for metric_type IN ('cpu', 'memory', 'storage'). During the Phase 2 → Phase 3 gap, VM, monthly, and tag costs still written by legacy SQL are silently deleted and never re-created.

Options (details in risk register):

A — Narrow the DELETE scope: Only remove rows that will be re-created by the aggregation INSERT.
B — Pull forward VM/monthly/tag costs into RTU in Phase 2: Eliminates the gap entirely.
C — Defer aggregation to Phase 3: Phase 2 populates RTU without aggregating; validation via CI queries.
D — Reorder orchestration: Run legacy SQL after aggregation to restore deleted rows.

Recommendation: Option A or C.

R21 — Transitional VM Cost Handling (Phase 2 → 3)

Problem: phased-delivery.md says Phase 2 should "Replace _update_usage_costs()", but that method handles both usage costs (migrated in Phase 2) and VM costs (assigned to Phase 3). The transitional behavior is unspecified.

Resolution: R21 resolves mechanically once R20 is decided — each R20 option dictates exactly what happens to the VM cost path.

Also in this push

Finding B fix (commit 01d4df7): Restored the Phase 1a rate_sync.py extraction pattern. CostModelManager now delegates to rate_sync.sync_rate_table() instead of inlining sync methods — fixing a regression where the inlined copy diverged from the canonical implementation.

jordigilh · 2026-04-21T21:55:22Z

R20 & R21 — Resolved (Option D: Reorder orchestration)

@myersCody — Follow-up to my earlier comment. Both risks are now mitigated. The approach was derived directly from your feedback on the tech design PR.

R20: Aggregation DELETE scope

Root cause: _aggregate_rates_to_daily_summary() ran after the legacy VM/tag direct-write paths in update_summary_cost_model_costs(). Its DELETE (cost_model_rate_type IN ('Infrastructure', 'Supplementary') AND monthly_cost_type IS NULL) removed VM and tag rows that the legacy paths had just written, but the INSERT only re-created cpu/memory/storage rows from RTU.

Fix: Moved the aggregation to run before the legacy direct-write paths. The DELETE now only removes stale rows from the previous cycle, and legacy VM/tag paths write their rows after — unaffected.

Why this approach (from your PR #5948 reviews):

You rejected dual-path architectures: "My concern with removing the aggregation step is that we will be creating a dual path approach" — ruling out deferring aggregation to Phase 3.
You accepted orchestration ordering: "sequencing has always been important even with the current implementation" — confirming call ordering as a valid constraint.
No schema change or Phase 2 scope increase needed.

R21: Transitional VM cost handling

Already resolved on this branch: The original _update_usage_costs() (which contained both populate_usage_costs() and populate_vm_usage_costs()) was correctly split into:

RTU pipeline → replaces populate_usage_costs() for cpu/memory/storage
_update_vm_usage_costs() → keeps populate_vm_usage_costs() as a standalone method

With the R20 orchestration reorder, _update_vm_usage_costs() now runs after the aggregation DELETE, so its rows survive. Phase 3 will migrate this method to write to RTU (step 4a in sql-pipeline.md).

New orchestration order (Phase 2)

1. _update_usage_rates_to_usage()        ← RTU INSERT (usage costs)
2. _update_per_rate_distributed_cost()   ← Phase 4 distribution on RTU
3. _aggregate_rates_to_daily_summary()   ← DELETE+INSERT from RTU
4. _update_vm_usage_costs()              ← Legacy VM (writes after DELETE)
5. _update_markup_cost()                 ← Markup ORM UPDATE
6. _update_monthly_cost()                ← Monthly (monthly_cost_type set, not in DELETE scope)
7. Tag cost methods                      ← Legacy tag (writes after DELETE)
8. distribute_costs_and_update_ui_summary()

Full decision rationale with code references in risk-register.md (v1.2, R20 § "Why Option D" and R21 § "Resolution").

Commits in this push:

01d4df7 — Finding B: Extract rate sync logic to rate_sync.py (restores Phase 1a pattern)
3c404fe — R20/R21 risk register entries (initial)
07addee — R20/R21 fix: reorder aggregation + risk register update with resolution

jordigilh · 2026-04-21T22:39:23Z

Phase 2 Drift Fixes — Post-Preflight Execution

Three drift fixes committed, resolving the remaining items from the adversarial audit:

Commit 1: `12694c1` — drift8-test: Fix `sync_test_rate_rows` naming

Replaced inline custom_name generation with canonical generate_custom_name from rate_sync
Test Rate rows now use metric-cost_type format (e.g. cpu_core_usage_per_hour-Infrastructure)
PF-1 mitigation: cost_type set on rate_data dict before calling generate_custom_name

Commit 2: `9420f6b` — drift5-sql: Resolve `rate_id` and `custom_name` from Rate table

Added rate_names CTE joining cost_model_rate → cost_model_price_list → cost_model_price_list_cost_model_map
Each of 11 UNION ALL components: LEFT JOIN rate_names with COALESCE fallback
rate_id FK now populated in INSERT (was NULL before)
PF-2 cascade fix: PRD tree test updated to custom_name__startswith
New tests: TC-D8-01..04, TC-D5-01..05, TC-R20-01/02 (11 total)

Commit 3: `c83a30e` — drift2-7-docs: Fix `data-model.md` `source_uuid` type

3 locations updated from TextField/TEXT to UUIDField/UUID

Preflight Verification

All changes verified via 5 preflight checks (PF-1, PF-2, PF-D5, PF-D8, PF-D2). Key findings:

Aggregation SQL does not use custom_name — zero impact
Distribution SQL GROUP BY custom_name produces finer, correct per-rate grouping
Breakdown SQL tree paths get distinct leaves per cost_type (intended behavior)
rate_id addition is purely additive — no existing SQL reads it

Confidence: 98% (residual 2%: CTE not runtime-tested yet, validated structurally)

@myersCody — drift fixes ready for review. See risk-register.md R20/R21 for the orchestration changes from the prior commits.

codecov · 2026-04-22T02:30:00Z

Codecov Report

❌ Patch coverage is 95.55556% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.3%. Comparing base (87ad6a1) to head (137f77c).

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #6017     +/-   ##
=======================================
- Coverage   94.3%   94.3%   -0.0%     
=======================================
  Files        361     361             
  Lines      31805   31909    +104     
  Branches    3484    3494     +10     
=======================================
+ Hits       30007   30101     +94     
  Misses      1170    1170             
- Partials     628     638     +10

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

myersCody

The current structure makes it really hard to validate through our integration tests. Overall, mostly good structure. I would like to see us utilize the rate table more in the sql though.

myersCody · 2026-04-22T16:57:10Z

+            # cycle. Legacy paths write their rows AFTER this step and are unaffected.
+            self._aggregate_rates_to_daily_summary(start_date, end_date)
+
            self._update_usage_costs(start_date, end_date)


We won't be able to utilize our integration tests to confirm your changes with this current structure.

self._aggregate_rates_to_daily_summary(start_date, end_date) -> populates daily summary with cost_model_cost_type = Infrastructure & Supplementary

elf._update_usage_costs(start_date, end_date) -> Deletes Infrastructure & Supplementary
Reinserts Infrastructure and Supplementary.

Essentially we are doing twice the work at the moment with zero gain and no way to confirm functionality outside of the validate_ script that you wrote. Not that I don't trust your validate script, I would just like the ability to use integrated tests instead.

What we can do is add an unleash flag to conditionally turn on your feature pathway for integration tests:

If feature_enabled: self._aggregate_rates_to_daily_summary(start_date, end_date) else: self._update_usage_costs(start_date, end_date)

Sounds good — here's the flag proposal for the gating:

Flag name: cost-management.backend.cost_breakdown_rates_to_usage
Constant: COST_BREAKDOWN_RTU_UNLEASH_FLAG in masu/processor/__init__.py

Following the GPU flag pattern (dev_fallback=True):

SaaS: off by default, enabled per-schema via Unleash for integration testing

Dev: on by default via fallback_development_true

On-prem: off until promoted to MockUnleashClient.ONPREM_FLAG_DEFAULTS after validation

Gating follows your either/or suggestion — when enabled, the RTU pipeline replaces the legacy _update_usage_costs path entirely (no dual-write):

rtu_enabled = is_feature_flag_enabled_by_schema( self._schema, COST_BREAKDOWN_RTU_UNLEASH_FLAG, dev_fallback=True ) if rtu_enabled: self._update_usage_rates_to_usage(start_date, end_date) self._aggregate_rates_to_daily_summary(start_date, end_date) else: self._update_usage_costs(start_date, end_date)

This way existing integration tests run the legacy path by default, and we can flip the flag per-schema to exercise the RTU pipeline independently.

Also bundled in this update: the SQL refactor you suggested (comments 4+5) — rate_names CTE now JOINs cost_model_rate for default_rate and cost_type directly, eliminating 11 per-metric Jinja params and the two-call Infrastructure/Supplementary loop (single-pass). cluster_cost_per_hour remains as the sole rate Jinja param for the cte_node_cost pre-computation. DELETE is merged into the INSERT file, and all RTU SQL is under usage_rates/.

So, I traced down the _update_usage_costs path, and I saw:

report_accessor.populate_vm_usage_costs( report_type, filter_dictionary(report_type_dict, metric_constants.COST_MODEL_VM_USAGE_RATES), start_date, end_date, self._provider.uuid, report_period_id, )

Your:

self._update_usage_rates_to_usage(start_date, end_date) self._aggregate_rates_to_daily_summary(start_date, end_date)

Does not calculate the VM rates. Which is likely the source of your failing integration tests.

You can decouple the populate_vm_usage_costs and run it independently of the usage costs and that may get your integration tests passing.

Exactly right — this was the root cause. Extracted _update_vm_usage_costs() as a standalone method and wired it into the RTU path after aggregation (commit ce7fcd2). The orchestration is now:

if rtu_enabled and cost_model_id: _update_usage_rates_to_usage() # RTU INSERT _aggregate_rates_to_daily_summary() # DELETE+INSERT from RTU _update_vm_usage_costs() # VM costs (decoupled) else: _update_usage_costs() # Legacy (includes VM)

Also added _cleanup_stale_rtu_costs() for the edge case where rtu_enabled=True but the cost model has been removed — deletes orphaned rows from both daily_summary and rates_to_usage.

myersCody · 2026-04-22T19:20:40Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', {{rate_type}},


COALESCE(rn.custom_name, 'cpu_core_usage_per_hour') I would prefer to use rn.metric instead of a hard coded vale. Less chance of a typo making it to production in the future.

Instead of cpu could we just use the metric_type field from rate table?

{{rate_type}} can be replaced with cost_type from the rate field.

Utilizing the rate_type from the rate field means that we don't have to run this SQL twice, one for infrastructure and one for supplementary.

All three addressed:

COALESCE(rn.custom_name, rn.metric) — all 11 components now fall back to rn.metric instead of hardcoded metric names (commit e7e3c3e).

metric_type — added r.metric_type to the rate_names CTE and replaced hardcoded 'cpu'/'memory'/'storage' with rn.metric_type for Components 1-3 and 7-11 (commit 45bf072). Components 4-5 (node/cluster core) keep 'cpu' because the Rate table stores 'node'/'cluster' but the aggregation SQL routes costs via metric_type IN ('cpu','memory','storage') into cost_model_*_cost columns. Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution expression since its target column is distribution-dependent.

{{rate_type}} eliminated — rn.cost_type is read directly from the Rate table via the rate_names CTE (commit 8991953). This also eliminated the two-call loop (Infra/Supplementary in a single pass).

jordigilh · 2026-04-23T12:19:16Z

Post-Review Polish: Audit Findings Summary

Following a comprehensive adversarial audit of the entire PR (security, correctness, performance, design quality, maintainability), these two commits address the identified findings for long-term maintenance. All changes are backward-compatible and non-functional in the happy path.

Commit 1: `[COST-7249] Fix O(n^2) rate dedup, add composite index, document epsilon`

4 targeted improvements:

O(n^2) → O(n) in rate_sync._classify_incoming_rates (LOW)
- The used_names set was rebuilt from scratch on every loop iteration via a set comprehension over all rates_data. This is O(n) per iteration = O(n²) total.
- Fix: hoist the set above the loop and grow it with .add() after each _resolve_custom_name call. This preserves the deduplication behavior (names like cpu_core_usage_per_hour-Infrastructure-2) while reducing to O(n).
- The in-place mutation by _resolve_custom_name was the reason a naive hoist would have been unsafe — the .add() in the loop body is the key.
Composite index for RatesToUsage DELETE/SELECT (LOW)
- Both insert_usage_rates_to_usage.sql (DELETE) and aggregate_rates_to_daily_summary.sql (SELECT) filter on (usage_start, source_uuid, report_period_id). The previous two separate indexes required a bitmap AND.
- Fix: replace (usage_start, source_uuid) + (report_period_id) with a single composite (usage_start, source_uuid, report_period_id). Migration 0349 applies this.
Document validation epsilon (INFO)
- Added inline SQL comment explaining that 0.000000000000001 (1e-15) matches calculated_cost's decimal_places=15 — differences below this are rounding noise.
Clarify re-export pattern (INFO)
- Added block comment above the derive_metric_type, extract_default_rate, generate_custom_name re-exports in cost_model_manager.py explaining they exist for backward compatibility with common.py and test_rate_helpers.py.

Commit 2: `[COST-7249] Replace inspect.getsource ordering tests with mock-based assertions`

Test refactor for robustness:

TestOrchestrationOrder previously used inspect.getsource() + str.find() to verify method ordering in the source text. This is fragile and breaks on any refactor.
In the either/or execution model, _aggregate_rates_to_daily_summary and _update_usage_costs never run in the same invocation, making the original "agg before usage" test conceptually moot.
TC-R20-01 now asserts the full RTU-enabled call ordering: rtu → agg → markup → monthly → dist, and verifies _update_usage_costs is NOT called. Uses the existing _make_orchestration_patches decorator with side_effect lambdas and a call_order list.
TC-R20-02 asserts agg < markup as a proxy for agg-before-tags, since tag methods run strictly after _update_markup_cost / _update_monthly_cost in the source. This avoids extending the mock stack to include tag methods.

What's NOT changed (and why)

Migration reversibility: All Koku partitioned-table migrations are forward-only. Adding a reverse RunPython that drops the table would risk data loss in production. This is consistent with existing patterns.
No new feature flag behavior: These commits do not change any runtime behavior when the flag is enabled or disabled.

jordigilh

/retest

myersCody

There are some failing regressions with your integration tests. Setting dev_fallback=True means that when the integration tests run turn the unleash flag on. Therefore, your changes is what is causing those test failures.

myersCody · 2026-04-24T17:40:03Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

Applied all 11 — every COALESCE(rn.custom_name, '<hardcoded>') is now COALESCE(rn.custom_name, rn.metric) (commit e7e3c3e).

myersCody · 2026-04-24T17:40:48Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cpu_core_request_per_hour'), 'cpu', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'cpu_core_request_per_hour'), 'cpu', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

myersCody · 2026-04-24T17:41:01Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cpu_core_effective_usage_per_hour'), 'cpu', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'cpu_core_effective_usage_per_hour'), 'cpu', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

myersCody · 2026-04-24T17:41:22Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'node_core_cost_per_hour'), 'cpu', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'node_core_cost_per_hour'), 'cpu', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

myersCody · 2026-04-24T17:41:35Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cluster_core_cost_per_hour'), 'cpu', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'cluster_core_cost_per_hour'), 'cpu', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

myersCody · 2026-04-24T17:43:04Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'memory_gb_effective_usage_per_hour'), 'memory', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'memory_gb_effective_usage_per_hour'), 'memory', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'memory', rn.cost_type,

myersCody · 2026-04-24T17:43:18Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'storage_gb_usage_per_month'), 'storage', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'storage_gb_usage_per_month'), 'storage', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

myersCody · 2026-04-24T17:43:35Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'storage_gb_request_per_month'), 'storage', rn.cost_type,


Suggested change

COALESCE(rn.custom_name, 'storage_gb_request_per_month'), 'storage', rn.cost_type,

COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

myersCody · 2026-04-24T17:44:41Z

+        lids.cost_category_id
+),
+
+rate_names AS (


Eventually there could be more than one price list per cost model. We need to choose the price list here that is the highest priority for the start & end date passed in.

Acknowledged — the current rate_names CTE joins through price_list_cost_model_map without priority filtering, which works today because there's only one price list per cost model. When COST-575 lands the multi-price-list lifecycle support, this CTE will need to filter by pcm.priority and pl.effective_start_date <= start_date AND (pl.effective_end_date IS NULL OR pl.effective_end_date >= end_date). Tracking as a follow-up for when that schema is available.

That schema is already available. The only thing stoping 575 from landing right now is the UI work and final verification, so I think it would be better to adjust the SQL now to handle more than one price list.

Addressed in commit e4c7add — added an effective_price_list CTE that resolves the active price list by effective_start_date/effective_end_date date range and priority, then scopes rate_names to that single price list. This handles the multi-price-list scenario from 575.

myersCody · 2026-04-24T17:49:12Z

+SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
+    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
+    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
+    COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,


We added a metric_type column to Rate table. Why are we hardcoding cpu, memory or storage for all of the select statement in order to insert the metric_type? I find that a bit confusing.

Fixed in commit 45bf072 — added r.metric_type to the rate_names CTE and replaced hardcoded values with rn.metric_type for 8 of the 11 components (1-3, 7-11). The remaining 3 components (4, 5, 6) keep explicit values because the Rate table stores 'node'/'cluster' for those metrics, but the aggregation SQL filters on metric_type IN ('cpu', 'memory', 'storage') to route costs into the daily summary cost_model_*_cost columns. Using rn.metric_type there would silently drop those rows from aggregation.

jordigilh · 2026-04-25T18:50:21Z

/retest

jordigilh · 2026-04-26T12:07:37Z

/retest

Add r.metric_type to the rate_names CTE and replace hardcoded 'cpu'/'memory'/'storage' with rn.metric_type in Components 1-3, 7-11. Components 4-5 (node_core_cost_per_hour, cluster_core_cost_per_hour) keep 'cpu' because the Rate table stores 'node'/'cluster' but the aggregation SQL routes costs via metric_type IN ('cpu','memory','storage') into the daily summary cost_model_*_cost columns. Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution expression since its metric_type is distribution-dependent. Addresses TL review feedback on PR project-koku#6017. Made-with: Cursor

jordigilh · 2026-04-27T00:32:10Z

Note on dev_fallback=False

Changed the Unleash check from dev_fallback=True to dev_fallback=False (commit 6388df8).

Why: dev_fallback=True meant the RTU pipeline activated in any environment without an Unleash server — CI, local dev, on-prem — which caused integration test failures because the tests did not set up rates_to_usage data. With dev_fallback=False, the RTU path only activates when the flag is explicitly enabled in Unleash, which is the correct behavior for a gated rollout. SaaS environments with Unleash are unaffected.

pedrolp85 · 2026-04-27T09:11:42Z

/retest

esebesto · 2026-04-28T09:43:22Z

Tested with full smokes with just 1 failure unrelated to the PR -> we can switch to hot-fix-smokes tests for rebase run (in case the PR itself is updated, we will need a different label).
ERROR tests/rest_api/v1/test__cost_model.py::test_api_cost_model_ocp_on_azure_markup_calculation - LookupError: No unused azure storage account has been found.

myersCody · 2026-04-28T16:56:45Z

-                )
+
+            rtu_enabled = is_feature_flag_enabled_by_schema(
+                self._schema, COST_BREAKDOWN_RTU_UNLEASH_FLAG, dev_fallback=False


So, I chatted with QE during our boardwalk today and the preference it to set the deve fallback=True. This means all smoke test jobs moving forward will be using your flow.

We have stage running the old flow for any regressions.

Done — flipped dev_fallback=True so all smoke tests use the RTU flow going forward.

myersCody · 2026-04-28T17:03:10Z

+    ]
+
+    operations = [
+        migrations.RemoveIndex(


These indexes are added in the previous migration, we could jsut set the migrations we want in the original migration 0348.

Good call — squashed the composite index directly into 0348 and removed 0349.

jordigilh · 2026-04-29T01:27:54Z

koku-ci timed out after ~6h during the data ingestion setup phase — the actual smoke tests never ran (still at 0% progress). The last activity was polling test_cost_gcp_source_no_markup waiting for GCP report file processing. Zero test failures in the log.

This is a resource/time constraint of full-run-smoke-tests (10,161 tests) in the ephemeral environment, not related to PR changes.

/retest

esebesto · 2026-04-29T09:35:04Z

koku-ci timed out after ~6h during the data ingestion setup phase — the actual smoke tests never ran (still at 0% progress). The last activity was polling test_cost_gcp_source_no_markup waiting for GCP report file processing. Zero test failures in the log.

This is a resource/time constraint of full-run-smoke-tests (10,161 tests) in the ephemeral environment, not related to PR changes.

@jordigilh The PR check koku-ci-hrlz6 was actually not aborted - Note that the logs in the konflux UI are truncated (so it is def. confusing)

I downloaded the results of that pr-check and see the following failures directly related to you PR:

FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_aws_source_raw_calc - AssertionError: Raw calculated value does not match report value, supplied=19.839636879244534, expected=14.356314616774196 and dynamic_tolerance=0.014356314616774196 absolute_tolerance=1e-08
assert (5.483322262470338 <= 0.014356314616774196 or 5.483322262470338 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_aws_source_raw_calc_multi_cluster - AssertionError: Raw calculated value does not match report value, supplied=138843.39823139575, expected=120784.39198605249 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (18059.006245343262 <= 0.1 or 18059.006245343262 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_azure_source_raw_calc - AssertionError: Raw calculated value does not match report value, supplied=245.26385915438297, expected=195.93512241792 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (49.32873673646296 <= 0.1 or 49.32873673646296 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_azure_source_raw_calc_multi_cluster - AssertionError: Supplementary cost in OCP cost path unmatches calculated for date='2026-03-01' and project_name='Platform unallocated', supplied=5378.55709132931, expected=2689.2785464262406 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (2689.27854490307 <= 0.1 or 2689.27854490307 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_gcp_source_raw_calc - AssertionError: kikis ocp worker unallocated distributed cost mismatch, supplied=202.61313590940347, expected=116.2304124744 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (86.38272343500347 <= 0.1 or 86.38272343500347 <= 1e-08)

These failures indicate that the cost calculations are not as expected.

Add r.metric_type to the rate_names CTE and replace hardcoded 'cpu'/'memory'/'storage' with rn.metric_type in Components 1-3, 7-11. Components 4-5 (node_core_cost_per_hour, cluster_core_cost_per_hour) keep 'cpu' because the Rate table stores 'node'/'cluster' but the aggregation SQL routes costs via metric_type IN ('cpu','memory','storage') into the daily summary cost_model_*_cost columns. Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution expression since its metric_type is distribution-dependent. Addresses TL review feedback on PR project-koku#6017. Made-with: Cursor

…, and tests Introduces the RatesToUsage table and pipeline for cost model calculations: - RatesToUsage model and migration (M4) with composite index - SQL pipeline: insert per-rate rows, aggregate to daily summary - Orchestration: feature-flagged RTU path in OCPCostModelCostUpdater - Unleash flag: cost-management.backend.cost_breakdown_rates_to_usage - Rate sync logic extracted to rate_sync.py with O(n^2) dedup fix - Custom name generation consolidated into _resolve_custom_name - Full test suite: unit, integration, behavioral, and e2e - Legacy-path tests updated to disable RTU flag Made-with: Cursor

…nd orchestration Root cause: label_hash collision in RTU aggregate SQL caused join expansion and inflated costs (~38% higher). The md5 hash concatenated pod_labels, volume_labels, and all_labels without delimiters, so rows differing only in which label field was NULL vs '{}' produced identical hashes. Fix: add '|' delimiter between label fields in md5() across insert_usage_rates_to_usage.sql, aggregate_rates_to_daily_summary.sql, and validate_rates_against_daily_summary.sql. Additional fixes: - Tighten base CTE filter to exclude rows with existing cost_model_rate_type - Fix distribution lookup: read from cost_model table in SQL - Read metric_type and cost_type from Rate table directly - Extract _update_vm_usage_costs and call after RTU aggregation - Factor cluster_cost_per_hour into distribution-aware component - Enable dev_fallback=True for RTU pipeline in dev/test Verified: both IQE raw_calc tests pass with dev_fallback=True. Co-authored-by: Cursor <cursoragent@cursor.com>

jordigilh · 2026-05-01T17:23:33Z

RCA: RTU Pipeline Cost Inflation with `dev_fallback=True`

Symptom

When dev_fallback=True enabled the RTU pipeline, two IQE integration tests failed with ~38% cost inflation:

test_api_ocp_on_aws_source_raw_calc
test_api_ocp_on_aws_source_raw_calc_multi_cluster

API-returned costs were consistently higher than expected (e.g., supplied=66.0 vs expected=44.0 for CPU).

Root Cause: `label_hash` collision in aggregation JOIN

The aggregate_rates_to_daily_summary.sql and insert_usage_rates_to_usage.sql files use an md5 hash over label fields to group rows:

-- BEFORE (buggy):
md5(COALESCE(lids.pod_labels::text, '')
    || COALESCE(lids.volume_labels::text, '')
    || COALESCE(lids.all_labels::text, '')) AS label_hash

This hash lacked delimiters between concatenated fields, causing collisions when field boundaries shifted. For example:

pod_labels	volume_labels	all_labels	Concatenation	Hash
NULL → `''`	`'{}'`	NULL → `''`	`'{}'`	`abc123`
NULL → `''`	NULL → `''`	`'{}'`	`'{}'`	`abc123` ← collision

Two semantically distinct rows produced the same label_hash. During the LEFT JOIN in aggregation, a single rates_to_usage row matched multiple base rows, inflating costs by 1.5x.

Fix

Added | delimiter between label fields to prevent boundary collisions:

-- AFTER (fixed):
md5(COALESCE(lids.pod_labels::text, '')
    || '|' || COALESCE(lids.volume_labels::text, '')
    || '|' || COALESCE(lids.all_labels::text, '')) AS label_hash

Applied consistently across:

insert_usage_rates_to_usage.sql
aggregate_rates_to_daily_summary.sql
validate_rates_against_daily_summary.sql

Verification

Both IQE tests pass with dev_fallback=True:

========= 2 passed, 11369 deselected, 16 warnings in 475.70s (0:07:55) =========

Added full-run-smoke-tests label for full CI validation.

jordigilh · 2026-05-02T13:07:42Z

/retest

jordigilh requested review from a team as code owners April 21, 2026 19:03

github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label Apr 21, 2026

jordigilh marked this pull request as draft April 21, 2026 19:05

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 3 times, most recently from c9b388b to 545ae8d Compare April 22, 2026 01:51

myersCody marked this pull request as ready for review April 22, 2026 13:34

myersCody added flightpath-pr Issues being worked on by the flight path team ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. labels Apr 22, 2026

myersCody self-assigned this Apr 22, 2026

myersCody requested changes Apr 22, 2026

View reviewed changes

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 2 times, most recently from 26777f3 to 2f2cd73 Compare April 23, 2026 00:51

jordigilh commented Apr 23, 2026

View reviewed changes

myersCody requested changes Apr 24, 2026

View reviewed changes

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from efd3aae to 058ed69 Compare April 24, 2026 19:15

esebesto added full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. and removed ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. labels Apr 27, 2026

esebesto added the hot-fix-smoke-tests pr_check label to run minimal smoke tests for fast moving bug-fix label Apr 28, 2026

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from e8fc443 to e4c7add Compare April 28, 2026 13:25

myersCody reviewed Apr 28, 2026

View reviewed changes

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 3 times, most recently from 9d33130 to 4ba9477 Compare April 28, 2026 17:26

esebesto removed the full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. label Apr 29, 2026

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from 4ba9477 to da15ea8 Compare April 30, 2026 17:24

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 2 times, most recently from 2256f40 to acd4e74 Compare May 1, 2026 01:13

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from acd4e74 to 7f76f7d Compare May 1, 2026 01:24

jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from 0bec8fb to 137f77c Compare May 1, 2026 17:20

jordigilh added the full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. label May 1, 2026

	COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

	COALESCE(rn.custom_name, 'cpu_core_request_per_hour'), 'cpu', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

	COALESCE(rn.custom_name, 'cpu_core_effective_usage_per_hour'), 'cpu', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

	COALESCE(rn.custom_name, 'node_core_cost_per_hour'), 'cpu', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

	COALESCE(rn.custom_name, 'cluster_core_cost_per_hour'), 'cpu', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

	COALESCE(rn.custom_name, 'memory_gb_effective_usage_per_hour'), 'memory', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'memory', rn.cost_type,

	COALESCE(rn.custom_name, 'storage_gb_usage_per_month'), 'storage', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

	COALESCE(rn.custom_name, 'storage_gb_request_per_month'), 'storage', rn.cost_type,
	COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

Conversation

jordigilh commented Apr 21, 2026

Summary

What this PR does

Pipeline flow

Artifacts

Design documentation

Dependencies

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jordigilh commented Apr 21, 2026

New Risks Identified: R20 & R21

R20 — Aggregation DELETE Scope Too Broad for Phase 2

R21 — Transitional VM Cost Handling (Phase 2 → 3)

Also in this push

Uh oh!

jordigilh commented Apr 21, 2026

R20 & R21 — Resolved (Option D: Reorder orchestration)

R20: Aggregation DELETE scope

R21: Transitional VM cost handling

New orchestration order (Phase 2)

Uh oh!

jordigilh commented Apr 21, 2026

Phase 2 Drift Fixes — Post-Preflight Execution

Commit 1: 12694c1 — drift8-test: Fix sync_test_rate_rows naming

Commit 2: 9420f6b — drift5-sql: Resolve rate_id and custom_name from Rate table

Commit 3: c83a30e — drift2-7-docs: Fix data-model.md source_uuid type

Preflight Verification

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

myersCody left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordigilh commented Apr 23, 2026

Post-Review Polish: Audit Findings Summary

Commit 1: [COST-7249] Fix O(n^2) rate dedup, add composite index, document epsilon

Commit 2: [COST-7249] Replace inspect.getsource ordering tests with mock-based assertions

What's NOT changed (and why)

Uh oh!

jordigilh left a comment

Choose a reason for hiding this comment

Uh oh!

myersCody left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Commit 1: `12694c1` — drift8-test: Fix `sync_test_rate_rows` naming

Commit 2: `9420f6b` — drift5-sql: Resolve `rate_id` and `custom_name` from Rate table

Commit 3: `c83a30e` — drift2-7-docs: Fix `data-model.md` `source_uuid` type

codecov Bot commented Apr 22, 2026 •

edited

Loading

Commit 1: `[COST-7249] Fix O(n^2) rate dedup, add composite index, document epsilon`

Commit 2: `[COST-7249] Replace inspect.getsource ordering tests with mock-based assertions`