Skip to content

[COST-7249] Phase 2: Rate Calculation Isolation — RatesToUsage pipeline#6017

Open
jordigilh wants to merge 2 commits intoproject-koku:mainfrom
jordigilh:COST-7249/phase2-rate-isolation
Open

[COST-7249] Phase 2: Rate Calculation Isolation — RatesToUsage pipeline#6017
jordigilh wants to merge 2 commits intoproject-koku:mainfrom
jordigilh:COST-7249/phase2-rate-isolation

Conversation

@jordigilh
Copy link
Copy Markdown
Contributor

Summary

PR 2 of 5 in the COST-7249 phased delivery plan.

PR Status Scope
#5980 Phase 1a: Design docs Merged Architecture documentation
#5984 Phase 1b: Rate table backfill Open Data migration (JSON → Rate rows)
This PR — Phase 2: Rate Calculation Isolation Open RatesToUsage pipeline
Phase 3+4: Per-rate distribution, breakdown API Planned Distribution SQL, API endpoint
Phase 5: Legacy cleanup Planned Remove usage_costs.sql and CostModel.rates JSON

What this PR does

Introduces the RatesToUsage table and pipeline, which replaces the direct-write path in usage_costs.sql with a two-step approach:

  1. INSERT per-rate cost rows into rates_to_usage (partitioned by month, one row per rate × namespace × node × day at full label granularity)
  2. Aggregate those rows back into the existing reporting_ocpusagelineitem_daily_summary columns (cost_model_cpu_cost, cost_model_memory_cost, cost_model_volume_cost)

This makes rates_to_usage the single source of truth for cost model calculations, enabling per-rate cost breakdown in Phases 3-4.

Pipeline flow

DELETE stale RTU rows (by source_uuid + date range)
    ↓
INSERT per-rate rows into rates_to_usage
  (CTE-based SQL: 6 usage metrics + cluster_cost_per_hour + raw infra cost)
    ↓
UPDATE usage_costs (legacy path — still active, removed in Phase 5)
    ↓
UPDATE markup, monthly, tag-based costs
    ↓
AGGREGATE rates_to_usage → daily summary (DELETE + INSERT)
    ↓
DISTRIBUTE costs (reads cost_model_*_cost from daily summary)
    ↓
UPDATE UI summary tables

Artifacts

Artifact File Description
RatesToUsage model reporting/provider/ocp/models.py New partitioned Django model with label_hash for performant GROUP BY
Migration M3 reporting/migrations/0345_create_rates_to_usage.py DDL: partitioned table, indexes on (source_uuid, usage_start) and label_hash
RTU INSERT SQL masu/database/sql/openshift/cost_model/insert_usage_rates_to_usage.sql CTE + UNION ALL producing per-rate rows — replaces usage_costs.sql direct-write
Aggregation SQL masu/database/sql/openshift/cost_model/aggregate_rates_to_daily_summary.sql DELETE + INSERT: rolls up RTU → daily summary cost_model_*_cost columns
DELETE SQL masu/database/sql/openshift/cost_model/delete_rates_to_usage.sql Cleans stale RTU rows before recalculation
CI Validation SQL masu/database/sql/openshift/cost_model/validate_rates_against_daily_summary.sql Read-only comparison verifying aggregation correctness (CI-only)
Orchestration masu/processor/ocp/ocp_cost_model_cost_updater.py _update_usage_rates_to_usage() and _aggregate_rates_to_daily_summary() methods; wired into update_summary_cost_model_costs()
Accessor methods masu/database/ocp_report_db_accessor.py populate_usage_rates_to_usage(), aggregate_rates_to_daily_summary(), delete_rates_to_usage()
Rate-table accessor masu/database/cost_model_db_accessor.py price_list property reads from Rate table instead of JSON
Purge wiring masu/processor/ocp/ocp_report_db_cleaner.py rates_to_usage added to partition cleanup
Partition management ocp_cost_model_cost_updater.py _ensure_rates_to_usage_partitions() called before RTU writes
Test suite masu/test/processor/ocp/test_phase2_rates_to_usage.py Unit, integration, behavioral, and e2e tests (~984 lines)

Design documentation

Dependencies

  • Requires #5984 (Phase 1b: Rate table backfill) — Phase 2 reads from the Rate table; without the backfill migration, Rate rows would be empty and the RTU pipeline would produce no output.

Test plan

  • New test suite (test_phase2_rates_to_usage.py) passes
  • Existing cost model test suite passes (aggregation produces identical daily summary values)
  • CI validation SQL (validate_rates_against_daily_summary.sql) returns zero rows (no divergence)
  • rates_to_usage partitions created and cleaned by purge
  • Migration 0345 applies cleanly on fresh and existing databases

Made with Cursor

@jordigilh jordigilh requested review from a team as code owners April 21, 2026 19:03
@github-actions github-actions Bot added the smokes-required Label to show that smokes tests should be run against these changes. label Apr 21, 2026
@jordigilh jordigilh marked this pull request as draft April 21, 2026 19:05
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 2 of the cost breakdown project, introducing the RatesToUsage table and a new SQL-based pipeline for OCP cost calculations. Key changes include normalizing JSON rates into a dedicated Rate table, updating the cost model manager for diff-based synchronization, and integrating the RatesToUsage logic into the OCPCostModelCostUpdater. Feedback identifies a critical logic error where the aggregation step is positioned outside the month loop, which would result in missing data for multi-month ranges. Furthermore, multiple SQL templates require single quotes around Jinja placeholders for date, string, and UUID literals to prevent syntax errors. A design concern was also raised regarding the level of aggregation in the RatesToUsage table, which may need to be more granular to support future per-rate breakdown requirements.

Comment thread koku/masu/processor/ocp/ocp_cost_model_cost_updater.py Outdated
Comment thread koku/masu/database/sql/openshift/cost_model/delete_rates_to_usage.sql Outdated
Comment on lines +146 to +151
SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
'cpu_core_usage_per_hour', 'cpu', {{rate_type}},
NULL, b.cpu_usage_hours * {{cpu_core_usage_per_hour}}, b.cost_category_id
FROM base b WHERE {{cpu_core_usage_per_hour}} != 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Missing quotes around Jinja placeholders for UUID and string values ({{cost_model_id}}, {{source_uuid}}, {{rate_type}}). This issue persists across all UNION ALL components in this file.

SELECT uuid_generate_v4(), '{{cost_model_id}}', {{report_period_id}}, '{{source_uuid}}',
    b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
    b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
    'cpu_core_usage_per_hour', 'cpu', '{{rate_type}}',
    NULL, b.cpu_usage_hours * {{cpu_core_usage_per_hour}}, b.cost_category_id
FROM base b WHERE {{cpu_core_usage_per_hour}} != 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive — same JinjaSql parameterization applies. {{cost_model_id}}, {{source_uuid}}, and {{rate_type}} are all rendered as %(param)s placeholders by JinjaSql and quoted by psycopg2 at execution time. No manual quotes needed.

Comment thread koku/masu/database/sql/openshift/cost_model/insert_usage_rates_to_usage.sql Outdated
@jordigilh
Copy link
Copy Markdown
Contributor Author

New Risks Identified: R20 & R21

@myersCody — Two new risks surfaced during an adversarial audit of the Phase 2 implementation. Full details are in the risk register (v1.2). Summary below for your review.


R20 — Aggregation DELETE Scope Too Broad for Phase 2

Problem: aggregate_rates_to_daily_summary.sql deletes all daily summary rows matching cost_model_rate_type IN ('Infrastructure', 'Supplementary') AND monthly_cost_type IS NULL, but the subsequent INSERT only re-creates rows from rates_to_usage for metric_type IN ('cpu', 'memory', 'storage'). During the Phase 2 → Phase 3 gap, VM, monthly, and tag costs still written by legacy SQL are silently deleted and never re-created.

Options (details in risk register):

  • A — Narrow the DELETE scope: Only remove rows that will be re-created by the aggregation INSERT.
  • B — Pull forward VM/monthly/tag costs into RTU in Phase 2: Eliminates the gap entirely.
  • C — Defer aggregation to Phase 3: Phase 2 populates RTU without aggregating; validation via CI queries.
  • D — Reorder orchestration: Run legacy SQL after aggregation to restore deleted rows.

Recommendation: Option A or C.

R21 — Transitional VM Cost Handling (Phase 2 → 3)

Problem: phased-delivery.md says Phase 2 should "Replace _update_usage_costs()", but that method handles both usage costs (migrated in Phase 2) and VM costs (assigned to Phase 3). The transitional behavior is unspecified.

Resolution: R21 resolves mechanically once R20 is decided — each R20 option dictates exactly what happens to the VM cost path.


Also in this push

Finding B fix (commit 01d4df7): Restored the Phase 1a rate_sync.py extraction pattern. CostModelManager now delegates to rate_sync.sync_rate_table() instead of inlining sync methods — fixing a regression where the inlined copy diverged from the canonical implementation.

@jordigilh
Copy link
Copy Markdown
Contributor Author

R20 & R21 — Resolved (Option D: Reorder orchestration)

@myersCody — Follow-up to my earlier comment. Both risks are now mitigated. The approach was derived directly from your feedback on the tech design PR.

R20: Aggregation DELETE scope

Root cause: _aggregate_rates_to_daily_summary() ran after the legacy VM/tag direct-write paths in update_summary_cost_model_costs(). Its DELETE (cost_model_rate_type IN ('Infrastructure', 'Supplementary') AND monthly_cost_type IS NULL) removed VM and tag rows that the legacy paths had just written, but the INSERT only re-created cpu/memory/storage rows from RTU.

Fix: Moved the aggregation to run before the legacy direct-write paths. The DELETE now only removes stale rows from the previous cycle, and legacy VM/tag paths write their rows after — unaffected.

Why this approach (from your PR #5948 reviews):

  • You rejected dual-path architectures: "My concern with removing the aggregation step is that we will be creating a dual path approach" — ruling out deferring aggregation to Phase 3.
  • You accepted orchestration ordering: "sequencing has always been important even with the current implementation" — confirming call ordering as a valid constraint.
  • No schema change or Phase 2 scope increase needed.

R21: Transitional VM cost handling

Already resolved on this branch: The original _update_usage_costs() (which contained both populate_usage_costs() and populate_vm_usage_costs()) was correctly split into:

  1. RTU pipeline → replaces populate_usage_costs() for cpu/memory/storage
  2. _update_vm_usage_costs() → keeps populate_vm_usage_costs() as a standalone method

With the R20 orchestration reorder, _update_vm_usage_costs() now runs after the aggregation DELETE, so its rows survive. Phase 3 will migrate this method to write to RTU (step 4a in sql-pipeline.md).

New orchestration order (Phase 2)

1. _update_usage_rates_to_usage()        ← RTU INSERT (usage costs)
2. _update_per_rate_distributed_cost()   ← Phase 4 distribution on RTU
3. _aggregate_rates_to_daily_summary()   ← DELETE+INSERT from RTU
4. _update_vm_usage_costs()              ← Legacy VM (writes after DELETE)
5. _update_markup_cost()                 ← Markup ORM UPDATE
6. _update_monthly_cost()                ← Monthly (monthly_cost_type set, not in DELETE scope)
7. Tag cost methods                      ← Legacy tag (writes after DELETE)
8. distribute_costs_and_update_ui_summary()

Full decision rationale with code references in risk-register.md (v1.2, R20 § "Why Option D" and R21 § "Resolution").

Commits in this push:

  • 01d4df7 — Finding B: Extract rate sync logic to rate_sync.py (restores Phase 1a pattern)
  • 3c404fe — R20/R21 risk register entries (initial)
  • 07addee — R20/R21 fix: reorder aggregation + risk register update with resolution

@jordigilh
Copy link
Copy Markdown
Contributor Author

Phase 2 Drift Fixes — Post-Preflight Execution

Three drift fixes committed, resolving the remaining items from the adversarial audit:

Commit 1: 12694c1 — drift8-test: Fix sync_test_rate_rows naming

  • Replaced inline custom_name generation with canonical generate_custom_name from rate_sync
  • Test Rate rows now use metric-cost_type format (e.g. cpu_core_usage_per_hour-Infrastructure)
  • PF-1 mitigation: cost_type set on rate_data dict before calling generate_custom_name

Commit 2: 9420f6b — drift5-sql: Resolve rate_id and custom_name from Rate table

  • Added rate_names CTE joining cost_model_ratecost_model_price_listcost_model_price_list_cost_model_map
  • Each of 11 UNION ALL components: LEFT JOIN rate_names with COALESCE fallback
  • rate_id FK now populated in INSERT (was NULL before)
  • PF-2 cascade fix: PRD tree test updated to custom_name__startswith
  • New tests: TC-D8-01..04, TC-D5-01..05, TC-R20-01/02 (11 total)

Commit 3: c83a30e — drift2-7-docs: Fix data-model.md source_uuid type

  • 3 locations updated from TextField/TEXT to UUIDField/UUID

Preflight Verification

All changes verified via 5 preflight checks (PF-1, PF-2, PF-D5, PF-D8, PF-D2). Key findings:

  • Aggregation SQL does not use custom_name — zero impact
  • Distribution SQL GROUP BY custom_name produces finer, correct per-rate grouping
  • Breakdown SQL tree paths get distinct leaves per cost_type (intended behavior)
  • rate_id addition is purely additive — no existing SQL reads it

Confidence: 98% (residual 2%: CTE not runtime-tested yet, validated structurally)

@myersCody — drift fixes ready for review. See risk-register.md R20/R21 for the orchestration changes from the prior commits.

@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 3 times, most recently from c9b388b to 545ae8d Compare April 22, 2026 01:51
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 95.55556% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.3%. Comparing base (87ad6a1) to head (137f77c).

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #6017     +/-   ##
=======================================
- Coverage   94.3%   94.3%   -0.0%     
=======================================
  Files        361     361             
  Lines      31805   31909    +104     
  Branches    3484    3494     +10     
=======================================
+ Hits       30007   30101     +94     
  Misses      1170    1170             
- Partials     628     638     +10     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@myersCody myersCody marked this pull request as ready for review April 22, 2026 13:34
@myersCody myersCody added flightpath-pr Issues being worked on by the flight path team ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. labels Apr 22, 2026
@myersCody myersCody self-assigned this Apr 22, 2026
Copy link
Copy Markdown
Contributor

@myersCody myersCody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current structure makes it really hard to validate through our integration tests. Overall, mostly good structure. I would like to see us utilize the rate table more in the sql though.

# cycle. Legacy paths write their rows AFTER this step and are unaffected.
self._aggregate_rates_to_daily_summary(start_date, end_date)

self._update_usage_costs(start_date, end_date)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be able to utilize our integration tests to confirm your changes with this current structure.

self._aggregate_rates_to_daily_summary(start_date, end_date) -> populates daily summary with cost_model_cost_type = Infrastructure & Supplementary

elf._update_usage_costs(start_date, end_date) -> Deletes Infrastructure & Supplementary
Reinserts Infrastructure and Supplementary.

Essentially we are doing twice the work at the moment with zero gain and no way to confirm functionality outside of the validate_ script that you wrote. Not that I don't trust your validate script, I would just like the ability to use integrated tests instead.

What we can do is add an unleash flag to conditionally turn on your feature pathway for integration tests:

If feature_enabled:
    self._aggregate_rates_to_daily_summary(start_date, end_date)
else:
    self._update_usage_costs(start_date, end_date)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good — here's the flag proposal for the gating:

Flag name: cost-management.backend.cost_breakdown_rates_to_usage
Constant: COST_BREAKDOWN_RTU_UNLEASH_FLAG in masu/processor/__init__.py

Following the GPU flag pattern (dev_fallback=True):

  • SaaS: off by default, enabled per-schema via Unleash for integration testing
  • Dev: on by default via fallback_development_true
  • On-prem: off until promoted to MockUnleashClient.ONPREM_FLAG_DEFAULTS after validation

Gating follows your either/or suggestion — when enabled, the RTU pipeline replaces the legacy _update_usage_costs path entirely (no dual-write):

rtu_enabled = is_feature_flag_enabled_by_schema(
    self._schema, COST_BREAKDOWN_RTU_UNLEASH_FLAG, dev_fallback=True
)
if rtu_enabled:
    self._update_usage_rates_to_usage(start_date, end_date)
    self._aggregate_rates_to_daily_summary(start_date, end_date)
else:
    self._update_usage_costs(start_date, end_date)

This way existing integration tests run the legacy path by default, and we can flip the flag per-schema to exercise the RTU pipeline independently.

Also bundled in this update: the SQL refactor you suggested (comments 4+5) — rate_names CTE now JOINs cost_model_rate for default_rate and cost_type directly, eliminating 11 per-metric Jinja params and the two-call Infrastructure/Supplementary loop (single-pass). cluster_cost_per_hour remains as the sole rate Jinja param for the cte_node_cost pre-computation. DELETE is merged into the INSERT file, and all RTU SQL is under usage_rates/.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I traced down the _update_usage_costs path, and I saw:

report_accessor.populate_vm_usage_costs(
                    report_type,
                    filter_dictionary(report_type_dict, metric_constants.COST_MODEL_VM_USAGE_RATES),
                    start_date,
                    end_date,
                    self._provider.uuid,
                    report_period_id,
                )

Your:

self._update_usage_rates_to_usage(start_date, end_date)
    self._aggregate_rates_to_daily_summary(start_date, end_date)

Does not calculate the VM rates. Which is likely the source of your failing integration tests.

You can decouple the populate_vm_usage_costs and run it independently of the usage costs and that may get your integration tests passing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right — this was the root cause. Extracted _update_vm_usage_costs() as a standalone method and wired it into the RTU path after aggregation (commit ce7fcd2). The orchestration is now:

if rtu_enabled and cost_model_id:
    _update_usage_rates_to_usage()      # RTU INSERT
    _aggregate_rates_to_daily_summary() # DELETE+INSERT from RTU
    _update_vm_usage_costs()            # VM costs (decoupled)
else:
    _update_usage_costs()               # Legacy (includes VM)

Also added _cleanup_stale_rtu_costs() for the edge case where rtu_enabled=True but the cost model has been removed — deletes orphaned rows from both daily_summary and rates_to_usage.

Comment thread koku/masu/database/sql/openshift/cost_model/delete_rates_to_usage.sql Outdated
SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', {{rate_type}},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. COALESCE(rn.custom_name, 'cpu_core_usage_per_hour') I would prefer to use rn.metric instead of a hard coded vale. Less chance of a typo making it to production in the future.
  2. Instead of cpu could we just use the metric_type field from rate table?
  3. {{rate_type}} can be replaced with cost_type from the rate field.

Utilizing the rate_type from the rate field means that we don't have to run this SQL twice, one for infrastructure and one for supplementary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three addressed:

  1. COALESCE(rn.custom_name, rn.metric) — all 11 components now fall back to rn.metric instead of hardcoded metric names (commit e7e3c3e).
  2. metric_type — added r.metric_type to the rate_names CTE and replaced hardcoded 'cpu'/'memory'/'storage' with rn.metric_type for Components 1-3 and 7-11 (commit 45bf072). Components 4-5 (node/cluster core) keep 'cpu' because the Rate table stores 'node'/'cluster' but the aggregation SQL routes costs via metric_type IN ('cpu','memory','storage') into cost_model_*_cost columns. Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution expression since its target column is distribution-dependent.
  3. {{rate_type}} eliminated — rn.cost_type is read directly from the Rate table via the rate_names CTE (commit 8991953). This also eliminated the two-call loop (Infra/Supplementary in a single pass).

@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 2 times, most recently from 26777f3 to 2f2cd73 Compare April 23, 2026 00:51
@jordigilh
Copy link
Copy Markdown
Contributor Author

Post-Review Polish: Audit Findings Summary

Following a comprehensive adversarial audit of the entire PR (security, correctness, performance, design quality, maintainability), these two commits address the identified findings for long-term maintenance. All changes are backward-compatible and non-functional in the happy path.


Commit 1: [COST-7249] Fix O(n^2) rate dedup, add composite index, document epsilon

4 targeted improvements:

  1. O(n^2) → O(n) in rate_sync._classify_incoming_rates (LOW)

    • The used_names set was rebuilt from scratch on every loop iteration via a set comprehension over all rates_data. This is O(n) per iteration = O(n²) total.
    • Fix: hoist the set above the loop and grow it with .add() after each _resolve_custom_name call. This preserves the deduplication behavior (names like cpu_core_usage_per_hour-Infrastructure-2) while reducing to O(n).
    • The in-place mutation by _resolve_custom_name was the reason a naive hoist would have been unsafe — the .add() in the loop body is the key.
  2. Composite index for RatesToUsage DELETE/SELECT (LOW)

    • Both insert_usage_rates_to_usage.sql (DELETE) and aggregate_rates_to_daily_summary.sql (SELECT) filter on (usage_start, source_uuid, report_period_id). The previous two separate indexes required a bitmap AND.
    • Fix: replace (usage_start, source_uuid) + (report_period_id) with a single composite (usage_start, source_uuid, report_period_id). Migration 0349 applies this.
  3. Document validation epsilon (INFO)

    • Added inline SQL comment explaining that 0.000000000000001 (1e-15) matches calculated_cost's decimal_places=15 — differences below this are rounding noise.
  4. Clarify re-export pattern (INFO)

    • Added block comment above the derive_metric_type, extract_default_rate, generate_custom_name re-exports in cost_model_manager.py explaining they exist for backward compatibility with common.py and test_rate_helpers.py.

Commit 2: [COST-7249] Replace inspect.getsource ordering tests with mock-based assertions

Test refactor for robustness:

  • TestOrchestrationOrder previously used inspect.getsource() + str.find() to verify method ordering in the source text. This is fragile and breaks on any refactor.
  • In the either/or execution model, _aggregate_rates_to_daily_summary and _update_usage_costs never run in the same invocation, making the original "agg before usage" test conceptually moot.
  • TC-R20-01 now asserts the full RTU-enabled call ordering: rtu → agg → markup → monthly → dist, and verifies _update_usage_costs is NOT called. Uses the existing _make_orchestration_patches decorator with side_effect lambdas and a call_order list.
  • TC-R20-02 asserts agg < markup as a proxy for agg-before-tags, since tag methods run strictly after _update_markup_cost / _update_monthly_cost in the source. This avoids extending the mock stack to include tag methods.

What's NOT changed (and why)

  • Migration reversibility: All Koku partitioned-table migrations are forward-only. Adding a reverse RunPython that drops the table would risk data loss in production. This is consistent with existing patterns.
  • No new feature flag behavior: These commits do not change any runtime behavior when the flag is enabled or disabled.

Copy link
Copy Markdown
Contributor Author

@jordigilh jordigilh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

Copy link
Copy Markdown
Contributor

@myersCody myersCody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some failing regressions with your integration tests. Setting dev_fallback=True means that when the integration tests run turn the unleash flag on. Therefore, your changes is what is causing those test failures.

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied all 11 — every COALESCE(rn.custom_name, '<hardcoded>') is now COALESCE(rn.custom_name, rn.metric) (commit e7e3c3e).

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cpu_core_request_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'cpu_core_request_per_hour'), 'cpu', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cpu_core_effective_usage_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'cpu_core_effective_usage_per_hour'), 'cpu', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'node_core_cost_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'node_core_cost_per_hour'), 'cpu', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cluster_core_cost_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'cluster_core_cost_per_hour'), 'cpu', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'cpu', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'memory_gb_effective_usage_per_hour'), 'memory', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'memory_gb_effective_usage_per_hour'), 'memory', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'memory', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'storage_gb_usage_per_month'), 'storage', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'storage_gb_usage_per_month'), 'storage', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'storage_gb_request_per_month'), 'storage', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COALESCE(rn.custom_name, 'storage_gb_request_per_month'), 'storage', rn.cost_type,
COALESCE(rn.custom_name, rn.metric), 'storage', rn.cost_type,

lids.cost_category_id
),

rate_names AS (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually there could be more than one price list per cost model. We need to choose the price list here that is the highest priority for the start & end date passed in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — the current rate_names CTE joins through price_list_cost_model_map without priority filtering, which works today because there's only one price list per cost model. When COST-575 lands the multi-price-list lifecycle support, this CTE will need to filter by pcm.priority and pl.effective_start_date <= start_date AND (pl.effective_end_date IS NULL OR pl.effective_end_date >= end_date). Tracking as a follow-up for when that schema is available.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That schema is already available. The only thing stoping 575 from landing right now is the UI work and final verification, so I think it would be better to adjust the SQL now to handle more than one price list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit e4c7add — added an effective_price_list CTE that resolves the active price list by effective_start_date/effective_end_date date range and priority, then scopes rate_names to that single price list. This handles the multi-price-list scenario from 575.

SELECT uuid_generate_v4(), {{cost_model_id}}, {{report_period_id}}, {{source_uuid}},
b.usage_start, b.usage_start, b.node, b.namespace, b.cluster_id, b.cluster_alias,
b.data_source, b.persistentvolumeclaim, b.pod_labels, b.volume_labels, b.all_labels, b.label_hash,
COALESCE(rn.custom_name, 'cpu_core_usage_per_hour'), 'cpu', rn.cost_type,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added a metric_type column to Rate table. Why are we hardcoding cpu, memory or storage for all of the select statement in order to insert the metric_type? I find that a bit confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 45bf072 — added r.metric_type to the rate_names CTE and replaced hardcoded values with rn.metric_type for 8 of the 11 components (1-3, 7-11). The remaining 3 components (4, 5, 6) keep explicit values because the Rate table stores 'node'/'cluster' for those metrics, but the aggregation SQL filters on metric_type IN ('cpu', 'memory', 'storage') to route costs into the daily summary cost_model_*_cost columns. Using rn.metric_type there would silently drop those rows from aggregation.

@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from efd3aae to 058ed69 Compare April 24, 2026 19:15
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

jordigilh added a commit to jordigilh/koku that referenced this pull request Apr 27, 2026
Add r.metric_type to the rate_names CTE and replace hardcoded
'cpu'/'memory'/'storage' with rn.metric_type in Components 1-3, 7-11.

Components 4-5 (node_core_cost_per_hour, cluster_core_cost_per_hour) keep
'cpu' because the Rate table stores 'node'/'cluster' but the aggregation
SQL routes costs via metric_type IN ('cpu','memory','storage') into the
daily summary cost_model_*_cost columns.

Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution
expression since its metric_type is distribution-dependent.

Addresses TL review feedback on PR project-koku#6017.

Made-with: Cursor
@jordigilh
Copy link
Copy Markdown
Contributor Author

Note on dev_fallback=False

Changed the Unleash check from dev_fallback=True to dev_fallback=False (commit 6388df8).

Why: dev_fallback=True meant the RTU pipeline activated in any environment without an Unleash server — CI, local dev, on-prem — which caused integration test failures because the tests did not set up rates_to_usage data. With dev_fallback=False, the RTU path only activates when the flag is explicitly enabled in Unleash, which is the correct behavior for a gated rollout. SaaS environments with Unleash are unaffected.

@pedrolp85
Copy link
Copy Markdown
Contributor

/retest

@esebesto esebesto added full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. and removed ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. labels Apr 27, 2026
@esebesto esebesto added the hot-fix-smoke-tests pr_check label to run minimal smoke tests for fast moving bug-fix label Apr 28, 2026
@esebesto
Copy link
Copy Markdown
Contributor

Tested with full smokes with just 1 failure unrelated to the PR -> we can switch to hot-fix-smokes tests for rebase run (in case the PR itself is updated, we will need a different label).
ERROR tests/rest_api/v1/test__cost_model.py::test_api_cost_model_ocp_on_azure_markup_calculation - LookupError: No unused azure storage account has been found.

@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from e8fc443 to e4c7add Compare April 28, 2026 13:25
@myersCody myersCody added smoke-tests pr_check will run minimal required smokes. Used when changes hit multiple providers. ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. and removed hot-fix-smoke-tests pr_check label to run minimal smoke tests for fast moving bug-fix smoke-tests pr_check will run minimal required smokes. Used when changes hit multiple providers. full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. ocp-smoke-tests pr_check will run ocp + ocp on cloud smoke tests, used when changes affect ocp. labels Apr 28, 2026
)

rtu_enabled = is_feature_flag_enabled_by_schema(
self._schema, COST_BREAKDOWN_RTU_UNLEASH_FLAG, dev_fallback=False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I chatted with QE during our boardwalk today and the preference it to set the deve fallback=True. This means all smoke test jobs moving forward will be using your flow.

We have stage running the old flow for any regressions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — flipped dev_fallback=True so all smoke tests use the RTU flow going forward.

]

operations = [
migrations.RemoveIndex(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These indexes are added in the previous migration, we could jsut set the migrations we want in the original migration 0348.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — squashed the composite index directly into 0348 and removed 0349.

@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 3 times, most recently from 9d33130 to 4ba9477 Compare April 28, 2026 17:26
@jordigilh
Copy link
Copy Markdown
Contributor Author

koku-ci timed out after ~6h during the data ingestion setup phase — the actual smoke tests never ran (still at 0% progress). The last activity was polling test_cost_gcp_source_no_markup waiting for GCP report file processing. Zero test failures in the log.

This is a resource/time constraint of full-run-smoke-tests (10,161 tests) in the ephemeral environment, not related to PR changes.

/retest

@esebesto
Copy link
Copy Markdown
Contributor

esebesto commented Apr 29, 2026

koku-ci timed out after ~6h during the data ingestion setup phase — the actual smoke tests never ran (still at 0% progress). The last activity was polling test_cost_gcp_source_no_markup waiting for GCP report file processing. Zero test failures in the log.

This is a resource/time constraint of full-run-smoke-tests (10,161 tests) in the ephemeral environment, not related to PR changes.

@jordigilh The PR check koku-ci-hrlz6 was actually not aborted - Note that the logs in the konflux UI are truncated (so it is def. confusing)

I downloaded the results of that pr-check and see the following failures directly related to you PR:

FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_aws_source_raw_calc - AssertionError: Raw calculated value does not match report value, supplied=19.839636879244534, expected=14.356314616774196 and dynamic_tolerance=0.014356314616774196 absolute_tolerance=1e-08
assert (5.483322262470338 <= 0.014356314616774196 or 5.483322262470338 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_aws_source_raw_calc_multi_cluster - AssertionError: Raw calculated value does not match report value, supplied=138843.39823139575, expected=120784.39198605249 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (18059.006245343262 <= 0.1 or 18059.006245343262 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_azure_source_raw_calc - AssertionError: Raw calculated value does not match report value, supplied=245.26385915438297, expected=195.93512241792 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (49.32873673646296 <= 0.1 or 49.32873673646296 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_azure_source_raw_calc_multi_cluster - AssertionError: Supplementary cost in OCP cost path unmatches calculated for date='2026-03-01' and project_name='Platform unallocated', supplied=5378.55709132931, expected=2689.2785464262406 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (2689.27854490307 <= 0.1 or 2689.27854490307 <= 1e-08)
FAILED tests/rest_api/v1/test__i_source.py::test_api_ocp_on_gcp_source_raw_calc - AssertionError: kikis ocp worker unallocated distributed cost mismatch, supplied=202.61313590940347, expected=116.2304124744 and dynamic_tolerance=0.1 absolute_tolerance=1e-08
assert (86.38272343500347 <= 0.1 or 86.38272343500347 <= 1e-08)

These failures indicate that the cost calculations are not as expected.

@esebesto esebesto removed the full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. label Apr 29, 2026
@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from 4ba9477 to da15ea8 Compare April 30, 2026 17:24
jordigilh added a commit to jordigilh/koku that referenced this pull request May 1, 2026
Add r.metric_type to the rate_names CTE and replace hardcoded
'cpu'/'memory'/'storage' with rn.metric_type in Components 1-3, 7-11.

Components 4-5 (node_core_cost_per_hour, cluster_core_cost_per_hour) keep
'cpu' because the Rate table stores 'node'/'cluster' but the aggregation
SQL routes costs via metric_type IN ('cpu','memory','storage') into the
daily summary cost_model_*_cost columns.

Component 6 (cluster_cost_per_hour) keeps the CASE WHEN distribution
expression since its metric_type is distribution-dependent.

Addresses TL review feedback on PR project-koku#6017.

Made-with: Cursor
@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch 2 times, most recently from 2256f40 to acd4e74 Compare May 1, 2026 01:13
…, and tests

Introduces the RatesToUsage table and pipeline for cost model calculations:

- RatesToUsage model and migration (M4) with composite index
- SQL pipeline: insert per-rate rows, aggregate to daily summary
- Orchestration: feature-flagged RTU path in OCPCostModelCostUpdater
- Unleash flag: cost-management.backend.cost_breakdown_rates_to_usage
- Rate sync logic extracted to rate_sync.py with O(n^2) dedup fix
- Custom name generation consolidated into _resolve_custom_name
- Full test suite: unit, integration, behavioral, and e2e
- Legacy-path tests updated to disable RTU flag

Made-with: Cursor
@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from acd4e74 to 7f76f7d Compare May 1, 2026 01:24
…nd orchestration

Root cause: label_hash collision in RTU aggregate SQL caused join expansion
and inflated costs (~38% higher). The md5 hash concatenated pod_labels,
volume_labels, and all_labels without delimiters, so rows differing only
in which label field was NULL vs '{}' produced identical hashes.

Fix: add '|' delimiter between label fields in md5() across
insert_usage_rates_to_usage.sql, aggregate_rates_to_daily_summary.sql,
and validate_rates_against_daily_summary.sql.

Additional fixes:
- Tighten base CTE filter to exclude rows with existing cost_model_rate_type
- Fix distribution lookup: read from cost_model table in SQL
- Read metric_type and cost_type from Rate table directly
- Extract _update_vm_usage_costs and call after RTU aggregation
- Factor cluster_cost_per_hour into distribution-aware component
- Enable dev_fallback=True for RTU pipeline in dev/test

Verified: both IQE raw_calc tests pass with dev_fallback=True.
Co-authored-by: Cursor <cursoragent@cursor.com>
@jordigilh jordigilh force-pushed the COST-7249/phase2-rate-isolation branch from 0bec8fb to 137f77c Compare May 1, 2026 17:20
@jordigilh jordigilh added the full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. label May 1, 2026
@jordigilh
Copy link
Copy Markdown
Contributor Author

RCA: RTU Pipeline Cost Inflation with dev_fallback=True

Symptom

When dev_fallback=True enabled the RTU pipeline, two IQE integration tests failed with ~38% cost inflation:

  • test_api_ocp_on_aws_source_raw_calc
  • test_api_ocp_on_aws_source_raw_calc_multi_cluster

API-returned costs were consistently higher than expected (e.g., supplied=66.0 vs expected=44.0 for CPU).

Root Cause: label_hash collision in aggregation JOIN

The aggregate_rates_to_daily_summary.sql and insert_usage_rates_to_usage.sql files use an md5 hash over label fields to group rows:

-- BEFORE (buggy):
md5(COALESCE(lids.pod_labels::text, '')
    || COALESCE(lids.volume_labels::text, '')
    || COALESCE(lids.all_labels::text, '')) AS label_hash

This hash lacked delimiters between concatenated fields, causing collisions when field boundaries shifted. For example:

pod_labels volume_labels all_labels Concatenation Hash
NULL → '' '{}' NULL → '' '{}' abc123
NULL → '' NULL → '' '{}' '{}' abc123 ← collision

Two semantically distinct rows produced the same label_hash. During the LEFT JOIN in aggregation, a single rates_to_usage row matched multiple base rows, inflating costs by 1.5x.

Fix

Added | delimiter between label fields to prevent boundary collisions:

-- AFTER (fixed):
md5(COALESCE(lids.pod_labels::text, '')
    || '|' || COALESCE(lids.volume_labels::text, '')
    || '|' || COALESCE(lids.all_labels::text, '')) AS label_hash

Applied consistently across:

  • insert_usage_rates_to_usage.sql
  • aggregate_rates_to_daily_summary.sql
  • validate_rates_against_daily_summary.sql

Verification

Both IQE tests pass with dev_fallback=True:

========= 2 passed, 11369 deselected, 16 warnings in 475.70s (0:07:55) =========

Added full-run-smoke-tests label for full CI validation.

@jordigilh
Copy link
Copy Markdown
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flightpath-pr Issues being worked on by the flight path team full-run-smoke-tests pr_check will run all smoke tests. Used for large or wider reaching changes. smokes-required Label to show that smokes tests should be run against these changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants