Skip to content

CBO Phase 2: filter selectivity + join reordering#79

Merged
poyrazK merged 10 commits intomainfrom
feature/cbo-phase2
May 7, 2026
Merged

CBO Phase 2: filter selectivity + join reordering#79
poyrazK merged 10 commits intomainfrom
feature/cbo-phase2

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 7, 2026

Summary

  • Filter selectivity: execute_select() now estimates rows after WHERE clause using RowEstimator::estimate_filter_rows() (NDV-based selectivity for equality, range-based for int, string-length-based for LIKE prefixes) and compares against the 10k VectorizedRowThreshold
  • Join reordering: build_vectorized_plan() estimates both A⋈B and B⋈A join cardinalities via RowEstimator::estimate_join_rows() — when the reverse order is smaller, key expressions are swapped so the smaller table feeds the build-side hash table
  • current_est_rows tracked across the join chain for multi-join queries

Tests

  • AnalyzeFilterSelectivity — verifies SELECT with WHERE > constant uses Vectorized after ANALYZE (15k rows, filter estimated ~5k, threshold 10k)
  • AnalyzeJoinOrder — creates big (10k) and small (100) tables, ANALYZEs both, verifies join succeeds using Vectorized path

Verification

  • 37/37 tests pass
  • Build: clean (1 self-assign warning in vectorized_operator.hpp, unrelated)

Summary by CodeRabbit

  • Improvements
    • Optimized query execution with smarter filtering estimation for WHERE clauses.
    • Enhanced join operation ordering to improve query performance on multi-table queries.
    • Added validation tests to ensure query planning and execution work correctly with table statistics.

poyrazK added 3 commits May 7, 2026 14:46
When a WHERE clause is present with a simple column OP constant
predicate (e.g., id > 5000), use estimate_filter_rows to get a more
accurate row estimate before comparing against kVectorizedRowThreshold.
Falls back to estimate_scan_rows when no filter is present or the
filter is too complex to analyze.

Add AnalyzeFilterSelectivity test to verify SELECT with WHERE clause
works after ANALYZE and the chooser path is exercised.
Use RowEstimator::estimate_join_rows() to compare forward (A⋈B) vs
reverse (B⋈A) join cardinalities. When the reverse order is estimated
to produce fewer rows, swap the key expressions so the smaller side
feeds as the build-side hash table. This reduces hash table size and
probe cost without changing the operator tree topology.

Track current_est_rows across the join chain so subsequent joins in
a multi-join query also benefit from the heuristic.
Creates big (10000 rows) and small (100 rows) tables, runs ANALYZE
on both, then executes a join. Verifies the join succeeds using the
Vectorized path without error, exercising the join reordering logic.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 3 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 839486c0-ce79-4bc5-a393-014cd027c226

📥 Commits

Reviewing files that changed from the base of the PR and between 133410b and 766c703.

📒 Files selected for processing (4)
  • docs/adr/001-analyze-table-and-chooser.md
  • docs/adr/002-join-reordering-inner-only.md
  • src/executor/query_executor.cpp
  • tests/cloudSQL_tests.cpp
📝 Walkthrough

Walkthrough

This PR refines vectorized query planning by integrating row-count estimation. The executor now selects plans based on filtered-row estimates for simple WHERE clauses and reorders equi-joins by comparing estimated cardinalities to place the smaller input on the hash-build side, with updated cardinality tracking across join chains.

Changes

Query Optimizer – Vectorized Plan Selection and Join Reordering

Layer / File(s) Summary
Filter Row Estimation
src/executor/query_executor.cpp
execute_select now conditionally calls estimate_filter_rows for simple column–constant binary WHERE predicates, falling back to estimate_scan_rows when filtering is unavailable or ineligible.
Join Reordering & Cardinality Tracking
src/executor/query_executor.cpp
build_vectorized_plan initializes a current_est_rows cardinality estimate and, for each join, compares forward and reverse join orders via estimate_join_rows; if reverse is smaller, join keys are swapped, and cardinality is updated for subsequent joins.
Execution Tests
tests/cloudSQL_tests.cpp
Two new test cases (AnalyzeFilterSelectivity, AnalyzeJoinOrder) validate that table statistics collected by ANALYZE TABLE correctly inform filter selectivity and join-order decisions at execution time.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • poyrazK/cloudSQL#75: Main PR directly builds on PR #75's ANALYZE/RowEstimator work: it uses RowEstimator::estimate_filter_rows and ::estimate_join_rows inside QueryExecutor (execute_select/build_vectorized_plan) to refine vectorized selection and join-side swapping.

Poem

🐰 A rabbit hops through join trees fair,
Reordering sides with careless care—
Stats whisper: "build the smaller one!"
Now queries dance and run, and run!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'CBO Phase 2: filter selectivity + join reordering' directly and clearly summarizes the main changes: implementation of filter selectivity estimation and join reordering logic as part of the cost-based optimization phase 2.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/cbo-phase2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/executor/query_executor.cpp`:
- Around line 418-453: The code treats estimated_rows == 0 as "no filter
estimate" but eligible is scoped inside the bin_expr block and lost, so a
legitimate zero returned by optimizer::RowEstimator::estimate_filter_rows gets
overridden; fix by introducing a boolean (e.g., filter_estimated) declared
before checking stmt.where() and set to true only when estimate_filter_rows is
called (keep existing bin_expr, col_name, pred_val logic and call
optimizer::RowEstimator::estimate_filter_rows as before), then change the
fallback to call estimate_scan_rows only when filter_estimated is false (not
when estimated_rows == 0), ensuring zero-row estimates from
RowEstimator::estimate_filter_rows are preserved.
- Around line 1602-1633: The join-reorder currently swaps only
left_key/right_key (and unused left_key_col/right_key_col) which breaks
pre-resolved column indices in VectorizedHashJoinOperator; instead, when you
decide to flip build/probe (est_reverse < est_forward), swap the operator
children as well: exchange current_root and right_scan alongside swapping
left_key and right_key so that the left child truly matches left_key and the
right child matches right_key; remove the dead std::swap of
left_key_col/right_key_col (they are unused), and ensure the
VectorizedHashJoinOperator is constructed from the possibly-swapped current_root
and right_scan so its constructor resolves column indices against the correct
schemas (if SQL-declared column order must be preserved, add a projection after
the join).

In `@tests/cloudSQL_tests.cpp`:
- Around line 1477-1601: The tests AnalyzeFilterSelectivity and AnalyzeJoinOrder
never enable the vectorized path because QueryExecutor is constructed but never
configured with parallel_ and storage_manager_, so execute_select() skips the
CBO logic; before running the SELECT/JOIN/ANALYZE queries call
QueryExecutor::set_parallel(true) and
QueryExecutor::set_storage_manager(&storage) so the vectorized gate (parallel_
&& storage_manager_) is true and build_vectorized_plan / CBO filter selectivity
and join reordering (kVectorizedRowThreshold logic) are exercised; also correct
the comment about the estimated row count to reflect the actual threshold
comparison.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da842d5c-a7db-4897-a539-b98b8f05458d

📥 Commits

Reviewing files that changed from the base of the PR and between b6eaad9 and 133410b.

📒 Files selected for processing (2)
  • src/executor/query_executor.cpp
  • tests/cloudSQL_tests.cpp

Comment thread src/executor/query_executor.cpp Outdated
Comment thread src/executor/query_executor.cpp Outdated
Comment on lines +1602 to +1633
// Join reordering: estimate both join orders and pick the smaller-first approach.
// This heuristic puts the side estimated to produce fewer rows as the build (probe-side
// hash table) to reduce memory footprint and hash probe cost.
// Note: For multi-join chains, current_root may already be a join result — we use its
// estimated cardinality as the "left" side estimate.
std::string left_key_col = left_key ? left_key->to_string() : "";
std::string right_key_col = right_key ? right_key->to_string() : "";
uint64_t est_forward = 0;
uint64_t est_reverse = 0;
if (!left_key_col.empty() && !right_key_col.empty()) {
// Estimate forward: current_est_rows ⋈ join_table_meta (probe = right)
est_forward = optimizer::RowEstimator::estimate_join_rows(
::cloudsql::TableInfo{0, "", {}, {}, {}, current_est_rows, "", 0, 0},
*join_table_meta, right_key_col);
// Estimate reverse: join_table_meta ⋈ current_est_rows (probe = left)
::cloudsql::TableInfo left_est;
left_est.num_rows = current_est_rows;
est_reverse = optimizer::RowEstimator::estimate_join_rows(*join_table_meta, left_est,
left_key_col);
// If reverse order is smaller, swap the key expressions so build/probe flip.
// The VectorizedHashJoinOperator uses left child as build, right as probe —
// swapping keys redirects the hash table to the smaller side.
if (est_reverse < est_forward && est_reverse > 0) {
// Swap left_key and right_key to redirect build to the smaller side
auto swapped_left = std::move(right_key);
auto swapped_right = std::move(left_key);
left_key = std::move(swapped_left);
right_key = std::move(swapped_right);
// Also swap the schema-based key column names for output schema ordering
std::swap(left_key_col, right_key_col);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate VectorizedHashJoinOperator definition and its key-evaluation logic
rg -n --type=cpp -C 8 "class VectorizedHashJoinOperator"
rg -n --type=cpp -C 5 "left_key_\|right_key_\|build_key\|probe_key" src/executor/vectorized_operator.cpp

Repository: poyrazK/cloudSQL

Length of output: 1580


🏁 Script executed:

# Find all cpp files in the executor directory
fd --type f "\.cpp$" src/executor/ | head -20

Repository: poyrazK/cloudSQL

Length of output: 147


🏁 Script executed:

# Search for VectorizedHashJoinOperator implementation in all cpp files
rg -n "VectorizedHashJoinOperator::" --type=cpp -A 15 | head -100

Repository: poyrazK/cloudSQL

Length of output: 42


🏁 Script executed:

# Look for where left_key_ is used (evaluation against children)
rg -n "left_key_" --type=cpp -B 2 -A 2 include/executor/

Repository: poyrazK/cloudSQL

Length of output: 2938


🏁 Script executed:

# Look at the context after line 1633 in query_executor.cpp to find operator construction
sed -n '1633,1680p' src/executor/query_executor.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 2640


🏁 Script executed:

# Search for VectorizedHashJoinOperator construction in query_executor.cpp
rg -n "VectorizedHashJoinOperator\|make_unique.*Join" src/executor/query_executor.cpp -A 3

Repository: poyrazK/cloudSQL

Length of output: 42


🏁 Script executed:

# Check if left_key_col or right_key_col are used after line 1631
sed -n '1631,1660p' src/executor/query_executor.cpp | cat -n

Repository: poyrazK/cloudSQL

Length of output: 1651


Join reordering swaps key expressions but not the operator inputs — causes column resolution failure when reorder condition fires

Prior to this block, the key assignment logic (lines 1569–1594) guarantees left_key refers to a column in current_root's schema and right_key refers to a column in right_scan's schema. The swap at lines 1626–1629 inverts only the key expressions. current_root (left child) and right_scan (right child) are not swapped.

When VectorizedHashJoinOperator is constructed (lines 1653–1654), its constructor pre-resolves column indices once:

left_key_col_idx_ = left_->output_schema().find_column(left_key_->to_string());

After the swap, left_key_ now contains the original right_scan column name, but the constructor searches for it in current_root->output_schema(). The column does not exist in that schema, causing the lookup to fail or silently select the wrong column. Similarly, right_key_ (now the original current_root column) is looked up in right_scan's schema and fails to match.

Additionally, std::swap(left_key_col, right_key_col) at line 1631 is dead code — neither variable is used after this point.

The correct fix is to swap both the operator children and the key expressions together; the output schema construction at lines 1644–1650 then rebuilds itself correctly from the swapped children:

🛠️ Proposed fix
         if (est_reverse < est_forward && est_reverse > 0) {
-            // Swap left_key and right_key to redirect build to the smaller side
-            auto swapped_left = std::move(right_key);
-            auto swapped_right = std::move(left_key);
-            left_key = std::move(swapped_left);
-            right_key = std::move(swapped_right);
-            // Also swap the schema-based key column names for output schema ordering
-            std::swap(left_key_col, right_key_col);
+            // Move the smaller estimated table to the build (left) side
+            std::swap(current_root, right_scan);
+            std::swap(left_key, right_key);
         }

Note: Swapping the children changes the visible column order for SELECT * (the new build-side table's columns appear first). If preserving SQL-declaration column order is required, add a reordering projection after the join.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1602 - 1633, The join-reorder
currently swaps only left_key/right_key (and unused left_key_col/right_key_col)
which breaks pre-resolved column indices in VectorizedHashJoinOperator; instead,
when you decide to flip build/probe (est_reverse < est_forward), swap the
operator children as well: exchange current_root and right_scan alongside
swapping left_key and right_key so that the left child truly matches left_key
and the right child matches right_key; remove the dead std::swap of
left_key_col/right_key_col (they are unused), and ensure the
VectorizedHashJoinOperator is constructed from the possibly-swapped current_root
and right_scan so its constructor resolves column indices against the correct
schemas (if SQL-declared column order must be preserved, add a projection after
the join).

Comment thread tests/cloudSQL_tests.cpp
poyrazK and others added 6 commits May 7, 2026 16:04
…comment

1. Filter selectivity: start with scan estimate as baseline, override
   only when filter is eligible. Fixes 0-ambiguity bug where empty tables
   would incorrectly fall back to full scan estimate.

2. Join reordering comment: "build (probe-side hash table)" → "build
   (hashed) side" — code was correct, comment was wrong.

3. Add NOTE comment about to_string() format stability dependency in
   column key lookup for join reordering.
Verify the join returns exactly 100 rows (small.id 0-99 match
big.id 0-99), not just that it succeeds. Confirms correctness.
Outer joins (LEFT/RIGHT/FULL) must preserve build/probe semantics
— reordering keys could change which side is the outer side, breaking
query correctness. The reordering block is now wrapped with:

    if (exec_join_type == executor::JoinType::Inner) { ... }

Also moves join type determination before the reordering block so
the gating variable is available. Row estimate update is also scoped
to inner joins only.
ADR 002 documents the outer join safety decision (inner joins only
for reordering) from PR #79 review. ADR 001 gets a Phase 2 Extensions
section covering filter selectivity strategies and join reordering
mechanics, with a reference to ADR 002.
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit e4a1f57 into main May 7, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant