-
Notifications
You must be signed in to change notification settings - Fork 735
Brute force KNN vector search pushdown #29621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements brute force vector search pushdown to datashards by adding a new optimization rule KqpApplyVectorTopKToReadTable. This enables efficient ORDER BY Knn::*Distance/Similarity(...) LIMIT N queries without requiring a vector index.
Key Changes:
- New optimization rule pushes vector top-K operations down to datashard read operations
- Auto-detection of vector type and dimension from target vectors when no index is present
- Comprehensive test coverage for various distance metrics and edge cases
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/core/kqp/opt/physical/kqp_opt_phy_limit.cpp | Implements KqpApplyVectorTopKToReadTable optimization rule with pattern matching for Knn distance functions |
| ydb/core/kqp/query_compiler/kqp_query_compiler.cpp | Adds helper functions to populate VectorTopK protobuf settings from query AST |
| ydb/core/base/kmeans_clusters.cpp | Implements CreateClustersAutoDetect for auto-detecting vector type/dimension from target vectors |
| ydb/core/tx/datashard/datashard__read_iterator.cpp | Handles VectorTopK settings in datashard read operations with auto-detection support |
| ydb/core/kqp/runtime/kqp_read_actor.cpp | Propagates VectorTopK settings from source to datashard read requests |
| ydb/core/kqp/executer_actor/kqp_tasks_graph.cpp | Extracts and sets VectorTopK parameters for scan tasks |
| ydb/core/kqp/opt/kqp_opt_build_txs.cpp | Handles precompute nodes in stage program bodies for VectorTopK settings |
| ydb/core/kqp/common/kqp_yql.h/.cpp | Adds VectorTopK settings to TKqpReadTableSettings structure |
| ydb/core/protos/*.proto | Adds VectorTopK message fields to support pushdown configuration |
| ydb/core/kqp/ut/knn/* | Comprehensive test suite for vector search pushdown across various metrics and configurations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…opK helper functions
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
| targetProto->MutableParamValue()->SetParamName(expr.Cast<TCoParameter>().Name().StringValue()); | ||
| } else if (auto maybeBinding = expr.Maybe<TKqpTxResultBinding>()) { | ||
| // TKqpTxResultBinding should have been replaced with TCoParameter by kqp_opt_build_txs, | ||
| // but handle it defensively by constructing the expected parameter name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
вот это место немного пугает, откуда здесь берётся TKqpTxResultBinding ? в аналогичном pushdown-е поиска по индексу его тут не было
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| argsMap.emplace(inputArg.Raw(), makeParameterBinding(maybeBinding.Cast(), input.Pos()).Ptr()); | ||
| } | ||
|
|
||
| // Also scan the program body for TKqpTxResultBinding (for VectorTopK precompute settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
и вот тут, почему тут понадобилась отдельная обработка? вроде бы это не нужно было в pushdown поиска по индексу. может быть можно общую реализацию использовать, чтобы не копипастить обработку TxResultBinding-ов?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| } | ||
|
|
||
| // Also scan the program body for precomputes in read settings (for VectorTopK pushdown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
И вот тут - там где-то выше был поиск этих precompute, который просто фильтровал ноды, может быть можно просто добавить туда фильтр по типу нод чтобы там тоже зацепило precompute нам нужных параметров
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| checkDistance(results[0].second, 0.000882f); | ||
| checkDistance(results[1].second, 0.000985f); | ||
| checkDistance(results[2].second, 0.001070f); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
тут можно ещё добавить кейс где target сам выбирается из другой таблицы
т.к. там выше есть 3 кейса которые как я понял аналогичны тому что у меня в kqp_opt_log_indexes были, а там 3 кейса
- литерал
- переменная равная вызову функции
- переменная равная подзапросу
…DqPhyPrecompute and TKqpTxResultBinding collection
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
Changelog entry
Add optimization rule KqpApplyVectorTopKToReadTable that pushes down
ORDER BY Knn::*Distance/Similarity(...) LIMIT N queries to datashards,
enabling brute force vector search without requiring a vector index.
Changelog category
Description for reviewers
...