-
Notifications
You must be signed in to change notification settings - Fork 45
feat: Improve search performance #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will improve search result filtering performance.
Reviewer's GuideIndexes ancestor directory paths for each file and updates the searcher to enforce both filename and path-prefix constraints via a single Lucene boolean query, removing client-side path filtering and bumping the index schema version. Sequence diagram for the updated search with ancestor path constraintsequenceDiagram
actor User
participant Searcher
participant BooleanQuery
participant IndexSearcher
participant LuceneIndex
User->>Searcher: search(path, query, wildcard_query, max_results)
activate Searcher
Searcher->>BooleanQuery: create finalQuery
alt wildcard_query is true
Searcher->>BooleanQuery: add WildcardQuery(file_name_lower) MUST
else
Searcher->>BooleanQuery: add ParserQuery(file_name) MUST
end
Searcher->>Searcher: canonical = canonicalize_filename(path)
Searcher->>BooleanQuery: add TermQuery(ancestor_paths = canonical) MUST
Searcher->>IndexSearcher: search(finalQuery, max_results)
activate IndexSearcher
IndexSearcher->>LuceneIndex: execute boolean query
LuceneIndex-->>IndexSearcher: TopDocs
deactivate IndexSearcher
IndexSearcher-->>Searcher: TopDocs
loop for each hit in TopDocs
Searcher->>LuceneIndex: doc(docId)
LuceneIndex-->>Searcher: Document(full_path, file_type, ...)
Searcher->>Searcher: serialize fields into result string
end
Searcher-->>User: vector<string> results
deactivate Searcher
Class diagram for updated Searcher and indexing helpersclassDiagram
class file_record {
+string full_path
+string file_name
+string file_type
+string file_ext
+string modify_time_str
+string file_size_str
+string pinyin
+string pinyin_acronym
+bool is_hidden
}
class Document {
+add(field : FieldPtr) : void
}
class Field {
+name : wstring
+value : wstring
+STORE_NO : int
+STORE_YES : int
+INDEX_NOT_ANALYZED : int
}
class IndexerHelpers {
+add_ancestor_paths(doc : DocumentPtr, full_path : string) : void
+create_document(record : file_record) : DocumentPtr
}
class Searcher {
+reader : IndexReaderPtr
+searcher : IndexSearcherPtr
+search(path : string, query : string, wildcard_query : bool, max_results : int) : vector~string~
}
class BooleanQuery {
+add(query : QueryPtr, occur : BooleanClauseOccur) : void
}
class Queries {
+WildcardQuery(term : TermPtr)
+TermQuery(term : TermPtr)
+parse_with_analyzer(field : wstring, query_string : wstring) : QueryPtr
}
file_record --> IndexerHelpers : used by
IndexerHelpers --> Document : populates
IndexerHelpers --> Field : creates
Searcher ..> BooleanQuery : builds finalQuery
Searcher ..> Queries : creates filename and
path queries
Searcher ..> Document : reads indexed fields
Document --> Field : contains multiple
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The path indexing in
add_ancestor_pathsassumes/as the separator; if this code ever runs on Windows-style paths (e.g.,C:\dir\file), consider normalizing paths or usingstd::filesystemto derive parent directories to avoid mismatches with the search-side canonicalization. - The new path constraint uses an exact
TermQueryonancestor_paths; this changes semantics when thepathargument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing thatpathis always a directory and handling or rejecting other inputs explicitly. - Since the filtering has moved entirely into the Lucene query, the debug log still printing
reader->numDocs()can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The path indexing in `add_ancestor_paths` assumes `/` as the separator; if this code ever runs on Windows-style paths (e.g., `C:\dir\file`), consider normalizing paths or using `std::filesystem` to derive parent directories to avoid mismatches with the search-side canonicalization.
- The new path constraint uses an exact `TermQuery` on `ancestor_paths`; this changes semantics when the `path` argument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing that `path` is always a directory and handling or rejecting other inputs explicitly.
- Since the filtering has moved entirely into the Lucene query, the debug log still printing `reader->numDocs()` can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.
## Individual Comments
### Comment 1
<location> `src/daemon/src/core/file_index_manager.cpp:124` </location>
<code_context>
+ * 3. "/home"
+ * 4. "/"
+ */
+static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path)
+{
+ if (full_path.empty()) return;
</code_context>
<issue_to_address>
**issue (complexity):** Consider rewriting the ancestor-path building loop using std::filesystem::path to simplify path manipulation and clarify loop termination.
You can reduce the custom string/path handling and make the termination conditions clearer by delegating to `std::filesystem::path` and using an explicit loop condition instead of `while (true)` + multiple `break`s.
For example:
```cpp
#include <filesystem>
// namespace fs = std::filesystem; // if you use a namespace alias
static void add_ancestor_paths(DocumentPtr doc, const std::string& full_path)
{
if (full_path.empty()) return;
std::filesystem::path p(full_path);
// Start from the parent directory of the file
for (p = p.parent_path(); !p.empty(); p = p.parent_path()) {
auto s = p.generic_string(); // normalized '/' separators
// Map the root path consistently to "/"
if (p == p.root_path()) {
s = "/";
}
doc->add(newLucene<Field>(
ANCESTOR_PATHS_FIELD,
StringUtils::toUnicode(s),
Field::STORE_NO,
Field::INDEX_NOT_ANALYZED));
// Once we've added the root, stop
if (p == p.root_path()) {
break;
}
}
}
```
This keeps all existing behavior:
- Skips pure filenames without a directory component.
- Adds all ancestor directories from the immediate parent up to and including `/`.
- Avoids manual `find_last_of('/')`, `substr`, and explicit root special cases in multiple branches.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
deepin pr auto review我来对这段代码进行详细审查:
在
具体改进代码示例: // 建议将宏定义改为常量
const std::wstring ANCESTOR_PATHS_FIELD = L"ancestor_paths";
// 改进后的 add_ancestor_paths 函数
static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path) {
if (full_path.empty() || full_path.length() > PATH_MAX) return;
// 规范化路径
std::string normalized_path = std::filesystem::canonical(full_path).string();
// 防止路径遍历攻击
if (normalized_path.find("..") != std::string::npos) return;
std::string current_path = normalized_path;
// 获取父目录
size_t last_slash = current_path.find_last_of('/');
if (last_slash == std::string::npos) return;
// 处理根目录情况
if (last_slash == 0) {
current_path = "/";
} else {
current_path = current_path.substr(0, last_slash);
}
// 使用 vector 预分配空间
std::vector<std::string> ancestors;
ancestors.reserve(16); // 预估合理的路径深度
while (true) {
ancestors.push_back(current_path);
if (current_path == "/" || current_path.empty()) {
break;
}
last_slash = current_path.find_last_of('/');
if (last_slash == std::string::npos) break;
current_path = (last_slash == 0) ? "/" : current_path.substr(0, last_slash);
}
// 批量添加到文档
for (const auto& path : ancestors) {
doc->add(newLucene<Field>(ANCESTOR_PATHS_FIELD,
StringUtils::toUnicode(path),
Field::STORE_NO,
Field::INDEX_NOT_ANALYZED));
}
}
// 改进后的搜索结果构建函数
static std::string build_search_result(const DocumentPtr& doc) {
std::stringstream ss;
ss << StringUtils::toUTF8(doc->get(L"full_path"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_type"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_ext"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"modify_time_str"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"file_size_str"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin_acronym"))
<< "<\\>" << StringUtils::toUTF8(doc->get(L"is_hidden"));
return ss.str();
}这些改进主要关注了:
建议在实际应用中根据具体需求和环境进一步调整这些改进。 |
This will improve search result filtering performance.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lzwind, wangrong1069 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/forcemerge |
|
This pr force merged! (status: unstable) |
26381f8
into
linuxdeepin:develop/snipe
Summary by Sourcery
Index ancestor directory paths for each file and use them for path-prefixed filtering during search, updating the index version accordingly.
New Features:
Enhancements: