feat: Improve search performance #195

wangrong1069 · 2025-12-15T08:59:20Z

Summary by Sourcery

Index ancestor directory paths for each file and use them for path-prefixed filtering during search, updating the index version accordingly.

New Features:

Store ancestor directory paths in the index to support directory-scoped file searches.

Enhancements:

Replace client-side string prefix filtering with Lucene term-based directory filtering to improve search performance and simplify result handling.

This will improve search result filtering performance.

sourcery-ai · 2025-12-15T08:59:53Z

Reviewer's Guide

Indexes ancestor directory paths for each file and updates the searcher to enforce both filename and path-prefix constraints via a single Lucene boolean query, removing client-side path filtering and bumping the index schema version.

Sequence diagram for the updated search with ancestor path constraint

sequenceDiagram
    actor User
    participant Searcher
    participant BooleanQuery
    participant IndexSearcher
    participant LuceneIndex

    User->>Searcher: search(path, query, wildcard_query, max_results)
    activate Searcher

    Searcher->>BooleanQuery: create finalQuery

    alt wildcard_query is true
        Searcher->>BooleanQuery: add WildcardQuery(file_name_lower) MUST
    else
        Searcher->>BooleanQuery: add ParserQuery(file_name) MUST
    end

    Searcher->>Searcher: canonical = canonicalize_filename(path)
    Searcher->>BooleanQuery: add TermQuery(ancestor_paths = canonical) MUST

    Searcher->>IndexSearcher: search(finalQuery, max_results)
    activate IndexSearcher
    IndexSearcher->>LuceneIndex: execute boolean query
    LuceneIndex-->>IndexSearcher: TopDocs
    deactivate IndexSearcher

    IndexSearcher-->>Searcher: TopDocs

    loop for each hit in TopDocs
        Searcher->>LuceneIndex: doc(docId)
        LuceneIndex-->>Searcher: Document(full_path, file_type, ...)
        Searcher->>Searcher: serialize fields into result string
    end

    Searcher-->>User: vector<string> results
    deactivate Searcher

Class diagram for updated Searcher and indexing helpers

classDiagram
    class file_record {
        +string full_path
        +string file_name
        +string file_type
        +string file_ext
        +string modify_time_str
        +string file_size_str
        +string pinyin
        +string pinyin_acronym
        +bool is_hidden
    }

    class Document {
        +add(field : FieldPtr) : void
    }

    class Field {
        +name : wstring
        +value : wstring
        +STORE_NO : int
        +STORE_YES : int
        +INDEX_NOT_ANALYZED : int
    }

    class IndexerHelpers {
        +add_ancestor_paths(doc : DocumentPtr, full_path : string) : void
        +create_document(record : file_record) : DocumentPtr
    }

    class Searcher {
        +reader : IndexReaderPtr
        +searcher : IndexSearcherPtr
        +search(path : string, query : string, wildcard_query : bool, max_results : int) : vector~string~
    }

    class BooleanQuery {
        +add(query : QueryPtr, occur : BooleanClauseOccur) : void
    }

    class Queries {
        +WildcardQuery(term : TermPtr)
        +TermQuery(term : TermPtr)
        +parse_with_analyzer(field : wstring, query_string : wstring) : QueryPtr
    }

    file_record --> IndexerHelpers : used by
    IndexerHelpers --> Document : populates
    IndexerHelpers --> Field : creates

    Searcher ..> BooleanQuery : builds finalQuery
    Searcher ..> Queries : creates filename and
        path queries
    Searcher ..> Document : reads indexed fields

    Document --> Field : contains multiple

File-Level Changes

Change	Details	Files
Index ancestor directory paths for every file to support efficient path-based queries and bump the index schema version.	Introduce ANCESTOR_PATHS_FIELD constant used to store ancestor directory paths in the index. Add add_ancestor_paths helper that computes all ancestor directories for a file path and indexes them as non-stored, not-analyzed fields. Invoke add_ancestor_paths when creating a document so each file record has its ancestor directories indexed. Increment INDEX_VERSION from 3 to 4 to reflect the new index schema with ancestor path fields.	`src/daemon/src/core/file_index_manager.cpp`
Refactor search logic to build a combined Lucene BooleanQuery that enforces both filename and ancestor-path constraints, eliminating manual path-prefix filtering.	Replace a single QueryPtr with a BooleanQuery that aggregates required clauses. Wrap wildcard filename query or parsed analyzer-based filename query as a MUST clause in the BooleanQuery. Add a MUST TermQuery on the ANCESTOR_PATHS_FIELD using the canonicalized input search path to constrain results by ancestor directory. Remove post-search string-based path prefix filtering and always emit all hits returned by Lucene. Use the new BooleanQuery when invoking searcher->search instead of the old standalone filename query.	`src/searcher/searcher.cpp`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The path indexing in add_ancestor_paths assumes / as the separator; if this code ever runs on Windows-style paths (e.g., C:\dir\file), consider normalizing paths or using std::filesystem to derive parent directories to avoid mismatches with the search-side canonicalization.
The new path constraint uses an exact TermQuery on ancestor_paths; this changes semantics when the path argument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing that path is always a directory and handling or rejecting other inputs explicitly.
Since the filtering has moved entirely into the Lucene query, the debug log still printing reader->numDocs() can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The path indexing in `add_ancestor_paths` assumes `/` as the separator; if this code ever runs on Windows-style paths (e.g., `C:\dir\file`), consider normalizing paths or using `std::filesystem` to derive parent directories to avoid mismatches with the search-side canonicalization.
- The new path constraint uses an exact `TermQuery` on `ancestor_paths`; this changes semantics when the `path` argument points to a file rather than a directory (or is empty), so it’s worth confirming or enforcing that `path` is always a directory and handling or rejecting other inputs explicitly.
- Since the filtering has moved entirely into the Lucene query, the debug log still printing `reader->numDocs()` can be misleading when diagnosing path-filtered results; consider logging the final query or the number of hits instead/as well to aid troubleshooting of the new behavior.

## Individual Comments

### Comment 1
<location> `src/daemon/src/core/file_index_manager.cpp:124` </location>
<code_context>
+ * 3. "/home"
+ * 4. "/"
+ */
+static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path)
+{
+    if (full_path.empty()) return;
</code_context>

<issue_to_address>
**issue (complexity):** Consider rewriting the ancestor-path building loop using std::filesystem::path to simplify path manipulation and clarify loop termination.

You can reduce the custom string/path handling and make the termination conditions clearer by delegating to `std::filesystem::path` and using an explicit loop condition instead of `while (true)` + multiple `break`s.

For example:

```cpp
#include <filesystem>
// namespace fs = std::filesystem; // if you use a namespace alias

static void add_ancestor_paths(DocumentPtr doc, const std::string& full_path)
{
    if (full_path.empty()) return;

    std::filesystem::path p(full_path);

    // Start from the parent directory of the file
    for (p = p.parent_path(); !p.empty(); p = p.parent_path()) {
        auto s = p.generic_string();  // normalized '/' separators

        // Map the root path consistently to "/"
        if (p == p.root_path()) {
            s = "/";
        }

        doc->add(newLucene<Field>(
            ANCESTOR_PATHS_FIELD,
            StringUtils::toUnicode(s),
            Field::STORE_NO,
            Field::INDEX_NOT_ANALYZED));

        // Once we've added the root, stop
        if (p == p.root_path()) {
            break;
        }
    }
}
```

This keeps all existing behavior:

- Skips pure filenames without a directory component.
- Adds all ancestor directories from the immediate parent up to and including `/`.
- Avoids manual `find_last_of('/')`, `substr`, and explicit root special cases in multiple branches.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

deepin-ci-robot · 2025-12-16T05:22:54Z

deepin pr auto review

我来对这段代码进行详细审查：

代码逻辑改进建议：

在 add_ancestor_paths 函数中：

当处理位于根目录的文件时，应该直接添加根目录 "/"，而不是通过 substr 操作
建议在函数开始时添加路径规范化的步骤，确保路径格式统一
对于 Windows 系统的路径分隔符（\）没有做处理，建议增加跨平台兼容性

代码质量改进建议：

add_ancestor_paths 函数缺少对输入参数的空指针检查
建议将 ANCESTOR_PATHS_FIELD 定义为常量字符串，而不是宏定义
在 search 函数中，g_autofree 是 GLib 特有的，应该添加相应的头文件引用说明
search 函数中的结果构建逻辑可以提取为单独的函数，提高代码复用性

性能优化建议：

add_ancestor_paths 函数中重复的字符串操作（substr）可能影响性能，建议使用更高效的路径处理方式
在 search 函数中，可以预先计算 results 的大小，避免多次动态扩容
路径比较操作可以优化，使用更高效的字符串比较算法

安全性改进建议：

add_ancestor_paths 函数没有处理路径遍历攻击（../）的情况
在 search 函数中，对 path 参数的规范化处理可能不够完善
建议添加路径长度限制，防止过长的路径导致内存问题

具体改进代码示例：

// 建议将宏定义改为常量
const std::wstring ANCESTOR_PATHS_FIELD = L"ancestor_paths";

// 改进后的 add_ancestor_paths 函数
static void add_ancestor_paths(DocumentPtr doc, const std::string &full_path) {
    if (full_path.empty() || full_path.length() > PATH_MAX) return;

    // 规范化路径
    std::string normalized_path = std::filesystem::canonical(full_path).string();
    
    // 防止路径遍历攻击
    if (normalized_path.find("..") != std::string::npos) return;

    std::string current_path = normalized_path;
    
    // 获取父目录
    size_t last_slash = current_path.find_last_of('/');
    if (last_slash == std::string::npos) return;

    // 处理根目录情况
    if (last_slash == 0) {
        current_path = "/";
    } else {
        current_path = current_path.substr(0, last_slash);
    }

    // 使用 vector 预分配空间
    std::vector<std::string> ancestors;
    ancestors.reserve(16); // 预估合理的路径深度

    while (true) {
        ancestors.push_back(current_path);
        
        if (current_path == "/" || current_path.empty()) {
            break;
        }

        last_slash = current_path.find_last_of('/');
        if (last_slash == std::string::npos) break;

        current_path = (last_slash == 0) ? "/" : current_path.substr(0, last_slash);
    }

    // 批量添加到文档
    for (const auto& path : ancestors) {
        doc->add(newLucene<Field>(ANCESTOR_PATHS_FIELD,
                                  StringUtils::toUnicode(path),
                                  Field::STORE_NO,
                                  Field::INDEX_NOT_ANALYZED));
    }
}

// 改进后的搜索结果构建函数
static std::string build_search_result(const DocumentPtr& doc) {
    std::stringstream ss;
    ss << StringUtils::toUTF8(doc->get(L"full_path"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"file_type"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"file_ext"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"modify_time_str"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"file_size_str"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"pinyin_acronym"))
       << "<\\>" << StringUtils::toUTF8(doc->get(L"is_hidden"));
    return ss.str();
}

这些改进主要关注了：

更好的错误处理和输入验证
性能优化（预分配空间、批量操作）
安全性增强（路径规范化、长度限制）
代码结构优化（函数提取、常量定义）
跨平台兼容性考虑

建议在实际应用中根据具体需求和环境进一步调整这些改进。

This will improve search result filtering performance.

deepin-ci-robot · 2025-12-16T07:57:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lzwind, wangrong1069

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

debian/deepin/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wangrong1069 · 2025-12-16T08:15:38Z

/forcemerge

deepin-bot · 2025-12-16T08:15:56Z

This pr force merged! (status: unstable)

feat: Add ancestor_paths field for lucene document

cca9511

This will improve search result filtering performance.

sourcery-ai bot reviewed Dec 15, 2025

View reviewed changes

wangrong1069 force-pushed the pr1215 branch from 230b920 to 12e1095 Compare December 16, 2025 05:22

feat: Searcher filters results by ancestor_paths field

ebd8ca8

This will improve search result filtering performance.

wangrong1069 force-pushed the pr1215 branch from 12e1095 to ebd8ca8 Compare December 16, 2025 07:07

lzwind approved these changes Dec 16, 2025

View reviewed changes

deepin-bot bot merged commit 26381f8 into linuxdeepin:develop/snipe Dec 16, 2025
16 of 18 checks passed

wangrong1069 deleted the pr1215 branch December 16, 2025 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Improve search performance #195

feat: Improve search performance #195

Uh oh!

wangrong1069 commented Dec 15, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Dec 15, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

deepin-ci-robot commented Dec 16, 2025

Uh oh!

deepin-ci-robot commented Dec 16, 2025

Uh oh!

wangrong1069 commented Dec 16, 2025

Uh oh!

deepin-bot bot commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Improve search performance #195

feat: Improve search performance #195

Uh oh!

Conversation

wangrong1069 commented Dec 15, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for the updated search with ancestor path constraint

Class diagram for updated Searcher and indexing helpers

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

deepin-ci-robot commented Dec 16, 2025

deepin pr auto review

Uh oh!

deepin-ci-robot commented Dec 16, 2025

Uh oh!

wangrong1069 commented Dec 16, 2025

Uh oh!

deepin-bot bot commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangrong1069 commented Dec 15, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Dec 15, 2025 •

edited

Loading