feat: Supports hybrid search #198

wangrong1069 · 2025-12-25T11:46:52Z

Summary by Sourcery

Introduce a hybrid search mode that falls back to filesystem scanning when the index is updating and enhance index status handling in the daemon.

New Features:

Add a search dispatcher that selects indexed search or recursive filesystem scan based on index status metadata.
Support recursive filename search via filesystem scanning when the index is not ready.

Enhancements:

Canonicalize and manage search paths using GLib utilities in the searcher CLI.
Track batched index jobs and switch index status to updating once a configurable pending-event threshold is reached.
Persist and expose a new updating index status in the file index manager and status file.
Allow dynamic reconfiguration of the pending-events threshold for triggering index updating and log its value.
Reset index status from updating back to monitoring after committing indexes.

Build:

Link the searcher binary against the C++ filesystem library.

As title.

- Introduce updating index status to signify bulk index modifications. - Add batch_count_ to track pending event batches for processing. - Transition to updating status when pending events exceed a configured threshold. - Revert to monitoring status upon completion of batch updates.

As title.

Introduces a dynamic search mechanism in the deepin-anything-searcher that intelligently switches between an indexed search and a direct filesystem scan. - When the index is identified as "updating" (via status.json), the searcher will perform a recursive filesystem scan to find matching files and directories. - Otherwise, the more efficient indexed search will be utilized. - Adds necessary C++ filesystem library (stdc++fs) for scan operations. - Enhances path canonicalization for robustness. Task: https://pms.uniontech.com/task-view-385127.html

deepin-ci-robot · 2025-12-25T11:47:26Z

deepin pr auto review

我将对这段代码进行详细的审查，从语法逻辑、代码质量、性能和安全等方面提出改进建议。

语法逻辑改进：

// base_event_handler.cpp 中的条件判断可以优化
if ((g_atomic_int_get(&batch_count_)*(int)batch_size_) >= config_.pending_events_trigger_updating &&
    index_status_ == anything::index_status::monitoring) {

建议改进为：

int pending_events = g_atomic_int_get(&batch_count_) * static_cast<int>(batch_size_);
if (pending_events >= config_.pending_events_trigger_updating &&
    index_status_ == anything::index_status::monitoring) {

这样可以避免隐式类型转换，使代码更清晰。

代码质量改进：

在 searcher/main.cpp 中，search_by_scan 函数的异常处理可以更具体：

try {
    // ... 现有代码 ...
} catch (const std::filesystem::filesystem_error& ex) {
    std::cerr << "Filesystem error while scanning: " << ex.what() << std::endl;
    return;
} catch (const std::exception& ex) {
    std::cerr << "Error while scanning: " << ex.what() << std::endl;
    return;
} catch (...) {
    std::cerr << "Unknown error while scanning" << std::endl;
    return;
}

性能改进：

在 search_by_scan 函数中，可以考虑使用 reserve 来预分配结果向量的大小：

void search_by_scan(const std::string &path,
                    const std::string &query,
                    std::vector<std::string> &results)
{
    results.reserve(1000); // 预分配空间，减少内存重新分配
    // ... 其余代码 ...
}

安全性改进：

在 searcher/main.cpp 中，对文件路径的处理需要更严格的验证：

void search_by_scan(const std::string &path,
                    const std::string &query,
                    std::vector<std::string> &results)
{
    // 验证路径合法性
    if (path.empty() || path.find("..") != std::string::npos) {
        std::cerr << "Invalid path: " << path << std::endl;
        return;
    }
    
    // ... 其余代码 ...
}

配置管理改进：

在 config.cpp 中，建议添加配置验证：

bool Config::validate_config() {
    if (pending_events_trigger_updating_ < 0) {
        spdlog::error("Invalid pending_events_trigger_updating value: {}", 
                     pending_events_trigger_updating_);
        return false;
    }
    return true;
}

资源管理改进：

在 searcher/main.cpp 中，文件读取应该使用 RAII：

void search(const std::string &path,
    const std::string &query,
    int max_results,
    bool wildcard_query,
    const std::string &index_path,
    std::vector<std::string> &results)
{
    std::string status_file = index_path + "/status.json";
    std::ifstream file(status_file);
    if (!file.is_open()) {
        std::cerr << "Fail to open status file: " << status_file << std::endl;
        return;
    }
    
    // 使用 std::string 的赋值操作符自动管理内存
    std::string content((std::istreambuf_iterator<char>(file)),
                        std::istreambuf_iterator<char>());
    
    // ... 其余代码 ...
}

并发安全改进：

在 base_event_handler.h 中，batch_count_ 的声明应该使用原子类型：

class base_event_handler {
    // ... 其他成员 ...
private:
    std::atomic<gint> batch_count_; // 使用原子类型确保线程安全
};

错误处理改进：

在 file_index_manager.cpp 中，添加状态转换验证：

void file_index_manager::set_index_updating()
{
    std::lock_guard<std::mutex> lock(status_mtx_); // 添加互斥锁
    if (current_status_ != index_status::monitoring) {
        spdlog::warn("Invalid status transition to updating from {}",
                    static_cast<int>(current_status_));
        return;
    }
    save_index_status(index_status::updating);
    current_status_ = index_status::updating;
}

这些改进建议主要关注代码的健壮性、性能和安全性。建议在实施这些更改时，确保进行充分的测试，特别是并发场景下的测试。

sourcery-ai · 2025-12-25T11:47:33Z

Reviewer's Guide

Adds hybrid search support by introducing an index status–aware search path in the CLI, new index status "updating" in the daemon, and a configurable threshold of pending events that switches the index between monitoring and updating modes while persisting this to status.json for the searcher to decide between indexed and filesystem scan search.

Sequence diagram for hybrid search decision in CLI

sequenceDiagram
    actor User
    participant CLI as deepin_anything_searcher
    participant Search as search
    participant IndexSearch as search_by_index
    participant ScanSearch as search_by_scan
    participant StatusFile as status_json

    User->>CLI: Run with path and query
    CLI->>Search: search(path, query, max_results, wildcard_query, index_path, results)
    Search->>StatusFile: Open index_path/status.json
    StatusFile-->>Search: Read content
    alt status is updating
        Search->>ScanSearch: search_by_scan(path, query, results)
        ScanSearch->>ScanSearch: recursive_directory_iterator over path
        ScanSearch-->>Search: results (filesystem scan)
    else status is not updating
        Search->>IndexSearch: search_by_index(path, query, max_results, wildcard_query, index_path, results)
        IndexSearch->>IndexSearch: initialize(anything::Searcher)
        IndexSearch->>IndexSearch: search(path, query, max_results, wildcard_query)
        IndexSearch-->>Search: results (indexed search)
    end
    Search-->>CLI: results
    CLI-->>User: Print search results

Sequence diagram for index status lifecycle in daemon

sequenceDiagram
    participant FS as FileSystemEvents
    participant Handler as base_event_handler
    participant Pool as ThreadPool
    participant Manager as file_index_manager
    participant Timer as timer_worker
    participant StatusFile as status_json

    FS->>Handler: New index_job events
    Handler->>Handler: eat_jobs(jobs, batch_size)
    Handler->>Handler: Move first batch_size jobs to processing_jobs
    Handler->>Handler: g_atomic_int_inc(batch_count_)
    Handler->>Pool: enqueue_detach(processing_jobs)
    Pool->>Handler: For each job call eat_job(job)
    Pool->>Handler: g_atomic_int_dec_and_test(event_process_thread_count_)
    Pool->>Handler: g_atomic_int_dec_and_test(batch_count_)

    alt pending events threshold reached and index_status_ is monitoring
        Handler->>Handler: Set index_status_ = updating
        Handler->>Manager: set_index_updating()
        Manager->>StatusFile: save_index_status(updating)
    end

    loop periodic timer_worker
        Timer->>Handler: Check jobs_, pool_.busy, event_process_thread_count_
        alt Timeout reached and no pending work
            Timer->>Handler: If index_status_ == updating set to monitoring
            Timer->>Manager: commit(index_status_)
            Manager->>StatusFile: save_index_status(monitoring)
        end
    end

Updated class diagram for daemon configuration and index management

classDiagram
    class event_handler_config {
        +std::string persistent_index_dir
        +std::string volatile_index_dir
        +std::map~std::string,std::string~ file_type_mapping
        +std::map~std::string,std::string~ file_type_mapping_original
        +int commit_volatile_index_timeout
        +int commit_persistent_index_timeout
        +int pending_events_trigger_updating
    }

    class Config {
        -std::string persistent_index_dir_
        -std::string volatile_index_dir_
        -int commit_volatile_index_timeout_
        -int commit_persistent_index_timeout_
        -int pending_events_trigger_updating_
        -std::string log_level_
        -void* dbus_connection_
        -std::string resource_path_
        +event_handler_config make_event_handler_config()
        +bool update_config()
    }

    class index_status {
        <<enumeration>>
        loading
        scanning
        monitoring
        updating
        closed
    }

    class file_index_manager {
        +bool refresh_indexes(std::vector~std::string~ blacklist_paths, bool nrt, bool check_exist)
        +void set_index_invalid()
        +void set_index_updating()
        -void try_refresh_reader(bool nrt)
        -void save_index_status(index_status status)
    }

    class base_event_handler {
        -event_handler_config config_
        -file_index_manager index_manager_
        -index_status index_status_
        -std::atomic~int~ event_process_thread_count_
        -std::atomic~bool~ stop_scan_directory_
        -std::mutex config_access_mtx_
        -gint batch_count_
        +base_event_handler(event_handler_config config)
        +bool handle_config_change(std::string key, event_handler_config new_config)
        +void eat_jobs(std::vector~anything::index_job~ jobs, std::size_t batch_size)
        +void timer_worker(int64_t interval)
        -void eat_job(const anything::index_job& job)
    }

    class default_event_handler {
        +bool handle_config_change(std::string key, event_handler_config new_config)
    }

    event_handler_config <-- Config : builds
    base_event_handler o-- event_handler_config : has
    base_event_handler o-- file_index_manager : manages
    base_event_handler --> index_status : uses
    file_index_manager --> index_status : saves
    default_event_handler --|> base_event_handler

File-Level Changes

Change	Details	Files
Introduce hybrid search behavior in CLI searcher based on index status file and fallback to filesystem scanning when index is updating.	Add helper functions to perform index-based search and filesystem scan search with error handling and use std::filesystem for recursive directory traversal. Add a unified search() function that reads index status.json, chooses between index and scan search, and populates a results vector. Change main() to canonicalize the search path, manage it via GLib auto-free, and call the new search() function instead of directly using anything::Searcher. Link stdc++fs in the searcher CMake to support std::filesystem usage.	`src/searcher/main.cpp` `src/searcher/CMakeLists.txt`
Track a new "updating" index status in the daemon and expose it via file_index_manager so searcher can detect index update in progress.	Extend index_status enum with updating value and handle its serialization in save_index_status. Add file_index_manager::set_index_updating() that writes the updating status to the status file. Switch index_status_ back from updating to monitoring after successful commit in timer_worker().	`src/daemon/include/core/file_index_manager.h` `src/daemon/src/core/file_index_manager.cpp` `src/daemon/src/core/base_event_handler.cpp`
Trigger index "updating" status based on a configurable pending event threshold and track batches asynchronously in the event handler.	Add a batch_count_ member to base_event_handler, increment/decrement it around asynchronous batch processing, and use it with batch_size_ to compute pending events. When pending events exceed the configured threshold and current status is monitoring, set index_status_ to updating and call index_manager_.set_index_updating(). Ensure that when commit is performed and the index was updating, the status is changed back to monitoring before commit.	`src/daemon/include/core/base_event_handler.h` `src/daemon/src/core/base_event_handler.cpp`
Introduce a new configuration parameter pending_events_trigger_updating that controls when the index transitions to updating state and allow dynamic updates.	Define defaults, min, and max bounds for pending_events_trigger_updating and load it from DConfig with clamping in Config::update_config(). Propagate pending_events_trigger_updating into event_handler_config and print it in print_event_handler_config(). Handle dynamic config changes for pending_events_trigger_updating in base_event_handler and mark them as handled in default_event_handler without extra side effects.	`src/daemon/include/core/config.h` `src/daemon/src/core/config.cpp` `src/daemon/src/core/base_event_handler.cpp` `src/daemon/src/core/default_event_handler.cpp`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The new index_status_ transitions to and from updating are performed without any locking or atomic guarantees, but this field is read and written from multiple threads (e.g., in eat_jobs and timer_worker), so you may want to protect it with a mutex or make it atomic to avoid data races.
In search() you now return early when status.json cannot be opened, which is a behavior change from always performing an index search; consider falling back to index or scan search instead of failing hard so that search keeps working when the status file is missing or corrupt.
search_by_scan ignores max_results and always traverses the full directory tree, which can be very expensive; you might want to stop the recursion once max_results is reached to align with the indexed search behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `index_status_` transitions to and from `updating` are performed without any locking or atomic guarantees, but this field is read and written from multiple threads (e.g., in `eat_jobs` and `timer_worker`), so you may want to protect it with a mutex or make it atomic to avoid data races.
- In `search()` you now return early when `status.json` cannot be opened, which is a behavior change from always performing an index search; consider falling back to index or scan search instead of failing hard so that search keeps working when the status file is missing or corrupt.
- `search_by_scan` ignores `max_results` and always traverses the full directory tree, which can be very expensive; you might want to stop the recursion once `max_results` is reached to align with the indexed search behavior.

## Individual Comments

### Comment 1
<location> `src/searcher/main.cpp:49-58` </location>
<code_context>
+void search_by_scan(const std::string &path,
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Scan-based search ignores `max_results` and wildcard semantics, which can diverge from index-based behavior.

When `status` is "updating", `search()` falls back to `search_by_scan`, but that function (1) can return an unbounded number of paths because it doesn’t honor `max_results`, and (2) uses a simple substring match, ignoring wildcard semantics from `wildcard_query`. This can produce noticeably different results vs index-based search. If the goal is to approximate index search while updating, consider passing `max_results` and `wildcard_query` into `search_by_scan`, enforcing the limit during iteration, and aligning its matching rules with the indexed search (or clearly documenting and naming it as intentionally different).

Suggested implementation:

```cpp
namespace {
    // Simple glob-style wildcard matcher supporting '*' and '?'.
    // Intended to approximate the semantics used by the indexed search.
    bool wildcard_match(const std::string &pattern, const std::string &value) {
        const char *p = pattern.c_str();
        const char *v = value.c_str();
        const char *star = nullptr;
        const char *star_match = nullptr;

        while (*v) {
            if (*p == '?' || *p == *v) {
                ++p;
                ++v;
            } else if (*p == '*') {
                star = p++;
                star_match = v;
            } else if (star) {
                p = star + 1;
                v = ++star_match;
            } else {
                return false;
            }
        }

        while (*p == '*') {
            ++p;
        }

        return *p == '\0';
    }
}

void search_by_scan(const std::string &path,
                    const std::string &query,
                    std::vector<std::string> &results,
                    std::size_t max_results,
                    const std::string &wildcard_query)
{
    // 在目录 path 下, 查找包含 query 字符串的文件名和目录名, 将查找结果保存到 results,
    // 并尽量复用通配符(wildcard_query)语义以及 max_results 行为以接近索引搜索结果

```

To fully implement the behavior described in your review comment, the following additional edits are needed elsewhere in `src/searcher/main.cpp` (and possibly other translation units that reference `search_by_scan`):

1. **Update all callers of `search_by_scan`**  
   Wherever `search_by_scan` is called (e.g., in `search()` when status is `"updating"`), update the call to pass the new parameters:
   - `max_results`
   - `wildcard_query`

   For example, change something like:
   ```cpp
   search_by_scan(path, query, results);
   ```
   to:
   ```cpp
   search_by_scan(path, query, results, max_results, wildcard_query);
   ```

2. **Honor `max_results` inside the scan loop**  
   Inside the body of `search_by_scan`, in the loop where you iterate over `std::filesystem::directory_iterator` / `recursive_directory_iterator`:
   - Before pushing back a new match into `results`, check:
     ```cpp
     if (max_results > 0 && results.size() >= max_results) {
         return; // or break the traversal early
     }
     ```
   - This ensures `search_by_scan` does not return more than `max_results` entries and avoids unbounded result growth.

3. **Align matching semantics with `wildcard_query`**  
   In the same loop:
   - Extract the candidate name (file or directory) into a `std::string name`.
   - If `!wildcard_query.empty()`, use `wildcard_match(wildcard_query, name)` to decide whether to include the entry.
   - Otherwise, fall back to the existing substring behavior using `query`:
     ```cpp
     bool matched = false;
     if (!wildcard_query.empty()) {
         matched = wildcard_match(wildcard_query, name);
     } else if (!query.empty()) {
         matched = name.find(query) != std::string::npos;
     }
     if (matched) {
         // push_back if under max_results, as described above
     }
     ```

4. **Edge cases / documentation**  
   - Decide and document what `max_results == 0` means. A common convention is "no limit", i.e., only enforce the size check when `max_results > 0`.
   - If the indexed search has a dedicated helper or class to interpret `wildcard_query`, you should **replace** the `wildcard_match` implementation above with calls to that existing logic to ensure behavior matches as closely as possible.

These changes will ensure `search_by_scan` respects `max_results` and uses wildcard semantics similar to the indexed search, reducing divergence between the two code paths.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

deepin-ci-robot · 2025-12-25T13:19:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lzwind, wangrong1069

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

debian/deepin/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wangrong1069 · 2025-12-25T13:20:07Z

/merge

wangrong1069 added 4 commits December 25, 2025 14:39

feat: Add config item commit_persistent_index_timeout

8883f41

As title.

feat: Support pending_events_trigger_updating dynamic update

053831e

As title.

sourcery-ai bot reviewed Dec 25, 2025

View reviewed changes

lzwind approved these changes Dec 25, 2025

View reviewed changes

deepin-bot bot merged commit 73ac360 into linuxdeepin:develop/snipe Dec 25, 2025
18 checks passed

wangrong1069 deleted the pr1225 branch December 25, 2025 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Supports hybrid search #198

feat: Supports hybrid search #198

Uh oh!

wangrong1069 commented Dec 25, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

deepin-ci-robot commented Dec 25, 2025

Uh oh!

sourcery-ai bot commented Dec 25, 2025

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

deepin-ci-robot commented Dec 25, 2025

Uh oh!

wangrong1069 commented Dec 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Supports hybrid search #198

feat: Supports hybrid search #198

Uh oh!

Conversation

wangrong1069 commented Dec 25, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

deepin-ci-robot commented Dec 25, 2025

deepin pr auto review

Uh oh!

sourcery-ai bot commented Dec 25, 2025

Reviewer's Guide

Sequence diagram for hybrid search decision in CLI

Sequence diagram for index status lifecycle in daemon

Updated class diagram for daemon configuration and index management

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

deepin-ci-robot commented Dec 25, 2025

Uh oh!

wangrong1069 commented Dec 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangrong1069 commented Dec 25, 2025 •

edited by sourcery-ai bot

Loading