-
Notifications
You must be signed in to change notification settings - Fork 45
feat: Supports hybrid search #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduce updating index status to signify bulk index modifications. - Add batch_count_ to track pending event batches for processing. - Transition to updating status when pending events exceed a configured threshold. - Revert to monitoring status upon completion of batch updates.
Introduces a dynamic search mechanism in the deepin-anything-searcher that intelligently switches between an indexed search and a direct filesystem scan. - When the index is identified as "updating" (via status.json), the searcher will perform a recursive filesystem scan to find matching files and directories. - Otherwise, the more efficient indexed search will be utilized. - Adds necessary C++ filesystem library (stdc++fs) for scan operations. - Enhances path canonicalization for robustness. Task: https://pms.uniontech.com/task-view-385127.html
deepin pr auto review我将对这段代码进行详细的审查,从语法逻辑、代码质量、性能和安全等方面提出改进建议。
// base_event_handler.cpp 中的条件判断可以优化
if ((g_atomic_int_get(&batch_count_)*(int)batch_size_) >= config_.pending_events_trigger_updating &&
index_status_ == anything::index_status::monitoring) {建议改进为: int pending_events = g_atomic_int_get(&batch_count_) * static_cast<int>(batch_size_);
if (pending_events >= config_.pending_events_trigger_updating &&
index_status_ == anything::index_status::monitoring) {这样可以避免隐式类型转换,使代码更清晰。
在 searcher/main.cpp 中,search_by_scan 函数的异常处理可以更具体: try {
// ... 现有代码 ...
} catch (const std::filesystem::filesystem_error& ex) {
std::cerr << "Filesystem error while scanning: " << ex.what() << std::endl;
return;
} catch (const std::exception& ex) {
std::cerr << "Error while scanning: " << ex.what() << std::endl;
return;
} catch (...) {
std::cerr << "Unknown error while scanning" << std::endl;
return;
}
在 search_by_scan 函数中,可以考虑使用 reserve 来预分配结果向量的大小: void search_by_scan(const std::string &path,
const std::string &query,
std::vector<std::string> &results)
{
results.reserve(1000); // 预分配空间,减少内存重新分配
// ... 其余代码 ...
}
在 searcher/main.cpp 中,对文件路径的处理需要更严格的验证: void search_by_scan(const std::string &path,
const std::string &query,
std::vector<std::string> &results)
{
// 验证路径合法性
if (path.empty() || path.find("..") != std::string::npos) {
std::cerr << "Invalid path: " << path << std::endl;
return;
}
// ... 其余代码 ...
}
在 config.cpp 中,建议添加配置验证: bool Config::validate_config() {
if (pending_events_trigger_updating_ < 0) {
spdlog::error("Invalid pending_events_trigger_updating value: {}",
pending_events_trigger_updating_);
return false;
}
return true;
}
在 searcher/main.cpp 中,文件读取应该使用 RAII: void search(const std::string &path,
const std::string &query,
int max_results,
bool wildcard_query,
const std::string &index_path,
std::vector<std::string> &results)
{
std::string status_file = index_path + "/status.json";
std::ifstream file(status_file);
if (!file.is_open()) {
std::cerr << "Fail to open status file: " << status_file << std::endl;
return;
}
// 使用 std::string 的赋值操作符自动管理内存
std::string content((std::istreambuf_iterator<char>(file)),
std::istreambuf_iterator<char>());
// ... 其余代码 ...
}
在 base_event_handler.h 中,batch_count_ 的声明应该使用原子类型: class base_event_handler {
// ... 其他成员 ...
private:
std::atomic<gint> batch_count_; // 使用原子类型确保线程安全
};
在 file_index_manager.cpp 中,添加状态转换验证: void file_index_manager::set_index_updating()
{
std::lock_guard<std::mutex> lock(status_mtx_); // 添加互斥锁
if (current_status_ != index_status::monitoring) {
spdlog::warn("Invalid status transition to updating from {}",
static_cast<int>(current_status_));
return;
}
save_index_status(index_status::updating);
current_status_ = index_status::updating;
}这些改进建议主要关注代码的健壮性、性能和安全性。建议在实施这些更改时,确保进行充分的测试,特别是并发场景下的测试。 |
Reviewer's GuideAdds hybrid search support by introducing an index status–aware search path in the CLI, new index status "updating" in the daemon, and a configurable threshold of pending events that switches the index between monitoring and updating modes while persisting this to status.json for the searcher to decide between indexed and filesystem scan search. Sequence diagram for hybrid search decision in CLIsequenceDiagram
actor User
participant CLI as deepin_anything_searcher
participant Search as search
participant IndexSearch as search_by_index
participant ScanSearch as search_by_scan
participant StatusFile as status_json
User->>CLI: Run with path and query
CLI->>Search: search(path, query, max_results, wildcard_query, index_path, results)
Search->>StatusFile: Open index_path/status.json
StatusFile-->>Search: Read content
alt status is updating
Search->>ScanSearch: search_by_scan(path, query, results)
ScanSearch->>ScanSearch: recursive_directory_iterator over path
ScanSearch-->>Search: results (filesystem scan)
else status is not updating
Search->>IndexSearch: search_by_index(path, query, max_results, wildcard_query, index_path, results)
IndexSearch->>IndexSearch: initialize(anything::Searcher)
IndexSearch->>IndexSearch: search(path, query, max_results, wildcard_query)
IndexSearch-->>Search: results (indexed search)
end
Search-->>CLI: results
CLI-->>User: Print search results
Sequence diagram for index status lifecycle in daemonsequenceDiagram
participant FS as FileSystemEvents
participant Handler as base_event_handler
participant Pool as ThreadPool
participant Manager as file_index_manager
participant Timer as timer_worker
participant StatusFile as status_json
FS->>Handler: New index_job events
Handler->>Handler: eat_jobs(jobs, batch_size)
Handler->>Handler: Move first batch_size jobs to processing_jobs
Handler->>Handler: g_atomic_int_inc(batch_count_)
Handler->>Pool: enqueue_detach(processing_jobs)
Pool->>Handler: For each job call eat_job(job)
Pool->>Handler: g_atomic_int_dec_and_test(event_process_thread_count_)
Pool->>Handler: g_atomic_int_dec_and_test(batch_count_)
alt pending events threshold reached and index_status_ is monitoring
Handler->>Handler: Set index_status_ = updating
Handler->>Manager: set_index_updating()
Manager->>StatusFile: save_index_status(updating)
end
loop periodic timer_worker
Timer->>Handler: Check jobs_, pool_.busy, event_process_thread_count_
alt Timeout reached and no pending work
Timer->>Handler: If index_status_ == updating set to monitoring
Timer->>Manager: commit(index_status_)
Manager->>StatusFile: save_index_status(monitoring)
end
end
Updated class diagram for daemon configuration and index managementclassDiagram
class event_handler_config {
+std::string persistent_index_dir
+std::string volatile_index_dir
+std::map~std::string,std::string~ file_type_mapping
+std::map~std::string,std::string~ file_type_mapping_original
+int commit_volatile_index_timeout
+int commit_persistent_index_timeout
+int pending_events_trigger_updating
}
class Config {
-std::string persistent_index_dir_
-std::string volatile_index_dir_
-int commit_volatile_index_timeout_
-int commit_persistent_index_timeout_
-int pending_events_trigger_updating_
-std::string log_level_
-void* dbus_connection_
-std::string resource_path_
+event_handler_config make_event_handler_config()
+bool update_config()
}
class index_status {
<<enumeration>>
loading
scanning
monitoring
updating
closed
}
class file_index_manager {
+bool refresh_indexes(std::vector~std::string~ blacklist_paths, bool nrt, bool check_exist)
+void set_index_invalid()
+void set_index_updating()
-void try_refresh_reader(bool nrt)
-void save_index_status(index_status status)
}
class base_event_handler {
-event_handler_config config_
-file_index_manager index_manager_
-index_status index_status_
-std::atomic~int~ event_process_thread_count_
-std::atomic~bool~ stop_scan_directory_
-std::mutex config_access_mtx_
-gint batch_count_
+base_event_handler(event_handler_config config)
+bool handle_config_change(std::string key, event_handler_config new_config)
+void eat_jobs(std::vector~anything::index_job~ jobs, std::size_t batch_size)
+void timer_worker(int64_t interval)
-void eat_job(const anything::index_job& job)
}
class default_event_handler {
+bool handle_config_change(std::string key, event_handler_config new_config)
}
event_handler_config <-- Config : builds
base_event_handler o-- event_handler_config : has
base_event_handler o-- file_index_manager : manages
base_event_handler --> index_status : uses
file_index_manager --> index_status : saves
default_event_handler --|> base_event_handler
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 1 issue, and left some high level feedback:
- The new
index_status_transitions to and fromupdatingare performed without any locking or atomic guarantees, but this field is read and written from multiple threads (e.g., ineat_jobsandtimer_worker), so you may want to protect it with a mutex or make it atomic to avoid data races. - In
search()you now return early whenstatus.jsoncannot be opened, which is a behavior change from always performing an index search; consider falling back to index or scan search instead of failing hard so that search keeps working when the status file is missing or corrupt. search_by_scanignoresmax_resultsand always traverses the full directory tree, which can be very expensive; you might want to stop the recursion oncemax_resultsis reached to align with the indexed search behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `index_status_` transitions to and from `updating` are performed without any locking or atomic guarantees, but this field is read and written from multiple threads (e.g., in `eat_jobs` and `timer_worker`), so you may want to protect it with a mutex or make it atomic to avoid data races.
- In `search()` you now return early when `status.json` cannot be opened, which is a behavior change from always performing an index search; consider falling back to index or scan search instead of failing hard so that search keeps working when the status file is missing or corrupt.
- `search_by_scan` ignores `max_results` and always traverses the full directory tree, which can be very expensive; you might want to stop the recursion once `max_results` is reached to align with the indexed search behavior.
## Individual Comments
### Comment 1
<location> `src/searcher/main.cpp:49-58` </location>
<code_context>
+void search_by_scan(const std::string &path,
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Scan-based search ignores `max_results` and wildcard semantics, which can diverge from index-based behavior.
When `status` is "updating", `search()` falls back to `search_by_scan`, but that function (1) can return an unbounded number of paths because it doesn’t honor `max_results`, and (2) uses a simple substring match, ignoring wildcard semantics from `wildcard_query`. This can produce noticeably different results vs index-based search. If the goal is to approximate index search while updating, consider passing `max_results` and `wildcard_query` into `search_by_scan`, enforcing the limit during iteration, and aligning its matching rules with the indexed search (or clearly documenting and naming it as intentionally different).
Suggested implementation:
```cpp
namespace {
// Simple glob-style wildcard matcher supporting '*' and '?'.
// Intended to approximate the semantics used by the indexed search.
bool wildcard_match(const std::string &pattern, const std::string &value) {
const char *p = pattern.c_str();
const char *v = value.c_str();
const char *star = nullptr;
const char *star_match = nullptr;
while (*v) {
if (*p == '?' || *p == *v) {
++p;
++v;
} else if (*p == '*') {
star = p++;
star_match = v;
} else if (star) {
p = star + 1;
v = ++star_match;
} else {
return false;
}
}
while (*p == '*') {
++p;
}
return *p == '\0';
}
}
void search_by_scan(const std::string &path,
const std::string &query,
std::vector<std::string> &results,
std::size_t max_results,
const std::string &wildcard_query)
{
// 在目录 path 下, 查找包含 query 字符串的文件名和目录名, 将查找结果保存到 results,
// 并尽量复用通配符(wildcard_query)语义以及 max_results 行为以接近索引搜索结果
```
To fully implement the behavior described in your review comment, the following additional edits are needed elsewhere in `src/searcher/main.cpp` (and possibly other translation units that reference `search_by_scan`):
1. **Update all callers of `search_by_scan`**
Wherever `search_by_scan` is called (e.g., in `search()` when status is `"updating"`), update the call to pass the new parameters:
- `max_results`
- `wildcard_query`
For example, change something like:
```cpp
search_by_scan(path, query, results);
```
to:
```cpp
search_by_scan(path, query, results, max_results, wildcard_query);
```
2. **Honor `max_results` inside the scan loop**
Inside the body of `search_by_scan`, in the loop where you iterate over `std::filesystem::directory_iterator` / `recursive_directory_iterator`:
- Before pushing back a new match into `results`, check:
```cpp
if (max_results > 0 && results.size() >= max_results) {
return; // or break the traversal early
}
```
- This ensures `search_by_scan` does not return more than `max_results` entries and avoids unbounded result growth.
3. **Align matching semantics with `wildcard_query`**
In the same loop:
- Extract the candidate name (file or directory) into a `std::string name`.
- If `!wildcard_query.empty()`, use `wildcard_match(wildcard_query, name)` to decide whether to include the entry.
- Otherwise, fall back to the existing substring behavior using `query`:
```cpp
bool matched = false;
if (!wildcard_query.empty()) {
matched = wildcard_match(wildcard_query, name);
} else if (!query.empty()) {
matched = name.find(query) != std::string::npos;
}
if (matched) {
// push_back if under max_results, as described above
}
```
4. **Edge cases / documentation**
- Decide and document what `max_results == 0` means. A common convention is "no limit", i.e., only enforce the size check when `max_results > 0`.
- If the indexed search has a dedicated helper or class to interpret `wildcard_query`, you should **replace** the `wildcard_match` implementation above with calls to that existing logic to ensure behavior matches as closely as possible.
These changes will ensure `search_by_scan` respects `max_results` and uses wildcard semantics similar to the indexed search, reducing divergence between the two code paths.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lzwind, wangrong1069 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/merge |
Summary by Sourcery
Introduce a hybrid search mode that falls back to filesystem scanning when the index is updating and enhance index status handling in the daemon.
New Features:
Enhancements:
Build: