Improve research search with Tantivy-backed snippets#152
Draft
akseljoonas wants to merge 1 commit intomainfrom
Draft
Improve research search with Tantivy-backed snippets#152akseljoonas wants to merge 1 commit intomainfrom
akseljoonas wants to merge 1 commit intomainfrom
Conversation
Whoosh is unmaintained and emits Python 3.12 syntax warnings. More importantly, the existing research tools ranked whole pages/files and often forced the agent to spend tokens reading broad results before finding the useful passage. This moves HF docs, HF OpenAPI, and GitHub example search onto a small Tantivy-backed search layer with passage/snippet chunking, source line ranges, and disk caches for network-backed research data. GitHub example lookup now searches file contents as well as paths, tolerates missing or rejected GitHub tokens for public repos, and returns focused snippets that the agent can follow up with github_read_file line ranges. Constraint: Keep the PR scoped to search quality and do not introduce RAG or embedding infra. Rejected: Keep Whoosh and suppress warnings | leaves the stale dependency and weaker result granularity in place. Rejected: Index raw notebooks as snippets | raw ipynb JSON produced noisy excerpts and misleading line ranges. Confidence: high Scope-risk: moderate Directive: Treat this as the search substrate for future research-tool consolidation; broader gh/hf CLI exposure should build on this rather than reintroducing independent search paths. Tested: uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q Tested: uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.py Tested: live explore_hf_docs, find_hf_api, github_find_examples calls with cached follow-up timings Tested: real ml-intern CLI research prompt exercised explore_hf_docs, github_find_examples, fetch_hf_docs, and github_read_file Not-tested: Full unit suite has two pre-existing doom-loop wording assertion failures unrelated to search.
|
closed per maintainer request |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
This replaces the old Whoosh-backed search inside ml-intern's research tools with a small Tantivy-based search layer. The goal is not to add RAG or embeddings; it is to make the existing research tools return more precise, source-addressable results so the agent spends fewer tokens finding the right docs or examples.
Whoosh is unmaintained and emits Python 3.12 warnings in local runs. More importantly, the old search ranked whole docs/pages and GitHub paths, so research calls often sent the model broad results instead of the exact useful passage.
User-visible behavior
explore_hf_docsnow ranks markdown passages instead of whole pages. Results include the heading and line range for the matched section.find_hf_apinow uses the same Tantivy search layer for OpenAPI endpoint search.github_find_examplesstill starts from example-like files, but now also indexes source snippets from public repo contents when a keyword is provided.github_read_fileline ranges and focused excerpts around the query terms..ml-intern-cache/searchby default, orML_INTERN_SEARCH_CACHE_DIRwhen set.Implementation notes
agent/search/with:TantivyTextIndex: small wrapper aroundtantivyfor field-boosted BM25 search..ipynbcontent indexing for now because notebook JSON produced noisy snippets and misleading line ranges; notebooks can still appear as path-level example results.Validation
UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/test_tantivy_search.py tests/unit/test_docs_tantivy_search.py tests/unit/test_github_find_examples_tantivy.py -q11 passedUV_CACHE_DIR=/tmp/uv-cache uv run python -m compileall -q agent/search agent/tools/docs_tools.py agent/tools/github_find_examples.pyexplore_hf_docson TRL querydataset_text_field SFTConfig packingreturnedSFT / Packingwith source lines. Cached repeat was about 0.055s.find_hf_apireturned correct top endpoints forcreate repository,upload file, andspace logs.github_find_examplesonhuggingface/trlquerygrpo trainerreturned focused source snippets and cached repeat was about 0.031s.ml-intern --max-iterations 6 --no-stream "Research current TRL GRPOTrainer usage..."naturally calledexplore_hf_docs,github_find_examples,fetch_hf_docs, andgithub_read_file, then returned a researched GRPOTrainer answer.Known unrelated issue
The full unit suite currently reports two existing
tests/unit/test_doom_loop.pyfailures because tests still expectDOOM LOOP DETECTEDwhile the runtime returns[SYSTEM: REPETITION GUARD]. This PR does not change that behavior.Follow-up direction
This PR intentionally keeps scope to the search substrate. A natural next step is consolidating the research tools around a broader GitHub/HF interface, including model-accessible
gh/hfCLI-style capabilities and more GitHub operations. The Tantivy layer here should give that future consolidation one shared precise search path instead of several independent ones.