feat(embedder): use summary for file embedding in semantic pipeline#765
Merged
qin-ctx merged 3 commits intovolcengine:mainfrom Mar 19, 2026
Merged
Conversation
When files are processed through the semantic pipeline (SemanticDag), use the pre-generated summary (AST skeleton or LLM summary) for embedding instead of reading raw file content. This ensures code files, markdown, and other text files within a repository are indexed by their semantic summary rather than truncated raw content. - Add use_summary flag to VectorizeTask, _vectorize_single_file, and vectorize_file - Set use_summary=True in _file_summary_task when a non-empty summary is available - Truncate AST skeleton to max_skeleton_chars (12000 chars, ~3000 tokens) before embedding - Add max_skeleton_chars config field to SemanticConfig - index_resource and memory paths are unaffected (use_summary defaults to False) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
qin-ctx
reviewed
Mar 19, 2026
|
|
||
| try: | ||
| if need_vectorize: | ||
| use_summary = bool(summary_dict.get("summary")) |
Collaborator
Author
There was a problem hiding this comment.
好像有点问题,我再check下
Collaborator
Author
There was a problem hiding this comment.
只有代码会走这个路径吗
这个已经改了下,现在是只有code repo的情况才会这样了
…ext/doc files Add `is_code_repo` flag to `SemanticMsg` and propagate it through the pipeline so that summary-based embedding (AST skeleton) is only applied when processing a code repository (`source_format == "repository"`). For plain text, markdown, and other non-repo resources, raw file content is used for embedding as before. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
qin-ctx
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
is_code_repoflag toSemanticMsgand propagate it through the pipeline:ResourceProcessor→Summarizer→SemanticMsg→SemanticDagExecutorsource_format == "repository"(set byCodeRepositoryParser) and passis_code_repo=Truewhen enqueuing semantic processinguse_summaryin_file_summary_taskis now gated onis_code_repo, so plain text / markdown / other non-repo resources continue to embed raw file contentmax_skeleton_chars(12000 chars, ~3000 tokens) before embedding to prevent oversized inputmax_skeleton_charsconfig field toSemanticConfigWhy
Raw file content was being sent directly to the embedding API even when a semantic summary had already been generated. For large files this caused the embedding API to reject the request with a token limit error (e.g. OpenAI 8192 token limit). Using the bounded summary instead of raw content fixes this.
However, using summary for all file types (including markdown, plain text) was incorrect — for those files the raw content is the meaningful representation. Summary-based embedding is only appropriate for code files where AST skeletons provide a better semantic signal.
Paths unaffected:
index_resourcedirect indexing path (use_summarydefaults toFalse)memory_extractor.py)Closes
Closes #616