fix: preserve duplicate keyword delimiter matches#282
fix: preserve duplicate keyword delimiter matches#282bashandbone merged 3 commits intoknitli:mainfrom
Conversation
This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance. Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex. Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Reviewer's GuideOptimizes and corrects keyword delimiter matching so that duplicate keyword start strings preserve all associated delimiters, while adding a regression test and a minor docstring typo fix. Sequence diagram for optimized keyword delimiter matchingsequenceDiagram
participant Caller as Chunker
participant M as _match_keyword_delimiters
participant R as re
participant S as _is_inside_string_or_comment
participant F as _find_next_structural_with_char
Caller->>M: _match_keyword_delimiters(content, keyword_delimiters, matches)
M->>M: Filter keyword_delimiters where d.start is truthy
alt no keyword_delimiters
M-->>Caller: return matches
else keyword_delimiters present
M->>R: compile combined_pattern from unique delimiter.start values
M->>M: Build delimiter_map[start_string] -> list[Delimiter]
loop for each match in combined_pattern.finditer(content)
M->>R: combined_pattern.finditer(content)
R-->>M: match with matched_text and keyword_pos
M->>S: _is_inside_string_or_comment(content, keyword_pos)
S-->>M: bool is_inside
alt is_inside
M->>M: continue to next match
else not inside
loop for each delimiter in delimiter_map[matched_text]
M->>F: _find_next_structural_with_char(content, keyword_pos, structural_pairs)
F-->>M: struct_start, struct_char
M->>M: Evaluate delimiter rules and update matches
end
end
end
M-->>Caller: return matches
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
👋 Hey @aiedwardyi, Thanks for your contribution to codeweaver! 🧵You need to agree to the CLA first... 🖊️Before we can accept your contribution, you need to agree to our Contributor License Agreement (CLA). To agree to the CLA, please comment:
Those exact words are important1, so please don't change them. 😉 You can read the full CLA here: Contributor License Agreement ✅ @aiedwardyi has signed the CLA. You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Fixes the regression introduced by the single-regex keyword matching optimization so that multiple delimiters sharing the same keyword start are all preserved (instead of being overwritten), and adds a regression test for the duplicate-keyword-start case reported in #281.
Changes:
- Update
_match_keyword_delimiters()to group keyword delimiters bystartand emit matches for all delimiters that share a matched keyword. - Deduplicate keyword start strings when building the combined regex alternation to avoid repeated alternation entries.
- Add a unit regression test ensuring duplicate keyword starts (e.g.,
type) return multiple delimiter matches.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/codeweaver/engine/chunker/delimiter.py |
Preserves all delimiters for duplicate keyword starts by mapping matched text to a list of delimiters; also avoids compiling an empty/degenerate combined pattern after filtering. |
tests/unit/engine/chunker/test_delimiter_edge_cases.py |
Adds regression coverage asserting duplicate keyword starts produce multiple matches with distinct DelimiterKinds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
hoist structural_pairs construction above the loop, use frozenset for thread safety Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
bashandbone
left a comment
There was a problem hiding this comment.
Thanks for the PR @aiedwardyi -- merging. I'll close #281
Follows up on the duplicate-keyword-start bug reported on #281.
What changed
Validation
_match_keyword_delimiters()in an isolated env returned bothSTRUCTandTYPE_ALIASmatches for the sametypekeywordpy_compilepassed for the touched filesNote
I don't have push access to the original
bolt-optimize-chunker-delimiter-7248057790755398651branch, so this follow-up PR carries the minimal fix needed to make that optimization mergeable.Summary by Sourcery
Preserve behavior for duplicate keyword-start delimiters while keeping the optimized single-regex matching path.
Bug Fixes:
Enhancements:
Tests: