⚡ Bolt: Combine keyword delimiters into single regex pattern#283
⚡ Bolt: Combine keyword delimiters into single regex pattern#283bashandbone wants to merge 3 commits intomainfrom
Conversation
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's GuideOptimizes keyword delimiter matching in the chunker by building a single alternation regex over all keyword delimiters, short‑circuiting when no non‑empty keywords exist, and fixing a minor docstring typo. Flow diagram for optimized _match_keyword_delimiters keyword scanningflowchart TD
A_start["Start _match_keyword_delimiters"] --> B_filter
B_filter["Filter keyword_delimiters where start is non_empty"] --> C_any
C_any{"keyword_delimiters list is empty?"} -->|yes| Z_return_empty["Return matches (no keyword delimiters)"]
C_any -->|no| D_build_struct_pairs
D_build_struct_pairs["Initialize structural_pairs map"] --> E_build_map
E_build_map["Build delimiter_map from start_text to delimiter_instance"] --> F_build_regex
F_build_regex["Build combined_pattern with alternation over all delimiter.start values"] --> G_finditer
G_finditer["Iterate over re.finditer on combined_pattern and content"] --> H_keyword_pos
H_keyword_pos["For each match: compute keyword_pos and matched_text"] --> I_lookup_delim
I_lookup_delim["Lookup delimiter in delimiter_map using matched_text"] --> J_inside_comment
J_inside_comment{"_is_inside_string_or_comment at keyword_pos?"} -->|yes| G_finditer
J_inside_comment -->|no| K_find_struct_open
K_find_struct_open["_find_next_structural_with_char starting after keyword"] --> L_struct_found
L_struct_found{"struct_start is None?"} -->|yes| G_finditer
L_struct_found -->|no| M_find_struct_close
M_find_struct_close["_find_matching_close using struct_start and structural_pairs"] --> N_struct_end
N_struct_end{"struct_end is None?"} -->|yes| G_finditer
N_struct_end -->|no| O_calc_nesting
O_calc_nesting["_calculate_nesting_level at keyword_pos"] --> P_append_match
P_append_match["Append DelimiterMatch(delimiter, start_pos, end_pos, nesting_level) to matches"] --> G_finditer
G_finditer -->|no more matches| Q_return_matches
Q_return_matches["Return matches with all keyword-based DelimiterMatch instances"] --> R_end["End _match_keyword_delimiters"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="474-477" />
<code_context>
+ continue
- # Find the next structural opening after the keyword
- struct_start, struct_char = self._find_next_structural_with_char(
- content,
- start=keyword_pos + len(delimiter.start),
- allowed=set(structural_pairs.keys()),
- )
+ # Find the next structural opening after the keyword
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid recreating the `allowed` set on each iteration by precomputing it once outside the loop.
Since `structural_pairs` doesn’t change within this function, compute `allowed_structurals = set(structural_pairs.keys())` once before the `for match in re.finditer(...)` loop and reuse it. This keeps the per-match path leaner, which matters on large files with many keyword hits.
Suggested implementation:
```python
struct_start, struct_char = self._find_next_structural_with_char(
content,
start=keyword_pos + len(delimiter.start),
allowed=allowed_structurals,
```
To fully implement the optimization, define the `allowed_structurals` set once before the `for match in re.finditer(...):` loop that contains this snippet. For example:
- Just above the loop (and after `structural_pairs` is defined), add:
`allowed_structurals = set(structural_pairs.keys())`
Ensure that this variable is in the same scope as the loop so it is available where `_find_next_structural_with_char` is called.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| # Skip if keyword is inside a string or comment | ||
| if self._is_inside_string_or_comment(content, keyword_pos): | ||
| continue | ||
| # Skip if keyword is inside a string or comment | ||
| if self._is_inside_string_or_comment(content, keyword_pos): | ||
| continue |
There was a problem hiding this comment.
suggestion (performance): Avoid recreating the allowed set on each iteration by precomputing it once outside the loop.
Since structural_pairs doesn’t change within this function, compute allowed_structurals = set(structural_pairs.keys()) once before the for match in re.finditer(...) loop and reuse it. This keeps the per-match path leaner, which matters on large files with many keyword hits.
Suggested implementation:
struct_start, struct_char = self._find_next_structural_with_char(
content,
start=keyword_pos + len(delimiter.start),
allowed=allowed_structurals,To fully implement the optimization, define the allowed_structurals set once before the for match in re.finditer(...): loop that contains this snippet. For example:
- Just above the loop (and after
structural_pairsis defined), add:
allowed_structurals = set(structural_pairs.keys())
Ensure that this variable is in the same scope as the loop so it is available where _find_next_structural_with_char is called.
There was a problem hiding this comment.
Pull request overview
Optimizes delimiter-based chunking by scanning the document once for all keyword-based delimiters, rather than running a separate regex scan per keyword delimiter.
Changes:
- Combine keyword delimiter starts into a single alternation regex and map match text back to a delimiter.
- Add an early return when keyword delimiters are empty after filtering.
- Fix docstring spelling (“normalised” → “normalized”).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Optimization: combining multiple keyword delimiters into a single compiled regex pattern | ||
| # using alternation (?:...) avoids looping over keywords with individual re.finditer calls. | ||
| delimiter_map = {d.start: d for d in keyword_delimiters} | ||
| combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b" | ||
|
|
||
| for match in re.finditer(pattern, content): | ||
| keyword_pos = match.start() | ||
| for match in re.finditer(combined_pattern, content): | ||
| keyword_pos = match.start() | ||
| matched_text = match.group(0) | ||
| delimiter = delimiter_map[matched_text] |
There was a problem hiding this comment.
delimiter_map = {d.start: d ...} will silently drop keyword delimiters that share the same start string (later entries overwrite earlier ones). This codebase has overlapping keyword starts (e.g., both STRUCT_PATTERN and TYPE_ALIAS_PATTERN include "type"), so after this change only one delimiter kind will ever be produced for type matches, changing chunking behavior. Use a mapping from start string to a list of delimiters (or otherwise disambiguate) and emit matches for each applicable delimiter to preserve prior behavior.
| # Optimization: combining multiple keyword delimiters into a single compiled regex pattern | ||
| # using alternation (?:...) avoids looping over keywords with individual re.finditer calls. | ||
| delimiter_map = {d.start: d for d in keyword_delimiters} | ||
| combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b" | ||
|
|
||
| for match in re.finditer(pattern, content): | ||
| keyword_pos = match.start() | ||
| for match in re.finditer(combined_pattern, content): |
There was a problem hiding this comment.
This optimization changes how keyword delimiters are selected when multiple delimiter kinds share the same keyword (e.g., type). There doesn’t appear to be a unit test exercising overlapping keyword starts for delimiter-based chunking; adding a regression test (e.g., a snippet containing type Foo struct { ... } and/or type Foo = ...) would help prevent future regressions when refactoring this matcher.
There was a problem hiding this comment.
Review of PR #283: Combine keyword delimiters into single regex pattern
Summary
This PR optimizes the _match_keyword_delimiters method by combining multiple keyword delimiter patterns into a single regex using alternation. This is a smart optimization that reduces the complexity from O(K * N) to O(N), where K is the number of keyword delimiters and N is the document size.
Positive Changes
-
Performance Improvement: The optimization is well-motivated and should provide significant performance gains for large files with multiple keyword delimiters.
-
Code Quality:
- Clear explanatory comment added (lines 465-466)
- Early return optimization for empty delimiter lists (lines 454-455)
- Logic remains functionally equivalent to the original
-
Documentation: Minor typo fix ("normalised" → "normalized") at line 1289
Critical Issue Found
At line 467:
delimiter_map = {d.start: d for d in keyword_delimiters}This dictionary comprehension will silently overwrite delimiters if multiple delimiters have the same start string. I found at least two cases where this occurs:
-
"type" appears in:
STRUCT_PATTERN(line 146:starts=["struct", "type"])TYPE_ALIAS_PATTERN(line 161:starts=["type", "typedef", "typealias"])
-
"extension" appears in:
MODULE_PATTERN(line 153:starts=["module", "namespace", "package", "mod", "extension"])EXTENSION_PATTERN(line 167:starts=["extension", "extend"])
Since ends="ANY" expands to empty strings (making these keyword delimiters), these patterns will have the same start but different kind, priority, or other properties. The current code will only keep the last one in the list.
Recommendation
The fix should handle duplicate start strings by either:
-
Use a list of delimiters per key:
from collections import defaultdict delimiter_map = defaultdict(list) for d in keyword_delimiters: delimiter_map[d.start].append(d)
Then in the loop, iterate over
delimiter_map[matched_text]to create multiple matches. -
Add validation to ensure no duplicates exist (if duplicates aren't expected by design)
-
Sort by priority and only keep the highest priority delimiter for each start string (if this is the intended behavior)
The original code didn't have this issue because it iterated over all delimiters individually, so each one was processed.
Test Status
- Most CI checks passing
- Python 3.14t experimental test failing (appears unrelated to this change)
- Recommend adding a test case that verifies behavior when multiple delimiters share the same start string
Verdict: ❌ Needs changes before merging due to the dictionary collision bug.
Additional Finding on Dictionary Collision BugI want to emphasize the critical bug identified by @Copilot at src/codeweaver/engine/chunker/delimiter.py:467 The Conflicting Keywords:
Impact: Only the last delimiter in the list will be processed for these keywords, silently changing chunking behavior and potentially missing important code structures. Recommended Fix: from collections import defaultdict
# Build map of start strings to lists of delimiters
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
combined_pattern = rf"\b(?:{'|'.join(re.escape(start) for start in delimiter_map.keys())})\b"
for match in re.finditer(combined_pattern, content):
keyword_pos = match.start()
matched_text = match.group(0)
# Process all delimiters with this start string
for delimiter in delimiter_map[matched_text]:
# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
# ... rest of existing logicThis maintains the O(N) performance optimization while correctly handling all matching delimiters. |
There was a problem hiding this comment.
Code Review - PR #283
Summary
This PR optimizes the _match_keyword_delimiters method by combining multiple keyword delimiters into a single regex pattern, reducing complexity from O(K × N) to O(N) where K is the number of keyword delimiters and N is the document size.
✅ Strengths
-
Correct optimization approach: Combining multiple regex patterns using alternation (
(?:pattern1|pattern2|...)) is a well-established optimization technique that reduces the number of passes over the input string. -
Functionally equivalent: The logic correctly preserves the original behavior by:
- Using
match.group(0)to get the matched keyword text - Looking up the corresponding delimiter from
delimiter_map - Maintaining the same filtering and structural matching logic
- Using
-
Performance improvement: This change will significantly improve performance for large files with many keyword delimiters, especially since the regex engine only scans the document once instead of K times.
-
Early exit optimization: The addition of the early exit check (lines 454-455) is a good defensive programming practice.
-
Minor typo fix: Fixed "normalised" to "normalized" for consistency with American English used elsewhere in the codebase.
⚠️ Potential Issues
1. Duplicate delimiter handling (Medium Severity)
The code uses a dictionary to map delimiter start strings to delimiter objects:
delimiter_map = {d.start: d for d in keyword_delimiters}If multiple delimiters have the same start string, only the last one will be retained in the map. This could lead to:
- Lost delimiter configurations: If custom delimiters override built-in ones with the same start but different properties (kind, priority, etc.), only one will be used
- Inconsistent behavior: The behavior would depend on the order of delimiters in the list
Analysis: Based on the deduplication logic in _load_delimiters_for_language (lines 1381-1385), this scenario should be rare since the code checks for (start, end, kind) uniqueness. However, it's theoretically possible for keyword delimiters (which have empty end strings) to have the same start with different kind values.
Recommendation: Consider one of these approaches:
- Add assertion: Add a debug assertion to verify no duplicate starts exist in
keyword_delimiters - Use list comprehension: Keep only unique delimiter starts using a seen set
- Document assumption: Add a comment documenting that duplicate starts are not expected
Example fix:
# Build delimiter map - ensure no duplicate starts (each start should be unique)
seen_starts = set()
unique_delimiters = []
for d in keyword_delimiters:
if d.start not in seen_starts:
seen_starts.add(d.start)
unique_delimiters.append(d)
delimiter_map = {d.start: d for d in unique_delimiters}2. Regex pattern efficiency with duplicates (Low Severity)
If there are duplicate starts in keyword_delimiters, the combined pattern will contain redundant alternatives:
combined_pattern = rf"\b(?:{'|'.join(re.escape(d.start) for d in keyword_delimiters)})\b"
# Could produce: \b(?:function|function|def)\bThis is inefficient but functionally correct. The same deduplication fix above would address this.
📝 Minor Observations
-
Comment quality: The optimization comment (lines 465-466) clearly explains the benefit. Well done!
-
Code style: The change maintains consistency with the surrounding code style.
-
Testing: The PR description mentions running tests via
uv run pytest tests/unit/engine/chunker. The existing integration tests should catch any behavioral regressions.
🎯 Recommendation
Approve with minor suggestion: The optimization is sound and provides real performance benefits. The potential duplicate handling issue is theoretical and may not occur in practice due to existing deduplication logic. However, I recommend adding either:
- A comment documenting the assumption that
keyword_delimiterscontains unique start strings, OR - Defensive code to deduplicate by start string
The CI failure for Python 3.14t appears to be experimental and can be safely ignored based on the workflow configuration (continue-on-error: true for experimental versions).
Verdict
✅ LGTM with minor suggestion - The optimization is correct and valuable. Consider adding duplicate handling or documentation for robustness.
|
@jules please address the identified key collision bug. See below suggestion, another option would be use to tuples Additional Finding on Dictionary Collision BugI want to emphasize the critical bug identified by @copilot at src/codeweaver/engine/chunker/delimiter.py:467 The Conflicting Keywords:
Impact: Only the last delimiter in the list will be processed for these keywords, silently changing chunking behavior and potentially missing important code structures. Recommended Fix: from collections import defaultdict
# Build map of start strings to lists of delimiters
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
combined_pattern = rf"\b(?:{'|'.join(re.escape(start) for start in delimiter_map.keys())})\b"
for match in re.finditer(combined_pattern, content):
keyword_pos = match.start()
matched_text = match.group(0)
# Process all delimiters with this start string
for delimiter in delimiter_map[matched_text]:
# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
# ... rest of existing logicThis maintains the O(N) performance optimization while correctly handling all matching delimiters. |
Thank you for finding this collision bug. I've updated the implementation to use a |
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
…sition lookups - Move 'from collections import defaultdict' from inline function body to module-level imports - Add type annotation: defaultdict[str, list[Delimiter]] - Check _is_inside_string_or_comment once per match position (not once per delimiter) - Call _find_next_structural_with_char and _find_matching_close once per position - Use matches.extend() with a generator instead of an append loop
Fixed in commit Additionally moved the |
|
Addressed by #282 |
💡 What: Optimize the
_match_keyword_delimitersmethod by combining multiple keyword delimiters into a single compiled regex pattern.🎯 Why: Previously, the engine iterated over every registered keyword delimiter and scanned the entire document string with a separate regex
re.finditerfor each one, resulting in O(K * N) complexity. By combining them using alternation(?:...)and looking up the matching delimiter from a dictionary, the regex engine only needs to scan the document exactly once, lowering the text scanning to O(N).📊 Impact: Considerably faster document text chunking time due to fewer string reads, especially for large source code files with multiple delimiters.
🔬 Measurement: Verify tests run smoothly without regressions on the chunker code via
uv run pytest tests/unit/engine/chunker.PR created automatically by Jules for task 15926007155550964703 started by @bashandbone
Summary by Sourcery
Optimize keyword delimiter matching in the chunker to reduce redundant regex scans and improve performance when processing source content.
Enhancements:
Documentation: