Skip to content

⚡ Bolt: [performance improvement] Optimize keyword delimiter matching#281

Closed
bashandbone wants to merge 1 commit intomainfrom
bolt-optimize-chunker-delimiter-7248057790755398651
Closed

⚡ Bolt: [performance improvement] Optimize keyword delimiter matching#281
bashandbone wants to merge 1 commit intomainfrom
bolt-optimize-chunker-delimiter-7248057790755398651

Conversation

@bashandbone
Copy link
Copy Markdown
Contributor

@bashandbone bashandbone commented Mar 27, 2026

💡 What:
The _match_keyword_delimiters method in the DelimiterChunker has been optimized. Instead of looping through each potential keyword delimiter and running a separate re.finditer search across the entire document text, the code now combines all keyword start strings into a single compiled regex pattern using alternation (?:...) and maps matched text back to the delimiter objects.

🎯 Why:
Iterating re.finditer for each keyword results in $O(K \times N)$ regex evaluation time (where $K$ is the number of keywords and $N$ is the size of the text). By compiling them into a single pattern, the algorithm performs a single $O(N)$ pass, avoiding redundant full-text scans and context switching. An added guard clause ensures the optimized pattern does not compile an empty expression if the keyword_delimiters list is unexpectedly empty.

📊 Impact:
Significantly reduces parsing overhead for code chunking, especially on larger files with many potential keyword delimiters. Micro-benchmarking confirms the single-pass combined regex pattern reduces execution time compared to individual regex evaluation loops in extreme scenarios.

🔬 Measurement:
To verify the improvement, run mise //:test tests/unit/engine/chunker/test_delimiter_patterns.py to ensure matching behavior remains precisely equivalent while leveraging the new optimized compiled regex block.


PR created automatically by Jules for task 7248057790755398651 started by @bashandbone

Summary by Sourcery

Optimize keyword delimiter matching in the delimiter chunker for improved performance while preserving existing behavior.

Enhancements:

  • Replace per-keyword regex scanning with a single combined regex over all keyword delimiters to reduce parsing overhead.
  • Add a guard clause to skip keyword matching when no keyword delimiters are present.
  • Correct a parameter docstring to use standard American spelling for "normalized".

This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance.

Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copilot AI review requested due to automatic review settings March 27, 2026 12:40
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 27, 2026

Reviewer's Guide

Optimizes keyword delimiter matching in DelimiterChunker by replacing per-keyword regex scans with a single combined compiled regex pass, adding a guard for empty keyword sets, and fixing a minor docstring typo.

Sequence diagram for optimized keyword delimiter matching

sequenceDiagram
    actor Caller
    participant DelimiterChunker
    participant RegexEngine
    participant DelimiterMatchList as Matches

    activate DelimiterChunker
    Caller->>DelimiterChunker: _match_keyword_delimiters(content, keyword_delimiters, matches)
    DelimiterChunker->>DelimiterChunker: filter keyword_delimiters with nonempty start
    DelimiterChunker->>Caller: return matches if keyword_delimiters is empty
    deactivate DelimiterChunker

    activate DelimiterChunker
    DelimiterChunker->>DelimiterChunker: start_strings = [d.start for d in keyword_delimiters]
    DelimiterChunker->>RegexEngine: compile combined_pattern with alternation
    RegexEngine-->>DelimiterChunker: combined_pattern
    DelimiterChunker->>DelimiterChunker: delimiter_map = {d.start: d}

    loop for each match in combined_pattern.finditer(content)
        DelimiterChunker->>RegexEngine: combined_pattern.finditer(content)
        RegexEngine-->>DelimiterChunker: match
        DelimiterChunker->>DelimiterChunker: matched_text = match.group(0)
        DelimiterChunker->>DelimiterChunker: delimiter = delimiter_map[matched_text]
        DelimiterChunker->>DelimiterChunker: keyword_pos = match.start()
        DelimiterChunker->>DelimiterChunker: _is_inside_string_or_comment(content, keyword_pos)
        alt inside string or comment
            DelimiterChunker->>DelimiterChunker: continue to next match
        else outside string and comment
            DelimiterChunker->>DelimiterChunker: _find_next_structural_with_char(content, start, allowed)
            DelimiterChunker-->>DelimiterChunker: struct_start, struct_char
            alt no structural opening
                DelimiterChunker->>DelimiterChunker: continue to next match
            else structural opening found
                DelimiterChunker->>DelimiterChunker: _find_matching_close(content, struct_start, struct_char, closing_char)
                DelimiterChunker-->>DelimiterChunker: struct_end
                alt struct_end is not None
                    DelimiterChunker->>DelimiterChunker: nesting_level = _calculate_nesting_level(content, keyword_pos)
                    DelimiterChunker->>Matches: append DelimiterMatch(delimiter, keyword_pos, struct_end, nesting_level)
                else struct_end is None
                    DelimiterChunker->>DelimiterChunker: continue to next match
                end
            end
        end
    end

    DelimiterChunker-->>Caller: matches
Loading

Updated class diagram for DelimiterChunker keyword delimiter matching

classDiagram
    class DelimiterChunker {
        +_match_keyword_delimiters(content: str, keyword_delimiters: list[Delimiter], matches: list[DelimiterMatch]) list[DelimiterMatch]
        -_is_inside_string_or_comment(content: str, position: int) bool
        -_find_next_structural_with_char(content: str, start: int, allowed: set[str]) tuple[int,str]
        -_find_matching_close(content: str, struct_start: int, struct_char: str, closing_char: str) int
        -_calculate_nesting_level(content: str, position: int) int
    }

    class Delimiter {
        +start: str
        +end: str
        +type: str
    }

    class DelimiterMatch {
        +delimiter: Delimiter
        +start_pos: int
        +end_pos: int
        +nesting_level: int
    }

    DelimiterChunker ..> Delimiter : uses
    DelimiterChunker ..> DelimiterMatch : creates

    note for DelimiterChunker "_match_keyword_delimiters now uses a single combined regex pattern and delimiter_map for keyword matching"
Loading

File-Level Changes

Change Details Files
Optimize keyword delimiter matching by using a single combined regex pattern instead of iterating per keyword.
  • Add guard clause to early-return when the filtered keyword_delimiters list is empty.
  • Build a combined regex pattern using alternation over all keyword start strings with word boundaries and compile it once before scanning the content.
  • Create a mapping from keyword start strings to their corresponding delimiter objects.
  • Replace the outer loop over keyword_delimiters and per-keyword re.finditer calls with a single loop over combined_pattern.finditer, reusing the existing structural and nesting computations for each match.
src/codeweaver/engine/chunker/delimiter.py
Minor documentation cleanup in delimiter loading helper.
  • Correct spelling in _load_custom_delimiters docstring from 'normalised' to 'normalized' for the normalized_language parameter description.
src/codeweaver/engine/chunker/delimiter.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The new delimiter_map = {d.start: d ...} assumes keyword start strings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each.
  • The combined regex changes the ordering of matches from grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly.
  • You can slightly simplify/optimize by computing allowed = set(structural_pairs.keys()) once before the regex loop and reusing it in _find_next_structural_with_char instead of recreating the set on every match.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `delimiter_map = {d.start: d ...}` assumes keyword `start` strings are unique; if multiple delimiters can share the same start token, this will silently drop all but one, whereas the previous per-delimiter loop would have produced matches for each.
- The combined regex changes the ordering of `matches` from grouped-by-delimiter to strictly text order; if any downstream logic relies on the previous grouping semantics, consider either preserving the old order or documenting the change explicitly.
- You can slightly simplify/optimize by computing `allowed = set(structural_pairs.keys())` once before the regex loop and reusing it in `_find_next_structural_with_char` instead of recreating the set on every match.

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="472" />
<code_context>
-            for match in re.finditer(pattern, content):
-                keyword_pos = match.start()
+        # Create a mapping to quickly look up the delimiter by its matched start string
+        delimiter_map = {d.start: d for d in keyword_delimiters}

-                # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**issue (bug_risk):** Handling duplicate keyword start strings may change behavior compared to the previous implementation.

Previously, each delimiter was processed independently with `re.finditer`, so duplicate `start` values would each be applied (possibly with different end behavior/metadata). With `delimiter_map = {d.start: d for d in keyword_delimiters}`, only the last delimiter for a given `start` is ever used. If duplicate `start` values are valid in this context, this changes semantics. Consider either enforcing unique `start` values up front and failing fast, or mapping each `start` to a list of delimiters to preserve the prior behavior.
</issue_to_address>

### Comment 2
<location path="src/codeweaver/engine/chunker/delimiter.py" line_range="477" />
<code_context>
-                struct_start, struct_char = self._find_next_structural_with_char(
-                    content,
-                    start=keyword_pos + len(delimiter.start),
-                    allowed=set(structural_pairs.keys()),
-                )
+            # Skip if keyword is inside a string or comment
</code_context>
<issue_to_address>
**suggestion (performance):** The `allowed` set is rebuilt on every match and could be hoisted out of the loop.

This recreates `set(structural_pairs.keys())` for every keyword occurrence. Since `structural_pairs` is constant in this method, compute `allowed_structurals = set(structural_pairs.keys())` once before `for match in combined_pattern.finditer(content):` and pass `allowed=allowed_structurals` here to avoid repeated allocation in the hot inner loop.

Suggested implementation:

```python
        allowed_structurals = set(structural_pairs.keys())

        for match in combined_pattern.finditer(content):
            matched_text = match.group(0)
            delimiter = delimiter_map[matched_text]
            keyword_pos = match.start()

```

Within the same method where this loop lives, find the call(s) to:
`self._find_next_structural_with_char(...)` that currently pass `allowed=set(structural_pairs.keys())` and change those arguments to `allowed=allowed_structurals` to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use `allowed_structurals`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

for match in re.finditer(pattern, content):
keyword_pos = match.start()
# Create a mapping to quickly look up the delimiter by its matched start string
delimiter_map = {d.start: d for d in keyword_delimiters}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Handling duplicate keyword start strings may change behavior compared to the previous implementation.

Previously, each delimiter was processed independently with re.finditer, so duplicate start values would each be applied (possibly with different end behavior/metadata). With delimiter_map = {d.start: d for d in keyword_delimiters}, only the last delimiter for a given start is ever used. If duplicate start values are valid in this context, this changes semantics. Consider either enforcing unique start values up front and failing fast, or mapping each start to a list of delimiters to preserve the prior behavior.

for match in combined_pattern.finditer(content):
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
keyword_pos = match.start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): The allowed set is rebuilt on every match and could be hoisted out of the loop.

This recreates set(structural_pairs.keys()) for every keyword occurrence. Since structural_pairs is constant in this method, compute allowed_structurals = set(structural_pairs.keys()) once before for match in combined_pattern.finditer(content): and pass allowed=allowed_structurals here to avoid repeated allocation in the hot inner loop.

Suggested implementation:

        allowed_structurals = set(structural_pairs.keys())

        for match in combined_pattern.finditer(content):
            matched_text = match.group(0)
            delimiter = delimiter_map[matched_text]
            keyword_pos = match.start()

Within the same method where this loop lives, find the call(s) to:
self._find_next_structural_with_char(...) that currently pass allowed=set(structural_pairs.keys()) and change those arguments to allowed=allowed_structurals to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use allowed_structurals.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes DelimiterChunker._match_keyword_delimiters by scanning the input text once with a combined keyword regex instead of running re.finditer per keyword, aiming to reduce chunking overhead on large files.

Changes:

  • Combine all keyword-start strings into a single compiled regex and map matches back to delimiters.
  • Add a guard to avoid compiling an empty keyword regex when keyword delimiters are filtered down to none.
  • Fix a docstring spelling (“normalised” → “normalized”).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +468 to +477
start_strings = [d.start for d in keyword_delimiters]
combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b")

for match in re.finditer(pattern, content):
keyword_pos = match.start()
# Create a mapping to quickly look up the delimiter by its matched start string
delimiter_map = {d.start: d for d in keyword_delimiters}

# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
keyword_pos = match.start()
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiter_map = {d.start: d ...} assumes keyword starts are unique. In this codebase they are not (e.g., STRUCT_PATTERN and TYPE_ALIAS_PATTERN both include start string "type" via families.py), so this change will drop one delimiter and change which kind/priority wins during overlap resolution. Consider mapping start -> list[Delimiter] (or selecting the highest-priority delimiter per start) and emitting matches for each delimiter associated with the matched start string to preserve prior behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +465 to +472
# Optimization: Combine all keyword start strings into a single compiled regex pattern.
# This allows us to make a single pass over the content rather than iterating over
# `re.finditer` for each keyword delimiter individually, significantly reducing overhead.
start_strings = [d.start for d in keyword_delimiters]
combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b")

for match in re.finditer(pattern, content):
keyword_pos = match.start()
# Create a mapping to quickly look up the delimiter by its matched start string
delimiter_map = {d.start: d for d in keyword_delimiters}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization changes how keyword delimiters are associated to matches. There are currently unit tests for delimiter chunking behavior, but none appear to cover the case where multiple keyword delimiters share the same start (e.g., type being both STRUCT and TYPE_ALIAS in family patterns). Adding a test that asserts the resolved boundary kind/priority for such an input would help prevent regressions here.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Code Review - PR #281: Optimize keyword delimiter matching

Summary

This PR implements a valuable performance optimization by consolidating multiple regex searches into a single pass. The core idea is sound and addresses a real O(K × N) → O(N) performance improvement. However, there are a few considerations to address.


✅ Strengths

  1. Excellent optimization approach: Combining all keyword patterns into a single regex with alternation ((?:pattern1|pattern2|...)) is the correct approach for reducing redundant text scanning.

  2. Good documentation: The inline comments clearly explain the optimization rationale and approach.

  3. Correctness preserved: The logic flow remains identical to the original implementation - the optimization is purely algorithmic, not behavioral.

  4. Early exit guard: The added early return for empty keyword_delimiters list (line 454-455) prevents potential issues with compiling an empty regex pattern.

  5. Minor fix: The typo correction in the docstring (normalisednormalized) improves consistency.


🔍 Issues & Suggestions

1. Performance issue: Set allocation in hot loop ⚠️

Location: src/codeweaver/engine/chunker/delimiter.py:487

The suggested optimization in the PR comment is valid and should be implemented. Currently, set(structural_pairs.keys()) is being recreated on every iteration of the match loop, which partially negates the performance benefits of the optimization.

Current code:

for match in combined_pattern.finditer(content):
    # ... 
    struct_start, struct_char = self._find_next_structural_with_char(
        content,
        start=keyword_pos + len(delimiter.start),
        allowed=set(structural_pairs.keys()),  # ❌ Recreated every iteration
    )

Recommended fix:

# Define structural delimiters that can complete keywords
structural_pairs = {
    "{": "}",
    ":": "\n",
    "=>": "",
}
allowed_structurals = set(structural_pairs.keys())  # ✅ Compute once

# ... pattern compilation ...

for match in combined_pattern.finditer(content):
    # ...
    struct_start, struct_char = self._find_next_structural_with_char(
        content,
        start=keyword_pos + len(delimiter.start),
        allowed=allowed_structurals,  # ✅ Reuse
    )

2. Edge case: Duplicate keyword starts

Severity: Low (likely not an issue in practice)

The delimiter_map dictionary uses d.start as the key:

delimiter_map = {d.start: d for d in keyword_delimiters}

If multiple Delimiter objects share the same .start string but differ in other attributes (e.g., kind, priority, end), only the last one will be retained in the map due to dictionary key collision.

Questions:

  • Is it possible for keyword_delimiters to contain multiple delimiters with the same start string?
  • If so, should all such delimiters be matched, or is taking the last one acceptable?

If this is a concern, consider using a dict mapping to a list:

from collections import defaultdict
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

# Then in the loop:
for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    for delimiter in delimiter_map[matched_text]:
        # ... process each delimiter ...

However, this may be over-engineering if the codebase guarantees unique keyword starts.


🧪 Testing

The PR description mentions running:

mise //:test tests/unit/engine/chunker/test_delimiter_patterns.py

Recommendations:

  1. Ensure tests cover cases with multiple keyword delimiters
  2. Verify edge cases like empty delimiter lists (already handled)
  3. Consider adding a performance benchmark test to validate the O(N) improvement

📊 Overall Assessment

Approval status: ⏸️ Conditional approval

This is a solid optimization that significantly improves performance for code with many keyword delimiters. The approach is correct and the code is well-documented.

Required changes:

  1. Must fix: Hoist set(structural_pairs.keys()) out of the loop (as suggested in comment #3000778585)

Optional improvements:
2. Consider edge case handling for duplicate keyword starts (if applicable)
3. Verify test coverage for the optimization

Once the set allocation issue is addressed, this PR will be ready to merge.


🎯 Recommendation

Please apply the suggested fix from comment #3000778585 to complete the optimization. The change is straightforward and will ensure the performance benefits are fully realized.

Great work on identifying and implementing this optimization! 🚀

@github-actions
Copy link
Copy Markdown
Contributor

Review of PR #281: Optimize keyword delimiter matching

Thanks for this performance optimization! The approach of combining keyword delimiters into a single regex pattern is sound and should indeed provide O(N) performance instead of O(K × N). However, I've identified a critical correctness bug that needs to be addressed before merging.

🐛 Critical Bug: Dictionary Mapping Loses Duplicate Start Strings

Issue Location: src/codeweaver/engine/chunker/delimiter.py:472

The new implementation uses a dictionary to map start strings to delimiter objects:

```python
delimiter_map = {d.start: d for d in keyword_delimiters}
```

Problem: When multiple delimiters have the same `start` string but different semantic properties (kind, priority), the dictionary only keeps the last one, silently dropping the others.

Real-world example: The codebase defines:

  • `STRUCT_PATTERN` with starts `["struct", "type"]` (kind: STRUCT)
  • `TYPE_ALIAS_PATTERN` with starts `["type", "typedef", "typealias"]` (kind: TYPE_ALIAS)

When the code encounters the keyword `type`, the old implementation would:

  • ✅ Create matches for BOTH STRUCT and TYPE_ALIAS delimiters
  • ✅ Allow the priority system to resolve which interpretation is correct

The new implementation will:

  • ❌ Only use ONE of the two delimiters (whichever is last in the list)
  • ❌ Miss valid matches for the other delimiter type

🔧 Suggested Fix

Replace the simple dictionary with a multi-value mapping:

```python

Create a mapping to look up ALL delimiters by their matched start string

from collections import defaultdict
delimiter_map = defaultdict(list)
for d in keyword_delimiters:
delimiter_map[d.start].append(d)
```

Then update the matching loop to iterate through all matching delimiters:

```python
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
matching_delimiters = delimiter_map[matched_text] # Get all delimiters with this start

for delimiter in matching_delimiters:  # Process each one
    keyword_pos = match.start()
    
    # Skip if keyword is inside a string or comment
    if self._is_inside_string_or_comment(content, keyword_pos):
        continue
    
    # ... rest of the logic remains the same

```

✅ What's Good

  1. Performance optimization is well-motivated - The O(K × N) → O(N) improvement is significant for large files
  2. Guard clause for empty list - Good defensive programming (line 454)
  3. Code is well-documented - Clear comments explaining the optimization
  4. Maintains single regex compilation - Efficient approach

📋 Additional Recommendations

  1. Add a test case that verifies behavior when multiple delimiters share the same start string (e.g., test "type" keyword matching both STRUCT and TYPE_ALIAS)
  2. Consider deduplication - The `start_strings` list at line 468 could have duplicates. Use `list(set(...))` or `list(dict.fromkeys(...))` to deduplicate while preserving order
  3. Minor: The docstring typo fix (`normalised` → `normalized`) at line 1293 is good, but unrelated to this PR

🎯 Verdict

Request Changes - The correctness bug must be fixed before merging. The optimization approach is excellent, but the implementation needs to handle multiple delimiters per start string correctly.


Test Coverage: The existing test file `tests/unit/engine/chunker/test_delimiter_patterns.py` doesn't appear to have specific tests for `_match_keyword_delimiters` behavior. Consider adding integration tests that verify "type" keyword matches both STRUCT and TYPE_ALIAS patterns.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

I've reviewed this PR and identified a critical bug in the optimization that needs to be addressed before merging.

🐛 Critical Issue: Duplicate Start String Handling

Location: src/codeweaver/engine/chunker/delimiter.py:472

The optimization introduces a bug when multiple keyword delimiters share the same start string. The current implementation creates a dictionary mapping:

delimiter_map = {d.start: d for d in keyword_delimiters}

Problem: When multiple delimiters have the same start value, only the last one is retained in the dictionary, causing earlier delimiters to be lost.

Concrete Example:
Looking at src/codeweaver/engine/chunker/delimiters/patterns.py:

  • Line 146: STRUCT_PATTERN = DelimiterPattern(starts=["struct", "type"], ...) with kind DelimiterKind.STRUCT
  • Line 161: TYPE_ALIAS_PATTERN = DelimiterPattern(starts=["type", "typedef", "typealias"], ...) with kind DelimiterKind.TYPE_ALIAS

Both patterns include "type" as a start keyword but with different kinds (STRUCT vs TYPE_ALIAS) and potentially different priorities. With the current optimization, when the code encounters the keyword type, only one of these delimiters will be considered—whichever was processed last when building the map.

🔧 Suggested Fix

The delimiter map needs to support multiple delimiters per start string. Here are two approaches:

Option 1: Map to list of delimiters

from collections import defaultdict

delimiter_map = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    matching_delimiters = delimiter_map[matched_text]
    
    for delimiter in matching_delimiters:
        keyword_pos = match.start()
        # ... rest of the matching logic

Option 2: Include all duplicates in regex and match all
Keep the combined pattern but process all delimiters that match the found text:

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    matching_delimiters = [d for d in keyword_delimiters if d.start == matched_text]
    
    for delimiter in matching_delimiters:
        # ... matching logic

Option 1 is more efficient as it pre-builds the lookup structure.

✅ Positive Aspects

  1. Performance optimization is sound in principle - combining regex patterns into a single pass is a valid optimization technique
  2. Good documentation - the inline comments clearly explain the optimization intent
  3. Guard clause added - the early return for empty keyword_delimiters is a good addition
  4. Spelling fix - correcting "normalised" to "normalized" in the docstring (line 1293)

📋 Testing Recommendation

As mentioned in the Copilot review comment, this PR should include a test that validates behavior when multiple keyword delimiters share the same start string. Suggested test case:

def test_keyword_delimiters_with_duplicate_starts():
    """Test that multiple delimiters with same start string are all matched."""
    # Test code that uses 'type' keyword which appears in both STRUCT and TYPE_ALIAS patterns
    code = """
    type MyStruct struct {
        field string
    }
    
    type MyAlias = string
    """
    
    # Verify that both STRUCT and TYPE_ALIAS boundaries are detected
    # and resolved based on context/priority

🎯 Recommendation

Request Changes - The bug must be fixed before merging. Once the duplicate start handling is corrected and appropriate tests are added, this optimization will be a valuable improvement to the codebase.

📊 Impact Assessment

  • Correctness: ❌ Critical bug affecting delimiter matching
  • Performance: ✅ Optimization approach is sound
  • Test Coverage: ⚠️ Missing tests for edge case
  • Code Quality: ✅ Well-documented, clean code style
  • Security: ✅ No security concerns

Next Steps:

  1. Fix the delimiter_map to handle multiple delimiters with the same start string
  2. Add test coverage for this edge case
  3. Verify all existing tests still pass
  4. Re-run performance benchmarks to confirm optimization still delivers expected gains

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

I've reviewed this PR and found a critical bug in the optimization that will cause incorrect behavior when there are multiple keyword delimiters with the same start string.

Issue: Dictionary Comprehension Loses Duplicate Start Strings

On line 472 in src/codeweaver/engine/chunker/delimiter.py:

delimiter_map = {d.start: d for d in keyword_delimiters}

This dictionary comprehension will silently drop all but the last delimiter when multiple delimiters share the same start string. For example, if you have:

  • Delimiter 1: start="function", end="{", priority=70
  • Delimiter 2: start="function", end="=>", priority=65

The map will only contain Delimiter 2, and Delimiter 1 will be completely lost. This breaks the original behavior where each delimiter was processed independently.

The Fix

To maintain correctness while keeping the performance optimization, you need to use a multi-value map. Here's the corrected approach:

# Create a mapping to quickly look up delimiters by their matched start string
# Use a list to handle multiple delimiters with the same start string
from collections import defaultdict
delimiter_map: dict[str, list[Delimiter]] = defaultdict(list)
for d in keyword_delimiters:
    delimiter_map[d.start].append(d)

for match in combined_pattern.finditer(content):
    matched_text = match.group(0)
    keyword_pos = match.start()
    
    # Process ALL delimiters that match this start string
    for delimiter in delimiter_map[matched_text]:
        # Skip if keyword is inside a string or comment
        if self._is_inside_string_or_comment(content, keyword_pos):
            continue
        
        # ... rest of the logic

Additional Observations

Positives:
✅ The core optimization idea is sound - combining regexes into a single pattern is the right approach
✅ Code is well-commented explaining the optimization rationale
✅ Guard clause for empty keyword_delimiters is good
✅ Minor fix: corrected spelling "normalized" (line 1293)

Concerns:
⚠️ The experimental Python 3.13t and 3.14t tests are failing (though these are experimental free-threaded builds)
⚠️ CI logs aren't available yet to determine if failures are related to this change

Recommendation

Request Changes - The duplicate key issue must be fixed before merging. The optimization is valuable, but correctness cannot be compromised.

Once fixed, this will be an excellent performance improvement for keyword delimiter matching.

@aiedwardyi
Copy link
Copy Markdown
Contributor

I pulled this down locally to address the duplicate-keyword-start bug from the review. I don't have push access to bolt-optimize-chunker-delimiter-7248057790755398651, so I opened #282 with the minimal fix plus a regression test.

The change preserves all delimiters that share the same keyword start instead of collapsing them to the last map entry, which restores the original per-delimiter behavior while keeping the single-pass regex optimization. If you'd rather keep everything on #281, feel free to cherry-pick 196c3ad5.

@bashandbone
Copy link
Copy Markdown
Contributor Author

closing in favor of #282

@bashandbone
Copy link
Copy Markdown
Contributor Author

closing in favor of #282

@bashandbone bashandbone reopened this Mar 28, 2026
@google-labs-jules
Copy link
Copy Markdown
Contributor

closing in favor of #282

Understood. Acknowledging that this work is now obsolete and stopping work on this task.

1 similar comment
@google-labs-jules
Copy link
Copy Markdown
Contributor

closing in favor of #282

Understood. Acknowledging that this work is now obsolete and stopping work on this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants