fix: preserve duplicate keyword delimiter matches by aiedwardyi · Pull Request #282 · knitli/codeweaver

aiedwardyi · 2026-03-28T08:17:13Z

Follows up on the duplicate-keyword-start bug reported on #281.

What changed

preserve all delimiters that share the same keyword start instead of collapsing them to the last map entry
deduplicate regex alternation inputs without changing per-delimiter behavior
add a regression test covering duplicate keyword starts

Validation

direct execution of _match_keyword_delimiters() in an isolated env returned both STRUCT and TYPE_ALIAS matches for the same type keyword
py_compile passed for the touched files

Note

I don't have push access to the original bolt-optimize-chunker-delimiter-7248057790755398651 branch, so this follow-up PR carries the minimal fix needed to make that optimization mergeable.

Summary by Sourcery

Preserve behavior for duplicate keyword-start delimiters while keeping the optimized single-regex matching path.

Bug Fixes:

Ensure keyword delimiters sharing the same start string all produce matches instead of collapsing to the last delimiter.

Enhancements:

Optimize keyword delimiter matching by using a single combined regex with deduplicated start strings and grouping delimiters by keyword.
Return early from keyword matching when no non-empty-start keyword delimiters are present.
Fix a minor spelling issue in the custom delimiter loading docstring.

Tests:

Add a regression test ensuring duplicate keyword starts preserve all corresponding delimiter matches.

This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance. Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex. Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

sourcery-ai · 2026-03-28T08:17:25Z

Reviewer's Guide

Optimizes and corrects keyword delimiter matching so that duplicate keyword start strings preserve all associated delimiters, while adding a regression test and a minor docstring typo fix.

Sequence diagram for optimized keyword delimiter matching

sequenceDiagram
    participant Caller as Chunker
    participant M as _match_keyword_delimiters
    participant R as re
    participant S as _is_inside_string_or_comment
    participant F as _find_next_structural_with_char

    Caller->>M: _match_keyword_delimiters(content, keyword_delimiters, matches)
    M->>M: Filter keyword_delimiters where d.start is truthy
    alt no keyword_delimiters
        M-->>Caller: return matches
    else keyword_delimiters present
        M->>R: compile combined_pattern from unique delimiter.start values
        M->>M: Build delimiter_map[start_string] -> list[Delimiter]
        loop for each match in combined_pattern.finditer(content)
            M->>R: combined_pattern.finditer(content)
            R-->>M: match with matched_text and keyword_pos
            M->>S: _is_inside_string_or_comment(content, keyword_pos)
            S-->>M: bool is_inside
            alt is_inside
                M->>M: continue to next match
            else not inside
                loop for each delimiter in delimiter_map[matched_text]
                    M->>F: _find_next_structural_with_char(content, keyword_pos, structural_pairs)
                    F-->>M: struct_start, struct_char
                    M->>M: Evaluate delimiter rules and update matches
                end
            end
        end
        M-->>Caller: return matches
    end

File-Level Changes

Change	Details	Files
Preserve all delimiters when multiple keyword delimiters share the same start string while using a single optimized regex scan.	Short-circuit and return early when the filtered keyword delimiter list is empty. Build a de-duplicated list of keyword start strings and compile a single combined regex pattern using word boundaries and escaped starts. Create a mapping from each start string to its list of corresponding Delimiter objects to handle duplicates. Iterate over matches from the combined regex, skip those inside strings or comments, then apply the original structural matching logic for every delimiter associated with the matched start text.	`src/codeweaver/engine/chunker/delimiter.py`
Add regression coverage for duplicate keyword start delimiters and import necessary test dependencies.	Introduce a new unit test that defines two Delimiter instances with the same 'type' start but different kinds and asserts both matches are returned. Import DelimiterKind and Delimiter in the test module to construct the test delimiters.	`tests/unit/engine/chunker/test_delimiter_edge_cases.py`
Fix a minor spelling issue in a docstring parameter description.	Change 'normalised' to 'normalized' in the _load_custom_delimiters docstring parameter description.	`src/codeweaver/engine/chunker/delimiter.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

github-actions · 2026-03-28T08:17:28Z

👋 Hey @aiedwardyi,

Thanks for your contribution to codeweaver! 🧵

You need to agree to the CLA first... 🖊️

Before we can accept your contribution, you need to agree to our Contributor License Agreement (CLA).

To agree to the CLA, please comment:

I read the contributors license agreement and I agree to it.

Those exact words are important¹, so please don't change them. 😉

You can read the full CLA here: Contributor License Agreement

✅ @aiedwardyi has signed the CLA.

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Our bot needs those exact words to recognize that you agree to the CLA. ↩

sourcery-ai

Hey - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull request overview

Fixes the regression introduced by the single-regex keyword matching optimization so that multiple delimiters sharing the same keyword start are all preserved (instead of being overwritten), and adds a regression test for the duplicate-keyword-start case reported in #281.

Changes:

Update _match_keyword_delimiters() to group keyword delimiters by start and emit matches for all delimiters that share a matched keyword.
Deduplicate keyword start strings when building the combined regex alternation to avoid repeated alternation entries.
Add a unit regression test ensuring duplicate keyword starts (e.g., type) return multiple delimiter matches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/codeweaver/engine/chunker/delimiter.py`	Preserves all delimiters for duplicate keyword starts by mapping matched text to a list of delimiters; also avoids compiling an empty/degenerate combined pattern after filtering.
`tests/unit/engine/chunker/test_delimiter_edge_cases.py`	Adds regression coverage asserting duplicate keyword starts produce multiple matches with distinct `DelimiterKind`s.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-03-28T08:31:53Z

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/codeweaver/engine/chunker/delimiter.py	91.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

hoist structural_pairs construction above the loop, use frozenset for thread safety Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

bashandbone

Thanks for the PR @aiedwardyi -- merging. I'll close #281

google-labs-jules bot and others added 2 commits March 27, 2026 12:40

fix: preserve duplicate keyword delimiters

196c3ad

Copilot AI review requested due to automatic review settings March 28, 2026 08:17

Copilot started reviewing on behalf of aiedwardyi March 28, 2026 08:17 View session

aiedwardyi mentioned this pull request Mar 28, 2026

⚡ Bolt: [performance improvement] Optimize keyword delimiter matching #281

Closed

sourcery-ai bot reviewed Mar 28, 2026

View reviewed changes

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Optimize allowed_keys assignment for structural pairs

bcce8bd

hoist structural_pairs construction above the loop, use frozenset for thread safety Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

bashandbone approved these changes Mar 28, 2026

View reviewed changes

bashandbone merged commit d3ebb33 into knitli:main Mar 28, 2026
11 of 15 checks passed

github-actions bot locked and limited conversation to collaborators Mar 28, 2026

aiedwardyi deleted the fix/pr-281-duplicate-keyword-delimiters branch March 29, 2026 06:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve duplicate keyword delimiter matches#282

fix: preserve duplicate keyword delimiter matches#282
bashandbone merged 3 commits intoknitli:mainfrom
aiedwardyi:fix/pr-281-duplicate-keyword-delimiters

aiedwardyi commented Mar 28, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Mar 28, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

github-actions bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Mar 28, 2026

Uh oh!

bashandbone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aiedwardyi commented Mar 28, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Validation

Note

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for optimized keyword delimiter matching

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thanks for your contribution to codeweaver! 🧵

You need to agree to the CLA first... 🖊️

To agree to the CLA, please comment:

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Mar 28, 2026

Codecov Report

Uh oh!

bashandbone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aiedwardyi commented Mar 28, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading