Skip to content

fix: preserve duplicate keyword delimiter matches#282

Merged
bashandbone merged 3 commits intoknitli:mainfrom
aiedwardyi:fix/pr-281-duplicate-keyword-delimiters
Mar 28, 2026
Merged

fix: preserve duplicate keyword delimiter matches#282
bashandbone merged 3 commits intoknitli:mainfrom
aiedwardyi:fix/pr-281-duplicate-keyword-delimiters

Conversation

@aiedwardyi
Copy link
Copy Markdown
Contributor

@aiedwardyi aiedwardyi commented Mar 28, 2026

Follows up on the duplicate-keyword-start bug reported on #281.

What changed

  • preserve all delimiters that share the same keyword start instead of collapsing them to the last map entry
  • deduplicate regex alternation inputs without changing per-delimiter behavior
  • add a regression test covering duplicate keyword starts

Validation

  • direct execution of _match_keyword_delimiters() in an isolated env returned both STRUCT and TYPE_ALIAS matches for the same type keyword
  • py_compile passed for the touched files

Note

I don't have push access to the original bolt-optimize-chunker-delimiter-7248057790755398651 branch, so this follow-up PR carries the minimal fix needed to make that optimization mergeable.

Summary by Sourcery

Preserve behavior for duplicate keyword-start delimiters while keeping the optimized single-regex matching path.

Bug Fixes:

  • Ensure keyword delimiters sharing the same start string all produce matches instead of collapsing to the last delimiter.

Enhancements:

  • Optimize keyword delimiter matching by using a single combined regex with deduplicated start strings and grouping delimiters by keyword.
  • Return early from keyword matching when no non-empty-start keyword delimiters are present.
  • Fix a minor spelling issue in the custom delimiter loading docstring.

Tests:

  • Add a regression test ensuring duplicate keyword starts preserve all corresponding delimiter matches.

google-labs-jules bot and others added 2 commits March 27, 2026 12:40
This commit modifies `_match_keyword_delimiters` in `src/codeweaver/engine/chunker/delimiter.py` to significantly improve chunking performance.

Instead of calling `re.finditer` for every individual keyword delimiter, the optimization combines all start strings into a single compiled regex pattern. This reduces regex execution overhead and limits the algorithm to making a single pass over the content. Additionally, an early return checks for empty lists to prevent compiling a dangerous empty regex.

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 28, 2026 08:17
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 28, 2026

Reviewer's Guide

Optimizes and corrects keyword delimiter matching so that duplicate keyword start strings preserve all associated delimiters, while adding a regression test and a minor docstring typo fix.

Sequence diagram for optimized keyword delimiter matching

sequenceDiagram
    participant Caller as Chunker
    participant M as _match_keyword_delimiters
    participant R as re
    participant S as _is_inside_string_or_comment
    participant F as _find_next_structural_with_char

    Caller->>M: _match_keyword_delimiters(content, keyword_delimiters, matches)
    M->>M: Filter keyword_delimiters where d.start is truthy
    alt no keyword_delimiters
        M-->>Caller: return matches
    else keyword_delimiters present
        M->>R: compile combined_pattern from unique delimiter.start values
        M->>M: Build delimiter_map[start_string] -> list[Delimiter]
        loop for each match in combined_pattern.finditer(content)
            M->>R: combined_pattern.finditer(content)
            R-->>M: match with matched_text and keyword_pos
            M->>S: _is_inside_string_or_comment(content, keyword_pos)
            S-->>M: bool is_inside
            alt is_inside
                M->>M: continue to next match
            else not inside
                loop for each delimiter in delimiter_map[matched_text]
                    M->>F: _find_next_structural_with_char(content, keyword_pos, structural_pairs)
                    F-->>M: struct_start, struct_char
                    M->>M: Evaluate delimiter rules and update matches
                end
            end
        end
        M-->>Caller: return matches
    end
Loading

File-Level Changes

Change Details Files
Preserve all delimiters when multiple keyword delimiters share the same start string while using a single optimized regex scan.
  • Short-circuit and return early when the filtered keyword delimiter list is empty.
  • Build a de-duplicated list of keyword start strings and compile a single combined regex pattern using word boundaries and escaped starts.
  • Create a mapping from each start string to its list of corresponding Delimiter objects to handle duplicates.
  • Iterate over matches from the combined regex, skip those inside strings or comments, then apply the original structural matching logic for every delimiter associated with the matched start text.
src/codeweaver/engine/chunker/delimiter.py
Add regression coverage for duplicate keyword start delimiters and import necessary test dependencies.
  • Introduce a new unit test that defines two Delimiter instances with the same 'type' start but different kinds and asserts both matches are returned.
  • Import DelimiterKind and Delimiter in the test module to construct the test delimiters.
tests/unit/engine/chunker/test_delimiter_edge_cases.py
Fix a minor spelling issue in a docstring parameter description.
  • Change 'normalised' to 'normalized' in the _load_custom_delimiters docstring parameter description.
src/codeweaver/engine/chunker/delimiter.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 28, 2026

👋 Hey @aiedwardyi,

Thanks for your contribution to codeweaver! 🧵

You need to agree to the CLA first... 🖊️

Before we can accept your contribution, you need to agree to our Contributor License Agreement (CLA).

To agree to the CLA, please comment:

I read the contributors license agreement and I agree to it.

Those exact words are important1, so please don't change them. 😉

You can read the full CLA here: Contributor License Agreement


@aiedwardyi has signed the CLA.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Footnotes

  1. Our bot needs those exact words to recognize that you agree to the CLA.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the regression introduced by the single-regex keyword matching optimization so that multiple delimiters sharing the same keyword start are all preserved (instead of being overwritten), and adds a regression test for the duplicate-keyword-start case reported in #281.

Changes:

  • Update _match_keyword_delimiters() to group keyword delimiters by start and emit matches for all delimiters that share a matched keyword.
  • Deduplicate keyword start strings when building the combined regex alternation to avoid repeated alternation entries.
  • Add a unit regression test ensuring duplicate keyword starts (e.g., type) return multiple delimiter matches.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/codeweaver/engine/chunker/delimiter.py Preserves all delimiters for duplicate keyword starts by mapping matched text to a list of delimiters; also avoids compiling an empty/degenerate combined pattern after filtering.
tests/unit/engine/chunker/test_delimiter_edge_cases.py Adds regression coverage asserting duplicate keyword starts produce multiple matches with distinct DelimiterKinds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/codeweaver/engine/chunker/delimiter.py 91.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

hoist structural_pairs construction above the loop, use frozenset for thread safety

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@bashandbone bashandbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @aiedwardyi -- merging. I'll close #281

@bashandbone bashandbone merged commit d3ebb33 into knitli:main Mar 28, 2026
11 of 15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2026
@aiedwardyi aiedwardyi deleted the fix/pr-281-duplicate-keyword-delimiters branch March 29, 2026 06:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants