Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 46 additions & 36 deletions src/codeweaver/engine/chunker/delimiter.py
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,9 @@ def _match_keyword_delimiters(
# Filter out delimiters with empty start strings - they match everywhere!
keyword_delimiters = [d for d in keyword_delimiters if d.start]

if not keyword_delimiters:
return matches

# Define structural delimiters that can complete keywords
# Map opening structural chars to their closing counterparts
structural_pairs = {
Expand All @@ -459,49 +462,56 @@ def _match_keyword_delimiters(
"=>": "", # Arrow functions often have expression bodies
}

for delimiter in keyword_delimiters:
# Find all keyword occurrences using word boundary matching
pattern = rf"\b{re.escape(delimiter.start)}\b"
# Optimization: Combine all keyword start strings into a single compiled regex pattern.
# This allows us to make a single pass over the content rather than iterating over
# `re.finditer` for each keyword delimiter individually, significantly reducing overhead.
start_strings = [d.start for d in keyword_delimiters]
combined_pattern = re.compile(rf"\b(?:{'|'.join(map(re.escape, start_strings))})\b")

for match in re.finditer(pattern, content):
keyword_pos = match.start()
# Create a mapping to quickly look up the delimiter by its matched start string
delimiter_map = {d.start: d for d in keyword_delimiters}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Handling duplicate keyword start strings may change behavior compared to the previous implementation.

Previously, each delimiter was processed independently with re.finditer, so duplicate start values would each be applied (possibly with different end behavior/metadata). With delimiter_map = {d.start: d for d in keyword_delimiters}, only the last delimiter for a given start is ever used. If duplicate start values are valid in this context, this changes semantics. Consider either enforcing unique start values up front and failing fast, or mapping each start to a list of delimiters to preserve the prior behavior.

Comment on lines +465 to +472
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization changes how keyword delimiters are associated to matches. There are currently unit tests for delimiter chunking behavior, but none appear to cover the case where multiple keyword delimiters share the same start (e.g., type being both STRUCT and TYPE_ALIAS in family patterns). Adding a test that asserts the resolved boundary kind/priority for such an input would help prevent regressions here.

Copilot uses AI. Check for mistakes.

# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue
for match in combined_pattern.finditer(content):
matched_text = match.group(0)
delimiter = delimiter_map[matched_text]
keyword_pos = match.start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): The allowed set is rebuilt on every match and could be hoisted out of the loop.

This recreates set(structural_pairs.keys()) for every keyword occurrence. Since structural_pairs is constant in this method, compute allowed_structurals = set(structural_pairs.keys()) once before for match in combined_pattern.finditer(content): and pass allowed=allowed_structurals here to avoid repeated allocation in the hot inner loop.

Suggested implementation:

        allowed_structurals = set(structural_pairs.keys())

        for match in combined_pattern.finditer(content):
            matched_text = match.group(0)
            delimiter = delimiter_map[matched_text]
            keyword_pos = match.start()

Within the same method where this loop lives, find the call(s) to:
self._find_next_structural_with_char(...) that currently pass allowed=set(structural_pairs.keys()) and change those arguments to allowed=allowed_structurals to reuse the precomputed set.
If there are multiple such calls (e.g., for different delimiters or branches), ensure all are updated to use allowed_structurals.

Comment on lines +468 to +477
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delimiter_map = {d.start: d ...} assumes keyword starts are unique. In this codebase they are not (e.g., STRUCT_PATTERN and TYPE_ALIAS_PATTERN both include start string "type" via families.py), so this change will drop one delimiter and change which kind/priority wins during overlap resolution. Consider mapping start -> list[Delimiter] (or selecting the highest-priority delimiter per start) and emitting matches for each delimiter associated with the matched start string to preserve prior behavior.

Copilot uses AI. Check for mistakes.

# Find the next structural opening after the keyword
struct_start, struct_char = self._find_next_structural_with_char(
content,
start=keyword_pos + len(delimiter.start),
allowed=set(structural_pairs.keys()),
)
# Skip if keyword is inside a string or comment
if self._is_inside_string_or_comment(content, keyword_pos):
continue

if struct_start is None:
continue
# Find the next structural opening after the keyword
struct_start, struct_char = self._find_next_structural_with_char(
content,
start=keyword_pos + len(delimiter.start),
allowed=set(structural_pairs.keys()),
)

# Find the matching closing delimiter for the structural character
struct_end = self._find_matching_close(
content,
struct_start,
struct_char or "",
structural_pairs.get(cast(str, struct_char), ""),
)
if struct_start is None:
continue

if struct_end is not None:
# Calculate nesting level by counting parent structures
nesting_level = self._calculate_nesting_level(content, keyword_pos)
# Find the matching closing delimiter for the structural character
struct_end = self._find_matching_close(
content,
struct_start,
struct_char or "",
structural_pairs.get(cast(str, struct_char), ""),
)

# Create a complete match from keyword to closing structure
# This represents the entire construct (e.g., function...})
matches.append(
DelimiterMatch(
delimiter=delimiter,
start_pos=keyword_pos,
end_pos=struct_end,
nesting_level=nesting_level,
)
if struct_end is not None:
# Calculate nesting level by counting parent structures
nesting_level = self._calculate_nesting_level(content, keyword_pos)

# Create a complete match from keyword to closing structure
# This represents the entire construct (e.g., function...})
matches.append(
DelimiterMatch(
delimiter=delimiter,
start_pos=keyword_pos,
end_pos=struct_end,
nesting_level=nesting_level,
)
)

return matches

Expand Down Expand Up @@ -1280,7 +1290,7 @@ def _load_custom_delimiters(
patterns when merged with the full delimiter list.

Args:
normalized_language: Snake-case normalised language identifier.
normalized_language: Snake-case normalized language identifier.
language: Original language string (used for logging only).

Returns:
Expand Down
Loading