feat: Implement custom delimiter loading with override, type safety, and new-language detection#275
Conversation
Complete delimiter family system integration by implementing proper conversion from DelimiterPattern to Delimiter objects for custom delimiters loaded from user settings. Changes: - Remove TODO placeholder at delimiter.py:1289-1290 - Initialize delimiters list before custom delimiter processing - Add conversion loop using Delimiter.from_pattern() for custom delimiters - Maintain consistent pattern with family delimiter conversion The keyword matching system blocker (documented in previous analysis) has been resolved - _match_keyword_delimiters() was implemented at delimiter.py:417, enabling proper handling of delimiters with empty end strings and keyword-based structure detection. Resolves #99 Co-authored-by: Adam Poulemanos <bashandbone@users.noreply.github.com>
Reviewer's GuideImplements full custom-delimiter loading and language detection: custom patterns are converted to runtime delimiters with correct enum/string normalisation and error handling, prepended ahead of family patterns, and the chunker selector can now resolve new languages from user-defined extension mappings. Sequence diagram for custom language detection and delimiter loadingsequenceDiagram
actor User
participant ChunkerSelector
participant Governor as GovernorSettings
participant DelimiterChunker
participant DelimFamilies as DelimiterFamilies
User->>ChunkerSelector: select_for_file(file)
activate ChunkerSelector
ChunkerSelector->>ChunkerSelector: _detect_language(file)
alt ext_category is known
ChunkerSelector-->>ChunkerSelector: return enum or ext_category.language
else ext_category is None
ChunkerSelector->>ChunkerSelector: _detect_language_from_custom_ext(file_ext)
alt custom mapping found
ChunkerSelector-->>ChunkerSelector: return custom language
else no custom mapping
ChunkerSelector-->>ChunkerSelector: return file_ext.lstrip(".").lower()
end
end
ChunkerSelector-->>User: return DelimiterChunker(governor, language)
deactivate ChunkerSelector
User->>DelimiterChunker: process(file)
activate DelimiterChunker
DelimiterChunker->>DelimiterChunker: _load_delimiters_for_language(language)
DelimiterChunker->>DelimiterChunker: _load_custom_delimiters(normalized_language, language)
DelimiterChunker->>Governor: read settings.custom_delimiters
alt settings has matching custom_delimiters
loop each matching custom_delim
DelimiterChunker->>DelimiterChunker: Delimiter.from_pattern(pattern)
alt pattern valid
DelimiterChunker-->>DelimiterChunker: append custom delimiters
else pattern invalid
DelimiterChunker-->>DelimiterChunker: log warning and skip
end
end
else no matching custom_delimiters
DelimiterChunker-->>DelimiterChunker: custom list is empty
end
DelimiterChunker->>DelimFamilies: LanguageFamily.from_known_language(language)
DelimiterChunker->>DelimFamilies: get_family_patterns(family)
DelimFamilies-->>DelimiterChunker: built-in patterns
loop each family pattern
DelimiterChunker->>DelimiterChunker: Delimiter.from_pattern(pattern)
DelimiterChunker-->>DelimiterChunker: append built-in delimiters
end
DelimiterChunker-->>User: use merged delimiters (custom prepended)
deactivate DelimiterChunker
Class diagram for custom delimiter loading and language detectionclassDiagram
class GovernorSettings {
+custom_delimiters: list~CustomDelimiter~
}
class CustomDelimiter {
+language: SemanticSearchLanguage | ConfigLanguage | str | None
+extensions: list~ExtensionPair~
+delimiters: list~DelimiterPattern~
}
class ExtensionPair {
+ext: str | None
+language: SemanticSearchLanguage | ConfigLanguage | str | None
}
class DelimiterPattern {
}
class Delimiter {
+from_pattern(pattern: DelimiterPattern) list~Delimiter~
}
class SemanticSearchLanguage {
+variable: str
}
class ConfigLanguage {
+variable: str
}
class DelimiterChunker {
-governor: Governor
+_load_custom_delimiters(normalized_language: str, language: str) list~object~
+_load_delimiters_for_language(language: str) list~Delimiter~
}
class Governor {
+settings: GovernorSettings | None
}
class ChunkerSelector {
+governor: Governor
+select_for_file(file: DiscoveredFile) BaseChunker
+_detect_language(file: DiscoveredFile) SemanticSearchLanguage | ConfigLanguage | str
+_detect_language_from_custom_ext(file_ext: str) SemanticSearchLanguage | ConfigLanguage | str | None
+_match_custom_ext_pair(custom_delim: object, file_ext: str, textcase: object) SemanticSearchLanguage | ConfigLanguage | str | None
}
class DiscoveredFile {
+absolute_path: Path
+ext_category: ExtCategory | None
}
class ExtCategory {
+language: SemanticSearchLanguage | ConfigLanguage | str
}
class BaseChunker {
}
class LanguageFamily {
+from_known_language(language: str) LanguageFamily
}
class DelimiterFamiliesAPI {
+get_family_patterns(family: LanguageFamily) list~DelimiterPattern~
}
Governor --> GovernorSettings : has
GovernorSettings "*" --> "*" CustomDelimiter : contains
CustomDelimiter "*" --> "*" ExtensionPair : maps
CustomDelimiter "*" --> "*" DelimiterPattern : provides
DelimiterPattern --> Delimiter : converted_by
DelimiterChunker --> Governor : uses
DelimiterChunker --> Delimiter : creates
DelimiterChunker --> LanguageFamily : queries
DelimiterChunker --> DelimiterFamiliesAPI : calls
ChunkerSelector --> Governor : uses
ChunkerSelector --> DiscoveredFile : reads
ChunkerSelector --> BaseChunker : returns
ChunkerSelector --> CustomDelimiter : inspects
ChunkerSelector --> ExtensionPair : inspects
ChunkerSelector --> SemanticSearchLanguage : may_return
ChunkerSelector --> ConfigLanguage : may_return
DiscoveredFile --> ExtCategory : has
ExtCategory --> SemanticSearchLanguage : may_hold
ExtCategory --> ConfigLanguage : may_hold
SemanticSearchLanguage <|-- ConfigLanguage
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Consider whether custom delimiters should override or only extend family delimiters; if they should take precedence in matching, you may want to prepend them or otherwise ensure their priority rather than appending to the same list before family patterns.
- If there is any chance of
custom_delim.delimiterscontaining invalid patterns, it might be worth guardingDelimiter.from_pattern(pattern)with basic validation or error handling so a single bad custom delimiter does not break loading for the entire language.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider whether custom delimiters should override or only extend family delimiters; if they should take precedence in matching, you may want to prepend them or otherwise ensure their priority rather than appending to the same list before family patterns.
- If there is any chance of `custom_delim.delimiters` containing invalid patterns, it might be worth guarding `Delimiter.from_pattern(pattern)` with basic validation or error handling so a single bad custom delimiter does not break loading for the entire language.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
Implements the previously stubbed conversion of user-configured custom delimiter patterns into concrete Delimiter objects during delimiter loading, aligning custom-delimiter handling with the existing family-pattern conversion flow.
Changes:
- Initialize the
delimiterslist before processing settings-driven custom delimiters. - Convert each
DelimiterPatternfromsettings.custom_delimitersviaDelimiter.from_pattern(). - Remove the old TODO/pass placeholder and avoid re-initializing
delimitersbefore family conversion.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Convert DelimiterPattern to Delimiter objects | ||
| # TODO: Implement proper conversion when delimiter families are integrated | ||
| pass | ||
| for pattern in custom_delim.delimiters: | ||
| delimiters.extend(Delimiter.from_pattern(pattern)) |
There was a problem hiding this comment.
This conversion loop will rarely run for custom delimiters defined via extensions because the match predicate above compares ext.language directly to the language string. ExtLangPair.language can be SemanticSearchLanguage/ConfigLanguage enums (see codeweaver.core.metadata.ExtLangPair), so normalize both sides (e.g., .variable for BaseEnum + snake-case) similar to engine/chunker/delimiters/custom.py:get_custom_patterns().
| # Convert DelimiterPattern to Delimiter objects | ||
| # TODO: Implement proper conversion when delimiter families are integrated | ||
| pass | ||
| for pattern in custom_delim.delimiters: | ||
| delimiters.extend(Delimiter.from_pattern(pattern)) |
There was a problem hiding this comment.
The new conversion path for settings.custom_delimiters is untested. There are existing unit tests for DelimiterChunker, but none that exercise loading ChunkGovernor(settings=ChunkerSettings(custom_delimiters=...)) and verifying the resulting delimiters affect chunking. Adding a focused unit test here would prevent regressions in settings-driven delimiter behavior.
There was a problem hiding this comment.
Code Review Summary
Overview
This PR implements the delimiter conversion for custom delimiters from user settings, completing a TODO at delimiter.py:1289-1290. The implementation is clean, straightforward, and follows the existing pattern used for family delimiter conversion.
Changes Analysis
What Changed:
- ✅ Moved
delimiters: list[Delimiter] = []initialization before custom delimiter processing - ✅ Replaced TODO placeholder with actual conversion loop using
Delimiter.from_pattern() - ✅ Maintained consistency with the family delimiter conversion pattern below
Code Quality: ⭐⭐⭐⭐⭐
The implementation is excellent:
- Consistent: Uses the exact same pattern as family delimiter conversion (
delimiter.py:1304-1305) - Correct: The
Delimiter.from_pattern()method (defined atdelimiter_model.py:82-119) properly handles pattern expansion - Minimal: No unnecessary complexity or over-engineering
- Well-placed: List initialization happens at the right time, allowing custom delimiters to be added first
Architecture Review
Looking at the broader context:
-
CustomDelimiter structure (
chunker.py:109-160):- Has a
delimiters: list[DelimiterPattern]field - Validates that at least one delimiter is provided
- Supports both
languageandextensionsfor matching
- Has a
-
Delimiter.from_pattern() (
delimiter_model.py:82-119):- Expands DelimiterPattern (which may have multiple start/end combinations)
- Returns a list of concrete Delimiter instances
- Handles all the semantic metadata (kind, priority, inclusive, etc.)
-
Processing order (correct):
- Custom delimiters from settings are processed first
- Then family patterns are loaded and appended
- This gives users the ability to override or supplement built-in patterns
Potential Concerns
Test Failures
The CI shows test failures on Python 3.12, 3.13t, and 3.14. I couldn't access the logs yet as the run is still in progress, but this needs investigation:
- Are the failures related to this change?
- Could there be edge cases with custom delimiter conversion?
- Are there any type annotation issues with the newer Python versions?
Recommendation: Wait for test logs to become available and investigate the root cause.
Code Correctness Verification
The implementation assumes:
custom_delim.delimitersis always a valid list of DelimiterPattern objects ✅ (validated by CustomDelimiter model)Delimiter.from_pattern()handles all DelimiterPattern types correctly ✅ (implementation looks solid)- The order matters (custom first, then family) ✅ (makes sense for overrides)
All assumptions appear valid.
Minor Observations
-
Duplicate logic: The custom delimiter matching logic (
delimiter.py:1287-1294) is similar to logic inget_custom_patterns()(custom.py:693-706). This could potentially be extracted to a shared helper, but it's not critical. -
hasattr check: Line 1283 uses
hasattr(self._governor.settings, 'custom_delimiters')which is somewhat defensive. Since ChunkerSettings defines custom_delimiters with a default of None, this should always be present. However, the defensive check doesn't hurt.
Security & Performance
- ✅ No security concerns (defensive code, validates inputs via Pydantic)
- ✅ No performance concerns (iteration is minimal, happens once per language load)
- ✅ No memory leaks or resource issues
Recommendations
- MUST: Investigate and resolve test failures before merging
- SHOULD: Add test coverage specifically for custom delimiter conversion if not already present
- COULD: Consider extracting the language/extension matching logic to reduce duplication (future refactor)
Verdict
The code changes are correct and well-implemented. However, I cannot approve until the test failures are resolved. Once tests pass, this is ready to merge.
Status: ⏳ Awaiting test resolution
Review generated with Claude Code
There was a problem hiding this comment.
Code Review
Summary
This PR implements proper delimiter conversion for custom delimiters loaded from user settings, resolving issue #99 and removing the TODO placeholder at delimiter.py:1289-1290.
Changes Analysis
What Changed:
- Removed TODO placeholder
- Added initialization of
delimiterslist before custom delimiter processing (line 1278) - Implemented conversion loop using
Delimiter.from_pattern()for custom delimiters (lines 1296-1297) - Maintains consistent pattern with existing family delimiter conversion
Code Quality Assessment
✅ Strengths:
- Consistent Implementation: The conversion logic mirrors the existing family delimiter pattern conversion (lines 1304-1305), maintaining code consistency
- Proper Initialization: Moving the
delimiterslist initialization to the top (line 1278) ensures custom delimiters are processed first, giving them priority - Clean Conversion: Uses the existing
Delimiter.from_pattern()method, which properly handles pattern expansion and delimiter instantiation - Maintains Order: Custom delimiters → Family delimiters → Common fallbacks, which is a sensible priority ordering
✅ Type Safety: The code properly uses type annotations and follows the existing codebase patterns
✅ Integration: The implementation correctly integrates with the existing delimiter family system that was previously implemented
Potential Concerns
- The failures may be pre-existing (unrelated to this PR)
- Or they may indicate edge cases not handled by this implementation
Recommendation: Please investigate the test failures. If they're related to this change, we need to:
- Review what the tests expect
- Ensure the custom delimiter conversion handles all edge cases
- Potentially add test coverage for custom delimiter loading
Architecture & Design
✅ Separation of Concerns: Properly delegates pattern expansion to Delimiter.from_pattern()
✅ Backwards Compatible: Existing behavior is preserved; this only removes a TODO and implements the planned functionality
Security & Performance
✅ No Security Issues: The code only processes user settings that are already validated by the settings schema
✅ Performance: Uses the same conversion pattern as family delimiters, so no performance regression expected
Suggestions
-
Priority Ordering: Consider if custom delimiters should always take precedence, or if there should be a way to merge/override specific patterns. The current implementation adds custom delimiters first, which seems reasonable.
-
Documentation: The docstring could be updated to mention that custom delimiters are checked first, before family patterns.
-
Test Coverage: Add unit tests specifically for custom delimiter loading to prevent regressions.
Verdict
The code implementation is well-structured and follows best practices. However, I cannot approve until the test failures are resolved.
Action Items:
- Investigate and fix test failures
- Verify tests pass on all Python versions
- Consider adding specific test coverage for custom delimiter conversion
Once tests are green, this PR should be good to merge! 🚀
|
@copilot we need some adjusting for the implementation to reflect design intent:
|
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com> Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/66279302-fae6-4887-8c87-7b6eadfac587
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com> Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/66279302-fae6-4887-8c87-7b6eadfac587
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
_load_custom_delimitersand_load_delimiters_for_languageyou’re working withlist[object]and using# type: ignore[assignment]; consider tightening this tolist[Delimiter]end-to-end so you can drop theobjecttyping and the type ignore and get better static checking. - Both
_load_custom_delimitersand_detect_language_from_custom_extimporttextcaseinside the function body; since these are hot paths, it would be cleaner and slightly more efficient to move the import to module scope and pass nothing around instead of thetextcasemodule object.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_load_custom_delimiters` and `_load_delimiters_for_language` you’re working with `list[object]` and using `# type: ignore[assignment]`; consider tightening this to `list[Delimiter]` end-to-end so you can drop the `object` typing and the type ignore and get better static checking.
- Both `_load_custom_delimiters` and `_detect_language_from_custom_ext` import `textcase` inside the function body; since these are hot paths, it would be cleaner and slightly more efficient to move the import to module scope and pass nothing around instead of the `textcase` module object.
## Individual Comments
### Comment 1
<location path="src/codeweaver/engine/chunker/selector.py" line_range="257-264" />
<code_context>
+ pair_ext = f".{pair_ext}"
+ if pair_ext.lower() != file_ext.lower():
+ continue
+ delim_lang = getattr(custom_delim, "language", None)
+ if delim_lang is not None:
+ return textcase.snake(str(delim_lang)) # type: ignore[attr-defined]
+ pair_lang = getattr(pair, "language", None)
+ if pair_lang is not None:
+ if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
+ return pair_lang
+ return textcase.snake(str(pair_lang)) # type: ignore[attr-defined]
+ return None
+
</code_context>
<issue_to_address>
**issue:** Inconsistent handling of enum vs string return types when resolving the language from custom delimiters.
`custom_delim.language` is always converted to a snake-cased string (`textcase.snake(str(delim_lang))`), even when it is a `SemanticSearchLanguage` or `ConfigLanguage`, while `pair.language` preserves enum instances and only stringifies non-enum values. As a result, this function may return enums for pair-based mappings but only strings for custom delimiter mappings, despite the shared return type `SemanticSearchLanguage | ConfigLanguage | str | None`. It would be clearer to make both branches behave the same—either always return enums when the value is an enum, or always normalize both branches to strings.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return textcase.snake(lang.variable) | ||
| return textcase.snake(str(lang)) | ||
|
|
||
| delimiters: list[object] = [] |
There was a problem hiding this comment.
The local delimiters list is typed as list[object], which then forces a # type: ignore[assignment] when assigning into list[Delimiter]. Type this as list[Delimiter] and you can drop the ignore at the call site.
| delimiters: list[object] = [] | |
| delimiters: list[Delimiter] = [] |
| delimiters: list[Delimiter] = self._load_custom_delimiters( # type: ignore[assignment] | ||
| normalized_language, language |
There was a problem hiding this comment.
# type: ignore[assignment] here is a symptom of _load_custom_delimiters() being typed as list[object]. Once _load_custom_delimiters() returns list[Delimiter], this ignore should be removable.
| delimiters: list[Delimiter] = self._load_custom_delimiters( # type: ignore[assignment] | |
| normalized_language, language | |
| delimiters: list[Delimiter] = cast( | |
| list[Delimiter], | |
| self._load_custom_delimiters(normalized_language, language), |
| return textcase.snake(str(delim_lang)) # type: ignore[attr-defined] | ||
| pair_lang = getattr(pair, "language", None) | ||
| if pair_lang is not None: | ||
| if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage): |
There was a problem hiding this comment.
isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage) will raise TypeError at runtime because isinstance doesn’t accept PEP604 union types. Use (SemanticSearchLanguage, ConfigLanguage) instead.
| if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage): | |
| if isinstance(pair_lang, (SemanticSearchLanguage, ConfigLanguage)): |
| delim_lang = getattr(custom_delim, "language", None) | ||
| if delim_lang is not None: | ||
| return textcase.snake(str(delim_lang)) # type: ignore[attr-defined] | ||
| pair_lang = getattr(pair, "language", None) | ||
| if pair_lang is not None: | ||
| if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage): | ||
| return pair_lang | ||
| return textcase.snake(str(pair_lang)) # type: ignore[attr-defined] |
There was a problem hiding this comment.
When custom_delim.language is a SemanticSearchLanguage/ConfigLanguage, textcase.snake(str(delim_lang)) will normalize the enum’s repr (e.g. ConfigLanguage.BASH) into something like config_language_bash instead of the intended language key (bash). Handle enums via .variable (or return the enum itself) before snake-casing.
| logger = logging.getLogger(__name__) | ||
|
|
||
| def _normalize(lang: object) -> str: | ||
| if isinstance(lang, SemanticSearchLanguage | ConfigLanguage): |
There was a problem hiding this comment.
isinstance(lang, SemanticSearchLanguage | ConfigLanguage) will raise TypeError at runtime because isinstance does not accept PEP604 union types. Use a tuple (SemanticSearchLanguage, ConfigLanguage) (or a shared base enum type if available) instead so enum normalization works.
| if isinstance(lang, SemanticSearchLanguage | ConfigLanguage): | |
| if isinstance(lang, (SemanticSearchLanguage, ConfigLanguage)): |
| def _load_custom_delimiters( | ||
| self, | ||
| normalized_language: str, | ||
| language: str, | ||
| ) -> list[object]: | ||
| """Load custom delimiter patterns from settings that match the given language. |
There was a problem hiding this comment.
_load_custom_delimiters() is annotated as returning list[object], but it builds Delimiter instances via Delimiter.from_pattern(...). Update the return type to list[Delimiter] so typing reflects actual behavior.
|
@copilot open a new pull request to apply changes based on the comments in this thread With one exception: do not update the isinstance call to a tuple. Using union syntax is not a type error and is the preferred idiom in CodeWeaver (and ruff errors when not used) -- Also, please incorporate this feedback from sourcery: Overall Comments
Individual CommentsComment 1+ pair_ext = f".{pair_ext}" + if pair_ext.lower() != file_ext.lower(): + continue + delim_lang = getattr(custom_delim, "language", None) + if delim_lang is not None: + return textcase.snake(str(delim_lang)) # type: ignore[attr-defined] + pair_lang = getattr(pair, "language", None) + if pair_lang is not None: + if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage): + return pair_lang + return textcase.snake(str(pair_lang)) # type: ignore[attr-defined] + return None + **issue:** Inconsistent handling of enum vs string return types when resolving the language from custom delimiters.
|
|
@bashandbone I've opened a new pull request, #277, to work on those changes. Once the pull request is ready, I'll request review from you. |
…ter loading (#277) * Initial plan * fix: tighten types and move textcase to module scope in custom delimiter loading Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com> Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/0cc04151-09d9-4c2f-9eb2-9a1f233a187f * Update src/codeweaver/engine/chunker/selector.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com> --------- Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ext_match = custom_delim.extensions and any( | ||
| _normalize(ext.language) == normalized_language | ||
| for ext in custom_delim.extensions | ||
| if hasattr(ext, "language") and ext.language is not None | ||
| ) |
There was a problem hiding this comment.
ext_match = custom_delim.extensions and any(...) can evaluate to the list itself (e.g., when extensions=[]), so ext_match isn’t consistently a boolean. Consider ext_match = bool(custom_delim.extensions) and any(...) for type safety and clearer intent.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
There was a problem hiding this comment.
Review of PR #275: Custom Delimiter Loading Implementation
I've reviewed this PR implementing custom delimiter loading with override behavior, type safety, and new-language detection. The implementation is well-structured with comprehensive tests, but there are several important issues that need to be addressed.
Critical Issues
1. CI Test Failures (Blocking)
All test jobs are failing with error: Failed to spawn: pytest - this appears to be an environment/dependency issue unrelated to your code changes, but it blocks verification that the new tests actually pass. The failures need to be resolved before this can be merged.
2. Language Priority Logic Issue (High Priority)
Location: src/codeweaver/engine/chunker/selector.py:255-264
There's a significant logic flaw in _match_custom_ext_pair() that the Copilot reviewer correctly identified. When an extension match is found, the method returns custom_delim.language first, even though the user explicitly mapped that extension to pair.language. This can cause incorrect language detection when the top-level language field differs from the per-extension language values.
Current logic (lines 255-264):
delim_lang = getattr(custom_delim, "language", None)
if delim_lang is not None:
# Returns custom_delim.language even for extension-specific matches
if isinstance(delim_lang, SemanticSearchLanguage | ConfigLanguage):
return delim_lang
return textcase.snake(str(delim_lang))
pair_lang = getattr(pair, "language", None)
# This only runs if delim_lang is NoneRecommended fix: Prioritize pair.language over custom_delim.language for extension matches, as suggested in comment #2990624081. When matching by extension, the extension's declared language should take precedence.
Code Quality Issues
3. Missing Import for textcase in delimiter.py
Location: src/codeweaver/engine/chunker/delimiter.py:11
Good: You added import textcase at module scope (line 21).
Issue: The import is buried among other imports. For clarity and to prevent future "module not found" issues, consider adding a comment explaining this is needed for custom delimiter language normalization, or ensure it's grouped logically with other third-party imports.
4. Inconsistent Error Handling
Location: src/codeweaver/engine/chunker/delimiter.py:1314-1322
The error handling catches (ValueError, TypeError) when converting patterns, which is good defensive programming. However:
- The warning message could be more actionable for users
- Consider whether you should also catch other exceptions (e.g.,
AttributeErrorif the pattern structure is unexpected) - The current approach silently skips bad patterns; consider if there should be a config option to fail-fast on invalid custom delimiters
5. Type Safety Concerns
Location: Multiple places using getattr(custom_delim, ...)
The code uses getattr() with object types in several places (e.g., selector.py:244, 255, 260). While this provides flexibility, it sacrifices type safety. Consider:
- Adding runtime type assertions
- Using
Protocolclasses to define expected interfaces - Adding docstring warnings about the duck-typing approach
Test Quality
6. Comprehensive Test Coverage ✅
Location: tests/unit/engine/chunker/test_custom_delimiter_loading.py
Excellent test coverage with 8 focused unit tests covering:
- Override/prepend behavior
- Enum type comparison
- Error handling for bad patterns
- New language detection
The tests are well-structured with clear helper functions and good documentation.
Minor suggestion: Consider adding integration tests that verify the end-to-end behavior with actual file chunking, not just unit tests with mocks.
7. Test Markers
The tests use multiple markers: @pytest.mark.benchmark, @pytest.mark.mock_only, @pytest.mark.performance, @pytest.mark.unit.
Verify that:
- These markers are necessary (e.g., are these actually benchmark/performance tests?)
- The CI is configured to run tests with these markers
Documentation
8. Good Docstring Quality ✅
The docstrings for _load_custom_delimiters(), _detect_language_from_custom_ext(), and _match_custom_ext_pair() are clear and well-formatted with proper Args/Returns sections.
9. PR Description ✅
The PR description is excellent - it clearly explains the changes, provides context, and links to resolved issues.
Performance Considerations
10. Language Normalization Performance
Location: delimiter.py:1296-1299 and selector.py:259, 264
The _normalize() helper and direct textcase.snake() calls are executed in loops. For large numbers of custom delimiters:
- Consider caching normalized language names
- Profile to see if this becomes a bottleneck
- The current implementation is probably fine for typical use cases (< 100 custom delimiters)
Security
11. Input Validation ✅
The code safely handles malformed input:
- Checks for
Nonevalues - Uses
hasattr()before attribute access - Catches exceptions during pattern conversion
No security concerns identified.
Recommendations
Must Fix (Blocking):
- Resolve CI test failures
- Fix the language priority logic in
_match_custom_ext_pair()as suggested in comment #2990624081
Should Fix (High Priority):
3. Add a test case specifically for the scenario where custom_delim.language differs from pair.language to verify the fix for #2
4. Consider more specific exception handling or user-facing error messages for invalid patterns
Nice to Have:
5. Add integration tests
6. Review test markers for accuracy
7. Consider performance optimizations if custom delimiter counts scale significantly
Conclusion
This is a solid implementation with good separation of concerns, comprehensive unit tests, and clear documentation. The core algorithm (prepending custom delimiters to override defaults, normalizing language types) is sound. However, the language priority bug and CI failures are blocking issues that must be resolved before merge.
Once the critical issues are addressed, this will be a valuable addition to the codeweaver chunking system.
Overall Assessment: Request Changes - address the language priority logic and CI failures before merging.
There was a problem hiding this comment.
Code Review - PR #275: Custom Delimiter Loading Implementation
I've reviewed the changes in this PR and they look excellent overall. Here's my detailed analysis:
✅ Strengths
-
Clean Architecture: The extraction of
_load_custom_delimiters()helper method is well-designed, keeping the code modular and maintainable. -
Type Safety: The normalization logic for
SemanticSearchLanguage,ConfigLanguage, and plain language names is implemented correctly and consistently with the existing pattern incustom.py:694-704. -
Robust Error Handling: The try-except block around
Delimiter.from_pattern()(lines 1314-1322) ensures that a single invalid pattern doesn't break the entire loading process - this is defensive programming at its best. -
Prepend Strategy: Custom delimiters are correctly prepended (line 1348-1350) to override built-in family patterns, which is the intended behavior.
-
Extension Detection: The new
_detect_language_from_custom_ext()and_match_custom_ext_pair()methods inselector.pyenable detection of user-defined languages, which is a valuable feature. -
Comprehensive Tests: The test suite (
test_custom_delimiter_loading.py) is excellent:- 8 focused unit tests covering all key scenarios
- Tests for prepend/override behavior
- Tests for type comparison (all enum variants)
- Error handling verification
- New-language detection
- All tests are passing ✓
📊 Test Results
- ✅ All 8 new custom delimiter tests passed
- ✅ 1,446 total tests passed
- ❌ 1 unrelated test failed:
test_indexing_performance_with_real_providers- This is a flaky integration test with a
FileNotFoundErrorin vector storage - Not related to the PR changes - this test doesn't touch delimiter loading
- This is a flaky integration test with a
🔍 Code Quality Observations
-
Consistency: The implementation matches the existing pattern in
delimiters/custom.py(lines 685-714), using the same normalization approach withtextcase.snake(). -
Documentation: Excellent docstrings with clear parameter descriptions and return types.
-
Logging: Appropriate use of
logger.warning()for skipped patterns with helpful context. -
Defensive Coding: Good use of
hasattr()and null checks when accessing settings attributes.
⚡ Performance & Security
- Performance: No concerns. The custom delimiter loading is O(n*m) where n=custom delimiters and m=patterns per delimiter, which is expected and acceptable.
- Security: No security issues detected. The code safely handles user input and validates patterns.
🎯 Recommendations
-
Minor: Consider caching - If custom delimiters are loaded frequently for the same language, consider caching the normalized results (though this may already be handled at a higher level).
-
The flaky test - The failing
test_indexing_performance_with_real_providersshould be addressed separately (not in this PR). It appears to be a race condition or cleanup issue in the test infrastructure.
✅ Conclusion
This PR successfully implements the TODO at delimiter.py:1289-1290 with a well-designed, robust, and thoroughly tested solution. The code quality is high, follows existing patterns, and adds valuable functionality for custom delimiter support.
Recommendation: APPROVE ✓
The single failing test is unrelated to these changes and should not block this PR.
There was a problem hiding this comment.
Response to Comment #2990624081 (Language Priority Logic)
I agree with the Copilot reviewer's suggestion on the _match_custom_ext_pair method. The current implementation has a logic flaw:
The Problem:
When an extension match is found, the method returns custom_delim.language first, even though the extension explicitly declares pair.language. This means:
CustomDelimiter(
language="foo",
extensions=[ExtLangPair(ext=".bar", language="baz")],
delimiters=[...]
)A .bar file would be detected as "foo" instead of "baz".
The Fix:
Prioritize pair.language over custom_delim.language for extension-based matches, as suggested. The warning for mismatches is also valuable for catching configuration errors.
Additional Recommendations:
- Apply the suggested code change at
selector.py:255-264 - Add a test case validating this scenario (where extension language ≠ delimiter language)
- Import
loggerat module level inselector.pyfor the warning
This is a high-priority fix that should be addressed before merging.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if pair_lang is not None: | ||
| if ( | ||
| delim_lang is not None | ||
| and str(delim_lang) != str(pair_lang) |
There was a problem hiding this comment.
The mismatch check uses str(delim_lang) != str(pair_lang), which can produce false positives (e.g., enum values often stringify as SemanticSearchLanguage.PYTHON vs 'python'). This can cause noisy warnings even when languages are semantically the same. Normalize both sides (e.g., using .variable/textcase.snake) before comparing, or drop the warning given the config model already validates consistency.
| and str(delim_lang) != str(pair_lang) | |
| and textcase.snake(str(delim_lang)) != textcase.snake(str(pair_lang)) |
| # Custom entries are prepended so they override the built-in family | ||
| # patterns for known languages and are the sole source for new languages. | ||
| delimiters: list[Delimiter] = self._load_custom_delimiters( |
There was a problem hiding this comment.
The comment says custom delimiters are “the sole source for new languages”, but the implementation still always appends LanguageFamily.UNKNOWN patterns (and additional generic fallbacks) for unknown languages. Either update the comment/docstring to match behavior, or change the logic to skip family patterns when the language is unknown and custom delimiters were provided.
Code Review - PR #275I've reviewed the implementation of custom delimiter loading with override, type safety, and new-language detection. This is a well-structured PR that addresses important functionality gaps, but there are several issues that need to be resolved before approval. Critical Issues1. Test Failures (Blocking)The PR currently has failing tests on Python 3.12, 3.13t, and 3.14. These need to be investigated and fixed before the PR can be merged. The tests are still running at the time of this review, but the failures need to be addressed. 2. Inconsistent Enum/String Handling in
|
There was a problem hiding this comment.
Code Review Summary
This PR implements custom delimiter loading with good architecture and comprehensive testing. However, there is one critical type safety issue that must be addressed before merging.
🔴 Critical Issue: Type Safety Bug
Location: src/codeweaver/engine/chunker/delimiter.py:1307
The expression:
ext_match = custom_delim.extensions and any(...)Can evaluate to the list itself when extensions=[], making ext_match inconsistently typed (could be list, bool, or result of any()).
Fix:
ext_match = bool(custom_delim.extensions) and any(
_normalize(ext.language) == normalized_language
for ext in custom_delim.extensions
if hasattr(ext, "language") and ext.language is not None
)This is the same issue noted in this feedback comment.
✅ Strengths
- Well-structured refactoring - Clean extraction of
_load_custom_delimiters()helper - Proper error handling - Individual pattern failures don't crash the system
- Type normalization - Correctly handles
SemanticSearchLanguage,ConfigLanguage, and plain strings - Comprehensive testing - 8 focused unit tests covering major scenarios
- Language detection extension - Clean implementation in selector.py
🟡 Minor Issues
- CI Tests Failing - Python 3.12, 3.13t, and 3.14 tests are failing. Need to investigate if related to this PR.
- Performance consideration - Repeated iteration over custom_delimiters. Consider caching for large lists.
Security & Code Quality
✅ No security concerns - proper input validation and error handling
✅ Follows codebase patterns - consistent use of textcase.snake() and .variable
✅ Good documentation - clear docstrings and test documentation
Recommendation
Conditional Approval - Fix the critical type safety issue, ensure CI tests pass, then this is ready to merge.
The implementation is well-designed and thorough, but the ext_match type issue must be resolved for correctness.
There was a problem hiding this comment.
PR Review: Custom Delimiter Loading with Override, Type Safety, and New-Language Detection
I've completed a thorough review of PR #275. This is a well-structured implementation that properly addresses the TODO at delimiter.py:1289-1290 and extends the custom delimiter system with comprehensive type safety and new language detection capabilities.
Strengths
1. Excellent Implementation Design
- ✅ Custom delimiters are correctly prepended to override built-in family patterns (delimiter.py:1348)
- ✅ Proper separation of concerns with
_load_custom_delimiters()helper method - ✅ Graceful error handling prevents a single bad pattern from breaking the entire language
- ✅ ChunkerSelector correctly detects custom extensions for new languages
2. Type Safety & Normalization
- ✅ Consistent use of
textcase.snake()for language normalization matching the pattern incustom.py:694-704 - ✅ Proper handling of
SemanticSearchLanguage,ConfigLanguage, and plainLanguageNametypes - ✅ Correct enum property access via
.variable(delimiter.py:1298)
3. Comprehensive Test Coverage
- ✅ 8 focused unit tests covering all major scenarios
- ✅ Tests verify prepend/override behavior, type comparisons, error handling, and custom extension detection
- ✅ Good use of mocking and caplog for testing warnings
Issue Found: Language Mismatch Warning
Location: selector.py:259-269
The comparison str(delim_lang) \!= str(pair_lang) can produce false positives because enum values stringify differently than their underlying values:
# Current behavior:
str(SemanticSearchLanguage.PYTHON) # Returns "SemanticSearchLanguage.PYTHON"
str("python") # Returns "python"
# These would be considered "different" even though semantically they're the sameRecommendation: Apply the normalization function before comparison to avoid spurious warnings:
def _normalize(lang: object) -> str:
if isinstance(lang, SemanticSearchLanguage | ConfigLanguage):
return textcase.snake(lang.variable)
return textcase.snake(str(lang))
# Then in the comparison:
if (
delim_lang is not None
and _normalize(delim_lang) \!= _normalize(pair_lang)
):
logger.warning(...)This matches the normalization pattern already used in delimiter.py:1296-1299 and custom.py:699-704.
Minor Observations
-
Consistency: The
_normalize()helper is defined locally in_load_custom_delimiters(). Consider extracting it to module scope (liketextcaseimport) to reuse in_match_custom_ext_pair()for the mismatch check. -
Test Status: Some tests are failing but the run is still in progress. The test file itself looks well-structured.
-
Code Style: Excellent adherence to the project's style - proper typing, clear docstrings, and consistent formatting.
Recommendation
This is solid work that properly implements the feature. The only issue is the false-positive warning scenario. I recommend:
- Fix the language mismatch comparison in
selector.py:261using normalized values - Wait for CI to complete to verify all tests pass
- Consider extracting the normalization logic to reduce duplication
Once the mismatch comparison is fixed, this PR should be ready to merge.
Code Quality: ⭐⭐⭐⭐ (4/5)
- Deducting one star only for the mismatch comparison issue
- Otherwise excellent implementation with proper error handling, type safety, and comprehensive tests
Reviewed with Claude Code
There was a problem hiding this comment.
PR Review - Issue with Generic Fallback Patterns for New Languages
Thank you for this comprehensive implementation of custom delimiter loading! The overall design is solid, with good error handling, type safety, and thorough test coverage. However, there is a legitimate inconsistency identified in comment #2990743944 that should be addressed.
Critical Issue: Documentation vs. Implementation Mismatch
Location: src/codeweaver/engine/chunker/delimiter.py:1346-1347
The Problem:
The comment states that custom delimiters are "the sole source for new languages":
# Custom entries are prepended so they override the built-in family
# patterns for known languages and are the sole source for new languages.
delimiters: list[Delimiter] = self._load_custom_delimiters(
normalized_language, language
)However, the implementation always appends family patterns, even for unknown languages:
# Load from delimiter families system
family = LanguageFamily.from_known_language(language) # Returns UNKNOWN for new languages
patterns = get_family_patterns(family) # Returns generic fallback patterns for UNKNOWN
# Convert patterns to delimiters
for pattern in patterns:
delimiters.extend(Delimiter.from_pattern(pattern)) # Always appendsWhen a truly new language (e.g., "mylang") is used:
LanguageFamily.from_known_language("mylang")returnsLanguageFamily.UNKNOWNget_family_patterns(LanguageFamily.UNKNOWN)returns generic fallback patterns (PARAGRAPH_PATTERN, BRACE_BLOCK_PATTERN, etc. perfamilies.py:371-376)- These patterns are appended to the custom delimiters
Impact:
- For new custom languages, users get both their custom delimiters AND generic fallback patterns, not just their custom delimiters as documented
- This might be unintended behavior if users want precise control over chunking for their custom language
Suggested Fix
You have two options to align code with documentation:
Option 1: Skip family patterns for unknown languages with custom delimiters
# Load from delimiter families system
family = LanguageFamily.from_known_language(language)
# Only load family patterns if:
# - Language is known (not UNKNOWN), OR
# - No custom delimiters were provided
if family != LanguageFamily.UNKNOWN or not delimiters:
patterns = get_family_patterns(family)
for pattern in patterns:
delimiters.extend(Delimiter.from_pattern(pattern))Option 2: Update documentation to reflect current behavior
# Custom entries are prepended so they override the built-in family
# patterns for known languages. For unknown languages, custom delimiters
# are combined with generic fallback patterns from LanguageFamily.UNKNOWN.Recommendation: Option 1 seems more aligned with the design intent of allowing users to define new languages entirely via custom delimiters. However, Option 2 might be safer if the generic fallback patterns are desired as a safety net.
Positive Aspects
-
Excellent type safety: The normalization logic in
_normalize()correctly handlesSemanticSearchLanguage,ConfigLanguage, and plain strings usingtextcase.snake()- this matches the pattern used elsewhere in the codebase -
Robust error handling: The try-except wrapper around
Delimiter.from_pattern()with warning logs ensures one bad pattern doesn't break the entire language configuration -
Well-structured extraction: Moving custom delimiter loading into
_load_custom_delimiters()keeps complexity manageable and improves testability -
Comprehensive language detection: The
ChunkerSelectorextensions (_detect_language_from_custom_extand_match_custom_ext_pair) properly enable new language detection via custom extension mappings -
Thorough test coverage: The test suite covers prepend/override behavior, all enum type variants, error handling, and new language detection
-
Good logging: Warning messages for mismatched extension languages (selector.py:263-269) help users debug configuration issues
Minor Observations
-
Performance: The nested loops in
_load_custom_delimiters()and_match_custom_ext_pair()should be fine for typical use cases, but might be worth profiling if custom_delimiters lists become very large -
Test markers: The test file uses multiple pytest marks (
@pytest.mark.benchmark,@pytest.mark.performance, etc.) - ensure these align with your CI/CD pipeline expectations -
CI Failures: The PR shows test failures on multiple Python versions (3.12, 3.13t, 3.14). These should be investigated, though they may be unrelated to this specific issue
Recommendation
Request changes to address the documentation/implementation mismatch. Once that's resolved and CI passes, this will be a solid contribution that properly implements custom delimiter loading with good engineering practices.
Please clarify the intended behavior for unknown languages with custom delimiters and update either the code or documentation accordingly.
|
@copilot reference this feedback: #275 (review) It's good feedback but doesn't precisely reflect the design intent here. Please align your work with the intended design. The intent:
|
Summary
Implements proper delimiter conversion for custom delimiters loaded from user settings, resolving the TODO at
delimiter.py:1289-1290. Extends the implementation to reflect full design intent: custom delimiters override defaults, type comparisons are correct for all language union types, invalid patterns are handled gracefully, and new languages defined via custom extensions are detected by the chunker selector.Changes
src/codeweaver/engine/chunker/delimiter.py_load_custom_delimiters()helper to encapsulate custom delimiter loading logic and keep method complexity within limitsSemanticSearchLanguage/ConfigLanguageenum values via.variable+textcase.snake()and plain strings viatextcase.snake(str(...)), matching the pattern incustom.py:get_custom_patterns()Delimiter.from_pattern()call inexcept (ValueError, TypeError)with alogging.WARNING, so a single bad custom pattern does not abort loading for the entire languagesrc/codeweaver/engine/chunker/selector.py_detect_language_from_custom_ext()and_match_custom_ext_pair()toChunkerSelectorsettings.custom_delimiters[*].extensionsto resolve the declared language, enabling chunker routing for user-defined new languagestests/unit/engine/chunker/test_custom_delimiter_loading.py(new)SemanticSearchLanguage,ConfigLanguage, plainLanguageName), error handling (bad pattern skipped with warning), and new-language detection via custom extension mapping inChunkerSelectorContext
The keyword matching system blocker (documented in previous analysis) has been resolved -
_match_keyword_delimiters()was implemented, enabling proper handling of delimiters with empty end strings and keyword-based structure detection.Generated with Claude Code
Summary by Sourcery
Implement robust loading and precedence of custom delimiters and extend language detection to support user-defined languages via custom extensions.
New Features:
Bug Fixes:
Enhancements:
Tests:
📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.