Skip to content

feat: Implement custom delimiter loading with override, type safety, and new-language detection#275

Merged
bashandbone merged 5 commits intomainfrom
claude/issue-99-20260319-0103
Mar 27, 2026
Merged

feat: Implement custom delimiter loading with override, type safety, and new-language detection#275
bashandbone merged 5 commits intomainfrom
claude/issue-99-20260319-0103

Conversation

@bashandbone
Copy link
Copy Markdown
Contributor

@bashandbone bashandbone commented Mar 24, 2026

Summary

Implements proper delimiter conversion for custom delimiters loaded from user settings, resolving the TODO at delimiter.py:1289-1290. Extends the implementation to reflect full design intent: custom delimiters override defaults, type comparisons are correct for all language union types, invalid patterns are handled gracefully, and new languages defined via custom extensions are detected by the chunker selector.

Changes

src/codeweaver/engine/chunker/delimiter.py

  • Extract _load_custom_delimiters() helper to encapsulate custom delimiter loading logic and keep method complexity within limits
  • Override/prepend: custom delimiters are collected before family patterns so they take priority for known languages and are the sole source for new/unknown languages
  • Type comparison fix: normalize SemanticSearchLanguage/ConfigLanguage enum values via .variable + textcase.snake() and plain strings via textcase.snake(str(...)), matching the pattern in custom.py:get_custom_patterns()
  • Error handling: wrap each Delimiter.from_pattern() call in except (ValueError, TypeError) with a logging.WARNING, so a single bad custom pattern does not abort loading for the entire language

src/codeweaver/engine/chunker/selector.py

  • Add _detect_language_from_custom_ext() and _match_custom_ext_pair() to ChunkerSelector
  • When a file's extension is not in the standard registry, the selector now walks settings.custom_delimiters[*].extensions to resolve the declared language, enabling chunker routing for user-defined new languages

tests/unit/engine/chunker/test_custom_delimiter_loading.py (new)

  • 8 focused unit tests covering: prepend/override behaviour, enum type comparison (SemanticSearchLanguage, ConfigLanguage, plain LanguageName), error handling (bad pattern skipped with warning), and new-language detection via custom extension mapping in ChunkerSelector

Context

The keyword matching system blocker (documented in previous analysis) has been resolved - _match_keyword_delimiters() was implemented, enabling proper handling of delimiters with empty end strings and keyword-based structure detection.


Generated with Claude Code

Summary by Sourcery

Implement robust loading and precedence of custom delimiters and extend language detection to support user-defined languages via custom extensions.

New Features:

  • Support custom delimiter patterns that override built-in family delimiters for known languages and define delimiters for entirely new languages via settings.
  • Enable ChunkerSelector to resolve languages from user-defined file extensions declared in custom delimiter settings.

Bug Fixes:

  • Fix language matching for custom delimiters by normalising SemanticSearchLanguage, ConfigLanguage, and plain language name values before comparison.
  • Prevent invalid custom delimiter patterns from breaking delimiter loading by skipping them with a logged warning.

Enhancements:

  • Refactor delimiter loading to extract a dedicated helper for custom delimiter handling and keep method complexity in check.

Tests:

  • Add unit test suite covering custom delimiter precedence, language type matching, error handling for bad patterns, and new-language detection via custom extensions.

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

Complete delimiter family system integration by implementing proper
conversion from DelimiterPattern to Delimiter objects for custom
delimiters loaded from user settings.

Changes:
- Remove TODO placeholder at delimiter.py:1289-1290
- Initialize delimiters list before custom delimiter processing
- Add conversion loop using Delimiter.from_pattern() for custom delimiters
- Maintain consistent pattern with family delimiter conversion

The keyword matching system blocker (documented in previous analysis)
has been resolved - _match_keyword_delimiters() was implemented at
delimiter.py:417, enabling proper handling of delimiters with empty
end strings and keyword-based structure detection.

Resolves #99

Co-authored-by: Adam Poulemanos <bashandbone@users.noreply.github.com>
@bashandbone bashandbone added bug Something isn't working providers labels Mar 24, 2026
Copilot AI review requested due to automatic review settings March 24, 2026 18:14
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Mar 24, 2026

Reviewer's Guide

Implements full custom-delimiter loading and language detection: custom patterns are converted to runtime delimiters with correct enum/string normalisation and error handling, prepended ahead of family patterns, and the chunker selector can now resolve new languages from user-defined extension mappings.

Sequence diagram for custom language detection and delimiter loading

sequenceDiagram
    actor User
    participant ChunkerSelector
    participant Governor as GovernorSettings
    participant DelimiterChunker
    participant DelimFamilies as DelimiterFamilies

    User->>ChunkerSelector: select_for_file(file)
    activate ChunkerSelector
    ChunkerSelector->>ChunkerSelector: _detect_language(file)
    alt ext_category is known
        ChunkerSelector-->>ChunkerSelector: return enum or ext_category.language
    else ext_category is None
        ChunkerSelector->>ChunkerSelector: _detect_language_from_custom_ext(file_ext)
        alt custom mapping found
            ChunkerSelector-->>ChunkerSelector: return custom language
        else no custom mapping
            ChunkerSelector-->>ChunkerSelector: return file_ext.lstrip(".").lower()
        end
    end
    ChunkerSelector-->>User: return DelimiterChunker(governor, language)
    deactivate ChunkerSelector

    User->>DelimiterChunker: process(file)
    activate DelimiterChunker
    DelimiterChunker->>DelimiterChunker: _load_delimiters_for_language(language)
    DelimiterChunker->>DelimiterChunker: _load_custom_delimiters(normalized_language, language)
    DelimiterChunker->>Governor: read settings.custom_delimiters
    alt settings has matching custom_delimiters
        loop each matching custom_delim
            DelimiterChunker->>DelimiterChunker: Delimiter.from_pattern(pattern)
            alt pattern valid
                DelimiterChunker-->>DelimiterChunker: append custom delimiters
            else pattern invalid
                DelimiterChunker-->>DelimiterChunker: log warning and skip
            end
        end
    else no matching custom_delimiters
        DelimiterChunker-->>DelimiterChunker: custom list is empty
    end

    DelimiterChunker->>DelimFamilies: LanguageFamily.from_known_language(language)
    DelimiterChunker->>DelimFamilies: get_family_patterns(family)
    DelimFamilies-->>DelimiterChunker: built-in patterns
    loop each family pattern
        DelimiterChunker->>DelimiterChunker: Delimiter.from_pattern(pattern)
        DelimiterChunker-->>DelimiterChunker: append built-in delimiters
    end

    DelimiterChunker-->>User: use merged delimiters (custom prepended)
    deactivate DelimiterChunker
Loading

Class diagram for custom delimiter loading and language detection

classDiagram
    class GovernorSettings {
        +custom_delimiters: list~CustomDelimiter~
    }

    class CustomDelimiter {
        +language: SemanticSearchLanguage | ConfigLanguage | str | None
        +extensions: list~ExtensionPair~
        +delimiters: list~DelimiterPattern~
    }

    class ExtensionPair {
        +ext: str | None
        +language: SemanticSearchLanguage | ConfigLanguage | str | None
    }

    class DelimiterPattern {
    }

    class Delimiter {
        +from_pattern(pattern: DelimiterPattern) list~Delimiter~
    }

    class SemanticSearchLanguage {
        +variable: str
    }

    class ConfigLanguage {
        +variable: str
    }

    class DelimiterChunker {
        -governor: Governor
        +_load_custom_delimiters(normalized_language: str, language: str) list~object~
        +_load_delimiters_for_language(language: str) list~Delimiter~
    }

    class Governor {
        +settings: GovernorSettings | None
    }

    class ChunkerSelector {
        +governor: Governor
        +select_for_file(file: DiscoveredFile) BaseChunker
        +_detect_language(file: DiscoveredFile) SemanticSearchLanguage | ConfigLanguage | str
        +_detect_language_from_custom_ext(file_ext: str) SemanticSearchLanguage | ConfigLanguage | str | None
        +_match_custom_ext_pair(custom_delim: object, file_ext: str, textcase: object) SemanticSearchLanguage | ConfigLanguage | str | None
    }

    class DiscoveredFile {
        +absolute_path: Path
        +ext_category: ExtCategory | None
    }

    class ExtCategory {
        +language: SemanticSearchLanguage | ConfigLanguage | str
    }

    class BaseChunker {
    }

    class LanguageFamily {
        +from_known_language(language: str) LanguageFamily
    }

    class DelimiterFamiliesAPI {
        +get_family_patterns(family: LanguageFamily) list~DelimiterPattern~
    }

    Governor --> GovernorSettings : has
    GovernorSettings "*" --> "*" CustomDelimiter : contains
    CustomDelimiter "*" --> "*" ExtensionPair : maps
    CustomDelimiter "*" --> "*" DelimiterPattern : provides
    DelimiterPattern --> Delimiter : converted_by

    DelimiterChunker --> Governor : uses
    DelimiterChunker --> Delimiter : creates
    DelimiterChunker --> LanguageFamily : queries
    DelimiterChunker --> DelimiterFamiliesAPI : calls

    ChunkerSelector --> Governor : uses
    ChunkerSelector --> DiscoveredFile : reads
    ChunkerSelector --> BaseChunker : returns
    ChunkerSelector --> CustomDelimiter : inspects
    ChunkerSelector --> ExtensionPair : inspects
    ChunkerSelector --> SemanticSearchLanguage : may_return
    ChunkerSelector --> ConfigLanguage : may_return

    DiscoveredFile --> ExtCategory : has
    ExtCategory --> SemanticSearchLanguage : may_hold
    ExtCategory --> ConfigLanguage : may_hold

    SemanticSearchLanguage <|-- ConfigLanguage
Loading

File-Level Changes

Change Details Files
Implement reusable helper to load custom delimiters with precedence over family patterns and robust error handling.
  • Introduce _load_custom_delimiters(normalized_language, language) to encapsulate reading settings.custom_delimiters and converting DelimiterPattern entries to Delimiter objects.
  • Normalise SemanticSearchLanguage/ConfigLanguage enums via .variable and textcase.snake, and plain values via textcase.snake(str(...)) to compare against the requested language key.
  • Filter custom delimiters by either direct language match or matching extension.language entries, then call Delimiter.from_pattern for each pattern.
  • Wrap each Delimiter.from_pattern call in a try/except (ValueError, TypeError) and log a warning when a pattern is invalid so loading continues for remaining patterns.
  • Update _load_delimiters_for_language to compute a normalized_language, prepend the custom delimiters returned by _load_custom_delimiters, then append family-derived patterns as before.
src/codeweaver/engine/chunker/delimiter.py
Enhance ChunkerSelector to resolve languages from custom extension mappings when the standard registry has no match.
  • Add _detect_language_from_custom_ext(file_ext) to walk settings.custom_delimiters[*].extensions when extension lookup fails, returning the mapped language or None.
  • Add static helper _match_custom_ext_pair(custom_delim, file_ext, textcase) that normalises and compares extension strings and returns the appropriate language from either the CustomDelimiter or the ExtLangPair.
  • Modify _detect_language to call _detect_language_from_custom_ext on unknown extensions and fall back to lowercased extension string when no custom mapping exists.
  • Clarify docstrings to reflect that detection can return enums or plain language strings and that custom delimiter mappings can introduce new languages.
src/codeweaver/engine/chunker/selector.py
Add focused tests covering custom delimiter precedence, type-safe language matching, error handling, and new-language detection.
  • Create helpers to build minimal DelimiterPattern instances, ChunkGovernor/ChunkerSettings, and DiscoveredFile-like mocks.
  • Test that custom delimiters for a known language are prepended ahead of family patterns while family patterns are still present.
  • Test that SemanticSearchLanguage, ConfigLanguage, and plain LanguageName values are all matched correctly against the requested language.
  • Test that a simulated invalid pattern in Delimiter.from_pattern is skipped with a logged warning while valid patterns still load.
  • Test that ChunkerSelector._detect_language uses custom extension mappings to resolve new languages and still falls back to the bare extension string for truly unknown extensions.
tests/unit/engine/chunker/test_custom_delimiter_loading.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider whether custom delimiters should override or only extend family delimiters; if they should take precedence in matching, you may want to prepend them or otherwise ensure their priority rather than appending to the same list before family patterns.
  • If there is any chance of custom_delim.delimiters containing invalid patterns, it might be worth guarding Delimiter.from_pattern(pattern) with basic validation or error handling so a single bad custom delimiter does not break loading for the entire language.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider whether custom delimiters should override or only extend family delimiters; if they should take precedence in matching, you may want to prepend them or otherwise ensure their priority rather than appending to the same list before family patterns.
- If there is any chance of `custom_delim.delimiters` containing invalid patterns, it might be worth guarding `Delimiter.from_pattern(pattern)` with basic validation or error handling so a single bad custom delimiter does not break loading for the entire language.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the previously stubbed conversion of user-configured custom delimiter patterns into concrete Delimiter objects during delimiter loading, aligning custom-delimiter handling with the existing family-pattern conversion flow.

Changes:

  • Initialize the delimiters list before processing settings-driven custom delimiters.
  • Convert each DelimiterPattern from settings.custom_delimiters via Delimiter.from_pattern().
  • Remove the old TODO/pass placeholder and avoid re-initializing delimiters before family conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1291 to +1293
# Convert DelimiterPattern to Delimiter objects
# TODO: Implement proper conversion when delimiter families are integrated
pass
for pattern in custom_delim.delimiters:
delimiters.extend(Delimiter.from_pattern(pattern))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion loop will rarely run for custom delimiters defined via extensions because the match predicate above compares ext.language directly to the language string. ExtLangPair.language can be SemanticSearchLanguage/ConfigLanguage enums (see codeweaver.core.metadata.ExtLangPair), so normalize both sides (e.g., .variable for BaseEnum + snake-case) similar to engine/chunker/delimiters/custom.py:get_custom_patterns().

Copilot uses AI. Check for mistakes.
Comment on lines +1291 to +1293
# Convert DelimiterPattern to Delimiter objects
# TODO: Implement proper conversion when delimiter families are integrated
pass
for pattern in custom_delim.delimiters:
delimiters.extend(Delimiter.from_pattern(pattern))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new conversion path for settings.custom_delimiters is untested. There are existing unit tests for DelimiterChunker, but none that exercise loading ChunkGovernor(settings=ChunkerSettings(custom_delimiters=...)) and verifying the resulting delimiters affect chunking. Adding a focused unit test here would prevent regressions in settings-driven delimiter behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Overview

This PR implements the delimiter conversion for custom delimiters from user settings, completing a TODO at delimiter.py:1289-1290. The implementation is clean, straightforward, and follows the existing pattern used for family delimiter conversion.

Changes Analysis

What Changed:

  1. ✅ Moved delimiters: list[Delimiter] = [] initialization before custom delimiter processing
  2. ✅ Replaced TODO placeholder with actual conversion loop using Delimiter.from_pattern()
  3. ✅ Maintained consistency with the family delimiter conversion pattern below

Code Quality: ⭐⭐⭐⭐⭐

The implementation is excellent:

  • Consistent: Uses the exact same pattern as family delimiter conversion (delimiter.py:1304-1305)
  • Correct: The Delimiter.from_pattern() method (defined at delimiter_model.py:82-119) properly handles pattern expansion
  • Minimal: No unnecessary complexity or over-engineering
  • Well-placed: List initialization happens at the right time, allowing custom delimiters to be added first

Architecture Review

Looking at the broader context:

  1. CustomDelimiter structure (chunker.py:109-160):

    • Has a delimiters: list[DelimiterPattern] field
    • Validates that at least one delimiter is provided
    • Supports both language and extensions for matching
  2. Delimiter.from_pattern() (delimiter_model.py:82-119):

    • Expands DelimiterPattern (which may have multiple start/end combinations)
    • Returns a list of concrete Delimiter instances
    • Handles all the semantic metadata (kind, priority, inclusive, etc.)
  3. Processing order (correct):

    • Custom delimiters from settings are processed first
    • Then family patterns are loaded and appended
    • This gives users the ability to override or supplement built-in patterns

Potential Concerns

Test Failures

The CI shows test failures on Python 3.12, 3.13t, and 3.14. I couldn't access the logs yet as the run is still in progress, but this needs investigation:

  • Are the failures related to this change?
  • Could there be edge cases with custom delimiter conversion?
  • Are there any type annotation issues with the newer Python versions?

Recommendation: Wait for test logs to become available and investigate the root cause.

Code Correctness Verification

The implementation assumes:

  1. custom_delim.delimiters is always a valid list of DelimiterPattern objects ✅ (validated by CustomDelimiter model)
  2. Delimiter.from_pattern() handles all DelimiterPattern types correctly ✅ (implementation looks solid)
  3. The order matters (custom first, then family) ✅ (makes sense for overrides)

All assumptions appear valid.

Minor Observations

  1. Duplicate logic: The custom delimiter matching logic (delimiter.py:1287-1294) is similar to logic in get_custom_patterns() (custom.py:693-706). This could potentially be extracted to a shared helper, but it's not critical.

  2. hasattr check: Line 1283 uses hasattr(self._governor.settings, 'custom_delimiters') which is somewhat defensive. Since ChunkerSettings defines custom_delimiters with a default of None, this should always be present. However, the defensive check doesn't hurt.

Security & Performance

  • ✅ No security concerns (defensive code, validates inputs via Pydantic)
  • ✅ No performance concerns (iteration is minimal, happens once per language load)
  • ✅ No memory leaks or resource issues

Recommendations

  1. MUST: Investigate and resolve test failures before merging
  2. SHOULD: Add test coverage specifically for custom delimiter conversion if not already present
  3. COULD: Consider extracting the language/extension matching logic to reduce duplication (future refactor)

Verdict

The code changes are correct and well-implemented. However, I cannot approve until the test failures are resolved. Once tests pass, this is ready to merge.

Status: ⏳ Awaiting test resolution


Review generated with Claude Code

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Summary

This PR implements proper delimiter conversion for custom delimiters loaded from user settings, resolving issue #99 and removing the TODO placeholder at delimiter.py:1289-1290.

Changes Analysis

What Changed:

  • Removed TODO placeholder
  • Added initialization of delimiters list before custom delimiter processing (line 1278)
  • Implemented conversion loop using Delimiter.from_pattern() for custom delimiters (lines 1296-1297)
  • Maintains consistent pattern with existing family delimiter conversion

Code Quality Assessment

Strengths:

  1. Consistent Implementation: The conversion logic mirrors the existing family delimiter pattern conversion (lines 1304-1305), maintaining code consistency
  2. Proper Initialization: Moving the delimiters list initialization to the top (line 1278) ensures custom delimiters are processed first, giving them priority
  3. Clean Conversion: Uses the existing Delimiter.from_pattern() method, which properly handles pattern expansion and delimiter instantiation
  4. Maintains Order: Custom delimiters → Family delimiters → Common fallbacks, which is a sensible priority ordering

Type Safety: The code properly uses type annotations and follows the existing codebase patterns

Integration: The implementation correctly integrates with the existing delimiter family system that was previously implemented

Potential Concerns

⚠️ Test Failures: The CI shows test failures on Python 3.12, 3.13t, and 3.14. While the code logic appears sound, these need to be investigated:

  • The failures may be pre-existing (unrelated to this PR)
  • Or they may indicate edge cases not handled by this implementation

Recommendation: Please investigate the test failures. If they're related to this change, we need to:

  1. Review what the tests expect
  2. Ensure the custom delimiter conversion handles all edge cases
  3. Potentially add test coverage for custom delimiter loading

Architecture & Design

Separation of Concerns: Properly delegates pattern expansion to Delimiter.from_pattern()

Backwards Compatible: Existing behavior is preserved; this only removes a TODO and implements the planned functionality

Security & Performance

No Security Issues: The code only processes user settings that are already validated by the settings schema

Performance: Uses the same conversion pattern as family delimiters, so no performance regression expected

Suggestions

  1. Priority Ordering: Consider if custom delimiters should always take precedence, or if there should be a way to merge/override specific patterns. The current implementation adds custom delimiters first, which seems reasonable.

  2. Documentation: The docstring could be updated to mention that custom delimiters are checked first, before family patterns.

  3. Test Coverage: Add unit tests specifically for custom delimiter loading to prevent regressions.

Verdict

The code implementation is well-structured and follows best practices. However, I cannot approve until the test failures are resolved.

Action Items:

  • Investigate and fix test failures
  • Verify tests pass on all Python versions
  • Consider adding specific test coverage for custom delimiter conversion

Once tests are green, this PR should be good to merge! 🚀

@bashandbone
Copy link
Copy Markdown
Contributor Author

@copilot we need some adjusting for the implementation to reflect design intent:

  1. If the user defines custom delimiters for a language or extension that codeweaver already has, then their custom delimiters should override the defaults (i.e. prepend).

  2. Users can define new languages and supported extensions this way. So: a) If it's not a known language it should still get added, b) The required changes extend beyond this module -- CodeWeaver would need to detect the new language and extension; this would probably happen somewhere else in the engine package.

  3. If there is any chance of custom_delim.delimiters containing invalid patterns, it might be worth guarding Delimiter.from_pattern(pattern) with basic validation or error handling so a single bad custom delimiter does not break loading for the entire language.

Other items:

  1. Type error: The comparison in delimiters.py does not correctly compare types. ext.language is a union type of SemanticSearchLanguage enum, ConfigLanguage enum, and LanguageName (newtype string). To compare them with a string, the correct approach would be to detect the enums and convert them to a string first (using their .variable method); similar to how it's handled in engine/chunker/delimiters/custom.py:get_custom_patterns().

  2. I'd like to see some test coverage of the conversions.

Copilot AI and others added 2 commits March 24, 2026 22:23
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/66279302-fae6-4887-8c87-7b6eadfac587
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/66279302-fae6-4887-8c87-7b6eadfac587
Copilot AI review requested due to automatic review settings March 24, 2026 22:26
@bashandbone bashandbone review requested due to automatic review settings March 24, 2026 22:26
Copilot AI changed the title feat: Implement delimiter conversion for custom delimiters from settings feat: Implement custom delimiter loading with override, type safety, and new-language detection Mar 24, 2026
@bashandbone bashandbone requested a review from Copilot March 25, 2026 13:40
@bashandbone
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In _load_custom_delimiters and _load_delimiters_for_language you’re working with list[object] and using # type: ignore[assignment]; consider tightening this to list[Delimiter] end-to-end so you can drop the object typing and the type ignore and get better static checking.
  • Both _load_custom_delimiters and _detect_language_from_custom_ext import textcase inside the function body; since these are hot paths, it would be cleaner and slightly more efficient to move the import to module scope and pass nothing around instead of the textcase module object.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_load_custom_delimiters` and `_load_delimiters_for_language` you’re working with `list[object]` and using `# type: ignore[assignment]`; consider tightening this to `list[Delimiter]` end-to-end so you can drop the `object` typing and the type ignore and get better static checking.
- Both `_load_custom_delimiters` and `_detect_language_from_custom_ext` import `textcase` inside the function body; since these are hot paths, it would be cleaner and slightly more efficient to move the import to module scope and pass nothing around instead of the `textcase` module object.

## Individual Comments

### Comment 1
<location path="src/codeweaver/engine/chunker/selector.py" line_range="257-264" />
<code_context>
+                pair_ext = f".{pair_ext}"
+            if pair_ext.lower() != file_ext.lower():
+                continue
+            delim_lang = getattr(custom_delim, "language", None)
+            if delim_lang is not None:
+                return textcase.snake(str(delim_lang))  # type: ignore[attr-defined]
+            pair_lang = getattr(pair, "language", None)
+            if pair_lang is not None:
+                if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
+                    return pair_lang
+                return textcase.snake(str(pair_lang))  # type: ignore[attr-defined]
+        return None
+
</code_context>
<issue_to_address>
**issue:** Inconsistent handling of enum vs string return types when resolving the language from custom delimiters.

`custom_delim.language` is always converted to a snake-cased string (`textcase.snake(str(delim_lang))`), even when it is a `SemanticSearchLanguage` or `ConfigLanguage`, while `pair.language` preserves enum instances and only stringifies non-enum values. As a result, this function may return enums for pair-based mappings but only strings for custom delimiter mappings, despite the shared return type `SemanticSearchLanguage | ConfigLanguage | str | None`. It would be clearer to make both branches behave the same—either always return enums when the value is an enum, or always normalize both branches to strings.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return textcase.snake(lang.variable)
return textcase.snake(str(lang))

delimiters: list[object] = []
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local delimiters list is typed as list[object], which then forces a # type: ignore[assignment] when assigning into list[Delimiter]. Type this as list[Delimiter] and you can drop the ignore at the call site.

Suggested change
delimiters: list[object] = []
delimiters: list[Delimiter] = []

Copilot uses AI. Check for mistakes.
Comment on lines +1342 to +1343
delimiters: list[Delimiter] = self._load_custom_delimiters( # type: ignore[assignment]
normalized_language, language
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# type: ignore[assignment] here is a symptom of _load_custom_delimiters() being typed as list[object]. Once _load_custom_delimiters() returns list[Delimiter], this ignore should be removable.

Suggested change
delimiters: list[Delimiter] = self._load_custom_delimiters( # type: ignore[assignment]
normalized_language, language
delimiters: list[Delimiter] = cast(
list[Delimiter],
self._load_custom_delimiters(normalized_language, language),

Copilot uses AI. Check for mistakes.
return textcase.snake(str(delim_lang)) # type: ignore[attr-defined]
pair_lang = getattr(pair, "language", None)
if pair_lang is not None:
if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage) will raise TypeError at runtime because isinstance doesn’t accept PEP604 union types. Use (SemanticSearchLanguage, ConfigLanguage) instead.

Suggested change
if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
if isinstance(pair_lang, (SemanticSearchLanguage, ConfigLanguage)):

Copilot uses AI. Check for mistakes.
Comment on lines +257 to +264
delim_lang = getattr(custom_delim, "language", None)
if delim_lang is not None:
return textcase.snake(str(delim_lang)) # type: ignore[attr-defined]
pair_lang = getattr(pair, "language", None)
if pair_lang is not None:
if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
return pair_lang
return textcase.snake(str(pair_lang)) # type: ignore[attr-defined]
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When custom_delim.language is a SemanticSearchLanguage/ConfigLanguage, textcase.snake(str(delim_lang)) will normalize the enum’s repr (e.g. ConfigLanguage.BASH) into something like config_language_bash instead of the intended language key (bash). Handle enums via .variable (or return the enum itself) before snake-casing.

Copilot uses AI. Check for mistakes.
logger = logging.getLogger(__name__)

def _normalize(lang: object) -> str:
if isinstance(lang, SemanticSearchLanguage | ConfigLanguage):
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(lang, SemanticSearchLanguage | ConfigLanguage) will raise TypeError at runtime because isinstance does not accept PEP604 union types. Use a tuple (SemanticSearchLanguage, ConfigLanguage) (or a shared base enum type if available) instead so enum normalization works.

Suggested change
if isinstance(lang, SemanticSearchLanguage | ConfigLanguage):
if isinstance(lang, (SemanticSearchLanguage, ConfigLanguage)):

Copilot uses AI. Check for mistakes.
Comment on lines +1255 to +1260
def _load_custom_delimiters(
self,
normalized_language: str,
language: str,
) -> list[object]:
"""Load custom delimiter patterns from settings that match the given language.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_custom_delimiters() is annotated as returning list[object], but it builds Delimiter instances via Delimiter.from_pattern(...). Update the return type to list[Delimiter] so typing reflects actual behavior.

Copilot uses AI. Check for mistakes.
@bashandbone
Copy link
Copy Markdown
Contributor Author

@copilot open a new pull request to apply changes based on the comments in this thread

With one exception: do not update the isinstance call to a tuple. Using union syntax is not a type error and is the preferred idiom in CodeWeaver (and ruff errors when not used) -- isinstance(some_type, str | int) or similar is correct; tuples are not (styleistically).

Also, please incorporate this feedback from sourcery:

Overall Comments

  • In _load_custom_delimiters and _load_delimiters_for_language you’re working with list[object] and using # type: ignore[assignment]; consider tightening this to list[Delimiter] end-to-end so you can drop the object typing and the type ignore and get better static checking.
  • Both _load_custom_delimiters and _detect_language_from_custom_ext import textcase inside the function body; since these are hot paths, it would be cleaner and slightly more efficient to move the import to module scope and pass nothing around instead of the textcase module object.

Individual Comments

Comment 1

+ pair_ext = f".{pair_ext}" + if pair_ext.lower() != file_ext.lower(): + continue + delim_lang = getattr(custom_delim, "language", None) + if delim_lang is not None: + return textcase.snake(str(delim_lang)) # type: ignore[attr-defined] + pair_lang = getattr(pair, "language", None) + if pair_lang is not None: + if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage): + return pair_lang + return textcase.snake(str(pair_lang)) # type: ignore[attr-defined] + return None + **issue:** Inconsistent handling of enum vs string return types when resolving the language from custom delimiters.

custom_delim.language is always converted to a snake-cased string (textcase.snake(str(delim_lang))), even when it is a SemanticSearchLanguage or ConfigLanguage, while pair.language preserves enum instances and only stringifies non-enum values. As a result, this function may return enums for pair-based mappings but only strings for custom delimiter mappings, despite the shared return type SemanticSearchLanguage | ConfigLanguage | str | None. It would be clearer to make both branches behave the same—either always return enums when the value is an enum, or always normalize both branches to strings.
</issue_to_address>

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 25, 2026

@bashandbone I've opened a new pull request, #277, to work on those changes. Once the pull request is ready, I'll request review from you.

…ter loading (#277)

* Initial plan

* fix: tighten types and move textcase to module scope in custom delimiter loading

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Agent-Logs-Url: https://github.com/knitli/codeweaver/sessions/0cc04151-09d9-4c2f-9eb2-9a1f233a187f

* Update src/codeweaver/engine/chunker/selector.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

---------

Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1303 to +1307
ext_match = custom_delim.extensions and any(
_normalize(ext.language) == normalized_language
for ext in custom_delim.extensions
if hasattr(ext, "language") and ext.language is not None
)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ext_match = custom_delim.extensions and any(...) can evaluate to the list itself (e.g., when extensions=[]), so ext_match isn’t consistently a boolean. Consider ext_match = bool(custom_delim.extensions) and any(...) for type safety and clearer intent.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #275: Custom Delimiter Loading Implementation

I've reviewed this PR implementing custom delimiter loading with override behavior, type safety, and new-language detection. The implementation is well-structured with comprehensive tests, but there are several important issues that need to be addressed.

Critical Issues

1. CI Test Failures (Blocking)

All test jobs are failing with error: Failed to spawn: pytest - this appears to be an environment/dependency issue unrelated to your code changes, but it blocks verification that the new tests actually pass. The failures need to be resolved before this can be merged.

2. Language Priority Logic Issue (High Priority)

Location: src/codeweaver/engine/chunker/selector.py:255-264

There's a significant logic flaw in _match_custom_ext_pair() that the Copilot reviewer correctly identified. When an extension match is found, the method returns custom_delim.language first, even though the user explicitly mapped that extension to pair.language. This can cause incorrect language detection when the top-level language field differs from the per-extension language values.

Current logic (lines 255-264):

delim_lang = getattr(custom_delim, "language", None)
if delim_lang is not None:
    # Returns custom_delim.language even for extension-specific matches
    if isinstance(delim_lang, SemanticSearchLanguage | ConfigLanguage):
        return delim_lang
    return textcase.snake(str(delim_lang))
pair_lang = getattr(pair, "language", None)
# This only runs if delim_lang is None

Recommended fix: Prioritize pair.language over custom_delim.language for extension matches, as suggested in comment #2990624081. When matching by extension, the extension's declared language should take precedence.

Code Quality Issues

3. Missing Import for textcase in delimiter.py

Location: src/codeweaver/engine/chunker/delimiter.py:11

Good: You added import textcase at module scope (line 21).
Issue: The import is buried among other imports. For clarity and to prevent future "module not found" issues, consider adding a comment explaining this is needed for custom delimiter language normalization, or ensure it's grouped logically with other third-party imports.

4. Inconsistent Error Handling

Location: src/codeweaver/engine/chunker/delimiter.py:1314-1322

The error handling catches (ValueError, TypeError) when converting patterns, which is good defensive programming. However:

  • The warning message could be more actionable for users
  • Consider whether you should also catch other exceptions (e.g., AttributeError if the pattern structure is unexpected)
  • The current approach silently skips bad patterns; consider if there should be a config option to fail-fast on invalid custom delimiters

5. Type Safety Concerns

Location: Multiple places using getattr(custom_delim, ...)

The code uses getattr() with object types in several places (e.g., selector.py:244, 255, 260). While this provides flexibility, it sacrifices type safety. Consider:

  • Adding runtime type assertions
  • Using Protocol classes to define expected interfaces
  • Adding docstring warnings about the duck-typing approach

Test Quality

6. Comprehensive Test Coverage ✅

Location: tests/unit/engine/chunker/test_custom_delimiter_loading.py

Excellent test coverage with 8 focused unit tests covering:

  • Override/prepend behavior
  • Enum type comparison
  • Error handling for bad patterns
  • New language detection

The tests are well-structured with clear helper functions and good documentation.

Minor suggestion: Consider adding integration tests that verify the end-to-end behavior with actual file chunking, not just unit tests with mocks.

7. Test Markers

The tests use multiple markers: @pytest.mark.benchmark, @pytest.mark.mock_only, @pytest.mark.performance, @pytest.mark.unit.

Verify that:

  • These markers are necessary (e.g., are these actually benchmark/performance tests?)
  • The CI is configured to run tests with these markers

Documentation

8. Good Docstring Quality ✅

The docstrings for _load_custom_delimiters(), _detect_language_from_custom_ext(), and _match_custom_ext_pair() are clear and well-formatted with proper Args/Returns sections.

9. PR Description ✅

The PR description is excellent - it clearly explains the changes, provides context, and links to resolved issues.

Performance Considerations

10. Language Normalization Performance

Location: delimiter.py:1296-1299 and selector.py:259, 264

The _normalize() helper and direct textcase.snake() calls are executed in loops. For large numbers of custom delimiters:

  • Consider caching normalized language names
  • Profile to see if this becomes a bottleneck
  • The current implementation is probably fine for typical use cases (< 100 custom delimiters)

Security

11. Input Validation ✅

The code safely handles malformed input:

  • Checks for None values
  • Uses hasattr() before attribute access
  • Catches exceptions during pattern conversion

No security concerns identified.

Recommendations

Must Fix (Blocking):

  1. Resolve CI test failures
  2. Fix the language priority logic in _match_custom_ext_pair() as suggested in comment #2990624081

Should Fix (High Priority):
3. Add a test case specifically for the scenario where custom_delim.language differs from pair.language to verify the fix for #2
4. Consider more specific exception handling or user-facing error messages for invalid patterns

Nice to Have:
5. Add integration tests
6. Review test markers for accuracy
7. Consider performance optimizations if custom delimiter counts scale significantly

Conclusion

This is a solid implementation with good separation of concerns, comprehensive unit tests, and clear documentation. The core algorithm (prepending custom delimiters to override defaults, normalizing language types) is sound. However, the language priority bug and CI failures are blocking issues that must be resolved before merge.

Once the critical issues are addressed, this will be a valuable addition to the codeweaver chunking system.


Overall Assessment: Request Changes - address the language priority logic and CI failures before merging.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review - PR #275: Custom Delimiter Loading Implementation

I've reviewed the changes in this PR and they look excellent overall. Here's my detailed analysis:

Strengths

  1. Clean Architecture: The extraction of _load_custom_delimiters() helper method is well-designed, keeping the code modular and maintainable.

  2. Type Safety: The normalization logic for SemanticSearchLanguage, ConfigLanguage, and plain language names is implemented correctly and consistently with the existing pattern in custom.py:694-704.

  3. Robust Error Handling: The try-except block around Delimiter.from_pattern() (lines 1314-1322) ensures that a single invalid pattern doesn't break the entire loading process - this is defensive programming at its best.

  4. Prepend Strategy: Custom delimiters are correctly prepended (line 1348-1350) to override built-in family patterns, which is the intended behavior.

  5. Extension Detection: The new _detect_language_from_custom_ext() and _match_custom_ext_pair() methods in selector.py enable detection of user-defined languages, which is a valuable feature.

  6. Comprehensive Tests: The test suite (test_custom_delimiter_loading.py) is excellent:

    • 8 focused unit tests covering all key scenarios
    • Tests for prepend/override behavior
    • Tests for type comparison (all enum variants)
    • Error handling verification
    • New-language detection
    • All tests are passing ✓

📊 Test Results

  • ✅ All 8 new custom delimiter tests passed
  • ✅ 1,446 total tests passed
  • ❌ 1 unrelated test failed: test_indexing_performance_with_real_providers
    • This is a flaky integration test with a FileNotFoundError in vector storage
    • Not related to the PR changes - this test doesn't touch delimiter loading

🔍 Code Quality Observations

  1. Consistency: The implementation matches the existing pattern in delimiters/custom.py (lines 685-714), using the same normalization approach with textcase.snake().

  2. Documentation: Excellent docstrings with clear parameter descriptions and return types.

  3. Logging: Appropriate use of logger.warning() for skipped patterns with helpful context.

  4. Defensive Coding: Good use of hasattr() and null checks when accessing settings attributes.

Performance & Security

  • Performance: No concerns. The custom delimiter loading is O(n*m) where n=custom delimiters and m=patterns per delimiter, which is expected and acceptable.
  • Security: No security issues detected. The code safely handles user input and validates patterns.

🎯 Recommendations

  1. Minor: Consider caching - If custom delimiters are loaded frequently for the same language, consider caching the normalized results (though this may already be handled at a higher level).

  2. The flaky test - The failing test_indexing_performance_with_real_providers should be addressed separately (not in this PR). It appears to be a race condition or cleanup issue in the test infrastructure.

Conclusion

This PR successfully implements the TODO at delimiter.py:1289-1290 with a well-designed, robust, and thoroughly tested solution. The code quality is high, follows existing patterns, and adds valuable functionality for custom delimiter support.

Recommendation: APPROVE

The single failing test is unrelated to these changes and should not block this PR.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to Comment #2990624081 (Language Priority Logic)

I agree with the Copilot reviewer's suggestion on the _match_custom_ext_pair method. The current implementation has a logic flaw:

The Problem:
When an extension match is found, the method returns custom_delim.language first, even though the extension explicitly declares pair.language. This means:

CustomDelimiter(
    language="foo",
    extensions=[ExtLangPair(ext=".bar", language="baz")],
    delimiters=[...]
)

A .bar file would be detected as "foo" instead of "baz".

The Fix:
Prioritize pair.language over custom_delim.language for extension-based matches, as suggested. The warning for mismatches is also valuable for catching configuration errors.

Additional Recommendations:

  1. Apply the suggested code change at selector.py:255-264
  2. Add a test case validating this scenario (where extension language ≠ delimiter language)
  3. Import logger at module level in selector.py for the warning

This is a high-priority fix that should be addressed before merging.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 25, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if pair_lang is not None:
if (
delim_lang is not None
and str(delim_lang) != str(pair_lang)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mismatch check uses str(delim_lang) != str(pair_lang), which can produce false positives (e.g., enum values often stringify as SemanticSearchLanguage.PYTHON vs 'python'). This can cause noisy warnings even when languages are semantically the same. Normalize both sides (e.g., using .variable/textcase.snake) before comparing, or drop the warning given the config model already validates consistency.

Suggested change
and str(delim_lang) != str(pair_lang)
and textcase.snake(str(delim_lang)) != textcase.snake(str(pair_lang))

Copilot uses AI. Check for mistakes.
Comment on lines +1342 to +1344
# Custom entries are prepended so they override the built-in family
# patterns for known languages and are the sole source for new languages.
delimiters: list[Delimiter] = self._load_custom_delimiters(
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says custom delimiters are “the sole source for new languages”, but the implementation still always appends LanguageFamily.UNKNOWN patterns (and additional generic fallbacks) for unknown languages. Either update the comment/docstring to match behavior, or change the logic to skip family patterns when the language is unknown and custom delimiters were provided.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Code Review - PR #275

I've reviewed the implementation of custom delimiter loading with override, type safety, and new-language detection. This is a well-structured PR that addresses important functionality gaps, but there are several issues that need to be resolved before approval.

Critical Issues

1. Test Failures (Blocking)

The PR currently has failing tests on Python 3.12, 3.13t, and 3.14. These need to be investigated and fixed before the PR can be merged. The tests are still running at the time of this review, but the failures need to be addressed.

2. Inconsistent Enum/String Handling in selector.py:_match_custom_ext_pair (High Priority)

At src/codeweaver/engine/chunker/selector.py:255-276, there's an inconsistency in how the function handles enum vs string return types:

  • When resolving from custom_delim.language, the code always converts to a snake-cased string, even for SemanticSearchLanguage or ConfigLanguage enums
  • When resolving from pair.language, the code preserves enum instances and only stringifies non-enum values

This violates the principle of least surprise and could lead to subtle bugs. The function signature promises SemanticSearchLanguage | ConfigLanguage | str | None, but the two code paths return different types for the same input enum types.

Recommendation: Make both branches consistent. Since the code at line 270-272 already handles enums correctly by checking isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage), apply the same pattern to delim_lang. The logic should be:

if pair_lang is not None:
    if delim_lang is not None and str(delim_lang) \!= str(pair_lang):
        logger.warning(...)
    if isinstance(pair_lang, SemanticSearchLanguage | ConfigLanguage):
        return pair_lang
    return textcase.snake(str(pair_lang))
if delim_lang is not None:
    if isinstance(delim_lang, SemanticSearchLanguage | ConfigLanguage):
        return delim_lang
    return textcase.snake(str(delim_lang))

3. Missing Null Safety Check (Medium Priority)

At src/codeweaver/engine/chunker/delimiter.py:1307-1311, the extension matching logic doesn't safely handle cases where ext.language might not exist:

ext_match = custom_delim.extensions and any(
    _normalize(ext.language) == normalized_language
    for ext in custom_delim.extensions
    if hasattr(ext, "language") and ext.language is not None
)

While hasattr and null checks are present, this could still raise if ext itself is unexpectedly None or malformed. Consider adding an outer check or using getattr with a default.

Positive Observations

  1. Excellent Test Coverage: The new test file test_custom_delimiter_loading.py provides comprehensive coverage of:

    • Custom delimiter prepending/override behavior
    • Type-correct language comparison across all enum types
    • Error handling for invalid patterns
    • New language detection via custom extensions
  2. Good Error Handling: The try-except wrapper around Delimiter.from_pattern() at delimiter.py:1314-1322 properly handles invalid patterns with logging, preventing a single bad pattern from breaking the entire system.

  3. Clean Separation of Concerns: Extracting _load_custom_delimiters() as a helper method keeps complexity manageable and makes the code more testable.

  4. Proper Documentation: Docstrings are clear, comprehensive, and include examples where appropriate.

  5. Correct Type Normalization: The _normalize() helper function at delimiter.py:1296-1299 properly handles the different language type variants using .variable for enums and textcase.snake() for strings.

Minor Issues

Code Style and Performance

  1. Import Location: Both _load_custom_delimiters and _detect_language_from_custom_ext import or use textcase at module scope (which is correct in the current code), so this is good.

  2. Type Annotations: The code uses proper type hints throughout, though there might be opportunities to tighten some object types to more specific types for better static checking.

  3. Pytest Markers: The test file uses multiple pytest markers (@pytest.mark.benchmark, @pytest.mark.mock_only, @pytest.mark.performance, @pytest.mark.unit) on each test class. Verify these are all necessary and correctly configured in the project's pytest setup.

Architecture and Design

The implementation correctly addresses the design requirements mentioned in the comments:

Requirement 1 (Override/Prepend): Custom delimiters are correctly prepended to override defaults
Requirement 2 (New Languages): ChunkerSelector can detect new languages via custom extension mappings
Requirement 3 (Error Handling): Invalid patterns are skipped with warnings
Requirement 4 (Type Comparison): Language type comparison is implemented (but needs the consistency fix mentioned above)
Requirement 5 (Test Coverage): Comprehensive tests are included

Recommendations

Before Merge:

  1. Fix failing tests - Must investigate and resolve test failures
  2. Fix enum/string inconsistency in _match_custom_ext_pair
  3. Run the full test suite locally to ensure all tests pass
  4. Consider adding a test case specifically for the enum preservation behavior to prevent regression

Nice to Have:

  • Add integration tests showing the full flow from custom delimiter definition to chunking
  • Consider adding validation for the custom delimiter settings at config load time
  • Document the priority/precedence of custom delimiters in user-facing documentation

Conclusion

This is a solid implementation that addresses an important TODO and extends functionality significantly. The code is well-structured, properly tested, and follows good engineering practices. However, the failing tests and the enum/string inconsistency are blocking issues that must be resolved before approval.

Once the critical issues are addressed, this PR will be ready for merge.


Status: ⏸️ Changes Requested
Estimated Effort to Fix: ~30-60 minutes

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR implements custom delimiter loading with good architecture and comprehensive testing. However, there is one critical type safety issue that must be addressed before merging.

🔴 Critical Issue: Type Safety Bug

Location: src/codeweaver/engine/chunker/delimiter.py:1307

The expression:

ext_match = custom_delim.extensions and any(...)

Can evaluate to the list itself when extensions=[], making ext_match inconsistently typed (could be list, bool, or result of any()).

Fix:

ext_match = bool(custom_delim.extensions) and any(
    _normalize(ext.language) == normalized_language
    for ext in custom_delim.extensions
    if hasattr(ext, "language") and ext.language is not None
)

This is the same issue noted in this feedback comment.

✅ Strengths

  1. Well-structured refactoring - Clean extraction of _load_custom_delimiters() helper
  2. Proper error handling - Individual pattern failures don't crash the system
  3. Type normalization - Correctly handles SemanticSearchLanguage, ConfigLanguage, and plain strings
  4. Comprehensive testing - 8 focused unit tests covering major scenarios
  5. Language detection extension - Clean implementation in selector.py

🟡 Minor Issues

  1. CI Tests Failing - Python 3.12, 3.13t, and 3.14 tests are failing. Need to investigate if related to this PR.
  2. Performance consideration - Repeated iteration over custom_delimiters. Consider caching for large lists.

Security & Code Quality

✅ No security concerns - proper input validation and error handling
✅ Follows codebase patterns - consistent use of textcase.snake() and .variable
✅ Good documentation - clear docstrings and test documentation

Recommendation

Conditional Approval - Fix the critical type safety issue, ensure CI tests pass, then this is ready to merge.

The implementation is well-designed and thorough, but the ext_match type issue must be resolved for correctness.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Custom Delimiter Loading with Override, Type Safety, and New-Language Detection

I've completed a thorough review of PR #275. This is a well-structured implementation that properly addresses the TODO at delimiter.py:1289-1290 and extends the custom delimiter system with comprehensive type safety and new language detection capabilities.

Strengths

1. Excellent Implementation Design

  • ✅ Custom delimiters are correctly prepended to override built-in family patterns (delimiter.py:1348)
  • ✅ Proper separation of concerns with _load_custom_delimiters() helper method
  • ✅ Graceful error handling prevents a single bad pattern from breaking the entire language
  • ✅ ChunkerSelector correctly detects custom extensions for new languages

2. Type Safety & Normalization

  • ✅ Consistent use of textcase.snake() for language normalization matching the pattern in custom.py:694-704
  • ✅ Proper handling of SemanticSearchLanguage, ConfigLanguage, and plain LanguageName types
  • ✅ Correct enum property access via .variable (delimiter.py:1298)

3. Comprehensive Test Coverage

  • ✅ 8 focused unit tests covering all major scenarios
  • ✅ Tests verify prepend/override behavior, type comparisons, error handling, and custom extension detection
  • ✅ Good use of mocking and caplog for testing warnings

Issue Found: Language Mismatch Warning

Location: selector.py:259-269

The comparison str(delim_lang) \!= str(pair_lang) can produce false positives because enum values stringify differently than their underlying values:

# Current behavior:
str(SemanticSearchLanguage.PYTHON)  # Returns "SemanticSearchLanguage.PYTHON"
str("python")  # Returns "python"
# These would be considered "different" even though semantically they're the same

Recommendation: Apply the normalization function before comparison to avoid spurious warnings:

def _normalize(lang: object) -> str:
    if isinstance(lang, SemanticSearchLanguage | ConfigLanguage):
        return textcase.snake(lang.variable)
    return textcase.snake(str(lang))

# Then in the comparison:
if (
    delim_lang is not None
    and _normalize(delim_lang) \!= _normalize(pair_lang)
):
    logger.warning(...)

This matches the normalization pattern already used in delimiter.py:1296-1299 and custom.py:699-704.

Minor Observations

  1. Consistency: The _normalize() helper is defined locally in _load_custom_delimiters(). Consider extracting it to module scope (like textcase import) to reuse in _match_custom_ext_pair() for the mismatch check.

  2. Test Status: Some tests are failing but the run is still in progress. The test file itself looks well-structured.

  3. Code Style: Excellent adherence to the project's style - proper typing, clear docstrings, and consistent formatting.

Recommendation

This is solid work that properly implements the feature. The only issue is the false-positive warning scenario. I recommend:

  1. Fix the language mismatch comparison in selector.py:261 using normalized values
  2. Wait for CI to complete to verify all tests pass
  3. Consider extracting the normalization logic to reduce duplication

Once the mismatch comparison is fixed, this PR should be ready to merge.

Code Quality: ⭐⭐⭐⭐ (4/5)

  • Deducting one star only for the mismatch comparison issue
  • Otherwise excellent implementation with proper error handling, type safety, and comprehensive tests

Reviewed with Claude Code

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review - Issue with Generic Fallback Patterns for New Languages

Thank you for this comprehensive implementation of custom delimiter loading! The overall design is solid, with good error handling, type safety, and thorough test coverage. However, there is a legitimate inconsistency identified in comment #2990743944 that should be addressed.


Critical Issue: Documentation vs. Implementation Mismatch

Location: src/codeweaver/engine/chunker/delimiter.py:1346-1347

The Problem:
The comment states that custom delimiters are "the sole source for new languages":

# Custom entries are prepended so they override the built-in family
# patterns for known languages and are the sole source for new languages.
delimiters: list[Delimiter] = self._load_custom_delimiters(
    normalized_language, language
)

However, the implementation always appends family patterns, even for unknown languages:

# Load from delimiter families system
family = LanguageFamily.from_known_language(language)  # Returns UNKNOWN for new languages
patterns = get_family_patterns(family)  # Returns generic fallback patterns for UNKNOWN

# Convert patterns to delimiters
for pattern in patterns:
    delimiters.extend(Delimiter.from_pattern(pattern))  # Always appends

When a truly new language (e.g., "mylang") is used:

  1. LanguageFamily.from_known_language("mylang") returns LanguageFamily.UNKNOWN
  2. get_family_patterns(LanguageFamily.UNKNOWN) returns generic fallback patterns (PARAGRAPH_PATTERN, BRACE_BLOCK_PATTERN, etc. per families.py:371-376)
  3. These patterns are appended to the custom delimiters

Impact:

  • For new custom languages, users get both their custom delimiters AND generic fallback patterns, not just their custom delimiters as documented
  • This might be unintended behavior if users want precise control over chunking for their custom language

Suggested Fix

You have two options to align code with documentation:

Option 1: Skip family patterns for unknown languages with custom delimiters

# Load from delimiter families system
family = LanguageFamily.from_known_language(language)

# Only load family patterns if:
# - Language is known (not UNKNOWN), OR
# - No custom delimiters were provided
if family != LanguageFamily.UNKNOWN or not delimiters:
    patterns = get_family_patterns(family)
    for pattern in patterns:
        delimiters.extend(Delimiter.from_pattern(pattern))

Option 2: Update documentation to reflect current behavior

# Custom entries are prepended so they override the built-in family
# patterns for known languages. For unknown languages, custom delimiters
# are combined with generic fallback patterns from LanguageFamily.UNKNOWN.

Recommendation: Option 1 seems more aligned with the design intent of allowing users to define new languages entirely via custom delimiters. However, Option 2 might be safer if the generic fallback patterns are desired as a safety net.


Positive Aspects

  1. Excellent type safety: The normalization logic in _normalize() correctly handles SemanticSearchLanguage, ConfigLanguage, and plain strings using textcase.snake() - this matches the pattern used elsewhere in the codebase

  2. Robust error handling: The try-except wrapper around Delimiter.from_pattern() with warning logs ensures one bad pattern doesn't break the entire language configuration

  3. Well-structured extraction: Moving custom delimiter loading into _load_custom_delimiters() keeps complexity manageable and improves testability

  4. Comprehensive language detection: The ChunkerSelector extensions (_detect_language_from_custom_ext and _match_custom_ext_pair) properly enable new language detection via custom extension mappings

  5. Thorough test coverage: The test suite covers prepend/override behavior, all enum type variants, error handling, and new language detection

  6. Good logging: Warning messages for mismatched extension languages (selector.py:263-269) help users debug configuration issues


Minor Observations

  1. Performance: The nested loops in _load_custom_delimiters() and _match_custom_ext_pair() should be fine for typical use cases, but might be worth profiling if custom_delimiters lists become very large

  2. Test markers: The test file uses multiple pytest marks (@pytest.mark.benchmark, @pytest.mark.performance, etc.) - ensure these align with your CI/CD pipeline expectations

  3. CI Failures: The PR shows test failures on multiple Python versions (3.12, 3.13t, 3.14). These should be investigated, though they may be unrelated to this specific issue


Recommendation

Request changes to address the documentation/implementation mismatch. Once that's resolved and CI passes, this will be a solid contribution that properly implements custom delimiter loading with good engineering practices.

Please clarify the intended behavior for unknown languages with custom delimiters and update either the code or documentation accordingly.

@bashandbone
Copy link
Copy Markdown
Contributor Author

@copilot reference this feedback: #275 (review)

It's good feedback but doesn't precisely reflect the design intent here. Please align your work with the intended design.

The intent:

  • Providing custom delimiters is a prerequisite for adding a new language to codeweaver. If users don't provide custom delimiters -- then they must provide a family. Please double check the config handling in engine.config. This was the original intent and it should align with this (if it doesn't it needs correcting).
  • A user should not be able to add a language or extension type to codeweaver without providing either delimiters or a language family.
  • We don't index unknown data -- it must be explicitly defined because the alternative is a recipe for context noise.

@bashandbone bashandbone merged commit f6952a6 into main Mar 27, 2026
29 of 35 checks passed
@bashandbone bashandbone deleted the claude/issue-99-20260319-0103 branch March 27, 2026 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants