Skip to content

Hardening: make arrow-precedence preprocessing robust to literals/comments #73

@ndenev

Description

@ndenev

Summary

fix_arrow_precedence performs regex-based rewriting over raw SQL text. This works for common cases but is structurally brittle because it is not token/AST-aware.

Affected code

  • src/datafusion_integration/preprocess.rs (fix_arrow_precedence, LEFT_ARROW_PATTERN, RIGHT_ARROW_PATTERN)

Current status

  • Existing tests cover many happy/idempotency paths.
  • Quick live checks did not reproduce literal/comment corruption in simple probes.
  • This is filed as hardening (low priority), not a confirmed semantic bug.

Why track this

Regex-over-raw-SQL transforms can accidentally match inside contexts that should be inert (string literals/comments) or miss edge syntax forms. Even if currently okay, this is fragile and benefits from explicit guardrails.

Proposed improvements

  1. Add explicit regression tests for:
    • arrow-like text in single-quoted literals
    • arrow-like text in -- and /* ... */ comments
    • unusual whitespace/operator layouts
  2. If edge breakage appears, migrate to tokenizer/AST-aware rewriting for precedence wrapping.
  3. Keep idempotency guarantees.

Related issue

Related selector-side robustness problem (validated in live run) is tracked here:

That issue is about selector representability/validation; this issue is about SQL preprocessing robustness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions