Skip to content

SNOW-3440288: Enhance schema string parser for quotes#4206

Open
sfc-gh-wshangguan wants to merge 3 commits intomainfrom
wshangguan-SNOW-3440288-enhance-schema-string-parser-for-quotes
Open

SNOW-3440288: Enhance schema string parser for quotes#4206
sfc-gh-wshangguan wants to merge 3 commits intomainfrom
wshangguan-SNOW-3440288-enhance-schema-string-parser-for-quotes

Conversation

@sfc-gh-wshangguan
Copy link
Copy Markdown
Collaborator

@sfc-gh-wshangguan sfc-gh-wshangguan commented Apr 29, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3440288

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

Problem

concrete examples of INFER_SCHEMA outputs that the existing parser broke on (space / comma / paren / mixed-case inside "..." field names) and why those names need quoting per Snowflake's identifier grammar.

Solution

the two new helpers (_scan_quoted_identifier, _split_object_field) and the three updated callers (split_top_level_comma_fields, _extract_paren_content, OBJECT branch of _sf_type_to_type_object), referencing the server-side SFSqlLexer.g / SqlIdentifierUtils.java grammar that pins the "" escape.

Backward compatibility

bare names still take the original split path; non-OBJECT structured strings are unchanged.

Comment on lines +109 to +111
# "a""b" is the 7-char span 0..6 inclusive; index past it is 7
s = '"a""b" rest'
assert _scan_quoted_identifier(s, 0) == 6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment contains an error. The comment states "a""b" is a "7-char span 0..6 inclusive" with "index past it is 7", but:

  • "a""b" is 6 characters (positions 0-5): ", a, ", ", b, "
  • The index just past it is 6 (not 7)

The assertion assert _scan_quoted_identifier(s, 0) == 6 is correct, but the comment should read:

# "a""b" is a 6-char span (positions 0-5); index just past it is 6

While this is only a comment error and won't cause production failures, it could confuse future maintainers debugging this code.

Suggested change
# "a""b" is the 7-char span 0..6 inclusive; index past it is 7
s = '"a""b" rest'
assert _scan_quoted_identifier(s, 0) == 6
# "a""b" is a 6-char span (positions 0-5); index just past it is 6
s = '"a""b" rest'
assert _scan_quoted_identifier(s, 0) == 6

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.04%. Comparing base (75260b9) to head (5857d24).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4206      +/-   ##
==========================================
- Coverage   95.42%   95.04%   -0.38%     
==========================================
  Files         171      171              
  Lines       43801    43835      +34     
  Branches     7505     7513       +8     
==========================================
- Hits        41795    41665     -130     
- Misses       1226     1345     +119     
- Partials      780      825      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfc-gh-wshangguan sfc-gh-wshangguan changed the title first attemp with tests SNOW-3440288: Enhance schema string parser for quotes Apr 30, 2026
@sfc-gh-wshangguan sfc-gh-wshangguan marked this pull request as ready for review April 30, 2026 22:00
@sfc-gh-yuwang
Copy link
Copy Markdown
Collaborator

can you also run this change against SCOS's regression test?


Raises ``ValueError`` if the closing quote is missing.
"""
assert s[start] == '"'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assert necessary? Is there a case that this function is called on a string that its first character is not double quote?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.



def _split_object_field(field_def: str) -> Tuple[str, str]:
"""Split a single OBJECT field definition into ``(name_token, remainder)``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that a malformed field_def like "a NUM"BER reach here? Or is this already handled in the upstream?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's possible. Added a test that we raise exception with clear error message.

Comment on lines +1436 to +1440
if s[i] == '"':
if i + 1 < len(s) and s[i + 1] == '"':
i += 2 # escaped "" inside the name; keep scanning
continue
return i + 1 # index just past the closing quote
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer writing like this so it's slightly easier to tell what this check is doing

Suggested change
if s[i] == '"':
if i + 1 < len(s) and s[i + 1] == '"':
i += 2 # escaped "" inside the name; keep scanning
continue
return i + 1 # index just past the closing quote
if s[i:i + 1] == '""': # check for a "" escape sequence in the name
i += 2
continue
elif s[i] == '"':
# found closing quote, return the index just past it
return i + 1 # index just past the closing quote

Copy link
Copy Markdown
Collaborator

@sfc-gh-yuwang sfc-gh-yuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please check SCOS regression test before merge

@sfc-gh-wshangguan sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3440288-enhance-schema-string-parser-for-quotes branch from 6a2cf59 to 5857d24 Compare May 1, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants