SNOW-3440288: Enhance schema string parser for quotes#4206
SNOW-3440288: Enhance schema string parser for quotes#4206sfc-gh-wshangguan wants to merge 3 commits intomainfrom
Conversation
| # "a""b" is the 7-char span 0..6 inclusive; index past it is 7 | ||
| s = '"a""b" rest' | ||
| assert _scan_quoted_identifier(s, 0) == 6 |
There was a problem hiding this comment.
The test comment contains an error. The comment states "a""b" is a "7-char span 0..6 inclusive" with "index past it is 7", but:
"a""b"is 6 characters (positions 0-5):",a,",",b,"- The index just past it is 6 (not 7)
The assertion assert _scan_quoted_identifier(s, 0) == 6 is correct, but the comment should read:
# "a""b" is a 6-char span (positions 0-5); index just past it is 6While this is only a comment error and won't cause production failures, it could confuse future maintainers debugging this code.
| # "a""b" is the 7-char span 0..6 inclusive; index past it is 7 | |
| s = '"a""b" rest' | |
| assert _scan_quoted_identifier(s, 0) == 6 | |
| # "a""b" is a 6-char span (positions 0-5); index just past it is 6 | |
| s = '"a""b" rest' | |
| assert _scan_quoted_identifier(s, 0) == 6 | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4206 +/- ##
==========================================
- Coverage 95.42% 95.04% -0.38%
==========================================
Files 171 171
Lines 43801 43835 +34
Branches 7505 7513 +8
==========================================
- Hits 41795 41665 -130
- Misses 1226 1345 +119
- Partials 780 825 +45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
can you also run this change against SCOS's regression test? |
|
|
||
| Raises ``ValueError`` if the closing quote is missing. | ||
| """ | ||
| assert s[start] == '"' |
There was a problem hiding this comment.
is this assert necessary? Is there a case that this function is called on a string that its first character is not double quote?
|
|
||
|
|
||
| def _split_object_field(field_def: str) -> Tuple[str, str]: | ||
| """Split a single OBJECT field definition into ``(name_token, remainder)``. |
There was a problem hiding this comment.
is it possible that a malformed field_def like "a NUM"BER reach here? Or is this already handled in the upstream?
There was a problem hiding this comment.
Yes, it's possible. Added a test that we raise exception with clear error message.
| if s[i] == '"': | ||
| if i + 1 < len(s) and s[i + 1] == '"': | ||
| i += 2 # escaped "" inside the name; keep scanning | ||
| continue | ||
| return i + 1 # index just past the closing quote |
There was a problem hiding this comment.
nit: prefer writing like this so it's slightly easier to tell what this check is doing
| if s[i] == '"': | |
| if i + 1 < len(s) and s[i + 1] == '"': | |
| i += 2 # escaped "" inside the name; keep scanning | |
| continue | |
| return i + 1 # index just past the closing quote | |
| if s[i:i + 1] == '""': # check for a "" escape sequence in the name | |
| i += 2 | |
| continue | |
| elif s[i] == '"': | |
| # found closing quote, return the index just past it | |
| return i + 1 # index just past the closing quote |
sfc-gh-yuwang
left a comment
There was a problem hiding this comment.
LGTM, please check SCOS regression test before merge
6a2cf59 to
5857d24
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3440288
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Problem
concrete examples of INFER_SCHEMA outputs that the existing parser broke on (space / comma / paren / mixed-case inside "..." field names) and why those names need quoting per Snowflake's identifier grammar.
Solution
the two new helpers (_scan_quoted_identifier, _split_object_field) and the three updated callers (split_top_level_comma_fields, _extract_paren_content, OBJECT branch of _sf_type_to_type_object), referencing the server-side SFSqlLexer.g / SqlIdentifierUtils.java grammar that pins the "" escape.
Backward compatibility
bare names still take the original split path; non-OBJECT structured strings are unchanged.