-
Notifications
You must be signed in to change notification settings - Fork 629
Expand parse without semicolons #1949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
4b4f5a1
7c17355
b83c245
53d7930
0026bb7
75cb98e
bbd32ed
4bde5b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1036,8 +1036,14 @@ pub trait Dialect: Debug + Any { | |||||||||||||||||||
/// Returns true if the specified keyword should be parsed as a table factor alias. | ||||||||||||||||||||
/// When explicit is true, the keyword is preceded by an `AS` word. Parser is provided | ||||||||||||||||||||
/// to enable looking ahead if needed. | ||||||||||||||||||||
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, parser: &mut Parser) -> bool { | ||||||||||||||||||||
explicit || self.is_table_alias(kw, parser) | ||||||||||||||||||||
/// | ||||||||||||||||||||
/// When the dialect supports statements without semicolon delimiter, actual keywords aren't parsed as aliases. | ||||||||||||||||||||
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, _parser: &mut Parser) -> bool { | ||||||||||||||||||||
if self.supports_statements_without_semicolon_delimiter() { | ||||||||||||||||||||
kw == &Keyword::NoKeyword | ||||||||||||||||||||
} else { | ||||||||||||||||||||
explicit || self.is_table_alias(kw, _parser) | ||||||||||||||||||||
} | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// Returns true if this dialect supports querying historical table data | ||||||||||||||||||||
|
@@ -1136,6 +1142,18 @@ pub trait Dialect: Debug + Any { | |||||||||||||||||||
fn supports_notnull_operator(&self) -> bool { | ||||||||||||||||||||
false | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// Returns true if the dialect supports parsing statements without a semicolon delimiter. | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to give an example here? Something like
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done 👍 |
||||||||||||||||||||
/// | ||||||||||||||||||||
/// If returns true, the following SQL will not parse. If returns `false` the SQL will parse | ||||||||||||||||||||
/// | ||||||||||||||||||||
/// ```sql | ||||||||||||||||||||
/// SELECT 1 | ||||||||||||||||||||
/// SELECT 2 | ||||||||||||||||||||
/// ``` | ||||||||||||||||||||
fn supports_statements_without_semicolon_delimiter(&self) -> bool { | ||||||||||||||||||||
false | ||||||||||||||||||||
} | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// This represents the operators for which precedence must be defined | ||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,7 +67,7 @@ impl Dialect for MsSqlDialect { | |
} | ||
|
||
fn supports_connect_by(&self) -> bool { | ||
true | ||
false | ||
} | ||
|
||
fn supports_eq_alias_assignment(&self) -> bool { | ||
|
@@ -123,6 +123,10 @@ impl Dialect for MsSqlDialect { | |
true | ||
} | ||
|
||
fn supports_statements_without_semicolon_delimiter(&self) -> bool { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤦 MS SQL!!!! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🤷♂️🤣 |
||
true | ||
} | ||
|
||
/// See <https://learn.microsoft.com/en-us/sql/relational-databases/security/authentication-access/server-level-roles> | ||
fn get_reserved_grantees_types(&self) -> &[GranteesType] { | ||
&[GranteesType::Public] | ||
|
@@ -280,6 +284,9 @@ impl MsSqlDialect { | |
) -> Result<Vec<Statement>, ParserError> { | ||
let mut stmts = Vec::new(); | ||
loop { | ||
while let Token::SemiColon = parser.peek_token_ref().token { | ||
parser.advance_token(); | ||
} | ||
if let Token::EOF = parser.peek_token_ref().token { | ||
break; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,6 +266,22 @@ impl ParserOptions { | |
self.unescape = unescape; | ||
self | ||
} | ||
|
||
/// Set if semicolon statement delimiters are required. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the usecase to have the same configuration information on the It seems there are a few places in the Parser that inconsistently check the options and/or the dialect flag. If we had only one way to specify the option, there would not be any room for inconsistencies Or maybe I misunderstand what is going on here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's similar to the existing
It seems like testing parsing without semicolons would get a lot worse if it wasn't a dialect option There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yeah the My current thinking is be that ideally we have one way to configure the semicolon option since it would be nice to not have two knobs for the flag. In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting - i.e. whether or not semicolon is required is more reflecting the tool that originally processed the input file, vs an aspect of the sql language/dialect itself, so that it can differ in certain contexts for the same dialect. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me in terms of philosophy. However, I don't know how that could be implemented practically. I.e., how would you write that code in this project to be able to have the same test coverage? The benefit of setting the parse option based on the dialect configuration is:
|
||
/// | ||
/// If this option is `true`, the following SQL will not parse. If the option is `false`, the SQL will parse. | ||
/// | ||
/// ```sql | ||
/// SELECT 1 | ||
/// SELECT 2 | ||
/// ``` | ||
pub fn with_require_semicolon_stmt_delimiter( | ||
mut self, | ||
require_semicolon_stmt_delimiter: bool, | ||
) -> Self { | ||
self.require_semicolon_stmt_delimiter = require_semicolon_stmt_delimiter; | ||
self | ||
} | ||
} | ||
|
||
#[derive(Copy, Clone)] | ||
|
@@ -362,7 +378,11 @@ impl<'a> Parser<'a> { | |
state: ParserState::Normal, | ||
dialect, | ||
recursion_counter: RecursionCounter::new(DEFAULT_REMAINING_DEPTH), | ||
options: ParserOptions::new().with_trailing_commas(dialect.supports_trailing_commas()), | ||
options: ParserOptions::new() | ||
.with_trailing_commas(dialect.supports_trailing_commas()) | ||
.with_require_semicolon_stmt_delimiter( | ||
!dialect.supports_statements_without_semicolon_delimiter(), | ||
), | ||
} | ||
} | ||
|
||
|
@@ -484,13 +504,18 @@ impl<'a> Parser<'a> { | |
|
||
match self.peek_token().token { | ||
Token::EOF => break, | ||
|
||
// end of statement | ||
Token::Word(word) => { | ||
if expecting_statement_delimiter && word.keyword == Keyword::END { | ||
break; | ||
} | ||
} | ||
// don't expect a semicolon statement delimiter after a newline when not otherwise required | ||
Token::Whitespace(Whitespace::Newline) => { | ||
if !self.options.require_semicolon_stmt_delimiter { | ||
expecting_statement_delimiter = false; | ||
} | ||
} | ||
_ => {} | ||
} | ||
|
||
|
@@ -500,7 +525,7 @@ impl<'a> Parser<'a> { | |
|
||
let statement = self.parse_statement()?; | ||
stmts.push(statement); | ||
expecting_statement_delimiter = true; | ||
expecting_statement_delimiter = self.options.require_semicolon_stmt_delimiter; | ||
} | ||
Ok(stmts) | ||
} | ||
|
@@ -4541,6 +4566,14 @@ impl<'a> Parser<'a> { | |
return Ok(vec![]); | ||
} | ||
|
||
if end_token == Token::SemiColon && !self.options.require_semicolon_stmt_delimiter { | ||
if let Token::Word(ref kw) = self.peek_token().token { | ||
if kw.keyword != Keyword::NoKeyword { | ||
return Ok(vec![]); | ||
} | ||
} | ||
} | ||
|
||
if self.options.trailing_commas && self.peek_tokens() == [Token::Comma, end_token] { | ||
let _ = self.consume_token(&Token::Comma); | ||
return Ok(vec![]); | ||
|
@@ -4558,6 +4591,9 @@ impl<'a> Parser<'a> { | |
) -> Result<Vec<Statement>, ParserError> { | ||
let mut values = vec![]; | ||
loop { | ||
// ignore empty statements (between successive statement delimiters) | ||
while self.consume_token(&Token::SemiColon) {} | ||
|
||
match &self.peek_nth_token_ref(0).token { | ||
Token::EOF => break, | ||
Token::Word(w) => { | ||
|
@@ -4569,7 +4605,13 @@ impl<'a> Parser<'a> { | |
} | ||
|
||
values.push(self.parse_statement()?); | ||
self.expect_token(&Token::SemiColon)?; | ||
|
||
if self.options.require_semicolon_stmt_delimiter { | ||
self.expect_token(&Token::SemiColon)?; | ||
} | ||
|
||
// ignore empty statements (between successive statement delimiters) | ||
while self.consume_token(&Token::SemiColon) {} | ||
} | ||
Ok(values) | ||
} | ||
|
@@ -16464,7 +16506,28 @@ impl<'a> Parser<'a> { | |
|
||
/// Parse [Statement::Return] | ||
fn parse_return(&mut self) -> Result<Statement, ParserError> { | ||
match self.maybe_parse(|p| p.parse_expr())? { | ||
let rs = self.maybe_parse(|p| { | ||
let expr = p.parse_expr()?; | ||
|
||
match &expr { | ||
Expr::Value(_) | ||
| Expr::Function(_) | ||
| Expr::UnaryOp { .. } | ||
| Expr::BinaryOp { .. } | ||
| Expr::Case { .. } | ||
| Expr::Cast { .. } | ||
| Expr::Convert { .. } | ||
| Expr::Subquery(_) => Ok(expr), | ||
// todo: how to retstrict to variables? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems more like a semantic check -- this crate is focused on syntax parsing. Read more hre: https://github.com/apache/datafusion-sqlparser-rs?tab=readme-ov-file#syntax-vs-semantics I think the error is inappropriate in this case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What do you suggest?
The difficulty here is that without semicolon tokens the return statement is ambiguous. Consider the following perfectly valid T-SQL statement (which is also a test case example) which has several variations of
If you revert this change and run that test, you get this error:
The reason is that for the first I think I had said when introducing parse_return in a previous PR that we'd have to come back and tighten it up eventually, but I can't find that discussion 😐. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @aharpervc in the example, is the ambiguity stemming from tsql not supporting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see this being about the if statement, but rather about the return statement. The question for the parse is, given a return keyword, how many more tokens (or similarly, which expr's) should be considered part of that ReturnStatement? For T-SQL, there is a restricted answer to that question, but it is somewhat awkward to express here.
Newlines don't really help you here, I put them in the example above for readability, but this should parse identically (and does if you run on a real SQL Server instance):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah right, makes sense that they're not newline dependent. To clarify what I meant in the if self.peek_one_of_keywords(
// hmm noticed while enumerating this list that the only ambiguous case really
// seems to be IF - since offhand the others can't be an expression.
// So that the list might not be necessary after all.
[BEGIN, BREAK, CONTINUE, GOTO, IF, RETURN, THROW, TRY, WAITFOR, WHILE]
) {
// bare return statement
}
// else maybe_parse_expr would something like this work? |
||
Expr::Identifier(id) if id.value.starts_with('@') => Ok(expr), | ||
_ => parser_err!( | ||
"Non-returnable expression found following RETURN", | ||
p.peek_token().span.start | ||
), | ||
} | ||
})?; | ||
|
||
match rs { | ||
Some(expr) => Ok(Statement::Return(ReturnStatement { | ||
value: Some(ReturnStatementValue::Expr(expr)), | ||
})), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed the intent here, would you be able to provide example sqls to illustrate and how they should be parsed with vs without the flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the example sql is from the sqlparse_common test case
parse_select_with_table_alias_keyword
&parse_invalid_limit_by
, such asSELECT a FROM lineitem DECLARE
. Commenting out the new logic and running the tests shows the difficulty. My thinking here is that if semicolons are optional, we can forbid keywords as table aliases. Folks who use this option are going to be opting in to a certain level of ambiguity regardless, but having the parser do reasonable things (with test coverage) is my objective here.Looking at this code again I think this should be checking the actual configured parser option rather than the dialect's capability, similar to the other feedback comment. I will change it.