Skip to content

Expand parse without semicolons #1949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions src/dialect/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1036,8 +1036,14 @@ pub trait Dialect: Debug + Any {
/// Returns true if the specified keyword should be parsed as a table factor alias.
/// When explicit is true, the keyword is preceded by an `AS` word. Parser is provided
/// to enable looking ahead if needed.
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, parser: &mut Parser) -> bool {
explicit || self.is_table_alias(kw, parser)
///
/// When the dialect supports statements without semicolon delimiter, actual keywords aren't parsed as aliases.
fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, _parser: &mut Parser) -> bool {
if self.supports_statements_without_semicolon_delimiter() {
kw == &Keyword::NoKeyword
} else {
explicit || self.is_table_alias(kw, _parser)
}
Comment on lines +1040 to +1046
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I followed the intent here, would you be able to provide example sqls to illustrate and how they should be parsed with vs without the flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the example sql is from the sqlparse_common test case parse_select_with_table_alias_keyword & parse_invalid_limit_by, such as SELECT a FROM lineitem DECLARE. Commenting out the new logic and running the tests shows the difficulty. My thinking here is that if semicolons are optional, we can forbid keywords as table aliases. Folks who use this option are going to be opting in to a certain level of ambiguity regardless, but having the parser do reasonable things (with test coverage) is my objective here.

Looking at this code again I think this should be checking the actual configured parser option rather than the dialect's capability, similar to the other feedback comment. I will change it.

}

/// Returns true if this dialect supports querying historical table data
Expand Down Expand Up @@ -1136,6 +1142,18 @@ pub trait Dialect: Debug + Any {
fn supports_notnull_operator(&self) -> bool {
false
}

/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to give an example here?

Something like

Suggested change
/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
///
/// If returns true, the following SQL will not parse. If returns `false` the SQL will parse
///
/// ```sql
/// SELECT 1
/// SELECT 2
/// ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

///
/// If returns true, the following SQL will not parse. If returns `false` the SQL will parse
///
/// ```sql
/// SELECT 1
/// SELECT 2
/// ```
fn supports_statements_without_semicolon_delimiter(&self) -> bool {
false
}
}

/// This represents the operators for which precedence must be defined
Expand Down
9 changes: 8 additions & 1 deletion src/dialect/mssql.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ impl Dialect for MsSqlDialect {
}

fn supports_connect_by(&self) -> bool {
true
false
}

fn supports_eq_alias_assignment(&self) -> bool {
Expand Down Expand Up @@ -123,6 +123,10 @@ impl Dialect for MsSqlDialect {
true
}

fn supports_statements_without_semicolon_delimiter(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 MS SQL!!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️🤣

true
}

/// See <https://learn.microsoft.com/en-us/sql/relational-databases/security/authentication-access/server-level-roles>
fn get_reserved_grantees_types(&self) -> &[GranteesType] {
&[GranteesType::Public]
Expand Down Expand Up @@ -280,6 +284,9 @@ impl MsSqlDialect {
) -> Result<Vec<Statement>, ParserError> {
let mut stmts = Vec::new();
loop {
while let Token::SemiColon = parser.peek_token_ref().token {
parser.advance_token();
}
if let Token::EOF = parser.peek_token_ref().token {
break;
}
Expand Down
9 changes: 9 additions & 0 deletions src/keywords.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1072,6 +1072,7 @@ pub const RESERVED_FOR_TABLE_ALIAS: &[Keyword] = &[
Keyword::ANTI,
Keyword::SEMI,
Keyword::RETURNING,
Keyword::RETURN,
Keyword::ASOF,
Keyword::MATCH_CONDITION,
// for MSSQL-specific OUTER APPLY (seems reserved in most dialects)
Expand All @@ -1097,6 +1098,11 @@ pub const RESERVED_FOR_TABLE_ALIAS: &[Keyword] = &[
Keyword::TABLESAMPLE,
Keyword::FROM,
Keyword::OPEN,
Keyword::INSERT,
Keyword::UPDATE,
Keyword::DELETE,
Keyword::EXEC,
Keyword::EXECUTE,
];

/// Can't be used as a column alias, so that `SELECT <expr> alias`
Expand Down Expand Up @@ -1126,6 +1132,7 @@ pub const RESERVED_FOR_COLUMN_ALIAS: &[Keyword] = &[
Keyword::CLUSTER,
Keyword::DISTRIBUTE,
Keyword::RETURNING,
Keyword::RETURN,
// Reserved only as a column alias in the `SELECT` clause
Keyword::FROM,
Keyword::INTO,
Expand All @@ -1140,6 +1147,7 @@ pub const RESERVED_FOR_TABLE_FACTOR: &[Keyword] = &[
Keyword::LIMIT,
Keyword::HAVING,
Keyword::WHERE,
Keyword::RETURN,
];

/// Global list of reserved keywords that cannot be parsed as identifiers
Expand All @@ -1150,4 +1158,5 @@ pub const RESERVED_FOR_IDENTIFIER: &[Keyword] = &[
Keyword::INTERVAL,
Keyword::STRUCT,
Keyword::TRIM,
Keyword::RETURN,
];
73 changes: 68 additions & 5 deletions src/parser/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,22 @@ impl ParserOptions {
self.unescape = unescape;
self
}

/// Set if semicolon statement delimiters are required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It seems there are a few places in the Parser that inconsistently check the options and/or the dialect flag. If we had only one way to specify the option, there would not be any room for inconsistencies

Or maybe I misunderstand what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It's similar to the existing with_trailing_commas logic, and has several practical use cases:

  1. can be set to true for SQL Server & left false for all others (until if/when someone discovers this is supported in other dialects)
  2. integration with existing testing patterns. eg, we can have all_dialects_requiring_semicolon_statement_delimiter & all_dialects_not_requiring_semicolon_statement_delimiter
  3. used to set the corresponding default parser option

It seems like testing parsing without semicolons would get a lot worse if it wasn't a dialect option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah the with_trailing_commas option was introduced as parser setting originally and later had to be added as a dialect option since it was dialect specific, I figure ideally it would have been present only on the dialect.

My current thinking is be that ideally we have one way to configure the semicolon option since it would be nice to not have two knobs for the flag. In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting - i.e. whether or not semicolon is required is more reflecting the tool that originally processed the input file, vs an aspect of the sql language/dialect itself, so that it can differ in certain contexts for the same dialect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting

This seems reasonable to me in terms of philosophy. However, I don't know how that could be implemented practically. I.e., how would you write that code in this project to be able to have the same test coverage?

The benefit of setting the parse option based on the dialect configuration is:

  1. not all dialects support parsing without semicolons (so far, only one)
  2. for dialects that require semicolons, we probably don't want anything to change (as is implemented here in the PR)
  3. this project's tests are oriented around dialect features, so implementing the semicolon requirement as a dialect feature makes it more clear for how to run the tests

///
/// If this option is `true`, the following SQL will not parse. If the option is `false`, the SQL will parse.
///
/// ```sql
/// SELECT 1
/// SELECT 2
/// ```
pub fn with_require_semicolon_stmt_delimiter(
mut self,
require_semicolon_stmt_delimiter: bool,
) -> Self {
self.require_semicolon_stmt_delimiter = require_semicolon_stmt_delimiter;
self
}
}

#[derive(Copy, Clone)]
Expand Down Expand Up @@ -362,7 +378,11 @@ impl<'a> Parser<'a> {
state: ParserState::Normal,
dialect,
recursion_counter: RecursionCounter::new(DEFAULT_REMAINING_DEPTH),
options: ParserOptions::new().with_trailing_commas(dialect.supports_trailing_commas()),
options: ParserOptions::new()
.with_trailing_commas(dialect.supports_trailing_commas())
.with_require_semicolon_stmt_delimiter(
!dialect.supports_statements_without_semicolon_delimiter(),
),
}
}

Expand Down Expand Up @@ -484,13 +504,18 @@ impl<'a> Parser<'a> {

match self.peek_token().token {
Token::EOF => break,

// end of statement
Token::Word(word) => {
if expecting_statement_delimiter && word.keyword == Keyword::END {
break;
}
}
// don't expect a semicolon statement delimiter after a newline when not otherwise required
Token::Whitespace(Whitespace::Newline) => {
if !self.options.require_semicolon_stmt_delimiter {
expecting_statement_delimiter = false;
}
}
_ => {}
}

Expand All @@ -500,7 +525,7 @@ impl<'a> Parser<'a> {

let statement = self.parse_statement()?;
stmts.push(statement);
expecting_statement_delimiter = true;
expecting_statement_delimiter = self.options.require_semicolon_stmt_delimiter;
}
Ok(stmts)
}
Expand Down Expand Up @@ -4541,6 +4566,14 @@ impl<'a> Parser<'a> {
return Ok(vec![]);
}

if end_token == Token::SemiColon && !self.options.require_semicolon_stmt_delimiter {
if let Token::Word(ref kw) = self.peek_token().token {
if kw.keyword != Keyword::NoKeyword {
return Ok(vec![]);
}
}
}

if self.options.trailing_commas && self.peek_tokens() == [Token::Comma, end_token] {
let _ = self.consume_token(&Token::Comma);
return Ok(vec![]);
Expand All @@ -4558,6 +4591,9 @@ impl<'a> Parser<'a> {
) -> Result<Vec<Statement>, ParserError> {
let mut values = vec![];
loop {
// ignore empty statements (between successive statement delimiters)
while self.consume_token(&Token::SemiColon) {}

match &self.peek_nth_token_ref(0).token {
Token::EOF => break,
Token::Word(w) => {
Expand All @@ -4569,7 +4605,13 @@ impl<'a> Parser<'a> {
}

values.push(self.parse_statement()?);
self.expect_token(&Token::SemiColon)?;

if self.options.require_semicolon_stmt_delimiter {
self.expect_token(&Token::SemiColon)?;
}

// ignore empty statements (between successive statement delimiters)
while self.consume_token(&Token::SemiColon) {}
}
Ok(values)
}
Expand Down Expand Up @@ -16464,7 +16506,28 @@ impl<'a> Parser<'a> {

/// Parse [Statement::Return]
fn parse_return(&mut self) -> Result<Statement, ParserError> {
match self.maybe_parse(|p| p.parse_expr())? {
let rs = self.maybe_parse(|p| {
let expr = p.parse_expr()?;

match &expr {
Expr::Value(_)
| Expr::Function(_)
| Expr::UnaryOp { .. }
| Expr::BinaryOp { .. }
| Expr::Case { .. }
| Expr::Cast { .. }
| Expr::Convert { .. }
| Expr::Subquery(_) => Ok(expr),
// todo: how to retstrict to variables?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like a semantic check -- this crate is focused on syntax parsing.

Read more hre:

https://github.com/apache/datafusion-sqlparser-rs?tab=readme-ov-file#syntax-vs-semantics

I think the error is inappropriate in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error is inappropriate in this case

What do you suggest?

This seems more like a semantic check -- this crate is focused on syntax parsing.

The difficulty here is that without semicolon tokens the return statement is ambiguous. Consider the following perfectly valid T-SQL statement (which is also a test case example) which has several variations of return:

CREATE OR ALTER PROCEDURE example_sp
AS
    IF USER_NAME() = 'X'
        RETURN
    
    IF 1 = 2
        RETURN (SELECT 1)

    RETURN CONVERT(INT, 123)

If you revert this change and run that test, you get this error:

thread 'test_supports_statements_without_semicolon_delimiter' panicked at tests/sqlparser_mssql.rs:2533:14:
called `Result::unwrap()` on an `Err` value: ParserError("Expected: an SQL statement, found: 1 at Line: 1, Column: 72")

The reason is that for the first return, the self.maybe_parse(|p| p.parse_expr())? "successfully" parses/consumes the IF keyword. However, this is contrary to the keyword documentation for SQL Server (ref) which requires an "integer expression".

I think I had said when introducing parse_return in a previous PR that we'd have to come back and tighten it up eventually, but I can't find that discussion 😐.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aharpervc in the example, is the ambiguity stemming from tsql not supporting IF expressions but only IF statements? If so we could flag that difference in the parser dialect, avoiding this diff. If it does support IF statements then the grammar indeed looks either ambiguous or relying on the newline character to differentiate between the two scenarios?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being about the if statement, but rather about the return statement. The question for the parse is, given a return keyword, how many more tokens (or similarly, which expr's) should be considered part of that ReturnStatement? For T-SQL, there is a restricted answer to that question, but it is somewhat awkward to express here.

if it does support IF statements then the grammar indeed looks either ambiguous or relying on the newline character to differentiate between the two scenarios?

Newlines don't really help you here, I put them in the example above for readability, but this should parse identically (and does if you run on a real SQL Server instance):

CREATE OR ALTER PROCEDURE example_sp AS IF USER_NAME() = 'X' RETURN IF 1 = 2 RETURN (SELECT 1) RETURN CONVERT(INT, 123)

Copy link
Contributor

@iffyio iffyio Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, makes sense that they're not newline dependent. To clarify what I meant in the IF scenario the idea would be to look whether the next token can not be an expression - from a look at their docs the options seem limited to
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/control-of-flow?view=sql-server-ver15

if self.peek_one_of_keywords(
  // hmm noticed while enumerating this list that the only ambiguous case really
  // seems to be IF - since offhand the others can't be an expression.
  // So that the list might not be necessary after all.
  [BEGIN, BREAK, CONTINUE, GOTO, IF, RETURN, THROW, TRY, WAITFOR, WHILE]
) {
  // bare return statement
}
// else maybe_parse_expr

would something like this work?

Expr::Identifier(id) if id.value.starts_with('@') => Ok(expr),
_ => parser_err!(
"Non-returnable expression found following RETURN",
p.peek_token().span.start
),
}
})?;

match rs {
Some(expr) => Ok(Statement::Return(ReturnStatement {
value: Some(ReturnStatementValue::Expr(expr)),
})),
Expand Down
64 changes: 64 additions & 0 deletions src/test_utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#[cfg(not(feature = "std"))]
use alloc::{
boxed::Box,
format,
string::{String, ToString},
vec,
vec::Vec,
Expand Down Expand Up @@ -186,6 +187,32 @@ impl TestedDialects {
statements
}

/// The same as [`statements_parse_to`] but it will strip semicolons from the SQL text.
pub fn statements_without_semicolons_parse_to(
&self,
sql: &str,
canonical: &str,
) -> Vec<Statement> {
let sql_without_semicolons = sql.replace(";", " ");
let statements = self
.parse_sql_statements(&sql_without_semicolons)
.expect(&sql_without_semicolons);
if !canonical.is_empty() && sql != canonical {
assert_eq!(self.parse_sql_statements(canonical).unwrap(), statements);
} else {
assert_eq!(
sql,
statements
.iter()
// note: account for format_statement_list manually inserted semicolons
.map(|s| s.to_string().trim_end_matches(";").to_string())
.collect::<Vec<_>>()
.join("; ")
);
}
statements
}

/// Ensures that `sql` parses as an [`Expr`], and that
/// re-serializing the parse result produces canonical
pub fn expr_parses_to(&self, sql: &str, canonical: &str) -> Expr {
Expand Down Expand Up @@ -318,6 +345,43 @@ where
all_dialects_where(|d| !except(d))
}

/// Returns all dialects that don't support statements without semicolon delimiters.
/// (i.e. dialects that require semicolon delimiters.)
pub fn all_dialects_requiring_semicolon_statement_delimiter() -> TestedDialects {
let tested_dialects =
all_dialects_except(|d| d.supports_statements_without_semicolon_delimiter());
assert_ne!(tested_dialects.dialects.len(), 0);
tested_dialects
}

/// Returns all dialects that do support statements without semicolon delimiters.
/// (i.e. dialects not requiring semicolon delimiters.)
pub fn all_dialects_not_requiring_semicolon_statement_delimiter() -> TestedDialects {
let tested_dialects =
all_dialects_where(|d| d.supports_statements_without_semicolon_delimiter());
assert_ne!(tested_dialects.dialects.len(), 0);
tested_dialects
}

/// Asserts an error for `parse_sql_statements`:
/// - "end of statement" for dialects that require semicolon delimiters
/// - "an SQL statement" for dialects that don't require semicolon delimiters.
pub fn assert_err_parse_statements(sql: &str, found: &str) {
assert_eq!(
ParserError::ParserError(format!("Expected: end of statement, found: {found}")),
all_dialects_requiring_semicolon_statement_delimiter()
.parse_sql_statements(sql)
.unwrap_err()
);

assert_eq!(
ParserError::ParserError(format!("Expected: an SQL statement, found: {found}")),
all_dialects_not_requiring_semicolon_statement_delimiter()
.parse_sql_statements(sql)
.unwrap_err()
);
}

pub fn assert_eq_vec<T: ToString>(expected: &[&str], actual: &[T]) {
assert_eq!(
expected,
Expand Down
Loading