Skip to content

Analyse queries to get table name and columns used #74

@unruledboy

Description

@unruledboy

We have quite some large SQL Server databases in terms of data volume (in TBs) and database objects (thousands).

Overtime, we have developed manually created stored procedures (hundreds of them) and the ones auto generated by Entity Framework (thousands of them).

Now we in this constant struggle with performance issues, mainly due to lack of optimal indexes.

I know SQL Server have the sys.dm_db_missing_index_groups and related DMVs to suggest possible missing indexes, but it has limitations like the info will be reset after a server restart etc.

I tried to use this SqlParser-cs project to automatically scan through all the unique captured queries in the audit logs (via extended event), and if what indexes are missing.

The logic flow is relative straight forward:

  1. For every query, check the operation (CRUD)
  2. And for each operation, obtain the table name and columns used (either in WHERE clause or returned columns)
  3. Compare the existing indexes and see if it's missing

But I found there could be many different cases regarding the types of the expression in this project. It would be very verbose and clumsy for me to do lots of switch/case statements to have the exhausted support for all expressions etc. etc.

What should be simplest way to obtain the object/table name and referenced columns (in both selection, relationship and returned columns) so that I can produce a POCO to finally match with the existing indexes in a statement?

Note that there could be multiple tables in a statement, especially with joins. I need to work out the exhausted list of tables, their associated columns and the location/nature (in selection, relationship or returned columns), and the outcome should be like below:

  1. TableA: Column1 (selection), Column2 (selection), Column3 (relationship), Column4 (returned column)
  2. TableB: Column1 (selection), Column2 (returned column), Column3 (returned column)
  3. TableC: Column1 (returned column)

Sample query:

SELECT * from t1 where EXISTS(SELECT column from t2 where a<>b) and a=1 and 'x'=b

Attempt 1

var statements = new Parser().ParseSql(sql);

foreach (var item in statements)
{
    string tableName = null;

    if (item is Statement.Update update)
    {
        if (update.Table.Relation is TableFactor.Table table)
        {
            tableName = table.Name;
        }

        // handle different expression type
        // both left/right operand can be expression
        if (update.Selection is Expression.InSubquery inSubquery)
        {
            if (inSubquery.Expression is Expression.Identifier)
            {

            }
        }
    }
}

Then I tried with visitor pattern.

Attempt 2

using SqlParser.Ast;

namespace SqlOptimizer
{
    internal class SqlTableColumnVisitor : Visitor
{
    private readonly List<SqlTable> sqlTables = [];

    public List<SqlTable> SqlTables => sqlTables;

    public override ControlFlow PreVisitTableFactor(TableFactor tableFactor)
    {
        string? tableName = default;

        if (tableFactor is TableFactor.Table table)
        {
            tableName = table.Alias?.Name ?? table.Name.Values[0].Value;
        }

        if (tableName != default)
        {
            sqlTables.Add(new SqlTable
            {
                Alias = tableName
            });
        }

        return ControlFlow.Continue;
    }

    public override ControlFlow PostVisitExpression(Expression expression)
    {
        if (expression is Expression.Identifier identifier)
        {
            var sqlTable = sqlTables[^1];

            sqlTable.Columns.Add(new SqlTableColumn
            {
                ColumnName = identifier.Ident.Value,
                Location = "TBC"
            });
        }

        return ControlFlow.Continue;
    }

    //public override ControlFlow PreVisitRelation(ObjectName relation)
    //{
    //    var sqlTable = sqlTables[^1];

    //    if (relation.Values.Count > 0)
    //    {
    //        sqlTable.Alias = relation.Values[0].Value;
    //    }

    //    return ControlFlow.Continue;
    //}
    

    internal class SqlTable
    {
        public string? Alias { get; set; }

        public List<SqlTableColumn> Columns { get; set; } = [];
    }

    internal class SqlTableColumn
    {
        public required string Location { get; set; }
     
        public required string ColumnName { get; set; }
    }
}

That's based on the Visitor_Visits_Sql_Parts sample test:

"SELECT * from t1 where EXISTS(SELECT column from t2) UNION SELECT * from t3",
[
    "PRE: STATEMENT: SELECT * FROM t1 WHERE EXISTS (SELECT column FROM t2) UNION SELECT * FROM t3",
    "PRE: TABLE FACTOR: t1",
    "PRE: RELATION: t1",
    "POST: RELATION: t1",
    "POST: TABLE FACTOR: t1",
    "PRE: EXPR: EXISTS (SELECT column FROM t2)",
    "PRE: EXPR: column",
    "POST: EXPR: column",
    "PRE: TABLE FACTOR: t2",
    "PRE: RELATION: t2",
    "POST: RELATION: t2",
    "POST: TABLE FACTOR: t2",
    "POST: EXPR: EXISTS (SELECT column FROM t2)",
    "PRE: TABLE FACTOR: t3",
    "PRE: RELATION: t3",
    "POST: RELATION: t3",
    "POST: TABLE FACTOR: t3",
    "POST: STATEMENT: SELECT * FROM t1 WHERE EXISTS (SELECT column FROM t2) UNION SELECT * FROM t3"
]

And I realized it will not work because the scope of the current table will be changed in nested statement. We need to some sort of stack to correctly track the scope to have the right table/columns association.

What should be right way to correctly capture individual table/columns combination?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions