Analyse queries to get table name and columns used

We have quite some large SQL Server databases in terms of data volume (in TBs) and database objects (thousands).

Overtime, we have developed manually created stored procedures (hundreds of them) and the ones auto generated by Entity Framework (thousands of them).

Now we in this constant struggle with performance issues, mainly due to lack of optimal indexes.

I know SQL Server have the `sys.dm_db_missing_index_groups` and related DMVs to suggest possible missing indexes, but it has limitations like the info will be reset after a server restart etc.

I tried to use this SqlParser-cs project to automatically scan through all the unique captured queries in the audit logs (via extended event), and if what indexes are missing.

The logic flow is relative straight forward:

1. For every query, check the operation (CRUD)
2. And for each operation, obtain the table name and columns used (either in WHERE clause or returned columns)
3. Compare the existing indexes and see if it's missing

But I found there could be many different cases regarding the types of the expression in this project. It would be very verbose and clumsy for me to do lots of switch/case statements to have the exhausted support for all expressions etc. etc.

What should be simplest way to obtain the object/table name and referenced columns (in both selection, relationship and returned columns) so that I can produce a POCO to finally match with the existing indexes in a statement?

Note that there could be multiple tables in a statement, especially with joins. I need to work out the exhausted list of tables, their associated columns and the location/nature (in selection, relationship or returned columns), and the outcome should be like below:

1. TableA: Column1 (selection), Column2 (selection), Column3 (relationship), Column4 (returned column)
2. TableB: Column1 (selection), Column2 (returned column), Column3 (returned column)
3. TableC: Column1 (returned column)

Sample query:
```
SELECT * from t1 where EXISTS(SELECT column from t2 where a<>b) and a=1 and 'x'=b
```

**Attempt 1**
```
var statements = new Parser().ParseSql(sql);

foreach (var item in statements)
{
    string tableName = null;

    if (item is Statement.Update update)
    {
        if (update.Table.Relation is TableFactor.Table table)
        {
            tableName = table.Name;
        }

        // handle different expression type
        // both left/right operand can be expression
        if (update.Selection is Expression.InSubquery inSubquery)
        {
            if (inSubquery.Expression is Expression.Identifier)
            {

            }
        }
    }
}
```


Then I tried with visitor pattern.

**Attempt 2**
```
using SqlParser.Ast;

namespace SqlOptimizer
{
    internal class SqlTableColumnVisitor : Visitor
{
    private readonly List<SqlTable> sqlTables = [];

    public List<SqlTable> SqlTables => sqlTables;

    public override ControlFlow PreVisitTableFactor(TableFactor tableFactor)
    {
        string? tableName = default;

        if (tableFactor is TableFactor.Table table)
        {
            tableName = table.Alias?.Name ?? table.Name.Values[0].Value;
        }

        if (tableName != default)
        {
            sqlTables.Add(new SqlTable
            {
                Alias = tableName
            });
        }

        return ControlFlow.Continue;
    }

    public override ControlFlow PostVisitExpression(Expression expression)
    {
        if (expression is Expression.Identifier identifier)
        {
            var sqlTable = sqlTables[^1];

            sqlTable.Columns.Add(new SqlTableColumn
            {
                ColumnName = identifier.Ident.Value,
                Location = "TBC"
            });
        }

        return ControlFlow.Continue;
    }

    //public override ControlFlow PreVisitRelation(ObjectName relation)
    //{
    //    var sqlTable = sqlTables[^1];

    //    if (relation.Values.Count > 0)
    //    {
    //        sqlTable.Alias = relation.Values[0].Value;
    //    }

    //    return ControlFlow.Continue;
    //}
    

    internal class SqlTable
    {
        public string? Alias { get; set; }

        public List<SqlTableColumn> Columns { get; set; } = [];
    }

    internal class SqlTableColumn
    {
        public required string Location { get; set; }
     
        public required string ColumnName { get; set; }
    }
}
```

That's based on the `Visitor_Visits_Sql_Parts` sample test:

```
"SELECT * from t1 where EXISTS(SELECT column from t2) UNION SELECT * from t3",
[
    "PRE: STATEMENT: SELECT * FROM t1 WHERE EXISTS (SELECT column FROM t2) UNION SELECT * FROM t3",
    "PRE: TABLE FACTOR: t1",
    "PRE: RELATION: t1",
    "POST: RELATION: t1",
    "POST: TABLE FACTOR: t1",
    "PRE: EXPR: EXISTS (SELECT column FROM t2)",
    "PRE: EXPR: column",
    "POST: EXPR: column",
    "PRE: TABLE FACTOR: t2",
    "PRE: RELATION: t2",
    "POST: RELATION: t2",
    "POST: TABLE FACTOR: t2",
    "POST: EXPR: EXISTS (SELECT column FROM t2)",
    "PRE: TABLE FACTOR: t3",
    "PRE: RELATION: t3",
    "POST: RELATION: t3",
    "POST: TABLE FACTOR: t3",
    "POST: STATEMENT: SELECT * FROM t1 WHERE EXISTS (SELECT column FROM t2) UNION SELECT * FROM t3"
]
```

And I realized it will not work because the scope of the current table will be changed in nested statement. We need to some sort of stack to correctly track the scope to have the right table/columns association.

What should be right way to correctly capture individual table/columns combination?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Analyse queries to get table name and columns used #74

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Analyse queries to get table name and columns used #74

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions