Skip to content

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 22, 2025

About

Find out about about table usage frequency.

We want to know which tables are being queried with which queries.

References

Copy link

coderabbitai bot commented Aug 22, 2025

Walkthrough

Adds a new CLI command table-traffic that invokes a new job analysis pipeline. Introduces cratedb_toolkit/info/job.py with SQL classification, job log reading, operation modeling, and rendering. Adjusts get_table_names to return fully qualified table names. The CLI wires context to a DatabaseCluster and TableTraffic to render results.

Changes

Cohort / File(s) Summary
CLI command addition
cratedb_toolkit/info/cli.py
Adds table_traffic(ctx) command. Builds DatabaseCluster from ctx.meta["address"], initializes TableTraffic with adapter, and calls render(). Imports TableTraffic. Does not pass scrub to render.
Job analysis framework
cratedb_toolkit/info/job.py
New module implementing: Operation and Operations data models; SqlStatementClassifier for parsing/classification via sqlparse; TableTraffic orchestrator with methods to read sys.jobs_log, parse statements to operations, analyze via Polars, and render output. Includes heuristics for DQL/select, INTO, evasive queries, and table extraction.
Database utility tweak
cratedb_toolkit/util/database.py
Changes get_table_names to collect table.fqn instead of table.name, returning fully qualified identifiers; signature and control flow unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (table-traffic)
  participant DC as DatabaseCluster
  participant TA as TableTraffic
  participant DB as CrateDB
  participant Class as SqlStatementClassifier
  participant Polars as Polars/DataFrame

  User->>CLI: crate info table-traffic
  CLI->>DC: init(address from ctx.meta)
  CLI->>TA: init(adapter=DC.adapter)
  CLI->>TA: render()
  activate TA
  TA->>DB: SELECT ... FROM sys.jobs_log WHERE ... (time window, non-system)
  DB-->>TA: jobs rows (stmt, timestamps)
  TA->>TA: read_jobs(map rows -> parse_expression)
  loop per stmt
    TA->>Class: classify(stmt)
    Class-->>TA: operation, table_names
  end
  TA->>Polars: build/analyze Operations
  Polars-->>TA: aggregated results
  TA-->>CLI: output rendered
  deactivate TA
  CLI-->>User: display table usage/traffic
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit taps at logs so neat,
Sniffs SQL crumbs from every seat.
Tables named with schema flair,
Traffic mapped with careful care.
Hop, parse, render—job complete!
Little paws, big data beat.
(thump-thump) 🐇📊

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch table-traffic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (7)
cratedb_toolkit/info/job.py (7)

41-56: Parameterize SQL to avoid SQL injection vectors and improve readability.

Even though begin/end are computed locally, binding parameters is safer and consistent with the rest of the codebase.

Apply this diff:

     def read_jobs_database(self, begin: int = 0, end: int = 0):
         logger.info("Reading sys.jobs_log")
         now = int(time.time() * 1000)
         end = end or now
         begin = begin or now - 600 * 60 * 1000
-        stmt = (
-            f"SELECT "
-            f"started, ended, classification, stmt, username, node "
-            f"FROM sys.jobs_log "
-            f"WHERE "
-            f"stmt NOT LIKE '%sys.%' AND "
-            f"stmt NOT LIKE '%information_schema.%' "
-            f"AND ended BETWEEN {begin} AND {end} "
-            f"ORDER BY ended ASC"
-        )
-        return self.adapter.run_sql(stmt, records=True)
+        stmt = (
+            "SELECT started, ended, classification, stmt, username, node "
+            "FROM sys.jobs_log "
+            "WHERE stmt NOT LIKE :sys AND stmt NOT LIKE :info "
+            "AND ended BETWEEN :begin AND :end "
+            "ORDER BY ended ASC"
+        )
+        params = {"sys": "%sys.%", "info": "%information_schema.%", "begin": begin, "end": end}
+        return self.adapter.run_sql(stmt, parameters=params, records=True)

1-12: Pick one data model library (attrs or dataclasses) for consistency.

This module mixes @attr.define and @dataclasses.dataclass. Prefer one approach across the module for maintainability.

If the project favors attrs, convert SqlStatementClassifier to @attr.define. Otherwise, convert Operation/Operations to @dataclasses.dataclass. I can provide a patch if you confirm the preferred style.


47-54: Time window readability.

600 * 60 * 1000 is “10 hours” but non-obvious. Consider 10 * 60 * 60 * 1000 or use datetime.timedelta(hours=10) to compute milliseconds for clarity.

I can send a small patch using timedelta if you prefer.


27-34: Unit-test coverage for aggregation and rendering paths.

Add tests that:

  • verify a single statement with two tables yields two counts after explode;
  • verify duplicates within the same statement are counted once (via your new dedup logic);
  • ensure render() returns a list of dicts suitable for jd().

I can scaffold tests/info/test_table_traffic.py with fixtures for sys.jobs_log rows and mocked adapter responses.

Also applies to: 83-88


30-30: Remove prints per Ruff T201.

All output should go through the structured return and CLI serialization.

Also applies to: 33-33


21-21: Remove commented-out code per Ruff ERA001.

Delete the commented fields/lines to keep the module clean.

Also applies to: 31-31


151-156: Leverage FQN change upstream; consider normalization.

flatten(get_table_names(self.expression)) now yields FQNs. If you anticipate comparing or joining across sources, consider normalizing (e.g., lowercasing schema/table where appropriate) to avoid case/quote mismatches, especially when tables are referenced with quotes.

I can provide a helper to canonicalize identifiers if needed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3816479 and fa299d6.

📒 Files selected for processing (3)
  • cratedb_toolkit/info/cli.py (2 hunks)
  • cratedb_toolkit/info/job.py (1 hunks)
  • cratedb_toolkit/util/database.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
cratedb_toolkit/info/cli.py (3)
cratedb_toolkit/cluster/core.py (5)
  • info (208-211)
  • DatabaseCluster (623-676)
  • from_options (636-640)
  • adapter (399-405)
  • adapter (484-488)
cratedb_toolkit/info/job.py (2)
  • TableTraffic (36-87)
  • render (82-87)
cratedb_toolkit/util/app.py (1)
  • cli (21-50)
cratedb_toolkit/info/job.py (1)
cratedb_toolkit/util/database.py (4)
  • DatabaseAdapter (43-421)
  • get_table_names (477-488)
  • run_sql (35-36)
  • run_sql (109-137)
🪛 Ruff (0.12.2)
cratedb_toolkit/info/cli.py

95-95: Local variable scrub is assigned to but never used

Remove assignment to unused variable scrub

(F841)

cratedb_toolkit/info/job.py

21-21: Found commented-out code

Remove commented-out code

(ERA001)


30-30: print found

Remove print

(T201)


31-31: Found commented-out code

Remove commented-out code

(ERA001)


33-33: print found

Remove print

(T201)


47-54: Possible SQL injection vector through string-based query construction

(S608)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-and-publish
  • GitHub Check: build-and-test
  • GitHub Check: Generic: Python 3.9 on OS ubuntu-latest
  • GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
  • GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
  • GitHub Check: CFR for OS ubuntu-latest
  • GitHub Check: CFR for OS windows-latest
  • GitHub Check: CFR for OS macos-13
  • GitHub Check: CFR for OS macos-latest
🔇 Additional comments (3)
cratedb_toolkit/util/database.py (1)

478-489: Ensure safe fallback to bare table names when fully-qualified names might break downstream

The change from table.name to table.fqn will now always surface schema‐qualified names (including surrounding quotes), which may not be expected by consumers like the Rockset adapter or MLflow/LangChain integrations. To avoid unintended breakage:

• Update the docstring in cratedb_toolkit/util/database.py to reflect that the function returns fully-qualified names when available, but will fall back to the plain table name.
• Modify the loop at lines 480–488 in cratedb_toolkit/util/database.py to prefer table.fqn yet safely fall back to table.name.
• Manually review all callsites of get_table_names(sql)—notably in:
cratedb_toolkit/info/job.py (populating tables_symbols)
cratedb_toolkit/adapter/rockset/server/api/query.py (setting "collections")
—to confirm that passing FQNs (or bare names) behaves as intended in each context.

Proposed diff in cratedb_toolkit/util/database.py:

 def get_table_names(sql: str) -> t.List[t.List[str]]:
-    """
-    Decode table names from SQL statements.
-    """
+    """
+    Decode table names from SQL statements.
+    Returns fully-qualified names (schema.table) when available;
+    falls back to the plain table name otherwise.
+    """
     names = []
     statements = sqlparse_cratedb(sql)
     for statement in statements:
         local_names = []
         for table in statement.metadata.tables:
-            local_names.append(table.fqn)
+            # Prefer FQN (with schema); fall back to the bare table name.
+            local_names.append(getattr(table, "fqn", None) or table.name)
         names.append(local_names)
     return names

Please verify this change against each downstream consumer to ensure compatibility.

cratedb_toolkit/info/cli.py (1)

9-9: Import looks fine.

No issues with the new import. Keeping imports top-level matches existing style in this module.

cratedb_toolkit/info/job.py (1)

118-134: Semantics of DQL detection look good.

The is_dql logic combining is_select and not is_camouflage is a sensible baseline.

Comment on lines +89 to +99
@make_command(cli, "table-traffic", "Display information about table use.")
@click.pass_context
def table_traffic(ctx: click.Context):
"""
Display ad hoc job information.
"""
scrub = ctx.meta.get("scrub", False)
dc = DatabaseCluster.from_options(ctx.meta["address"])
traffic = TableTraffic(adapter=dc.adapter)
traffic.render()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Wire scrub, return structured data, and emit JSON like other info commands.

  • scrub is read but unused (Ruff F841).
  • TableTraffic.render() currently logs and prints in its implementation; the CLI should output structured JSON like the other commands.

Refactor to pass scrub, return data, and call jd().

Apply this diff:

-@make_command(cli, "table-traffic", "Display information about table use.")
+@make_command(cli, "table-traffic", "Display information about table usage.")
 @click.pass_context
 def table_traffic(ctx: click.Context):
     """
-    Display ad hoc job information.
+    Display aggregate table usage derived from sys.jobs_log.
     """
     scrub = ctx.meta.get("scrub", False)
     dc = DatabaseCluster.from_options(ctx.meta["address"])
-    traffic = TableTraffic(adapter=dc.adapter)
-    traffic.render()
+    traffic = TableTraffic(adapter=dc.adapter, scrub=scrub)
+    data = traffic.render()
+    jd(data)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@make_command(cli, "table-traffic", "Display information about table use.")
@click.pass_context
def table_traffic(ctx: click.Context):
"""
Display ad hoc job information.
"""
scrub = ctx.meta.get("scrub", False)
dc = DatabaseCluster.from_options(ctx.meta["address"])
traffic = TableTraffic(adapter=dc.adapter)
traffic.render()
@make_command(cli, "table-traffic", "Display information about table usage.")
@click.pass_context
def table_traffic(ctx: click.Context):
"""
Display aggregate table usage derived from sys.jobs_log.
"""
scrub = ctx.meta.get("scrub", False)
dc = DatabaseCluster.from_options(ctx.meta["address"])
traffic = TableTraffic(adapter=dc.adapter, scrub=scrub)
data = traffic.render()
jd(data)
🧰 Tools
🪛 Ruff (0.12.2)

95-95: Local variable scrub is assigned to but never used

Remove assignment to unused variable scrub

(F841)

🤖 Prompt for AI Agents
In cratedb_toolkit/info/cli.py around lines 89-99, the CLI reads scrub but
doesn't use it and calls TableTraffic.render() which prints/logs instead of
returning structured data; update the function to pass scrub into TableTraffic
(e.g., TableTraffic(adapter=dc.adapter, scrub=scrub)), call the TableTraffic
method that returns structured data (assign it to a variable), emit that data
via jd(data) like the other info commands, and return the data so the scrub
variable is actually used and the output is JSON-serializable.

Comment on lines +27 to +34
def foo(self):
fj = [attr.asdict(j) for j in self.data]
df = pl.from_records(fj)
print(df)
#grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])
grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")
print(grouped)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace prints and fragile DataFrame SQL with a deterministic aggregation API.

  • print in library code is undesirable (Ruff T201).
  • Grouping by the list column tables_symbols counts per-statement arrays, not per-table usage.
  • df.sql("... FROM self ...") is brittle; prefer a stable Polars API.

Refactor to explode the list column, then group and count, and return a DataFrame.

Apply this diff:

-    def foo(self):
-        fj = [attr.asdict(j) for j in self.data]
-        df = pl.from_records(fj)
-        print(df)
-        #grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])
-        grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")
-        print(grouped)
+    def summarize_by_table(self) -> pl.DataFrame:
+        fj = [attr.asdict(j) for j in self.data]
+        if not fj:
+            return pl.DataFrame({"tables_symbols": [], "count": []})
+        df = pl.from_records(fj)
+        # Count statements per table (explode list column first, then group).
+        df = df.explode("tables_symbols")
+        return (
+            df.group_by("tables_symbols")
+              .agg(pl.len().alias("count"))
+              .sort("count", descending=True)
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def foo(self):
fj = [attr.asdict(j) for j in self.data]
df = pl.from_records(fj)
print(df)
#grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])
grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")
print(grouped)
def summarize_by_table(self) -> pl.DataFrame:
fj = [attr.asdict(j) for j in self.data]
if not fj:
return pl.DataFrame({"tables_symbols": [], "count": []})
- df = pl.from_records(fj)
- # Count statements per table (explode list column first, then group).
- df = df.explode("tables_symbols")
- return (
- df.group_by("tables_symbols")
- .agg(pl.len().alias("count"))
- .sort("count", descending=True)
df = pl.from_records(fj)
# Count statements per table (explode list column first, then group).
df = df.explode("tables_symbols")
return (
df.group_by("tables_symbols")
.agg(pl.len().alias("count"))
.sort("count", descending=True)
)
🧰 Tools
🪛 Ruff (0.12.2)

30-30: print found

Remove print

(T201)


31-31: Found commented-out code

Remove commented-out code

(ERA001)


33-33: print found

Remove print

(T201)

Comment on lines +36 to +40
class TableTraffic:

def __init__(self, adapter: DatabaseAdapter):
self.adapter = adapter

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pass scrub via constructor for consistent PII handling.

Expose scrub on TableTraffic to allow the CLI to control redaction.

Apply this diff:

-class TableTraffic:
-
-    def __init__(self, adapter: DatabaseAdapter):
-        self.adapter = adapter
+class TableTraffic:
+
+    def __init__(self, adapter: DatabaseAdapter, scrub: bool = False):
+        self.adapter = adapter
+        self.scrub = scrub
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class TableTraffic:
def __init__(self, adapter: DatabaseAdapter):
self.adapter = adapter
class TableTraffic:
def __init__(self, adapter: DatabaseAdapter, scrub: bool = False):
self.adapter = adapter
self.scrub = scrub
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 36 to 40, TableTraffic's constructor
must accept and store a scrub parameter so the CLI can control PII redaction;
add a scrub parameter (typed appropriately, e.g., Callable[[str], str] or a
Scrubber type) to __init__, assign it to self.scrub, and update any code that
constructs TableTraffic to pass the desired scrub function (for example the
adapter's scrub or a CLI-provided scrub). Ensure the attribute is used wherever
redaction is performed and update type hints/imports as needed.

Comment on lines +58 to +64
def read_jobs(self, jobs) -> List[Operation]:
result = []
for job in jobs:
sql = job["stmt"]
result.append(self.parse_expression(sql))
return result

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Redact raw SQL when scrubbing and prepare data for aggregation.

Use scrub to avoid leaking query text and deduplicate tables per statement to count per-table usage once per statement.

Apply this diff:

     def read_jobs(self, jobs) -> List[Operation]:
         result = []
         for job in jobs:
             sql = job["stmt"]
-            result.append(self.parse_expression(sql))
+            op = self.parse_expression(sql)
+            if self.scrub:
+                op.stmt = None  # redact raw SQL text
+            result.append(op)
         return result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def read_jobs(self, jobs) -> List[Operation]:
result = []
for job in jobs:
sql = job["stmt"]
result.append(self.parse_expression(sql))
return result
def read_jobs(self, jobs) -> List[Operation]:
result = []
for job in jobs:
sql = job["stmt"]
op = self.parse_expression(sql)
if self.scrub:
op.stmt = None # redact raw SQL text
result.append(op)
return result
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 58 to 64, the code returns raw SQL
and may double-count tables; call scrub(sql) before parsing to avoid leaking
query text (e.g., scrubbed = scrub(sql)) and pass the scrubbed SQL to
parse_expression, then deduplicate any per-operation table list so each table is
counted once per statement (e.g., replace op.tables with a unique list/set)
before appending to the result; ensure the returned Operation objects contain
scrubbed SQL and deduplicated tables for aggregation.

Comment on lines +65 to +78
@staticmethod
def parse_expression(sql: str) -> Operation:
logger.debug(f"Analyzing SQL: {sql}")
classifier = SqlStatementClassifier(expression=sql)
if not classifier.operation:
logger.warning(f"Unable to determine operation: {sql}")
if not classifier.table_names:
logger.warning(f"Unable to determine table names: {sql}")
return Operation(
op=classifier.operation,
stmt=sql,
tables_symbols=classifier.table_names,
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Harden parsing: default unknown op, deduplicate table names.

Avoid None for op, and ensure tables per statement are unique to prevent overcounting.

Apply this diff:

     @staticmethod
     def parse_expression(sql: str) -> Operation:
         logger.debug(f"Analyzing SQL: {sql}")
         classifier = SqlStatementClassifier(expression=sql)
         if not classifier.operation:
             logger.warning(f"Unable to determine operation: {sql}")
         if not classifier.table_names:
             logger.warning(f"Unable to determine table names: {sql}")
-        return Operation(
-            op=classifier.operation,
+        table_names = list(dict.fromkeys(classifier.table_names))
+        return Operation(
+            op=classifier.operation or "UNKNOWN",
             stmt=sql,
-            tables_symbols=classifier.table_names,
+            tables_symbols=table_names,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def parse_expression(sql: str) -> Operation:
logger.debug(f"Analyzing SQL: {sql}")
classifier = SqlStatementClassifier(expression=sql)
if not classifier.operation:
logger.warning(f"Unable to determine operation: {sql}")
if not classifier.table_names:
logger.warning(f"Unable to determine table names: {sql}")
return Operation(
op=classifier.operation,
stmt=sql,
tables_symbols=classifier.table_names,
)
@staticmethod
def parse_expression(sql: str) -> Operation:
logger.debug(f"Analyzing SQL: {sql}")
classifier = SqlStatementClassifier(expression=sql)
if not classifier.operation:
logger.warning(f"Unable to determine operation: {sql}")
if not classifier.table_names:
logger.warning(f"Unable to determine table names: {sql}")
table_names = list(dict.fromkeys(classifier.table_names))
return Operation(
op=classifier.operation or "UNKNOWN",
stmt=sql,
tables_symbols=table_names,
)
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 65 to 78, the parse_expression
function currently can return None for the operation and duplicate table names;
ensure you set a default op value (e.g., "UNKNOWN" or Operation.UNKNOWN) when
classifier.operation is falsy, and replace the tables_symbols assignment with a
deduplicated list preserving order (e.g., use dict.fromkeys or an OrderedDict)
so repeated table names are collapsed before constructing the Operation; keep
the existing logging but still return the safe default op and the unique tables
list.

Comment on lines +79 to +88
def analyze_jobs(self, ops: Operations):
ops.foo()

def render(self):
jobs = self.read_jobs_database()
logger.info(f"Analyzing {len(jobs)} jobs")
ops = Operations(self.read_jobs(jobs))
jobsa = self.analyze_jobs(ops)
logger.info(f"Result: {jobsa}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make analyze/render return structured results; avoid side-effect-only logging.

Currently analyze_jobs returns None and render() logs that None. Return the computed summary and let the CLI serialize it.

Apply this diff:

-    def analyze_jobs(self, ops: Operations):
-        ops.foo()
+    def analyze_jobs(self, ops: Operations) -> pl.DataFrame:
+        return ops.summarize_by_table()
 
-    def render(self):
+    def render(self):
         jobs = self.read_jobs_database()
         logger.info(f"Analyzing {len(jobs)} jobs")
-        ops = Operations(self.read_jobs(jobs))
-        jobsa = self.analyze_jobs(ops)
-        logger.info(f"Result: {jobsa}")
+        ops = Operations(self.read_jobs(jobs))
+        summary_df = self.analyze_jobs(ops)
+        return summary_df.to_dicts()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def analyze_jobs(self, ops: Operations):
ops.foo()
def render(self):
jobs = self.read_jobs_database()
logger.info(f"Analyzing {len(jobs)} jobs")
ops = Operations(self.read_jobs(jobs))
jobsa = self.analyze_jobs(ops)
logger.info(f"Result: {jobsa}")
def analyze_jobs(self, ops: Operations) -> pl.DataFrame:
return ops.summarize_by_table()
def render(self):
jobs = self.read_jobs_database()
logger.info(f"Analyzing {len(jobs)} jobs")
- ops = Operations(self.read_jobs(jobs))
- jobsa = self.analyze_jobs(ops)
ops = Operations(self.read_jobs(jobs))
summary_df = self.analyze_jobs(ops)
return summary_df.to_dicts()
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 79 to 88, analyze_jobs currently
calls ops.foo() and returns None, and render() logs that None; change
analyze_jobs to compute and return a structured summary (e.g., dict or
dataclass) of the operations instead of returning None, and update render() to
call result = self.analyze_jobs(ops), stop printing/logging the raw None, and
return that structured result so the CLI can serialize it; ensure no
side-effect-only logging remains — keep logging minimal or move it to the
caller.

Comment on lines +143 to +149
def operation(self) -> str:
"""
The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.
"""
parsed = self.parse_sqlparse()
return parsed[0].get_type().upper()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard against empty parses in operation to prevent IndexError.

sqlparse.parse("") returns an empty list; parsed[0] would crash.

Apply this diff:

     @property
     def operation(self) -> str:
         """
         The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.
         """
-        parsed = self.parse_sqlparse()
-        return parsed[0].get_type().upper()
+        parsed = self.parse_sqlparse()
+        if not parsed:
+            return ""
+        return (parsed[0].get_type() or "").upper()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def operation(self) -> str:
"""
The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.
"""
parsed = self.parse_sqlparse()
return parsed[0].get_type().upper()
@property
def operation(self) -> str:
"""
The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.
"""
parsed = self.parse_sqlparse()
if not parsed:
return ""
return (parsed[0].get_type() or "").upper()
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 143-149, the operation() method
currently assumes sqlparse.parse() returns at least one statement and accessing
parsed[0] can raise IndexError for empty SQL; update the method to check whether
parsed is non-empty before accessing parsed[0] and return an empty string (or a
sensible default) if it is empty, otherwise call get_type().upper() on the first
parsed statement; ensure the guard covers both an empty list and any unexpected
None so no IndexError occurs.

Comment on lines +165 to +177
def is_select_into(self) -> bool:
"""
Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.
Examples:
SELECT * INTO foobar FROM bazqux
SELECT * FROM bazqux INTO foobar
"""
# Flatten all tokens (including nested ones) and match on type+value.
statement = self.parse_sqlparse()[0]
return any(
token.ttype is Keyword and token.value.upper() == "INTO"
for token in statement.flatten()
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard is_select_into against empty parses.

Prevent potential IndexError when no statements are parsed.

Apply this diff:

         # Flatten all tokens (including nested ones) and match on type+value.
-        statement = self.parse_sqlparse()[0]
+        parsed = self.parse_sqlparse()
+        if not parsed:
+            return False
+        statement = parsed[0]
         return any(
             token.ttype is Keyword and token.value.upper() == "INTO"
             for token in statement.flatten()
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def is_select_into(self) -> bool:
"""
Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.
Examples:
SELECT * INTO foobar FROM bazqux
SELECT * FROM bazqux INTO foobar
"""
# Flatten all tokens (including nested ones) and match on type+value.
statement = self.parse_sqlparse()[0]
return any(
token.ttype is Keyword and token.value.upper() == "INTO"
for token in statement.flatten()
)
def is_select_into(self) -> bool:
"""
Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.
Examples:
SELECT * INTO foobar FROM bazqux
SELECT * FROM bazqux INTO foobar
"""
# Flatten all tokens (including nested ones) and match on type+value.
parsed = self.parse_sqlparse()
if not parsed:
return False
statement = parsed[0]
return any(
token.ttype is Keyword and token.value.upper() == "INTO"
for token in statement.flatten()
)
🤖 Prompt for AI Agents
In cratedb_toolkit/info/job.py around lines 165 to 177, the is_select_into
method assumes parse_sqlparse() always returns at least one statement and can
raise IndexError; add a guard to handle empty or falsy parses by checking the
result (e.g., assign statements = self.parse_sqlparse(); if not statements:
return False) before accessing statements[0], then proceed with the existing
flatten-and-check logic so the method safely returns False when no statements
are parsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant