`ctk info table-traffic` #528

amotl · 2025-08-22T09:01:01Z

About

Find out about about table usage frequency.

We want to know which tables are being queried with which queries.

References

Next steps after XMover #526

coderabbitai · 2025-08-22T09:01:08Z

Walkthrough

Adds a new CLI command table-traffic that invokes a new job analysis pipeline. Introduces cratedb_toolkit/info/job.py with SQL classification, job log reading, operation modeling, and rendering. Adjusts get_table_names to return fully qualified table names. The CLI wires context to a DatabaseCluster and TableTraffic to render results.

Changes

Cohort / File(s)	Summary
CLI command addition `cratedb_toolkit/info/cli.py`	Adds table_traffic(ctx) command. Builds DatabaseCluster from ctx.meta["address"], initializes TableTraffic with adapter, and calls render(). Imports TableTraffic. Does not pass scrub to render.
Job analysis framework `cratedb_toolkit/info/job.py`	New module implementing: Operation and Operations data models; SqlStatementClassifier for parsing/classification via sqlparse; TableTraffic orchestrator with methods to read sys.jobs_log, parse statements to operations, analyze via Polars, and render output. Includes heuristics for DQL/select, INTO, evasive queries, and table extraction.
Database utility tweak `cratedb_toolkit/util/database.py`	Changes get_table_names to collect table.fqn instead of table.name, returning fully qualified identifiers; signature and control flow unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI (table-traffic)
  participant DC as DatabaseCluster
  participant TA as TableTraffic
  participant DB as CrateDB
  participant Class as SqlStatementClassifier
  participant Polars as Polars/DataFrame

  User->>CLI: crate info table-traffic
  CLI->>DC: init(address from ctx.meta)
  CLI->>TA: init(adapter=DC.adapter)
  CLI->>TA: render()
  activate TA
  TA->>DB: SELECT ... FROM sys.jobs_log WHERE ... (time window, non-system)
  DB-->>TA: jobs rows (stmt, timestamps)
  TA->>TA: read_jobs(map rows -> parse_expression)
  loop per stmt
    TA->>Class: classify(stmt)
    Class-->>TA: operation, table_names
  end
  TA->>Polars: build/analyze Operations
  Polars-->>TA: aggregated results
  TA-->>CLI: output rendered
  deactivate TA
  CLI-->>User: display table usage/traffic

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A rabbit taps at logs so neat,
Sniffs SQL crumbs from every seat.
Tables named with schema flair,
Traffic mapped with careful care.
Hop, parse, render—job complete!
Little paws, big data beat.
(thump-thump) 🐇📊

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch table-traffic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (7)

cratedb_toolkit/info/job.py (7)
41-56: Parameterize SQL to avoid SQL injection vectors and improve readability.

Even though begin/end are computed locally, binding parameters is safer and consistent with the rest of the codebase.

Apply this diff:
     def read_jobs_database(self, begin: int = 0, end: int = 0):
         logger.info("Reading sys.jobs_log")
         now = int(time.time() * 1000)
         end = end or now
         begin = begin or now - 600 * 60 * 1000
-        stmt = (
-            f"SELECT "
-            f"started, ended, classification, stmt, username, node "
-            f"FROM sys.jobs_log "
-            f"WHERE "
-            f"stmt NOT LIKE '%sys.%' AND "
-            f"stmt NOT LIKE '%information_schema.%' "
-            f"AND ended BETWEEN {begin} AND {end} "
-            f"ORDER BY ended ASC"
-        )
-        return self.adapter.run_sql(stmt, records=True)
+        stmt = (
+            "SELECT started, ended, classification, stmt, username, node "
+            "FROM sys.jobs_log "
+            "WHERE stmt NOT LIKE :sys AND stmt NOT LIKE :info "
+            "AND ended BETWEEN :begin AND :end "
+            "ORDER BY ended ASC"
+        )
+        params = {"sys": "%sys.%", "info": "%information_schema.%", "begin": begin, "end": end}
+        return self.adapter.run_sql(stmt, parameters=params, records=True)
1-12: Pick one data model library (attrs or dataclasses) for consistency.

This module mixes @attr.define and @dataclasses.dataclass. Prefer one approach across the module for maintainability.

If the project favors attrs, convert SqlStatementClassifier to @attr.define. Otherwise, convert Operation/Operations to @dataclasses.dataclass. I can provide a patch if you confirm the preferred style.

47-54: Time window readability.

600 * 60 * 1000 is “10 hours” but non-obvious. Consider 10 * 60 * 60 * 1000 or use datetime.timedelta(hours=10) to compute milliseconds for clarity.

I can send a small patch using timedelta if you prefer.

27-34: Unit-test coverage for aggregation and rendering paths.

Add tests that:

verify a single statement with two tables yields two counts after explode;

verify duplicates within the same statement are counted once (via your new dedup logic);

ensure render() returns a list of dicts suitable for jd().

I can scaffold tests/info/test_table_traffic.py with fixtures for sys.jobs_log rows and mocked adapter responses.

Also applies to: 83-88

30-30: Remove prints per Ruff T201.

All output should go through the structured return and CLI serialization.

Also applies to: 33-33

21-21: Remove commented-out code per Ruff ERA001.

Delete the commented fields/lines to keep the module clean.

Also applies to: 31-31

151-156: Leverage FQN change upstream; consider normalization.

flatten(get_table_names(self.expression)) now yields FQNs. If you anticipate comparing or joining across sources, consider normalizing (e.g., lowercasing schema/table where appropriate) to avoid case/quote mismatches, especially when tables are referenced with quotes.

I can provide a helper to canonicalize identifiers if needed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3816479 and fa299d6.

📒 Files selected for processing (3)

cratedb_toolkit/info/cli.py (2 hunks)
cratedb_toolkit/info/job.py (1 hunks)
cratedb_toolkit/util/database.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

cratedb_toolkit/info/cli.py (3)

cratedb_toolkit/cluster/core.py (5)

info (208-211)

DatabaseCluster (623-676)

from_options (636-640)

adapter (399-405)

adapter (484-488)

cratedb_toolkit/info/job.py (2)

TableTraffic (36-87)

render (82-87)

cratedb_toolkit/util/app.py (1)

cli (21-50)

cratedb_toolkit/info/job.py (1)

cratedb_toolkit/util/database.py (4)

DatabaseAdapter (43-421)

get_table_names (477-488)

run_sql (35-36)

run_sql (109-137)

🪛 Ruff (0.12.2)

cratedb_toolkit/info/cli.py

95-95: Local variable scrub is assigned to but never used

Remove assignment to unused variable scrub

(F841)

cratedb_toolkit/info/job.py

21-21: Found commented-out code

Remove commented-out code

(ERA001)

30-30: print found

Remove print

(T201)

31-31: Found commented-out code

Remove commented-out code

(ERA001)

33-33: print found

Remove print

(T201)

47-54: Possible SQL injection vector through string-based query construction

(S608)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-and-publish
GitHub Check: build-and-test
GitHub Check: Generic: Python 3.9 on OS ubuntu-latest
GitHub Check: Generic: Python 3.13 on OS ubuntu-latest
GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
GitHub Check: CFR for OS ubuntu-latest
GitHub Check: CFR for OS windows-latest
GitHub Check: CFR for OS macos-13
GitHub Check: CFR for OS macos-latest

🔇 Additional comments (3)

cratedb_toolkit/util/database.py (1)
478-489: Ensure safe fallback to bare table names when fully-qualified names might break downstream

The change from table.name to table.fqn will now always surface schema‐qualified names (including surrounding quotes), which may not be expected by consumers like the Rockset adapter or MLflow/LangChain integrations. To avoid unintended breakage:

• Update the docstring in cratedb_toolkit/util/database.py to reflect that the function returns fully-qualified names when available, but will fall back to the plain table name.
• Modify the loop at lines 480–488 in cratedb_toolkit/util/database.py to prefer table.fqn yet safely fall back to table.name.
• Manually review all callsites of get_table_names(sql)—notably in:
– cratedb_toolkit/info/job.py (populating tables_symbols)
– cratedb_toolkit/adapter/rockset/server/api/query.py (setting "collections")
—to confirm that passing FQNs (or bare names) behaves as intended in each context.

Proposed diff in cratedb_toolkit/util/database.py:
 def get_table_names(sql: str) -> t.List[t.List[str]]:
-    """
-    Decode table names from SQL statements.
-    """
+    """
+    Decode table names from SQL statements.
+    Returns fully-qualified names (schema.table) when available;
+    falls back to the plain table name otherwise.
+    """
     names = []
     statements = sqlparse_cratedb(sql)
     for statement in statements:
         local_names = []
         for table in statement.metadata.tables:
-            local_names.append(table.fqn)
+            # Prefer FQN (with schema); fall back to the bare table name.
+            local_names.append(getattr(table, "fqn", None) or table.name)
         names.append(local_names)
     return names
Please verify this change against each downstream consumer to ensure compatibility.
cratedb_toolkit/info/cli.py (1)

9-9: Import looks fine.

No issues with the new import. Keeping imports top-level matches existing style in this module.

cratedb_toolkit/info/job.py (1)

118-134: Semantics of DQL detection look good.

The is_dql logic combining is_select and not is_camouflage is a sensible baseline.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/cli.py

+@make_command(cli, "table-traffic", "Display information about table use.")
+@click.pass_context
+def table_traffic(ctx: click.Context):
+    """
+    Display ad hoc job information.
+    """
+    scrub = ctx.meta.get("scrub", False)
+    dc = DatabaseCluster.from_options(ctx.meta["address"])
+    traffic = TableTraffic(adapter=dc.adapter)
+    traffic.render()
+


🛠️ Refactor suggestion

Wire scrub, return structured data, and emit JSON like other info commands.

scrub is read but unused (Ruff F841).

TableTraffic.render() currently logs and prints in its implementation; the CLI should output structured JSON like the other commands.

Refactor to pass scrub, return data, and call jd().

Apply this diff:

-@make_command(cli, "table-traffic", "Display information about table use.") +@make_command(cli, "table-traffic", "Display information about table usage.") @click.pass_context def table_traffic(ctx: click.Context): """ - Display ad hoc job information. + Display aggregate table usage derived from sys.jobs_log. """ scrub = ctx.meta.get("scrub", False) dc = DatabaseCluster.from_options(ctx.meta["address"]) - traffic = TableTraffic(adapter=dc.adapter) - traffic.render() + traffic = TableTraffic(adapter=dc.adapter, scrub=scrub) + data = traffic.render() + jd(data)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@make_command(cli, "table-traffic", "Display information about table use.")

@click.pass_context

def table_traffic(ctx: click.Context):

"""

Display ad hoc job information.

"""

scrub = ctx.meta.get("scrub", False)

dc = DatabaseCluster.from_options(ctx.meta["address"])

traffic = TableTraffic(adapter=dc.adapter)

traffic.render()

@make_command(cli, "table-traffic", "Display information about table usage.")

@click.pass_context

def table_traffic(ctx: click.Context):

"""

Display aggregate table usage derived from sys.jobs_log.

"""

scrub = ctx.meta.get("scrub", False)

dc = DatabaseCluster.from_options(ctx.meta["address"])

traffic = TableTraffic(adapter=dc.adapter, scrub=scrub)

data = traffic.render()

jd(data)

🧰 Tools

🪛 Ruff (0.12.2)

95-95: Local variable scrub is assigned to but never used

Remove assignment to unused variable scrub

(F841)

🤖 Prompt for AI Agents

In cratedb_toolkit/info/cli.py around lines 89-99, the CLI reads scrub but doesn't use it and calls TableTraffic.render() which prints/logs instead of returning structured data; update the function to pass scrub into TableTraffic (e.g., TableTraffic(adapter=dc.adapter, scrub=scrub)), call the TableTraffic method that returns structured data (assign it to a variable), emit that data via jd(data) like the other info commands, and return the data so the scrub variable is actually used and the output is JSON-serializable.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+    def foo(self):
+        fj = [attr.asdict(j) for j in self.data]
+        df = pl.from_records(fj)
+        print(df)
+        #grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])
+        grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")
+        print(grouped)
+


🛠️ Refactor suggestion

Replace prints and fragile DataFrame SQL with a deterministic aggregation API.

print in library code is undesirable (Ruff T201).

Grouping by the list column tables_symbols counts per-statement arrays, not per-table usage.

df.sql("... FROM self ...") is brittle; prefer a stable Polars API.

Refactor to explode the list column, then group and count, and return a DataFrame.

Apply this diff:

- def foo(self): - fj = [attr.asdict(j) for j in self.data] - df = pl.from_records(fj) - print(df) - #grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")]) - grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols") - print(grouped) + def summarize_by_table(self) -> pl.DataFrame: + fj = [attr.asdict(j) for j in self.data] + if not fj: + return pl.DataFrame({"tables_symbols": [], "count": []}) + df = pl.from_records(fj) + # Count statements per table (explode list column first, then group). + df = df.explode("tables_symbols") + return ( + df.group_by("tables_symbols") + .agg(pl.len().alias("count")) + .sort("count", descending=True) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def foo(self):

fj = [attr.asdict(j) for j in self.data]

df = pl.from_records(fj)

print(df)

#grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])

grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")

print(grouped)

def summarize_by_table(self) -> pl.DataFrame:

fj = [attr.asdict(j) for j in self.data]

if not fj:

return pl.DataFrame({"tables_symbols": [], "count": []})

- df = pl.from_records(fj)

- # Count statements per table (explode list column first, then group).

- df = df.explode("tables_symbols")

- return (

- df.group_by("tables_symbols")

- .agg(pl.len().alias("count"))

- .sort("count", descending=True)

df = pl.from_records(fj)

# Count statements per table (explode list column first, then group).

df = df.explode("tables_symbols")

return (

df.group_by("tables_symbols")

.agg(pl.len().alias("count"))

.sort("count", descending=True)

)

🧰 Tools

🪛 Ruff (0.12.2)

30-30: print found

Remove print

(T201)

31-31: Found commented-out code

Remove commented-out code

(ERA001)

33-33: print found

Remove print

(T201)

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+class TableTraffic:
+
+    def __init__(self, adapter: DatabaseAdapter):
+        self.adapter = adapter
+


🛠️ Refactor suggestion

Pass scrub via constructor for consistent PII handling.

Expose scrub on TableTraffic to allow the CLI to control redaction.

Apply this diff:

-class TableTraffic: - - def __init__(self, adapter: DatabaseAdapter): - self.adapter = adapter +class TableTraffic: + + def __init__(self, adapter: DatabaseAdapter, scrub: bool = False): + self.adapter = adapter + self.scrub = scrub

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class TableTraffic:

def __init__(self, adapter: DatabaseAdapter):

self.adapter = adapter

class TableTraffic:

def __init__(self, adapter: DatabaseAdapter, scrub: bool = False):

self.adapter = adapter

self.scrub = scrub

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 36 to 40, TableTraffic's constructor must accept and store a scrub parameter so the CLI can control PII redaction; add a scrub parameter (typed appropriately, e.g., Callable[[str], str] or a Scrubber type) to __init__, assign it to self.scrub, and update any code that constructs TableTraffic to pass the desired scrub function (for example the adapter's scrub or a CLI-provided scrub). Ensure the attribute is used wherever redaction is performed and update type hints/imports as needed.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+    def read_jobs(self, jobs) -> List[Operation]:
+        result = []
+        for job in jobs:
+            sql = job["stmt"]
+            result.append(self.parse_expression(sql))
+        return result
+


🛠️ Refactor suggestion

Redact raw SQL when scrubbing and prepare data for aggregation.

Use scrub to avoid leaking query text and deduplicate tables per statement to count per-table usage once per statement.

Apply this diff:

def read_jobs(self, jobs) -> List[Operation]: result = [] for job in jobs: sql = job["stmt"] - result.append(self.parse_expression(sql)) + op = self.parse_expression(sql) + if self.scrub: + op.stmt = None # redact raw SQL text + result.append(op) return result

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def read_jobs(self, jobs) -> List[Operation]:

result = []

for job in jobs:

sql = job["stmt"]

result.append(self.parse_expression(sql))

return result

def read_jobs(self, jobs) -> List[Operation]:

result = []

for job in jobs:

sql = job["stmt"]

op = self.parse_expression(sql)

if self.scrub:

op.stmt = None # redact raw SQL text

result.append(op)

return result

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 58 to 64, the code returns raw SQL and may double-count tables; call scrub(sql) before parsing to avoid leaking query text (e.g., scrubbed = scrub(sql)) and pass the scrubbed SQL to parse_expression, then deduplicate any per-operation table list so each table is counted once per statement (e.g., replace op.tables with a unique list/set) before appending to the result; ensure the returned Operation objects contain scrubbed SQL and deduplicated tables for aggregation.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+    @staticmethod
+    def parse_expression(sql: str) -> Operation:
+        logger.debug(f"Analyzing SQL: {sql}")
+        classifier = SqlStatementClassifier(expression=sql)
+        if not classifier.operation:
+            logger.warning(f"Unable to determine operation: {sql}")
+        if not classifier.table_names:
+            logger.warning(f"Unable to determine table names: {sql}")
+        return Operation(
+            op=classifier.operation,
+            stmt=sql,
+            tables_symbols=classifier.table_names,
+        )
+


⚠️ Potential issue

Harden parsing: default unknown op, deduplicate table names.

Avoid None for op, and ensure tables per statement are unique to prevent overcounting.

Apply this diff:

@staticmethod def parse_expression(sql: str) -> Operation: logger.debug(f"Analyzing SQL: {sql}") classifier = SqlStatementClassifier(expression=sql) if not classifier.operation: logger.warning(f"Unable to determine operation: {sql}") if not classifier.table_names: logger.warning(f"Unable to determine table names: {sql}") - return Operation( - op=classifier.operation, + table_names = list(dict.fromkeys(classifier.table_names)) + return Operation( + op=classifier.operation or "UNKNOWN", stmt=sql, - tables_symbols=classifier.table_names, + tables_symbols=table_names, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@staticmethod

def parse_expression(sql: str) -> Operation:

logger.debug(f"Analyzing SQL: {sql}")

classifier = SqlStatementClassifier(expression=sql)

if not classifier.operation:

logger.warning(f"Unable to determine operation: {sql}")

if not classifier.table_names:

logger.warning(f"Unable to determine table names: {sql}")

return Operation(

op=classifier.operation,

stmt=sql,

tables_symbols=classifier.table_names,

)

@staticmethod

def parse_expression(sql: str) -> Operation:

logger.debug(f"Analyzing SQL: {sql}")

classifier = SqlStatementClassifier(expression=sql)

if not classifier.operation:

logger.warning(f"Unable to determine operation: {sql}")

if not classifier.table_names:

logger.warning(f"Unable to determine table names: {sql}")

table_names = list(dict.fromkeys(classifier.table_names))

return Operation(

op=classifier.operation or "UNKNOWN",

stmt=sql,

tables_symbols=table_names,

)

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 65 to 78, the parse_expression function currently can return None for the operation and duplicate table names; ensure you set a default op value (e.g., "UNKNOWN" or Operation.UNKNOWN) when classifier.operation is falsy, and replace the tables_symbols assignment with a deduplicated list preserving order (e.g., use dict.fromkeys or an OrderedDict) so repeated table names are collapsed before constructing the Operation; keep the existing logging but still return the safe default op and the unique tables list.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+    def analyze_jobs(self, ops: Operations):
+        ops.foo()
+
+    def render(self):
+        jobs = self.read_jobs_database()
+        logger.info(f"Analyzing {len(jobs)} jobs")
+        ops = Operations(self.read_jobs(jobs))
+        jobsa = self.analyze_jobs(ops)
+        logger.info(f"Result: {jobsa}")
+


🛠️ Refactor suggestion

Make analyze/render return structured results; avoid side-effect-only logging.

Currently analyze_jobs returns None and render() logs that None. Return the computed summary and let the CLI serialize it.

Apply this diff:

- def analyze_jobs(self, ops: Operations): - ops.foo() + def analyze_jobs(self, ops: Operations) -> pl.DataFrame: + return ops.summarize_by_table() - def render(self): + def render(self): jobs = self.read_jobs_database() logger.info(f"Analyzing {len(jobs)} jobs") - ops = Operations(self.read_jobs(jobs)) - jobsa = self.analyze_jobs(ops) - logger.info(f"Result: {jobsa}") + ops = Operations(self.read_jobs(jobs)) + summary_df = self.analyze_jobs(ops) + return summary_df.to_dicts()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def analyze_jobs(self, ops: Operations):

ops.foo()

def render(self):

jobs = self.read_jobs_database()

logger.info(f"Analyzing {len(jobs)} jobs")

ops = Operations(self.read_jobs(jobs))

jobsa = self.analyze_jobs(ops)

logger.info(f"Result: {jobsa}")

def analyze_jobs(self, ops: Operations) -> pl.DataFrame:

return ops.summarize_by_table()

def render(self):

jobs = self.read_jobs_database()

logger.info(f"Analyzing {len(jobs)} jobs")

- ops = Operations(self.read_jobs(jobs))

- jobsa = self.analyze_jobs(ops)

ops = Operations(self.read_jobs(jobs))

summary_df = self.analyze_jobs(ops)

return summary_df.to_dicts()

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 79 to 88, analyze_jobs currently calls ops.foo() and returns None, and render() logs that None; change analyze_jobs to compute and return a structured summary (e.g., dict or dataclass) of the operations instead of returning None, and update render() to call result = self.analyze_jobs(ops), stop printing/logging the raw None, and return that structured result so the CLI can serialize it; ensure no side-effect-only logging remains — keep logging minimal or move it to the caller.

coderabbitai · 2025-08-22T09:08:56Z

cratedb_toolkit/info/job.py

+    def operation(self) -> str:
+        """
+        The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.
+        """
+        parsed = self.parse_sqlparse()
+        return parsed[0].get_type().upper()
+


⚠️ Potential issue

Guard against empty parses in operation to prevent IndexError.

sqlparse.parse("") returns an empty list; parsed[0] would crash.

Apply this diff:

@property def operation(self) -> str: """ The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc. """ - parsed = self.parse_sqlparse() - return parsed[0].get_type().upper() + parsed = self.parse_sqlparse() + if not parsed: + return "" + return (parsed[0].get_type() or "").upper()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def operation(self) -> str:

"""

The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.

"""

parsed = self.parse_sqlparse()

return parsed[0].get_type().upper()

@property

def operation(self) -> str:

"""

The SQL operation: SELECT, INSERT, UPDATE, DELETE, CREATE, etc.

"""

parsed = self.parse_sqlparse()

if not parsed:

return ""

return (parsed[0].get_type() or "").upper()

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 143-149, the operation() method currently assumes sqlparse.parse() returns at least one statement and accessing parsed[0] can raise IndexError for empty SQL; update the method to check whether parsed is non-empty before accessing parsed[0] and return an empty string (or a sensible default) if it is empty, otherwise call get_type().upper() on the first parsed statement; ensure the guard covers both an empty list and any unexpected None so no IndexError occurs.

coderabbitai · 2025-08-22T09:08:57Z

cratedb_toolkit/info/job.py

+    def is_select_into(self) -> bool:
+        """
+        Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.
+        Examples:
+            SELECT * INTO foobar FROM bazqux
+            SELECT * FROM bazqux INTO foobar
+        """
+        # Flatten all tokens (including nested ones) and match on type+value.
+        statement = self.parse_sqlparse()[0]
+        return any(
+            token.ttype is Keyword and token.value.upper() == "INTO"
+            for token in statement.flatten()
+        )


🛠️ Refactor suggestion

Guard is_select_into against empty parses.

Prevent potential IndexError when no statements are parsed.

Apply this diff:

# Flatten all tokens (including nested ones) and match on type+value. - statement = self.parse_sqlparse()[0] + parsed = self.parse_sqlparse() + if not parsed: + return False + statement = parsed[0] return any( token.ttype is Keyword and token.value.upper() == "INTO" for token in statement.flatten() )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def is_select_into(self) -> bool:

"""

Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.

Examples:

SELECT * INTO foobar FROM bazqux

SELECT * FROM bazqux INTO foobar

"""

# Flatten all tokens (including nested ones) and match on type+value.

statement = self.parse_sqlparse()[0]

return any(

token.ttype is Keyword and token.value.upper() == "INTO"

for token in statement.flatten()

)

def is_select_into(self) -> bool:

"""

Use traditional `sqlparse` for catching `SELECT ... INTO ...` statements.

Examples:

SELECT * INTO foobar FROM bazqux

SELECT * FROM bazqux INTO foobar

"""

# Flatten all tokens (including nested ones) and match on type+value.

parsed = self.parse_sqlparse()

if not parsed:

return False

statement = parsed[0]

return any(

token.ttype is Keyword and token.value.upper() == "INTO"

for token in statement.flatten()

)

🤖 Prompt for AI Agents

In cratedb_toolkit/info/job.py around lines 165 to 177, the is_select_into method assumes parse_sqlparse() always returns at least one statement and can raise IndexError; add a guard to handle empty or falsy parses by checking the result (e.g., assign statements = self.parse_sqlparse(); if not statements: return False) before accessing statements[0], then proceed with the existing flatten-and-check logic so the method safely returns False when no statements are parsed.

ctk info table-traffic

fa299d6

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ctk info table-traffic` #528

`ctk info table-traffic` #528

Uh oh!

amotl commented Aug 22, 2025

Uh oh!

coderabbitai bot commented Aug 22, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

coderabbitai bot Aug 22, 2025

Uh oh!

Uh oh!

-    def foo(self):
-        fj = [attr.asdict(j) for j in self.data]
-        df = pl.from_records(fj)
-        print(df)
-        #grouped = df.group_by("tables_symbols").agg([pl.sum("tables_symbols"), pl.sum("op")])
-        grouped = df.sql("SELECT tables_symbols, COUNT(op) FROM self GROUP BY tables_symbols")
-        print(grouped)
+     def summarize_by_table(self) -> pl.DataFrame:
+         fj = [attr.asdict(j) for j in self.data]
+         if not fj:
+             return pl.DataFrame({"tables_symbols": [], "count": []})
+-        df = pl.from_records(fj)
+-        # Count statements per table (explode list column first, then group).
+-        df = df.explode("tables_symbols")
+-        return (
+-            df.group_by("tables_symbols")
+-              .agg(pl.len().alias("count"))
+-              .sort("count", descending=True)
+        df = pl.from_records(fj)
+        # Count statements per table (explode list column first, then group).
+        df = df.explode("tables_symbols")
+        return (
+            df.group_by("tables_symbols")
+              .agg(pl.len().alias("count"))
+              .sort("count", descending=True)
+        )

ctk info table-traffic #528

Are you sure you want to change the base?

ctk info table-traffic #528

Uh oh!

Conversation

amotl commented Aug 22, 2025

About

References

Uh oh!

coderabbitai bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

`ctk info table-traffic` #528

`ctk info table-traffic` #528

coderabbitai bot commented Aug 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)