-
Notifications
You must be signed in to change notification settings - Fork 4
Admin/XMover: Problematic translog detection and cancellation #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: xmover
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughIntroduces shard-size analytics and reporting in ShardAnalyzer, adds TranslogReporter and a new CLI command problematic-translogs, extends RecoveryMonitor with problematic shard visibility, incorporates partition-aware table displays, updates data model and database queries to include partition_values, and adds documentation and tests covering new behaviors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant CLI as xmover CLI
participant TR as TranslogReporter
participant DB as CrateDBClient
U->>CLI: xmover problematic-translogs --size-mb N [--cancel]
CLI->>TR: problematic_translogs(size_mb=N)
TR->>DB: query replicas with uncommitted translog >= N MB
DB-->>TR: rows (table, partition_values, shard_id, node, ...)
TR-->>CLI: generated ALTER ... REROUTE CANCEL SHARD commands
alt --cancel flag and commands exist
CLI->>U: Prompt for confirmation
U-->>CLI: Confirm/Decline
opt If confirmed
loop For each command
CLI->>DB: execute_query(command)
Note right of CLI: 1s delay between commands
end
end
end
CLI-->>U: Summary (counts, successes/failures)
sequenceDiagram
autonumber
participant SA as ShardAnalyzer
participant DB as CrateDBClient
participant UI as ShardReporter UI
SA->>DB: fetch shard stats and nodes/zones
DB-->>SA: shards (sizes, states), node zones
SA-->>UI: distribution_stats (buckets, totals, warnings)
UI-->>UI: render size distribution + optional large/small details
sequenceDiagram
autonumber
participant RM as RecoveryMonitor
participant DB as CrateDBClient
participant T as Terminal
RM->>DB: get_recoveries()
DB-->>RM: active/completed recoveries with partition_values
RM->>DB: get_problematic_shards()
DB-->>RM: shards not STARTED (with partition_values)
RM-->>T: status lines, summaries, examples (partition-aware)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
373efef
to
143ffbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WalBeh: Unfortunately, the software tests are failing here. I guess CrateDB has to be properly provisioned before.
pytest --no-cov tests/admin/test_problematic_translogs.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've fixed the software tests per 13d59cc.
dac2a6d
to
cb05b4d
Compare
143ffbf
to
a48d9fc
Compare
21dd237
to
3b3942f
Compare
d6a1753
to
0c12e4f
Compare
@WalBeh:
Can you add software tests here or to another branch which will bring up software test code coverage again, with a specific focus on your new code? EDIT: All right, coverage is still >50%. However, the addition still decreased it, so it would not hurt to still add more test cases when applicable.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (11)
cratedb_toolkit/admin/xmover/model.py (1)
61-61
: Standardize “no partition” sentinel.Tests use the string "NULL". Prefer None to represent “no partition” at the data layer to avoid leaking sentinel strings into UIs. If "NULL" must be retained for compatibility, document it on the field.
doc/admin/xmover/handbook.md (1)
271-287
: Add language to fenced code block to satisfy markdownlint.Label the sample output block as text.
Apply:
-**Sample Output:** -``` +**Sample Output:** +```text Found 3 shards with problematic translogs: ... Total: 3 ALTER commands generated -``` +```cratedb_toolkit/admin/xmover/util/format.py (1)
46-58
: Fix typing: avoid implicit Optional (RUF013).Annotate
partition_values
as Optional and import it.Apply:
+from typing import Optional @@ -def format_table_display_with_partition(schema_name: str, table_name: str, partition_values: str = None) -> str: +def format_table_display_with_partition( + schema_name: str, table_name: str, partition_values: Optional[str] = None +) -> str:Optional (defensive): treat common sentinels as “no partition” to keep displays clean:
- if partition_values: + if partition_values and partition_values.strip() not in {"NULL", "-", "none", ""}: return f"{base_display} {partition_values}"cratedb_toolkit/admin/xmover/cli.py (2)
480-493
: Catching broad Exception — either narrow or document.If
execute_query
raises a known exception type, catch that. Otherwise, suppress the Ruff BLE001 warning explicitly.Apply one of:
- except Exception as e: + # Broad except is intentional for CLI UX + except Exception as e: # noqa: BLE001or, if available:
- except Exception as e: + except (click.ClickException, RuntimeError) as e:Confirm the concrete exception type thrown by
CrateDBClient.execute_query
and replaceException
accordingly.
506-510
: Same: broad except for analysis block.Mirror the decision above for consistency.
Apply:
- except Exception as e: + except Exception as e: # noqa: BLE001tests/admin/test_problematic_translogs.py (2)
156-159
: Patch time.sleep at the module where it’s used to avoid cross-test bleed.Patch the symbol in the CLI module (or the module that calls sleep), not the global
time.sleep
.-), patch("time.sleep"): # Mock sleep to speed up test +), patch("cratedb_toolkit.admin.xmover.cli.time.sleep"): # Mock sleep where usedIf sleep is called from another module, target that module path instead.
128-131
: Make the query assertion resilient to quoting of "primary".The SQL should quote the column as
sh."primary" = FALSE
. Update the assertion to allow both.+import re ... -assert "primary = FALSE" in query +assert re.search(r'"?primary"?\s*=\s*FALSE', query), querycratedb_toolkit/admin/xmover/operational/monitor.py (1)
140-147
: Use the shared formatter for table+partition display for consistency.Align
_format_recovery_table
with the rest of the file.-# Format table name with partition values if available -table_display = f"{recovery.schema_name}.{recovery.table_name}" -if recovery.partition_values: - table_display = f"{table_display} {recovery.partition_values}" +table_display = format_table_display_with_partition( + recovery.schema_name, recovery.table_name, recovery.partition_values +)cratedb_toolkit/admin/xmover/util/database.py (1)
516-573
: Docstring vs filter mismatch: exclude actively recovering shards or adjust wording.
get_problematic_shards
says “aren't actively recovering” but currently includes allstate != 'STARTED'
, which can includeRECOVERING
. Either update the filter or the docstring.Option A (exclude recovering):
- where_conditions = ["s.state != 'STARTED'"] + where_conditions = ["s.state NOT IN ('STARTED','RECOVERING')"]Option B: Clarify the docstring to say it returns “non-started shards (may include recovering),” and rely on caller-side filtering (as done in monitor).
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
1069-1127
: Narrow the try/except and useelse:
for Ruff TRY300; annotate broad catch if intentional.Limit the try to the query, put the rendering in
else:
, and annotate the broad catch (BLE001) since this is a CLI UX path.- try: - result = self.client.execute_query(query, [size_mb]) - rows = result.get("rows", []) - - if not rows: - console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]") - return [] - - console.print(f"Found {len(rows)} shards with problematic translogs:") - console.print() - ... - return alter_commands - except Exception as e: - console.print(f"[red]Error analyzing problematic translogs: {e}[/red]") + try: + result = self.client.execute_query(query, [size_mb]) + except Exception as e: # noqa: BLE001 + console.print(f"[red]Error analyzing problematic translogs: {e}[/red]") import traceback - console.print(f"[dim]{traceback.format_exc()}[/dim]") return [] + else: + rows = result.get("rows", []) + if not rows: + console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]") + return [] + console.print(f"Found {len(rows)} shards with problematic translogs:") + console.print() + ... + return alter_commands
771-773
: Remove unused variable or use it.
zone
is computed but unused.-# FIXME: `zone` does not get used. -zone = next((node.zone for node in self.nodes if node.id == node_id), "unknown") # noqa: F841 +# zone lookup omitted intentionally; remove variable to avoid noiseOr use it in emitted stats if valuable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cratedb_toolkit/admin/xmover/analysis/shard.py
(4 hunks)cratedb_toolkit/admin/xmover/cli.py
(4 hunks)cratedb_toolkit/admin/xmover/model.py
(2 hunks)cratedb_toolkit/admin/xmover/operational/monitor.py
(11 hunks)cratedb_toolkit/admin/xmover/util/database.py
(5 hunks)cratedb_toolkit/admin/xmover/util/format.py
(1 hunks)doc/admin/xmover/handbook.md
(1 hunks)tests/admin/test_cli.py
(1 hunks)tests/admin/test_problematic_translogs.py
(1 hunks)tests/admin/test_recovery_monitor.py
(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T17:32:10.537Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: cratedb_toolkit/admin/xmover/analyze/shard.py:0-0
Timestamp: 2025-09-18T17:32:10.537Z
Learning: XMover (cratedb_toolkit.admin.xmover) intentionally uses print statements throughout its codebase instead of structured logging, as it's designed as a CLI tool where direct console output is the intended behavior.
Applied to files:
cratedb_toolkit/admin/xmover/cli.py
cratedb_toolkit/admin/xmover/operational/monitor.py
🪛 markdownlint-cli2 (0.18.1)
doc/admin/xmover/handbook.md
272-272: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.0)
cratedb_toolkit/admin/xmover/util/format.py
46-46: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
cratedb_toolkit/admin/xmover/cli.py
490-490: Do not catch blind exception: Exception
(BLE001)
506-506: Do not catch blind exception: Exception
(BLE001)
cratedb_toolkit/admin/xmover/analysis/shard.py
1126-1126: Consider moving this statement to an else
block
(TRY300)
1128-1128: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-and-publish
🔇 Additional comments (5)
cratedb_toolkit/admin/xmover/model.py (1)
2-5
: Type alias and imports look good.The
TableStatsType
alias and expanded typing imports are appropriate and improve clarity.tests/admin/test_cli.py (1)
10-19
: CLI surface coverage updated — looks good.Including
active-shards
,problematic-translogs
, andshard-distribution
here keeps the smoke test aligned with the CLI.tests/admin/test_recovery_monitor.py (1)
74-74
: Partition awareness wired into tests — OK.Passing
partition_values
aligns tests with the new dataclass signature.If "no partition" is expected, consider using
None
rather than the string "NULL" to avoid UI artifacts. Confirm downstream formatting handlesNone
correctly.Also applies to: 138-138, 156-156, 224-224, 241-241
cratedb_toolkit/admin/xmover/cli.py (1)
44-56
: Don't require a DB connection for offline/diagnostic commands — lazy-init the CrateDBClientEager connection blocks
explain-error
and other offline flows; initialize the client only for subcommands that need DB access.File: cratedb_toolkit/admin/xmover/cli.py (lines 44–56)
-# Test connection on startup. -client = CrateDBClient() -if not client.test_connection(): - console.print("[red]Error: Failed connecting to CrateDB[/red]") - console.print( - "Please check your database connection string, " - "i.e. the CRATE_CONNECTION_STRING environment variable, " - "possibly stored within an .env file" - ) - raise click.Abort() - -# Propagate the client handle. -ctx.obj["client"] = client +# Only initialize the DB client for commands that need it +needs_db = { + "analyze", + "find-candidates", + "recommend", + "check-balance", + "shard-distribution", + "active-shards", + "monitor-recovery", + "problematic-translogs", + "zone-analysis", + "validate-move", + "test-connection", +} +sub = ctx.invoked_subcommand +if sub in needs_db: + client = CrateDBClient() + if not client.test_connection(): + console.print("[red]Error: Failed connecting to CrateDB[/red]") + console.print( + "Please check your database connection string, " + "i.e. the CRATE_CONNECTION_STRING environment variable, " + "possibly stored within an .env file" + ) + raise click.Abort() + ctx.obj["client"] = clientVerification: attempted the provided smoke-check but the import failed here with "ModuleNotFoundError: No module named 'croud'"; unable to run the smoke test in this environment. Re-run the smoke-check locally or in CI (with project dependencies, including croud, installed) after applying the patch to confirm offline commands work as expected.
tests/admin/test_problematic_translogs.py (1)
13-214
: Solid, comprehensive test coverage for the new CLI path.Scenarios are well covered (empty, partitioned/non‑partitioned, mixed, cancel flow, failures, defaults). Nice use of CliRunner and mocking.
WHERE | ||
sh.state = 'STARTED' | ||
AND sh.translog_stats['uncommitted_size'] \ | ||
> ? * 1024^2 | ||
AND primary = FALSE | ||
ORDER BY | ||
6 DESC \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote and qualify the "primary" column in SQL.
Unquoted primary
can break on some engines; qualify with the alias for clarity.
- AND primary = FALSE
+ AND sh."primary" = FALSE
Note: adjust the related test to tolerate quoted primary
(see test update suggestion).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
WHERE | |
sh.state = 'STARTED' | |
AND sh.translog_stats['uncommitted_size'] \ | |
> ? * 1024^2 | |
AND primary = FALSE | |
ORDER BY | |
6 DESC \ | |
WHERE | |
sh.state = 'STARTED' | |
AND sh.translog_stats['uncommitted_size'] \ | |
> ? * 1024^2 | |
AND sh."primary" = FALSE | |
ORDER BY | |
6 DESC \ |
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/analysis/shard.py around lines 1060 to 1066 the
SQL uses an unquoted column name primary which can fail on some SQL engines;
qualify it with the table alias (e.g., sh."primary" or sh.primary depending on
quoting rules used elsewhere) in the WHERE clause and in any related
ORDER/GROUP/SELECT references, and update the related test to accept the
quoted/qualified form of primary.
|
||
schema_name: str | ||
table_name: str | ||
partition_values: Optional[str] # Partition values for partitioned tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
New required field may be a breaking change; verify all call sites.
partition_values: Optional[str]
is non‑default and now required at construction. Ensure all creators of RecoveryInfo
across the repo (and downstream users) were updated accordingly.
Run to verify:
🏁 Script executed:
#!/bin/bash
# Find all RecoveryInfo constructions and show arguments
rg -nPU --type=py '(?s)RecoveryInfo\s*\(' -C2
Length of output: 2141
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="cratedb_toolkit/admin/xmover/model.py"
if [[ ! -f "$file" ]]; then
echo "MISSING:$file"
exit 0
fi
nl -ba "$file" | sed -n '1,240p'
Length of output: 59
Make partition_values optional (or update all callers).
Multiple RecoveryInfo(...) constructions don’t pass partition_values; either change the model to add a default (e.g. partition_values: Optional[str] = None in cratedb_toolkit/admin/xmover/model.py at ~line 61) or update all call sites to pass the new field. Affected call sites found: tests/admin/test_recovery_monitor.py:71,135,152,219,237 and cratedb_toolkit/admin/xmover/util/database.py:448.
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/model.py around line 61, the RecoveryInfo field
partition_values is declared as Optional[str] but has no default causing callers
that omit it to fail; update the model to make it truly optional by giving it a
default (e.g. partition_values: Optional[str] = None) so existing callers
continue to work, or alternatively update all call sites (tests and
cratedb_toolkit/admin/xmover/util/database.py) to pass a value for
partition_values.
is_recovering = any( | ||
r.shard_id == shard["shard_id"] | ||
and r.table_name == shard["table_name"] | ||
and r.schema_name == shard["schema_name"] | ||
for r in recoveries | ||
) | ||
if not is_recovering: | ||
non_recovering_shards.append(shard) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include partition in “is_recovering” checks to avoid false matches across partitions.
table_name + shard_id
is not unique for partitioned tables; this can hide problematic shards or misclassify them.
- is_recovering = any(
- r.shard_id == shard["shard_id"]
- and r.table_name == shard["table_name"]
- and r.schema_name == shard["schema_name"]
- for r in recoveries
- )
+ is_recovering = any(
+ r.shard_id == shard["shard_id"]
+ and r.table_name == shard["table_name"]
+ and r.schema_name == shard["schema_name"]
+ and (r.partition_values or None) == (shard.get("partition_values") or None)
+ for r in recoveries
+ )
Apply the same change at Lines 395-401 and 463-470.
Also applies to: 395-401, 463-470
🤖 Prompt for AI Agents
In cratedb_toolkit/admin/xmover/operational/monitor.py around lines 311-319, the
is_recovering check only compares shard_id, table_name and schema_name which can
produce false matches for partitioned tables; update the predicate to also
compare partition (or partition_value/partition_id field used in shard dict and
Recovery objects) so the match includes partition and prevents
misclassification; make the identical change in the same logic at lines 395-401
and 463-470 so all three checks include partition in the equality conditions.
- More or less just line-length fixes. - Only a single type adjustment was needed on the return value of the `analyze_distribution` method. - Ruff recommended to use set comprehensions, so here we go. - At a single spot where an exception has been `pass`ed, we added error output. Is it bad?
The new command can find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.
…porting This commit adds comprehensive shard size analysis capabilities, including: - Size distribution bucketing and reporting - Early warning detection for oversized and undersized shards - Partition-aware shard status display - Integration with recovery monitoring - Improved problematic shard detection - Enhanced CLI output formatting The changes help identify shard sizing issues that could impact cluster performance and stability.
- Add max shard size tracking to size buckets - Add small shards analysis view to analyzer command
0c12e4f
to
bc2a66b
Compare
About
The new subcommand
xmover problematic-translogs
can find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.Install
uv pip install --upgrade 'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover-problematic-translogs'
Documentation
References