Admin/XMover: Problematic translog detection and cancellation #533

amotl · 2025-09-10T19:39:19Z

About

The new subcommand xmover problematic-translogs can find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.

Note: This PR is stacked upon Admin/XMover: Add CrateDB shard analyzer and movement tool #523.

Install

uv pip install --upgrade 'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover-problematic-translogs'

Documentation

https://cratedb-toolkit--533.org.readthedocs.build/admin/xmover/handbook.html#problematic-translogs

References

Looking-glass utilities for exploring what's going on: CrateDB shard analysis and beyond #513

coderabbitai · 2025-09-10T19:39:26Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces shard-size analytics and reporting in ShardAnalyzer, adds TranslogReporter and a new CLI command problematic-translogs, extends RecoveryMonitor with problematic shard visibility, incorporates partition-aware table displays, updates data model and database queries to include partition_values, and adds documentation and tests covering new behaviors.

Changes

Cohort / File(s)	Summary of Changes
Shard size analytics and translog reporting `cratedb_toolkit/admin/xmover/analysis/shard.py`	Added shard size overview, large/small shard detail methods, and TranslogReporter for analyzing uncommitted translogs and generating ALTER commands; integrated distribution stats and zone mapping into reporting.
CLI integration `cratedb_toolkit/admin/xmover/cli.py`	Updated help/init flow; added problematic_translogs command with threshold option, optional execution of generated ALTER statements, confirmation, pacing, and error handling.
Model and typing updates `cratedb_toolkit/admin/xmover/model.py`	Added TableStatsType alias; extended RecoveryInfo with optional partition_values field.
Database client extensions `cratedb_toolkit/admin/xmover/util/database.py`	get_recovery_details now joins partition metadata and returns partition_values; added get_problematic_shards API with optional filters; adapted RecoveryInfo parsing to include partition_values.
Formatting helper `cratedb_toolkit/admin/xmover/util/format.py`	Introduced format_table_display_with_partition for partition-aware table labeling.
Recovery monitoring UX `cratedb_toolkit/admin/xmover/operational/monitor.py`	Added get_problematic_shards flow; partition-aware displays; revised summaries and watch-mode status, including non-recovering shard reporting.
Documentation `doc/admin/xmover/handbook.md`	Documented problematic-translogs command, options, examples, and behavior.
Tests: CLI and features `tests/admin/test_cli.py`, `tests/admin/test_problematic_translogs.py`	Expanded CLI coverage; added comprehensive tests for problematic-translogs including generation, cancellation, errors, and defaults.
Tests: Recovery monitor `tests/admin/test_recovery_monitor.py`	Updated RecoveryInfo usages to include partition_values parameter.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant CLI as xmover CLI
  participant TR as TranslogReporter
  participant DB as CrateDBClient

  U->>CLI: xmover problematic-translogs --size-mb N [--cancel]
  CLI->>TR: problematic_translogs(size_mb=N)
  TR->>DB: query replicas with uncommitted translog >= N MB
  DB-->>TR: rows (table, partition_values, shard_id, node, ...)
  TR-->>CLI: generated ALTER ... REROUTE CANCEL SHARD commands
  alt --cancel flag and commands exist
    CLI->>U: Prompt for confirmation
    U-->>CLI: Confirm/Decline
    opt If confirmed
      loop For each command
        CLI->>DB: execute_query(command)
        Note right of CLI: 1s delay between commands
      end
    end
  end
  CLI-->>U: Summary (counts, successes/failures)

sequenceDiagram
  autonumber
  participant SA as ShardAnalyzer
  participant DB as CrateDBClient
  participant UI as ShardReporter UI

  SA->>DB: fetch shard stats and nodes/zones
  DB-->>SA: shards (sizes, states), node zones
  SA-->>UI: distribution_stats (buckets, totals, warnings)
  UI-->>UI: render size distribution + optional large/small details

sequenceDiagram
  autonumber
  participant RM as RecoveryMonitor
  participant DB as CrateDBClient
  participant T as Terminal

  RM->>DB: get_recoveries()
  DB-->>RM: active/completed recoveries with partition_values
  RM->>DB: get_problematic_shards()
  DB-->>RM: shards not STARTED (with partition_values)
  RM-->>T: status lines, summaries, examples (partition-aware)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

WalBeh
surister

Poem

A rabbit taps keys with a whiskered grin,
Buckets of shards, where to begin?
Translogs tamed, with cautious hops,
Cancel reroutes in gentle stops.
Partitions named, the metrics sing—
Balance restored: thump-thump, spring! 🐇📊

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 90.24% which is sufficient. The required threshold is 80.00%.
Title Check	✅ Passed	The title is concise and accurately summarizes the primary change: adding problematic translog detection and cancellation to the Admin/XMover tool; it reflects the new CLI command and related code/docs in the diff and is clear for teammates scanning history.
Description Check	✅ Passed	The PR description clearly documents the new xmover problematic-translogs subcommand, its options (--size‑mb and --cancel), intended behavior (find and optionally cancel replica shards with large uncommitted translogs), and points to installation and docs; this directly matches the CLI, TranslogReporter, DB query, and handbook changes in the diff. The author also notes stacking on GH-523 and provides useful context and references, so the description is related and appropriate for the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

amotl · 2025-09-10T19:41:59Z

tests/admin/test_problematic_translogs.py

@WalBeh: Unfortunately, the software tests are failing here. I guess CrateDB has to be properly provisioned before.

pytest --no-cov tests/admin/test_problematic_translogs.py

We've fixed the software tests per 13d59cc.

amotl · 2025-09-19T21:16:24Z

@WalBeh: ~~Checks are failing here because code coverage dropped below 50%.~~

~~codecov/project — 46.49% (target 50.00%)~~

Can you add software tests here or to another branch which will bring up software test code coverage again, with a specific focus on your new code?

EDIT: All right, coverage is still >50%. However, the addition still decreased it, so it would not hurt to still add more test cases when applicable.

codecov/project Successful in 1s — 56.52% (target 50.00%)

amotl · 2025-09-19T21:23:04Z

@coderabbitai review

coderabbitai · 2025-09-19T21:23:11Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (11)

cratedb_toolkit/admin/xmover/model.py (1)

61-61: Standardize “no partition” sentinel.

Tests use the string "NULL". Prefer None to represent “no partition” at the data layer to avoid leaking sentinel strings into UIs. If "NULL" must be retained for compatibility, document it on the field.
doc/admin/xmover/handbook.md (1)
271-287: Add language to fenced code block to satisfy markdownlint.

Label the sample output block as text.

Apply:
-**Sample Output:**
-```
+**Sample Output:**
+```text
 Found 3 shards with problematic translogs:
 ...
 Total: 3 ALTER commands generated
-```
+```
cratedb_toolkit/admin/xmover/util/format.py (1)
46-58: Fix typing: avoid implicit Optional (RUF013).

Annotate partition_values as Optional and import it.

Apply:
+from typing import Optional
@@
-def format_table_display_with_partition(schema_name: str, table_name: str, partition_values: str = None) -> str:
+def format_table_display_with_partition(
+    schema_name: str, table_name: str, partition_values: Optional[str] = None
+) -> str:
Optional (defensive): treat common sentinels as “no partition” to keep displays clean:
-    if partition_values:
+    if partition_values and partition_values.strip() not in {"NULL", "-", "none", ""}:
         return f"{base_display} {partition_values}"
cratedb_toolkit/admin/xmover/cli.py (2)
480-493: Catching broad Exception — either narrow or document.

If execute_query raises a known exception type, catch that. Otherwise, suppress the Ruff BLE001 warning explicitly.

Apply one of:
-                    except Exception as e:
+                    # Broad except is intentional for CLI UX
+                    except Exception as e:  # noqa: BLE001
or, if available:
-                    except Exception as e:
+                    except (click.ClickException, RuntimeError) as e:
Confirm the concrete exception type thrown by CrateDBClient.execute_query and replace Exception accordingly.

506-510: Same: broad except for analysis block.

Mirror the decision above for consistency.

Apply:
-    except Exception as e:
+    except Exception as e:  # noqa: BLE001
tests/admin/test_problematic_translogs.py (2)
156-159: Patch time.sleep at the module where it’s used to avoid cross-test bleed.

Patch the symbol in the CLI module (or the module that calls sleep), not the global time.sleep.
-), patch("time.sleep"):  # Mock sleep to speed up test
+), patch("cratedb_toolkit.admin.xmover.cli.time.sleep"):  # Mock sleep where used
If sleep is called from another module, target that module path instead.

128-131: Make the query assertion resilient to quoting of "primary".

The SQL should quote the column as sh."primary" = FALSE. Update the assertion to allow both.
+import re
 ...
-assert "primary = FALSE" in query
+assert re.search(r'"?primary"?\s*=\s*FALSE', query), query
cratedb_toolkit/admin/xmover/operational/monitor.py (1)
140-147: Use the shared formatter for table+partition display for consistency.

Align _format_recovery_table with the rest of the file.
-# Format table name with partition values if available
-table_display = f"{recovery.schema_name}.{recovery.table_name}"
-if recovery.partition_values:
-    table_display = f"{table_display} {recovery.partition_values}"
+table_display = format_table_display_with_partition(
+    recovery.schema_name, recovery.table_name, recovery.partition_values
+)
cratedb_toolkit/admin/xmover/util/database.py (1)
516-573: Docstring vs filter mismatch: exclude actively recovering shards or adjust wording.

get_problematic_shards says “aren't actively recovering” but currently includes all state != 'STARTED', which can include RECOVERING. Either update the filter or the docstring.

Option A (exclude recovering):
- where_conditions = ["s.state != 'STARTED'"]
+ where_conditions = ["s.state NOT IN ('STARTED','RECOVERING')"]
Option B: Clarify the docstring to say it returns “non-started shards (may include recovering),” and rely on caller-side filtering (as done in monitor).
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
1069-1127: Narrow the try/except and use else: for Ruff TRY300; annotate broad catch if intentional.

Limit the try to the query, put the rendering in else:, and annotate the broad catch (BLE001) since this is a CLI UX path.
-        try:
-            result = self.client.execute_query(query, [size_mb])
-            rows = result.get("rows", [])
-
-            if not rows:
-                console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]")
-                return []
-
-            console.print(f"Found {len(rows)} shards with problematic translogs:")
-            console.print()
-            ...
-            return alter_commands
-        except Exception as e:
-            console.print(f"[red]Error analyzing problematic translogs: {e}[/red]")
+        try:
+            result = self.client.execute_query(query, [size_mb])
+        except Exception as e:  # noqa: BLE001
+            console.print(f"[red]Error analyzing problematic translogs: {e}[/red]")
             import traceback
-
             console.print(f"[dim]{traceback.format_exc()}[/dim]")
             return []
+        else:
+            rows = result.get("rows", [])
+            if not rows:
+                console.print(f"[green]✓ No replica shards found with translog uncommitted size > {size_mb}MB[/green]")
+                return []
+            console.print(f"Found {len(rows)} shards with problematic translogs:")
+            console.print()
+            ...
+            return alter_commands
771-773: Remove unused variable or use it.

zone is computed but unused.
-# FIXME: `zone` does not get used.
-zone = next((node.zone for node in self.nodes if node.id == node_id), "unknown")  # noqa: F841
+# zone lookup omitted intentionally; remove variable to avoid noise
Or use it in emitted stats if valuable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3942f and 0c12e4f.

📒 Files selected for processing (10)

cratedb_toolkit/admin/xmover/analysis/shard.py (4 hunks)
cratedb_toolkit/admin/xmover/cli.py (4 hunks)
cratedb_toolkit/admin/xmover/model.py (2 hunks)
cratedb_toolkit/admin/xmover/operational/monitor.py (11 hunks)
cratedb_toolkit/admin/xmover/util/database.py (5 hunks)
cratedb_toolkit/admin/xmover/util/format.py (1 hunks)
doc/admin/xmover/handbook.md (1 hunks)
tests/admin/test_cli.py (1 hunks)
tests/admin/test_problematic_translogs.py (1 hunks)
tests/admin/test_recovery_monitor.py (5 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-18T17:32:10.537Z

Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: cratedb_toolkit/admin/xmover/analyze/shard.py:0-0
Timestamp: 2025-09-18T17:32:10.537Z
Learning: XMover (cratedb_toolkit.admin.xmover) intentionally uses print statements throughout its codebase instead of structured logging, as it's designed as a CLI tool where direct console output is the intended behavior.

Applied to files:

cratedb_toolkit/admin/xmover/cli.py
cratedb_toolkit/admin/xmover/operational/monitor.py

🪛 markdownlint-cli2 (0.18.1)

doc/admin/xmover/handbook.md

272-272: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.0)

cratedb_toolkit/admin/xmover/util/format.py

46-46: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

cratedb_toolkit/admin/xmover/cli.py

490-490: Do not catch blind exception: Exception

(BLE001)

506-506: Do not catch blind exception: Exception

(BLE001)

cratedb_toolkit/admin/xmover/analysis/shard.py

1126-1126: Consider moving this statement to an else block

(TRY300)

1128-1128: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-and-publish

🔇 Additional comments (5)

cratedb_toolkit/admin/xmover/model.py (1)

2-5: Type alias and imports look good.

The TableStatsType alias and expanded typing imports are appropriate and improve clarity.

tests/admin/test_cli.py (1)

10-19: CLI surface coverage updated — looks good.

Including active-shards, problematic-translogs, and shard-distribution here keeps the smoke test aligned with the CLI.

tests/admin/test_recovery_monitor.py (1)

74-74: Partition awareness wired into tests — OK.

Passing partition_values aligns tests with the new dataclass signature.

If "no partition" is expected, consider using None rather than the string "NULL" to avoid UI artifacts. Confirm downstream formatting handles None correctly.

Also applies to: 138-138, 156-156, 224-224, 241-241
cratedb_toolkit/admin/xmover/cli.py (1)
44-56: Don't require a DB connection for offline/diagnostic commands — lazy-init the CrateDBClient

Eager connection blocks explain-error and other offline flows; initialize the client only for subcommands that need DB access.

File: cratedb_toolkit/admin/xmover/cli.py (lines 44–56)
-# Test connection on startup.
-client = CrateDBClient()
-if not client.test_connection():
-    console.print("[red]Error: Failed connecting to CrateDB[/red]")
-    console.print(
-        "Please check your database connection string, "
-        "i.e. the CRATE_CONNECTION_STRING environment variable, "
-        "possibly stored within an .env file"
-    )
-    raise click.Abort()
-
-# Propagate the client handle.
-ctx.obj["client"] = client
+# Only initialize the DB client for commands that need it
+needs_db = {
+    "analyze",
+    "find-candidates",
+    "recommend",
+    "check-balance",
+    "shard-distribution",
+    "active-shards",
+    "monitor-recovery",
+    "problematic-translogs",
+    "zone-analysis",
+    "validate-move",
+    "test-connection",
+}
+sub = ctx.invoked_subcommand
+if sub in needs_db:
+    client = CrateDBClient()
+    if not client.test_connection():
+        console.print("[red]Error: Failed connecting to CrateDB[/red]")
+        console.print(
+            "Please check your database connection string, "
+            "i.e. the CRATE_CONNECTION_STRING environment variable, "
+            "possibly stored within an .env file"
+        )
+        raise click.Abort()
+    ctx.obj["client"] = client
Verification: attempted the provided smoke-check but the import failed here with "ModuleNotFoundError: No module named 'croud'"; unable to run the smoke test in this environment. Re-run the smoke-check locally or in CI (with project dependencies, including croud, installed) after applying the patch to confirm offline commands work as expected.
tests/admin/test_problematic_translogs.py (1)

13-214: Solid, comprehensive test coverage for the new CLI path.

Scenarios are well covered (empty, partitioned/non‑partitioned, mixed, cancel flow, failures, defaults). Nice use of CliRunner and mocking.

coderabbitai · 2025-09-19T21:32:16Z

cratedb_toolkit/admin/xmover/analysis/shard.py

+                WHERE
+                    sh.state = 'STARTED'
+                  AND sh.translog_stats['uncommitted_size'] \
+                    > ? * 1024^2
+                  AND primary = FALSE
+                ORDER BY
+                    6 DESC \


⚠️ Potential issue

Quote and qualify the "primary" column in SQL.

Unquoted primary can break on some engines; qualify with the alias for clarity.

- AND primary = FALSE + AND sh."primary" = FALSE

Note: adjust the related test to tolerate quoted primary (see test update suggestion).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

WHERE

sh.state = 'STARTED'

AND sh.translog_stats['uncommitted_size'] \

> ? * 1024^2

AND primary = FALSE

ORDER BY

6 DESC \

WHERE

sh.state = 'STARTED'

AND sh.translog_stats['uncommitted_size'] \

> ? * 1024^2

AND sh."primary" = FALSE

ORDER BY

6 DESC \

🤖 Prompt for AI Agents

In cratedb_toolkit/admin/xmover/analysis/shard.py around lines 1060 to 1066 the SQL uses an unquoted column name primary which can fail on some SQL engines; qualify it with the table alias (e.g., sh."primary" or sh.primary depending on quoting rules used elsewhere) in the WHERE clause and in any related ORDER/GROUP/SELECT references, and update the related test to accept the quoted/qualified form of primary.

coderabbitai · 2025-09-19T21:32:16Z

cratedb_toolkit/admin/xmover/model.py


    schema_name: str
    table_name: str
+    partition_values: Optional[str]  # Partition values for partitioned tables


⚠️ Potential issue

🧩 Analysis chain

New required field may be a breaking change; verify all call sites.

partition_values: Optional[str] is non‑default and now required at construction. Ensure all creators of RecoveryInfo across the repo (and downstream users) were updated accordingly.

Run to verify:

🏁 Script executed:

#!/bin/bash # Find all RecoveryInfo constructions and show arguments rg -nPU --type=py '(?s)RecoveryInfo\s*\(' -C2

Length of output: 2141

🏁 Script executed:

#!/bin/bash set -euo pipefail file="cratedb_toolkit/admin/xmover/model.py" if [[ ! -f "$file" ]]; then echo "MISSING:$file" exit 0 fi nl -ba "$file" | sed -n '1,240p'

Length of output: 59

Make partition_values optional (or update all callers).

Multiple RecoveryInfo(...) constructions don’t pass partition_values; either change the model to add a default (e.g. partition_values: Optional[str] = None in cratedb_toolkit/admin/xmover/model.py at ~line 61) or update all call sites to pass the new field. Affected call sites found: tests/admin/test_recovery_monitor.py:71,135,152,219,237 and cratedb_toolkit/admin/xmover/util/database.py:448.

🤖 Prompt for AI Agents

In cratedb_toolkit/admin/xmover/model.py around line 61, the RecoveryInfo field partition_values is declared as Optional[str] but has no default causing callers that omit it to fail; update the model to make it truly optional by giving it a default (e.g. partition_values: Optional[str] = None) so existing callers continue to work, or alternatively update all call sites (tests and cratedb_toolkit/admin/xmover/util/database.py) to pass a value for partition_values.

coderabbitai · 2025-09-19T21:32:16Z

cratedb_toolkit/admin/xmover/operational/monitor.py

+                                is_recovering = any(
+                                    r.shard_id == shard["shard_id"]
+                                    and r.table_name == shard["table_name"]
+                                    and r.schema_name == shard["schema_name"]
+                                    for r in recoveries
+                                )
+                                if not is_recovering:
+                                    non_recovering_shards.append(shard)
+


⚠️ Potential issue

Include partition in “is_recovering” checks to avoid false matches across partitions.

table_name + shard_id is not unique for partitioned tables; this can hide problematic shards or misclassify them.

- is_recovering = any( - r.shard_id == shard["shard_id"] - and r.table_name == shard["table_name"] - and r.schema_name == shard["schema_name"] - for r in recoveries - ) + is_recovering = any( + r.shard_id == shard["shard_id"] + and r.table_name == shard["table_name"] + and r.schema_name == shard["schema_name"] + and (r.partition_values or None) == (shard.get("partition_values") or None) + for r in recoveries + )

Apply the same change at Lines 395-401 and 463-470.

Also applies to: 395-401, 463-470

🤖 Prompt for AI Agents

In cratedb_toolkit/admin/xmover/operational/monitor.py around lines 311-319, the is_recovering check only compares shard_id, table_name and schema_name which can produce false matches for partitioned tables; update the predicate to also compare partition (or partition_value/partition_id field used in shard dict and Recovery objects) so the match includes partition and prevents misclassification; make the identical change in the same logic at lines 395-401 and 463-470 so all three checks include partition in the equality conditions.

- More or less just line-length fixes. - Only a single type adjustment was needed on the return value of the `analyze_distribution` method. - Ruff recommended to use set comprehensions, so here we go. - At a single spot where an exception has been `pass`ed, we added error output. Is it bad?

The new command can find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.

…porting This commit adds comprehensive shard size analysis capabilities, including: - Size distribution bucketing and reporting - Early warning detection for oversized and undersized shards - Partition-aware shard status display - Integration with recovery monitoring - Improved problematic shard detection - Enhanced CLI output formatting The changes help identify shard sizing issues that could impact cluster performance and stability.

- Add max shard size tracking to size buckets - Add small shards analysis view to analyzer command

… shards query

amotl force-pushed the xmover-problematic-translogs branch from 373efef to 143ffbf Compare September 10, 2025 19:41

amotl commented Sep 10, 2025

View reviewed changes

amotl force-pushed the xmover branch 2 times, most recently from dac2a6d to cb05b4d Compare September 10, 2025 23:21

amotl force-pushed the xmover-problematic-translogs branch from 143ffbf to a48d9fc Compare September 17, 2025 16:49

amotl force-pushed the xmover branch 4 times, most recently from 21dd237 to 3b3942f Compare September 19, 2025 18:26

amotl force-pushed the xmover-problematic-translogs branch from d6a1753 to 0c12e4f Compare September 19, 2025 21:05

amotl marked this pull request as ready for review September 19, 2025 21:22

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

amotl marked this pull request as draft September 23, 2025 21:19

amotl force-pushed the xmover branch from 3b3942f to 18c7134 Compare September 23, 2025 21:24

WalBeh and others added 12 commits September 23, 2025 23:35

Admin/XMover: Add shard distribution analysis for (large) tables

c0ee048

Admin/XMover: Add module for active shard monitoring

a4f244a

Admin/XMover: Fix software tests

72042a4

Admin/XMover: Implement suggestions by CodeRabbit

4b21c79

Admin/XMover: Add problematic translog detection and cancellation

489dda5

The new command can find and cancel replica shards with large uncommitted translog sizes. It provides options to set size thresholds and optionally execute cancel commands after user confirmation.

Admin/XMover: Forward process output to user when CLI tests fail

5b427dd

Admin/XMover: Improve output for database connectivity test

9abf7df

Admin/XMover: Add missing subcommands to CLI software tests

4a55aa8

XMover/problematic-translogs: Fix software tests

135aa10

XMover/problematic-translogs: Large and small shards analysis/tracking

1a92c3f

- Add max shard size tracking to size buckets - Add small shards analysis view to analyzer command

XMover/problematic-translogs: Remove allocation info from problematic…

bc2a66b

… shards query

amotl force-pushed the xmover-problematic-translogs branch from 0c12e4f to bc2a66b Compare September 23, 2025 21:35

amotl changed the title ~~Admin/XMover: Add problematic translog detection and cancellation~~ Admin/XMover: Problematic translog detection and cancellation Sep 23, 2025

amotl mentioned this pull request Sep 23, 2025

Looking-glass utilities for exploring what's going on: CrateDB shard analysis and beyond #513

Open

amotl requested review from seut and WalBeh September 23, 2025 21:48

amotl marked this pull request as ready for review September 23, 2025 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Admin/XMover: Problematic translog detection and cancellation #533

Admin/XMover: Problematic translog detection and cancellation #533

Uh oh!

amotl commented Sep 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 10, 2025 •

edited

Loading

Review skipped

Uh oh!

amotl Sep 10, 2025 •

edited

Loading

Uh oh!

amotl Sep 19, 2025

Uh oh!

amotl commented Sep 19, 2025 •

edited

Loading

Uh oh!

amotl commented Sep 19, 2025

Uh oh!

coderabbitai bot commented Sep 19, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 19, 2025

Uh oh!

coderabbitai bot Sep 19, 2025

Uh oh!

coderabbitai bot Sep 19, 2025

Uh oh!

Uh oh!

Admin/XMover: Problematic translog detection and cancellation #533

Are you sure you want to change the base?

Admin/XMover: Problematic translog detection and cancellation #533

Uh oh!

Conversation

amotl commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

Install

Documentation

References

Uh oh!

coderabbitai bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

amotl Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amotl Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

amotl commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotl commented Sep 19, 2025

Uh oh!

coderabbitai bot commented Sep 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amotl commented Sep 10, 2025 •

edited

Loading

coderabbitai bot commented Sep 10, 2025 •

edited

Loading

amotl Sep 10, 2025 •

edited

Loading

amotl commented Sep 19, 2025 •

edited

Loading