Skip to content

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 19, 2025

About

@WalBeh contributed a powerful utility that uses CrateDB's system tables to find out about cluster imbalances related to shard number and shard size distribution across the whole cluster. Thanks!

After analysing the situation, the program presents solutions in form of SQL commands to bring the cluster into a more balanced state again.

Install

uv pip install --upgrade 'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover'

Documentation

https://cratedb-toolkit--523.org.readthedocs.build/admin/xmover/

References

Copy link

coderabbitai bot commented Aug 19, 2025

Walkthrough

Adds XMover: a new administrative toolkit for CrateDB that analyzes shard distribution, recommends and optionally executes safe shard relocations, validates moves, monitors recoveries, exposes a Click CLI and script entrypoint, provides models/utilities/analysis/operational modules, extensive docs, and unit tests.

Changes

Cohort / File(s) Summary
Package & CLI Integration
CHANGES.md, pyproject.toml, cratedb_toolkit/cli.py, cratedb_toolkit/admin/xmover/__init__.py
Announce XMover in changelog; add rich dependency; add script entry point scripts.xmover = "cratedb_toolkit.admin.xmover.cli:main"; register top-level xmover CLI subcommand; add xmover package metadata.
Top-level CLI Implementation
cratedb_toolkit/admin/xmover/cli.py
New Click-based CLI wiring that initializes CrateDBClient, tests connectivity, and exposes subcommands (analyze, find-candidates, recommend, validate-move, test-connection, check-balance, shard-distribution, active-shards, zone-analysis, explain-error, monitor-recovery) with Rich output and interactive prompts.
Core Models
cratedb_toolkit/admin/xmover/model.py
New dataclasses and typed containers: NodeInfo, ShardInfo, RecoveryInfo, ActiveShardSnapshot, ActiveShardActivity, ShardRelocationRequest/Response (includes to_sql and safety_score), DistributionStats, SizeCriteria, ShardRelocationConstraints.
Database Client
cratedb_toolkit/admin/xmover/util/database.py
New CrateDBClient for HTTP /_sql access and helpers: execute_query, get_nodes_info, get_shards_info, get_shard_distribution_summary, test_connection, get_cluster_watermarks, get_active_recoveries, get_recovery_details, get_all_recovering_shards, get_active_shards_snapshot, and internal parsing helpers.
Formatting & Error Helpers
cratedb_toolkit/admin/xmover/util/format.py, cratedb_toolkit/admin/xmover/util/error.py
Add size/percentage/translog formatting helpers and explain_cratedb_error CLI helper (pattern-matching diagnostics) with a Rich Console instance.
Shard Analysis & Reporting
cratedb_toolkit/admin/xmover/analysis/shard.py, .../analysis/table.py, .../analysis/zone.py
Implement ShardAnalyzer, ShardReporter, DistributionAnalyzer, and ZoneReport: collection of cluster/table/zone distribution analysis, imbalance/anomaly detectors, move validation, decommission planning, and Rich rendering.
Operational Modules
cratedb_toolkit/admin/xmover/operational/candidates.py, .../operational/recommend.py, .../operational/monitor.py
CandidateFinder, ShardRelocationRecommender, and RecoveryMonitor: find candidate shards, generate/validate/optionally execute relocation plans with recovery-aware sequencing, and monitor ongoing recoveries (watch mode).
Attic / Skeleton
cratedb_toolkit/admin/xmover/attic.py
Commented skeleton for a decommission command (non-executable example/skeleton).
Documentation
doc/admin/index.md, doc/admin/xmover/handbook.md, doc/admin/xmover/index.md, doc/admin/xmover/queries.md, doc/admin/xmover/troubleshooting.md, doc/index.md
Add admin docs section and comprehensive XMover docs (handbook, index, query gallery, troubleshooting) and update docs toctree.
Tests
tests/admin/test_cli.py, tests/admin/test_active_shard_monitor.py, tests/admin/test_distribution_analyzer.py, tests/admin/test_recovery_monitor.py
Add unit/CLI tests for the xmover CLI and analysis/monitoring components, covering CLI command invocation, active-shard snapshots/activity, distribution analyzer behavior, and recovery monitoring parsing/formatting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover CLI
  participant Client as CrateDBClient
  participant Analyzer as ShardAnalyzer
  participant Recommender as ShardRelocationRecommender
  participant DB as CrateDB

  User->>CLI: xmover recommend [options]
  CLI->>Client: init + test_connection()
  CLI->>Analyzer: init(client)
  CLI->>Recommender: execute(constraints, auto_execute, validate, dry_run)
  Recommender->>Analyzer: generate_rebalancing_recommendations(constraints)
  Analyzer->>Client: get_nodes_info()/get_shards_info()
  Client-->>Analyzer: nodes, shards
  Analyzer-->>Recommender: recommendations
  alt auto_execute and not dry_run
    Recommender->>Client: execute_query(ALTER TABLE ... REROUTE MOVE SHARD ...)
    Client->>DB: POST /_sql
    DB-->>Client: result
    Client-->>Recommender: success/failure
    Recommender->>Recommender: _wait_for_recovery_capacity()
  else dry_run
    Recommender-->>CLI: render recommendations (no execution)
  end
  CLI-->>User: rendered output / SQL / status
Loading
sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover monitor-recovery
  participant Client as CrateDBClient
  participant Monitor as RecoveryMonitor
  participant DB as CrateDB

  User->>CLI: xmover monitor-recovery [--watch]
  CLI->>Monitor: start(watch)
  loop refresh_interval (watch mode)
    Monitor->>Client: get_all_recovering_shards(filters)
    Client->>DB: query sys.allocations/sys.shards
    DB-->>Client: rows
    Client-->>Monitor: RecoveryInfo[]
    Monitor-->>CLI: formatted table + deltas
  end
  CLI-->>User: summary
Loading
sequenceDiagram
  autonumber
  actor User
  participant CLI as xmover validate-move
  participant Analyzer as ShardAnalyzer
  participant Client as CrateDBClient

  User->>CLI: xmover validate-move <schema.table> <shard> <from> <to>
  CLI->>Analyzer: init(client)
  CLI->>Analyzer: validate_move_safety(recommendation, max_disk_usage)
  Analyzer->>Client: lookups (nodes/shards)
  Client-->>Analyzer: details
  Analyzer-->>CLI: (is_safe, reason)
  CLI-->>User: verdict + SQL command
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • WalBeh
  • surister
  • hammerhead

Poem

I nudge the shards with twitchy care,
I count their hops and map each lair.
I wait for recoveries, then gently shove,
No data lost — just balance and love.
Thump, thump, I nudge them home — then eat my carrot. 🥕

✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xmover

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@amotl amotl force-pushed the xmover branch 2 times, most recently from 6a6b8f0 to 5068671 Compare August 21, 2025 12:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (14)
doc/admin/xmover/queries.md (5)

58-70: Align example output headers with the SELECT (zone alias and available_gb).

The example table shows attributes['zone'] and a different expression than the query (which aliases zone and available_gb). Update headers for consistency.

-```text
-+------------+--------------------+-----------------------------------------------+
-| name       | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) |
-+------------+--------------------+-----------------------------------------------+
+```text
++------------+--------------------+---------------+
+| name       | zone               | available_gb |
++------------+--------------------+---------------+
 | data-hot-5 | us-west-2a         |                            142.3342628479004  |
 ...
-+------------+--------------------+-----------------------------------------------+
++------------+--------------------+---------------+

---

`103-129`: **CTE filters on incomplete recoveries; use fully recovered shards.**

Filter to completed shards and keep unit conversion as-is. Consider guarding division by zero.



```diff
 WITH shard_summary AS (
     SELECT
         node['name'] AS node_name,
         table_name,
         schema_name,
         CASE
             WHEN "primary" = true THEN 'PRIMARY'
             ELSE 'REPLICA'
         END AS shard_type,
         COUNT(*) AS shard_count,
-        SUM(size) / 1024^3 AS total_size_gb
+        SUM(size) / 1024^3 AS total_size_gb
     FROM sys.shards
-    WHERE table_name = 'orderffD'
+    WHERE table_name = 'orderffD'
         AND routing_state = 'STARTED'
-        AND recovery['files']['percent'] = 0
+        AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     GROUP BY node['name'], table_name, schema_name, "primary"
 )
 SELECT
     node_name,
     table_name,
     schema_name,
     shard_type,
     shard_count,
     ROUND(total_size_gb, 2) AS total_size_gb,
-    ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb
+    ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb
 FROM shard_summary
 ORDER BY node_name, shard_type DESC, total_size_gb DESC;

132-157: Prefer healthy shards in the comprehensive distribution query.

Keep the caret (^) as exponentiation in CrateDB, but filter for fully recovered shards.

 WHERE s.table_name = 'your_table_name'  -- Replace with your specific table name
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

158-177: Summary query should also filter to fully recovered shards.

Use the same recovery predicate here to avoid skew from in-flight data.

-    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
-    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
+    ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb,
+    ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb,
 ...
 WHERE s.table_name = 'orderffD'  -- Replace with your specific table name
     AND s.routing_state = 'STARTED'
-    AND s.recovery['files']['percent'] = 0
+    AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0

72-79: Fix heading grammar and recovery predicate (selects only healthy shards).

  • Heading: singular “node”, drop shouting “SHARDS”.
  • Predicate: percent = 0 excludes fully recovered shards; use 100 (or >= 100 with COALESCE).
-## List biggest SHARDS on a particular Nodes
+## List biggest shards on a particular node

 ```sql
-select node['name'], table_name, schema_name, id,  sum(size) / 1024^3 from sys.shards
+select node['name'], table_name, schema_name, id, sum(size) / 1024^3 as size_gb
+from sys.shards
     where node['name'] = 'data-hot-2'
     AND routing_state = 'STARTED'
-    AND recovery['files']['percent'] = 0
+    AND coalesce(recovery['files']['percent'], 100.0) >= 100.0
     group by 1,2,3,4  order by 5  desc limit 8;

</blockquote></details>
<details>
<summary>doc/admin/xmover/handbook.md (2)</summary><blockquote>

`475-482`: **Keep threshold consistent and align argument placeholders.**

- Threshold header says 85% while defaults are 90%.
- Use the same placeholders as elsewhere: FROM_NODE/TO_NODE.



```diff
-**Disk Usage Threshold (default: 85%)**
+**Disk Usage Threshold (default: 90%)**
 ...
-# For urgent space relief
-xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO> --max-disk-usage 95
+# For urgent space relief
+xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE> --max-disk-usage 95

131-136: Docs drift: --max-disk-usage default and missing --auto-execute.

The model default is 90.0% (see ShardRelocationConstraints.max_disk_usage), not 85%. Also document the --auto-execute flag used to actually execute the generated SQL in execution mode.

 - `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85)
+ - `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 90)
  - `--validate/--no-validate`: Validate move safety (default: True)
  - `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False)
- - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
+ - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True)
+ - `--auto-execute`: Automatically execute the SQL commands (requires `--execute`, asks for confirmation) (default: False)
  - `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)

Run to verify the defaults in code:

#!/bin/bash
# Confirm the default for max_disk_usage and presence of auto-execute in CLI or model.
rg -nP 'class\s+ShardRelocationConstraints\b.*?max_disk_usage\s*:\s*float\s*=\s*([0-9.]+)' -U cratedb_toolkit/admin/xmover/model.py
rg -n 'auto[-_]execute' -n cratedb_toolkit/admin/xmover/cli.py || true
cratedb_toolkit/admin/xmover/operational/recommend.py (4)

326-334: Case-insensitive match for “zone conflict”.

This check is case-sensitive and can miss other capitalizations. Normalize to lower().

-                        if "Zone conflict" in safety_msg:
+                        if "zone conflict" in safety_msg.lower():

126-128: Interpolate variables in the “Check with” hint; include schema for safety.

Placeholders are printed literally. Use an f-string with schema_name/table_name and the actual shard id.

-            console.print(
-                "[dim]# Check with: SELECT * FROM sys.shards "
-                "WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]"
-            )
+            console.print(
+                f"[dim]# Check with: SELECT * FROM sys.shards "
+                f"WHERE schema_name = '{schema_name}' AND table_name = '{table_name}' AND id = {request.shard_id};[/dim]"
+            )

341-344: Honor user-configured max disk usage during auto-execution; avoid hard-coded 95%.

auto-exec re-validates with 95% regardless of constraints, which can run moves the user intended to disallow.

-            if auto_execute:
-                self._execute_recommendations_safely(recommendations, validate)
+            if auto_execute:
+                self._execute_recommendations_safely(
+                    recommendations,
+                    validate,
+                    max_disk_usage_percent=constraints.max_disk_usage,
+                )
@@
-    def _execute_recommendations_safely(self, recommendations, validate: bool):
+    def _execute_recommendations_safely(self, recommendations, validate: bool, max_disk_usage_percent: float):
         """Execute recommendations with extensive safety measures"""
@@
-        if validate:
+        if validate:
             for rec in recommendations:
-                is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0)
+                is_safe, safety_msg = self.analyzer.validate_move_safety(
+                    rec, max_disk_usage_percent=max_disk_usage_percent
+                )

Also applies to: 355-367


426-435: Treat CrateDB “error” payloads as failures, not successes.

CrateDB can return HTTP 200 with an "error" field. Checking only rowcount can misreport failures.

-                if result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
+                if isinstance(result, dict) and result.get("error"):
+                    console.print(f"    [red]❌ FAILED[/red] - {result['error'].get('message', result['error'])}")
+                    failed_moves += 1
+                elif result.get("rowcount", 0) >= 0:  # Success indicator for ALTER statements
                     console.print("    [green]✅ SUCCESS[/green] - Move initiated")
                     successful_moves += 1
                 else:
                     console.print(f"    [red]❌ FAILED[/red] - Unexpected result: {result}")
                     failed_moves += 1
cratedb_toolkit/admin/xmover/util/database.py (3)

147-151: Note: caret (^) is exponentiation in CrateDB; no change needed.

The use of 1024.0^3 is correct in CrateDB SQL (PostgreSQL semantics). Keeping as-is improves consistency with the rest of the docs.


21-41: Use a persistent requests.Session and surface CrateDB “error” fields.

Pooling improves performance, and CrateDB may return 200 with an error payload. Detect and raise on "error".

 class CrateDBClient:
@@
     def __init__(self, connection_string: Optional[str] = None):
         load_dotenv()
@@
+        # Reuse HTTP connections
+        self.session = requests.Session()
@@
-        try:
-            response = requests.post(
-                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
-            )
+        try:
+            response = self.session.post(
+                self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30
+            )
             response.raise_for_status()
-            return response.json()
+            data = response.json()
+            if isinstance(data, dict) and data.get("error"):
+                err = data["error"]
+                msg = err.get("message") if isinstance(err, dict) else str(err)
+                raise Exception(f"CrateDB error: {msg}")
+            return data

Also applies to: 53-60


320-345: Include schema in recovery-details filter to disambiguate shards.

Filtering only by table_name and id can select the wrong shard when names are duplicated across schemas.

         FROM sys.shards s
-        WHERE s.table_name = ? AND s.id = ?
+        WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ?
@@
-        result = self.execute_query(query, [table_name, shard_id])
+        result = self.execute_query(query, [schema_name, table_name, shard_id])
🧹 Nitpick comments (4)
doc/admin/xmover/index.md (1)

14-14: Grammar: add missing article (“the largest tables”).

“across largest tables” reads awkwardly; add “the” for clarity.

- - **Shard Distribution Analysis**: Detect and rank distribution anomalies across largest tables
+ - **Shard Distribution Analysis**: Detect and rank distribution anomalies across the largest tables
doc/admin/xmover/queries.md (1)

6-21: Split SQL and sample output into separate code fences; add column aliases.

Mixing SQL and tabular output in one fenced block reduces readability and breaks syntax highlighting. Also, alias computed columns so headers match the output.

 ```sql
-select node['name'], sum(size) / 1024^3, count(id)  from sys.shards  group by 1  order by 1 asc;
-+--------------+-----------------------------+-----------+
-| node['name'] | (sum(size) / 1.073741824E9) | count(id) |
-+--------------+-----------------------------+-----------+
+select
+  node['name'],
+  sum(size) / 1024^3 as total_size_gb,
+  count(id) as shard_count
+from sys.shards
+group by 1
+order by 1 asc;
+```
+```text
++--------------+---------------+-------------+
+| node['name'] | total_size_gb | shard_count |
++--------------+---------------+-------------+
 | data-hot-0   |          1862.5866614403203 |       680 |
 | data-hot-1   |          1866.0331328986213 |       684 |
 | data-hot-2   |          1856.6581886671484 |      1043 |
 | data-hot-3   |          1208.932889252901  |       477 |
 | data-hot-4   |          1861.7727940855548 |       674 |
 | data-hot-5   |          1863.4315695902333 |       744 |
 | data-hot-6   |          1851.3522544233128 |       948 |
 | NULL         |             0.0             |        35 |
-+--------------+-----------------------------+-----------+
++--------------+---------------+-------------+
 SELECT 8 rows in set (0.061 sec)

</blockquote></details>
<details>
<summary>doc/admin/xmover/handbook.md (1)</summary><blockquote>

`59-61`: **Grammar: “dedicating” → “dedicated to”.**

Small wording fix for readability.


```diff
-This view is dedicating a specific focus on large tables.
+This view is dedicated to a specific focus on large tables.
cratedb_toolkit/admin/xmover/analysis/table.py (1)

22-33: Enhance precision for storage size edge cases

The function correctly formats sizes, but for sub-byte values (< 0.001 GB), it displays "0 B" which may mask very small but non-zero values.

Consider preserving precision for very small values:

 def format_storage_size(size_gb: float) -> str:
     """Format storage size with appropriate units and spacing"""
     if size_gb < 0.001:
-        return "0 B"
+        size_bytes = size_gb * 1024 * 1024 * 1024
+        if size_bytes < 1:
+            return f"{size_bytes:.2e} B"
+        else:
+            return f"{size_bytes:.0f} B"
     elif size_gb < 1.0:
         size_mb = size_gb * 1024
         return f"{size_mb:.0f} MB"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 108577c and 5068671.

📒 Files selected for processing (9)
  • cratedb_toolkit/admin/xmover/analysis/table.py (1 hunks)
  • cratedb_toolkit/admin/xmover/cli.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/monitor.py (1 hunks)
  • cratedb_toolkit/admin/xmover/operational/recommend.py (1 hunks)
  • cratedb_toolkit/admin/xmover/util/database.py (1 hunks)
  • doc/admin/xmover/handbook.md (1 hunks)
  • doc/admin/xmover/index.md (1 hunks)
  • doc/admin/xmover/queries.md (1 hunks)
  • tests/admin/test_cli.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cratedb_toolkit/admin/xmover/operational/monitor.py
  • tests/admin/test_cli.py
  • cratedb_toolkit/admin/xmover/cli.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T09:20:33.795Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/handbook.md:462-469
Timestamp: 2025-08-20T09:20:33.795Z
Learning: In large documentation files with many existing headings, be less strict about flagging emphasis-as-heading (MD036) issues, as emphasis can serve valid purposes for minor subsections and clarifications rather than requiring full heading treatment.

Applied to files:

  • doc/admin/xmover/handbook.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.

Applied to files:

  • doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.

Applied to files:

  • doc/admin/xmover/queries.md
🧬 Code graph analysis (3)
cratedb_toolkit/admin/xmover/util/database.py (2)
cratedb_toolkit/admin/xmover/model.py (5)
  • NodeInfo (7-29)
  • RecoveryInfo (55-102)
  • ShardInfo (33-51)
  • size_gb (81-83)
  • overall_progress (76-78)
cratedb_toolkit/admin/xmover/cli.py (1)
  • test_connection (153-175)
cratedb_toolkit/admin/xmover/operational/recommend.py (5)
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
  • validate_move_safety (373-401)
  • generate_rebalancing_recommendations (191-371)
cratedb_toolkit/admin/xmover/model.py (8)
  • ShardRelocationConstraints (177-186)
  • ShardRelocationRequest (106-113)
  • ShardRelocationResponse (117-153)
  • shard_type (50-51)
  • shard_type (86-87)
  • size_gb (81-83)
  • to_sql (131-137)
  • available_space_gb (28-29)
cratedb_toolkit/admin/xmover/operational/monitor.py (3)
  • RecoveryMonitor (24-384)
  • RecoveryOptions (16-21)
  • get_cluster_recovery_status (31-43)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • CrateDBClient (18-498)
  • execute_query (42-60)
cratedb_toolkit/admin/xmover/util/format.py (1)
  • format_size (1-8)
cratedb_toolkit/admin/xmover/analysis/table.py (1)
cratedb_toolkit/admin/xmover/util/database.py (2)
  • execute_query (42-60)
  • get_nodes_info (62-96)
🪛 LanguageTool
doc/admin/xmover/index.md

[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Distribution Analysis: Detect an...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...ribution anomalies across largest tables - Shard Movement Recommendations: Intell...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...

(QB_NEW_EN)

doc/admin/xmover/handbook.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...

(QB_NEW_EN)


[grammar] ~59-~59: There might be a mistake here.
Context: ...ace ``` ### Shard Distribution Analysis This view is dedicating a specific focus...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t: Analyze specific table only **Example...

(QB_NEW_EN)


[grammar] ~106-~106: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t: Find candidates in specific table only...

(QB_NEW_EN)


[grammar] ~107-~107: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~108-~108: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~109-~109: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node: Only show candidates from this specifi...

(QB_NEW_EN)


[grammar] ~124-~124: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t: Generate recommendations for specific ...

(QB_NEW_EN)


[grammar] ~125-~125: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size: Minimum shard size in GB (default: 40)...

(QB_NEW_EN)


[grammar] ~126-~126: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size: Maximum shard size in GB (default: 60)...

(QB_NEW_EN)


[grammar] ~127-~127: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~128-~128: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space: Minimum free space required on target ...

(QB_NEW_EN)


[grammar] ~129-~129: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves: Maximum number of move recommendations...

(QB_NEW_EN)


[grammar] ~130-~130: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage: Maximum disk usage percentage for targ...

(QB_NEW_EN)


[grammar] ~131-~131: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate: Validate move safety (default: True) -...

(QB_NEW_EN)


[grammar] ~132-~132: There might be a mistake here.
Context: ...e: Validate move safety (default: True) - --prioritize-space/--prioritize-zones`: Prioritize available space over zone b...

(QB_NEW_EN)


[grammar] ~133-~133: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute: Show what would be done without genera...

(QB_NEW_EN)


[grammar] ~134-~134: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node: Only recommend moves from this specifi...

(QB_NEW_EN)


[grammar] ~158-~158: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t: Analyze zones for specific table only ...

(QB_NEW_EN)


[grammar] ~159-~159: There might be a mistake here.
Context: ...: Analyze zones for specific table only - --show-shards/--no-show-shards`: Show individual shard details (default...

(QB_NEW_EN)


[grammar] ~170-~170: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t: Check balance for specific table only ...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...: Check balance for specific table only - --tolerance`: Zone balance tolerance percentage (def...

(QB_NEW_EN)


[grammar] ~184-~184: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE: Schema and table name (format: schema....

(QB_NEW_EN)


[grammar] ~185-~185: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source...

(QB_NEW_EN)


[grammar] ~186-~186: There might be a mistake here.
Context: ...ma.table) - SHARD_ID: Shard ID to move - FROM_NODE: Source node name - TO_NODE: Target n...

(QB_NEW_EN)


[grammar] ~187-~187: There might be a mistake here.
Context: ... to move - FROM_NODE: Source node name - TO_NODE: Target node name Examples: ```bas...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE: The CrateDB error message to analyze (...

(QB_NEW_EN)


[grammar] ~217-~217: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t: Monitor recovery for specific table on...

(QB_NEW_EN)


[grammar] ~218-~218: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n: Monitor recovery on specific node only...

(QB_NEW_EN)


[grammar] ~219-~219: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w: Continuously monitor (refresh every 10...

(QB_NEW_EN)


[grammar] ~220-~220: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval: Refresh interval for watch mode in sec...

(QB_NEW_EN)


[grammar] ~221-~221: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type: Filter by recovery type - PEER, DISK, ...

(QB_NEW_EN)


[grammar] ~222-~222: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning: Include recently completed recoveries ...

(QB_NEW_EN)


[grammar] ~243-~243: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...

(QB_NEW_EN)


[grammar] ~244-~244: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...

(QB_NEW_EN)


[grammar] ~377-~377: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) - CRATE_USERNAME`: Username for authentication (optional)...

(QB_NEW_EN)


[grammar] ~378-~378: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD: Password for authentication (optional)...

(QB_NEW_EN)


[grammar] ~379-~379: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY: Enable SSL certificate verification (d...

(QB_NEW_EN)


[grammar] ~434-~434: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...

(QB_NEW_EN)


[grammar] ~435-~435: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...

(QB_NEW_EN)


[grammar] ~443-~443: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...

(QB_NEW_EN)


[grammar] ~454-~454: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...

(QB_NEW_EN)


[grammar] ~484-~484: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...

(QB_NEW_EN)


[style] ~485-~485: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...

(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)


[grammar] ~487-~487: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...

(QB_NEW_EN)


[grammar] ~492-~492: There might be a mistake here.
Context: ...f}xmover-troubleshooting which covers: - Step-by-step diagnostic procedures - Eme...

(QB_NEW_EN)


[grammar] ~493-~493: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...

(QB_NEW_EN)


[grammar] ~494-~494: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...

(QB_NEW_EN)


[grammar] ~495-~495: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...

(QB_NEW_EN)

doc/admin/xmover/queries.md

[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/handbook.md

220-220: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)


249-249: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build-and-publish
  • GitHub Check: build-and-publish
  • GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
🔇 Additional comments (6)
cratedb_toolkit/admin/xmover/analysis/table.py (6)

121-140: Well-structured CrateDB query with proper aggregation

The query correctly aggregates shard data and performs the necessary calculations for distribution analysis.


249-251: Robust coefficient of variation calculations

Good use of the helper method for calculating CV values for multiple metrics.


432-445: Solid statistical calculations with proper error handling

The coefficient of variation calculation properly handles edge cases including zero mean and insufficient data points.


654-676: Clean separation of concerns in anomaly detection

Excellent design pattern using separate detector methods for different anomaly types, making the code modular and maintainable.


637-638: Potential division by zero in document imbalance detection

When min_docs is 0, the condition max_docs > min_docs * 3 would not work as intended.

Add protection against zero document counts:

-        if max_docs > min_docs * 3:  # If difference is > 3x
+        if min_docs > 0 and max_docs > min_docs * 3:  # If difference is > 3x
             recommendations.append(f"Significant document skew detected ({min_docs:,} to {max_docs:,} docs per node)")
+        elif min_docs == 0 and max_docs > 0:
+            recommendations.append(f"Some nodes have no documents while others have {max_docs:,} - check shard routing")

Likely an incorrect or invalid review comment.


532-546: Potential division by zero in storage imbalance detection

When min_size is 0 (which can happen for nodes with no shards), the comparison max_size > min_size * 2 would evaluate incorrectly.

Add a check to prevent issues with zero minimum size:

         if max_size > min_size * 2:  # If difference is > 2x
+            # Skip if min_size is effectively zero
+            if min_size < 0.001:
+                recommendations.append(
+                    f"Some nodes have no storage - consider distributing shards more evenly"
+                )
+            else:
                 overloaded_node = None
                 underloaded_node = None
 
                 for node, data in table.node_distributions.items():
                     if data["total_size_gb"] == max_size:
                         overloaded_node = node
                     elif data["total_size_gb"] == min_size:
                         underloaded_node = node
 
                 if overloaded_node and underloaded_node:
                     recommendations.append(
                         f"Rebalance storage from {overloaded_node} ({format_storage_size(max_size)}) "
                         f"to {underloaded_node} ({format_storage_size(min_size)})"
                     )

Likely an incorrect or invalid review comment.

coderabbitai[bot]

This comment was marked as resolved.

WalBeh and others added 4 commits September 11, 2025 01:21
- More or less just line-length fixes.
- Only a single type adjustment was needed on the return value of the
  `analyze_distribution` method.
- Ruff recommended to use set comprehensions, so here we go.
- At a single spot where an exception has been `pass`ed, we added
  error output. Is it bad?
@amotl
Copy link
Member Author

amotl commented Sep 19, 2025

Thank you, @WalBeh!

This comment was marked as spam.

This comment was marked as off-topic.

@amotl amotl requested review from seut and WalBeh September 19, 2025 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants