-
Notifications
You must be signed in to change notification settings - Fork 4
Admin/XMover: Add CrateDB shard analyzer and movement tool #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds XMover: a new administrative toolkit for CrateDB that analyzes shard distribution, recommends and optionally executes safe shard relocations, validates moves, monitors recoveries, exposes a Click CLI and script entrypoint, provides models/utilities/analysis/operational modules, extensive docs, and unit tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as xmover CLI
participant Client as CrateDBClient
participant Analyzer as ShardAnalyzer
participant Recommender as ShardRelocationRecommender
participant DB as CrateDB
User->>CLI: xmover recommend [options]
CLI->>Client: init + test_connection()
CLI->>Analyzer: init(client)
CLI->>Recommender: execute(constraints, auto_execute, validate, dry_run)
Recommender->>Analyzer: generate_rebalancing_recommendations(constraints)
Analyzer->>Client: get_nodes_info()/get_shards_info()
Client-->>Analyzer: nodes, shards
Analyzer-->>Recommender: recommendations
alt auto_execute and not dry_run
Recommender->>Client: execute_query(ALTER TABLE ... REROUTE MOVE SHARD ...)
Client->>DB: POST /_sql
DB-->>Client: result
Client-->>Recommender: success/failure
Recommender->>Recommender: _wait_for_recovery_capacity()
else dry_run
Recommender-->>CLI: render recommendations (no execution)
end
CLI-->>User: rendered output / SQL / status
sequenceDiagram
autonumber
actor User
participant CLI as xmover monitor-recovery
participant Client as CrateDBClient
participant Monitor as RecoveryMonitor
participant DB as CrateDB
User->>CLI: xmover monitor-recovery [--watch]
CLI->>Monitor: start(watch)
loop refresh_interval (watch mode)
Monitor->>Client: get_all_recovering_shards(filters)
Client->>DB: query sys.allocations/sys.shards
DB-->>Client: rows
Client-->>Monitor: RecoveryInfo[]
Monitor-->>CLI: formatted table + deltas
end
CLI-->>User: summary
sequenceDiagram
autonumber
actor User
participant CLI as xmover validate-move
participant Analyzer as ShardAnalyzer
participant Client as CrateDBClient
User->>CLI: xmover validate-move <schema.table> <shard> <from> <to>
CLI->>Analyzer: init(client)
CLI->>Analyzer: validate_move_safety(recommendation, max_disk_usage)
Analyzer->>Client: lookups (nodes/shards)
Client-->>Analyzer: details
Analyzer-->>CLI: (is_safe, reason)
CLI-->>User: verdict + SQL command
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6a6b8f0
to
5068671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (14)
doc/admin/xmover/queries.md (5)
58-70
: Align example output headers with the SELECT (zone alias and available_gb).The example table shows attributes['zone'] and a different expression than the query (which aliases zone and available_gb). Update headers for consistency.
-```text -+------------+--------------------+-----------------------------------------------+ -| name | attributes['zone'] | (fs[1]['disks']['available'] / 1.073741824E9) | -+------------+--------------------+-----------------------------------------------+ +```text ++------------+--------------------+---------------+ +| name | zone | available_gb | ++------------+--------------------+---------------+ | data-hot-5 | us-west-2a | 142.3342628479004 | ... -+------------+--------------------+-----------------------------------------------+ ++------------+--------------------+---------------+--- `103-129`: **CTE filters on incomplete recoveries; use fully recovered shards.** Filter to completed shards and keep unit conversion as-is. Consider guarding division by zero. ```diff WITH shard_summary AS ( SELECT node['name'] AS node_name, table_name, schema_name, CASE WHEN "primary" = true THEN 'PRIMARY' ELSE 'REPLICA' END AS shard_type, COUNT(*) AS shard_count, - SUM(size) / 1024^3 AS total_size_gb + SUM(size) / 1024^3 AS total_size_gb FROM sys.shards - WHERE table_name = 'orderffD' + WHERE table_name = 'orderffD' AND routing_state = 'STARTED' - AND recovery['files']['percent'] = 0 + AND coalesce(recovery['files']['percent'], 100.0) >= 100.0 GROUP BY node['name'], table_name, schema_name, "primary" ) SELECT node_name, table_name, schema_name, shard_type, shard_count, ROUND(total_size_gb, 2) AS total_size_gb, - ROUND(total_size_gb / shard_count, 2) AS avg_shard_size_gb + ROUND(total_size_gb / NULLIF(shard_count, 0), 2) AS avg_shard_size_gb FROM shard_summary ORDER BY node_name, shard_type DESC, total_size_gb DESC;
132-157
: Prefer healthy shards in the comprehensive distribution query.Keep the caret (^) as exponentiation in CrateDB, but filter for fully recovered shards.
WHERE s.table_name = 'your_table_name' -- Replace with your specific table name AND s.routing_state = 'STARTED' - AND s.recovery['files']['percent'] = 0 + AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0
158-177
: Summary query should also filter to fully recovered shards.Use the same recovery predicate here to avoid skew from in-flight data.
- ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb, - ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb, + ROUND(SUM(s.size) / 1024^3, 2) AS total_size_gb, + ROUND(AVG(s.size) / 1024^3, 3) AS avg_shard_size_gb, ... WHERE s.table_name = 'orderffD' -- Replace with your specific table name AND s.routing_state = 'STARTED' - AND s.recovery['files']['percent'] = 0 + AND coalesce(s.recovery['files']['percent'], 100.0) >= 100.0
72-79
: Fix heading grammar and recovery predicate (selects only healthy shards).
- Heading: singular “node”, drop shouting “SHARDS”.
- Predicate: percent = 0 excludes fully recovered shards; use 100 (or >= 100 with COALESCE).
-## List biggest SHARDS on a particular Nodes +## List biggest shards on a particular node ```sql -select node['name'], table_name, schema_name, id, sum(size) / 1024^3 from sys.shards +select node['name'], table_name, schema_name, id, sum(size) / 1024^3 as size_gb +from sys.shards where node['name'] = 'data-hot-2' AND routing_state = 'STARTED' - AND recovery['files']['percent'] = 0 + AND coalesce(recovery['files']['percent'], 100.0) >= 100.0 group by 1,2,3,4 order by 5 desc limit 8;</blockquote></details> <details> <summary>doc/admin/xmover/handbook.md (2)</summary><blockquote> `475-482`: **Keep threshold consistent and align argument placeholders.** - Threshold header says 85% while defaults are 90%. - Use the same placeholders as elsewhere: FROM_NODE/TO_NODE. ```diff -**Disk Usage Threshold (default: 85%)** +**Disk Usage Threshold (default: 90%)** ... -# For urgent space relief -xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM> <TO> --max-disk-usage 95 +# For urgent space relief +xmover validate-move <SCHEMA.TABLE> <SHARD_ID> <FROM_NODE> <TO_NODE> --max-disk-usage 95
131-136
: Docs drift: --max-disk-usage default and missing --auto-execute.The model default is 90.0% (see ShardRelocationConstraints.max_disk_usage), not 85%. Also document the --auto-execute flag used to actually execute the generated SQL in execution mode.
- `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 85) + - `--max-disk-usage`: Maximum disk usage percentage for target nodes (default: 90) - `--validate/--no-validate`: Validate move safety (default: True) - `--prioritize-space/--prioritize-zones`: Prioritize available space over zone balancing (default: False) - - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True) + - `--dry-run/--execute`: Show what would be done without generating SQL commands (default: True) + - `--auto-execute`: Automatically execute the SQL commands (requires `--execute`, asks for confirmation) (default: False) - `--node`: Only recommend moves from this specific source node (e.g., data-hot-4)Run to verify the defaults in code:
#!/bin/bash # Confirm the default for max_disk_usage and presence of auto-execute in CLI or model. rg -nP 'class\s+ShardRelocationConstraints\b.*?max_disk_usage\s*:\s*float\s*=\s*([0-9.]+)' -U cratedb_toolkit/admin/xmover/model.py rg -n 'auto[-_]execute' -n cratedb_toolkit/admin/xmover/cli.py || truecratedb_toolkit/admin/xmover/operational/recommend.py (4)
326-334
: Case-insensitive match for “zone conflict”.This check is case-sensitive and can miss other capitalizations. Normalize to lower().
- if "Zone conflict" in safety_msg: + if "zone conflict" in safety_msg.lower():
126-128
: Interpolate variables in the “Check with” hint; include schema for safety.Placeholders are printed literally. Use an f-string with schema_name/table_name and the actual shard id.
- console.print( - "[dim]# Check with: SELECT * FROM sys.shards " - "WHERE table_name = '{table_name}' AND id = {shard_id};[/dim]" - ) + console.print( + f"[dim]# Check with: SELECT * FROM sys.shards " + f"WHERE schema_name = '{schema_name}' AND table_name = '{table_name}' AND id = {request.shard_id};[/dim]" + )
341-344
: Honor user-configured max disk usage during auto-execution; avoid hard-coded 95%.auto-exec re-validates with 95% regardless of constraints, which can run moves the user intended to disallow.
- if auto_execute: - self._execute_recommendations_safely(recommendations, validate) + if auto_execute: + self._execute_recommendations_safely( + recommendations, + validate, + max_disk_usage_percent=constraints.max_disk_usage, + ) @@ - def _execute_recommendations_safely(self, recommendations, validate: bool): + def _execute_recommendations_safely(self, recommendations, validate: bool, max_disk_usage_percent: float): """Execute recommendations with extensive safety measures""" @@ - if validate: + if validate: for rec in recommendations: - is_safe, safety_msg = self.analyzer.validate_move_safety(rec, max_disk_usage_percent=95.0) + is_safe, safety_msg = self.analyzer.validate_move_safety( + rec, max_disk_usage_percent=max_disk_usage_percent + )Also applies to: 355-367
426-435
: Treat CrateDB “error” payloads as failures, not successes.CrateDB can return HTTP 200 with an "error" field. Checking only rowcount can misreport failures.
- if result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements + if isinstance(result, dict) and result.get("error"): + console.print(f" [red]❌ FAILED[/red] - {result['error'].get('message', result['error'])}") + failed_moves += 1 + elif result.get("rowcount", 0) >= 0: # Success indicator for ALTER statements console.print(" [green]✅ SUCCESS[/green] - Move initiated") successful_moves += 1 else: console.print(f" [red]❌ FAILED[/red] - Unexpected result: {result}") failed_moves += 1cratedb_toolkit/admin/xmover/util/database.py (3)
147-151
: Note: caret (^) is exponentiation in CrateDB; no change needed.The use of
1024.0^3
is correct in CrateDB SQL (PostgreSQL semantics). Keeping as-is improves consistency with the rest of the docs.
21-41
: Use a persistent requests.Session and surface CrateDB “error” fields.Pooling improves performance, and CrateDB may return 200 with an error payload. Detect and raise on
"error"
.class CrateDBClient: @@ def __init__(self, connection_string: Optional[str] = None): load_dotenv() @@ + # Reuse HTTP connections + self.session = requests.Session() @@ - try: - response = requests.post( - self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30 - ) + try: + response = self.session.post( + self.connection_string, json=payload, auth=auth, verify=self.ssl_verify, timeout=30 + ) response.raise_for_status() - return response.json() + data = response.json() + if isinstance(data, dict) and data.get("error"): + err = data["error"] + msg = err.get("message") if isinstance(err, dict) else str(err) + raise Exception(f"CrateDB error: {msg}") + return dataAlso applies to: 53-60
320-345
: Include schema in recovery-details filter to disambiguate shards.Filtering only by table_name and id can select the wrong shard when names are duplicated across schemas.
FROM sys.shards s - WHERE s.table_name = ? AND s.id = ? + WHERE s.schema_name = ? AND s.table_name = ? AND s.id = ? @@ - result = self.execute_query(query, [table_name, shard_id]) + result = self.execute_query(query, [schema_name, table_name, shard_id])
🧹 Nitpick comments (4)
doc/admin/xmover/index.md (1)
14-14
: Grammar: add missing article (“the largest tables”).“across largest tables” reads awkwardly; add “the” for clarity.
- - **Shard Distribution Analysis**: Detect and rank distribution anomalies across largest tables + - **Shard Distribution Analysis**: Detect and rank distribution anomalies across the largest tablesdoc/admin/xmover/queries.md (1)
6-21
: Split SQL and sample output into separate code fences; add column aliases.Mixing SQL and tabular output in one fenced block reduces readability and breaks syntax highlighting. Also, alias computed columns so headers match the output.
```sql -select node['name'], sum(size) / 1024^3, count(id) from sys.shards group by 1 order by 1 asc; -+--------------+-----------------------------+-----------+ -| node['name'] | (sum(size) / 1.073741824E9) | count(id) | -+--------------+-----------------------------+-----------+ +select + node['name'], + sum(size) / 1024^3 as total_size_gb, + count(id) as shard_count +from sys.shards +group by 1 +order by 1 asc; +``` +```text ++--------------+---------------+-------------+ +| node['name'] | total_size_gb | shard_count | ++--------------+---------------+-------------+ | data-hot-0 | 1862.5866614403203 | 680 | | data-hot-1 | 1866.0331328986213 | 684 | | data-hot-2 | 1856.6581886671484 | 1043 | | data-hot-3 | 1208.932889252901 | 477 | | data-hot-4 | 1861.7727940855548 | 674 | | data-hot-5 | 1863.4315695902333 | 744 | | data-hot-6 | 1851.3522544233128 | 948 | | NULL | 0.0 | 35 | -+--------------+-----------------------------+-----------+ ++--------------+---------------+-------------+ SELECT 8 rows in set (0.061 sec)</blockquote></details> <details> <summary>doc/admin/xmover/handbook.md (1)</summary><blockquote> `59-61`: **Grammar: “dedicating” → “dedicated to”.** Small wording fix for readability. ```diff -This view is dedicating a specific focus on large tables. +This view is dedicated to a specific focus on large tables.
cratedb_toolkit/admin/xmover/analysis/table.py (1)
22-33
: Enhance precision for storage size edge casesThe function correctly formats sizes, but for sub-byte values (< 0.001 GB), it displays "0 B" which may mask very small but non-zero values.
Consider preserving precision for very small values:
def format_storage_size(size_gb: float) -> str: """Format storage size with appropriate units and spacing""" if size_gb < 0.001: - return "0 B" + size_bytes = size_gb * 1024 * 1024 * 1024 + if size_bytes < 1: + return f"{size_bytes:.2e} B" + else: + return f"{size_bytes:.0f} B" elif size_gb < 1.0: size_mb = size_gb * 1024 return f"{size_mb:.0f} MB"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
cratedb_toolkit/admin/xmover/analysis/table.py
(1 hunks)cratedb_toolkit/admin/xmover/cli.py
(1 hunks)cratedb_toolkit/admin/xmover/operational/monitor.py
(1 hunks)cratedb_toolkit/admin/xmover/operational/recommend.py
(1 hunks)cratedb_toolkit/admin/xmover/util/database.py
(1 hunks)doc/admin/xmover/handbook.md
(1 hunks)doc/admin/xmover/index.md
(1 hunks)doc/admin/xmover/queries.md
(1 hunks)tests/admin/test_cli.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cratedb_toolkit/admin/xmover/operational/monitor.py
- tests/admin/test_cli.py
- cratedb_toolkit/admin/xmover/cli.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T09:20:33.795Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/handbook.md:462-469
Timestamp: 2025-08-20T09:20:33.795Z
Learning: In large documentation files with many existing headings, be less strict about flagging emphasis-as-heading (MD036) issues, as emphasis can serve valid purposes for minor subsections and clarifications rather than requiring full heading treatment.
Applied to files:
doc/admin/xmover/handbook.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, following PostgreSQL conventions. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries. Do not flag this as incorrect.
Applied to files:
doc/admin/xmover/queries.md
📚 Learning: 2025-08-20T09:22:09.511Z
Learnt from: amotl
PR: crate/cratedb-toolkit#523
File: doc/admin/xmover/queries.md:6-21
Timestamp: 2025-08-20T09:22:09.511Z
Learning: In CrateDB SQL, the caret operator (^) performs exponentiation, similar to PostgreSQL. Expressions like 1024^3 are valid and correct for converting bytes to GiB in CrateDB queries.
Applied to files:
doc/admin/xmover/queries.md
🧬 Code graph analysis (3)
cratedb_toolkit/admin/xmover/util/database.py (2)
cratedb_toolkit/admin/xmover/model.py (5)
NodeInfo
(7-29)RecoveryInfo
(55-102)ShardInfo
(33-51)size_gb
(81-83)overall_progress
(76-78)cratedb_toolkit/admin/xmover/cli.py (1)
test_connection
(153-175)
cratedb_toolkit/admin/xmover/operational/recommend.py (5)
cratedb_toolkit/admin/xmover/analysis/shard.py (2)
validate_move_safety
(373-401)generate_rebalancing_recommendations
(191-371)cratedb_toolkit/admin/xmover/model.py (8)
ShardRelocationConstraints
(177-186)ShardRelocationRequest
(106-113)ShardRelocationResponse
(117-153)shard_type
(50-51)shard_type
(86-87)size_gb
(81-83)to_sql
(131-137)available_space_gb
(28-29)cratedb_toolkit/admin/xmover/operational/monitor.py (3)
RecoveryMonitor
(24-384)RecoveryOptions
(16-21)get_cluster_recovery_status
(31-43)cratedb_toolkit/admin/xmover/util/database.py (2)
CrateDBClient
(18-498)execute_query
(42-60)cratedb_toolkit/admin/xmover/util/format.py (1)
format_size
(1-8)
cratedb_toolkit/admin/xmover/analysis/table.py (1)
cratedb_toolkit/admin/xmover/util/database.py (2)
execute_query
(42-60)get_nodes_info
(62-96)
🪛 LanguageTool
doc/admin/xmover/index.md
[grammar] ~8-~8: There might be a mistake here.
Context: ...nd availability zones. It generates safe SQL commands for shard rebalancing and n...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...hard distribution across nodes and zones - Shard Distribution Analysis: Detect an...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...ribution anomalies across largest tables - Shard Movement Recommendations: Intell...
(QB_NEW_EN)
[grammar] ~15-~15: There might be a mistake here.
Context: ...s for rebalancing with safety validation - Recovery Monitoring: Track ongoing sha...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...ecovery operations with progress details - Zone Conflict Detection: Prevents move...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...t would violate CrateDB's zone awareness - Node Decommissioning: Plan safe node r...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ... removal with automated shard relocation - Dry Run Mode: Test recommendations wit...
(QB_NEW_EN)
[grammar] ~19-~19: There might be a mistake here.
Context: ...s without generating actual SQL commands - Safety Validation: Comprehensive check...
(QB_NEW_EN)
doc/admin/xmover/handbook.md
[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-handbook)= # XMover Handbook ## Installation Instal...
(QB_NEW_EN)
[grammar] ~59-~59: There might be a mistake here.
Context: ...ace ``` ### Shard Distribution Analysis This view is dedicating a specific focus...
(QB_NEW_EN)
[grammar] ~95-~95: There might be a mistake here.
Context: ...tion across nodes and zones. Options: - --table, -t
: Analyze specific table only **Example...
(QB_NEW_EN)
[grammar] ~106-~106: There might be a mistake here.
Context: ...on size and health criteria. Options: - --table, -t
: Find candidates in specific table only...
(QB_NEW_EN)
[grammar] ~107-~107: There might be a mistake here.
Context: ...: Find candidates in specific table only - --min-size
: Minimum shard size in GB (default: 40)...
(QB_NEW_EN)
[grammar] ~108-~108: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size
: Maximum shard size in GB (default: 60)...
(QB_NEW_EN)
[grammar] ~109-~109: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --node
: Only show candidates from this specifi...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ...ons for cluster rebalancing. Options: - --table, -t
: Generate recommendations for specific ...
(QB_NEW_EN)
[grammar] ~125-~125: There might be a mistake here.
Context: ... recommendations for specific table only - --min-size
: Minimum shard size in GB (default: 40)...
(QB_NEW_EN)
[grammar] ~126-~126: There might be a mistake here.
Context: ...: Minimum shard size in GB (default: 40) - --max-size
: Maximum shard size in GB (default: 60)...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: ...: Maximum shard size in GB (default: 60) - --zone-tolerance
: Zone balance tolerance percentage (def...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: ...lance tolerance percentage (default: 10) - --min-free-space
: Minimum free space required on target ...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ...red on target nodes in GB (default: 100) - --max-moves
: Maximum number of move recommendations...
(QB_NEW_EN)
[grammar] ~130-~130: There might be a mistake here.
Context: ...er of move recommendations (default: 10) - --max-disk-usage
: Maximum disk usage percentage for targ...
(QB_NEW_EN)
[grammar] ~131-~131: There might be a mistake here.
Context: ...ercentage for target nodes (default: 85) - --validate/--no-validate
: Validate move safety (default: True) -...
(QB_NEW_EN)
[grammar] ~132-~132: There might be a mistake here.
Context: ...e: Validate move safety (default: True) -
--prioritize-space/--prioritize-zones`: Prioritize available space over zone b...
(QB_NEW_EN)
[grammar] ~133-~133: There might be a mistake here.
Context: ...ace over zone balancing (default: False) - --dry-run/--execute
: Show what would be done without genera...
(QB_NEW_EN)
[grammar] ~134-~134: There might be a mistake here.
Context: ... generating SQL commands (default: True) - --node
: Only recommend moves from this specifi...
(QB_NEW_EN)
[grammar] ~158-~158: There might be a mistake here.
Context: ...ion and potential conflicts. Options: - --table, -t
: Analyze zones for specific table only ...
(QB_NEW_EN)
[grammar] ~159-~159: There might be a mistake here.
Context: ...: Analyze zones for specific table only -
--show-shards/--no-show-shards`: Show individual shard details (default...
(QB_NEW_EN)
[grammar] ~170-~170: There might be a mistake here.
Context: ...with configurable tolerance. Options: - --table, -t
: Check balance for specific table only ...
(QB_NEW_EN)
[grammar] ~171-~171: There might be a mistake here.
Context: ...: Check balance for specific table only -
--tolerance`: Zone balance tolerance percentage (def...
(QB_NEW_EN)
[grammar] ~184-~184: There might be a mistake here.
Context: ...ecution to prevent errors. Arguments: - SCHEMA_TABLE
: Schema and table name (format: schema....
(QB_NEW_EN)
[grammar] ~185-~185: There might be a mistake here.
Context: ...ma and table name (format: schema.table) - SHARD_ID
: Shard ID to move - FROM_NODE
: Source...
(QB_NEW_EN)
[grammar] ~186-~186: There might be a mistake here.
Context: ...ma.table) - SHARD_ID
: Shard ID to move - FROM_NODE
: Source node name - TO_NODE
: Target n...
(QB_NEW_EN)
[grammar] ~187-~187: There might be a mistake here.
Context: ... to move - FROM_NODE
: Source node name - TO_NODE
: Target node name Examples: ```bas...
(QB_NEW_EN)
[grammar] ~202-~202: There might be a mistake here.
Context: ... troubleshooting guidance. Arguments: - ERROR_MESSAGE
: The CrateDB error message to analyze (...
(QB_NEW_EN)
[grammar] ~217-~217: There might be a mistake here.
Context: ...y operations on the cluster. Options: - --table, -t
: Monitor recovery for specific table on...
(QB_NEW_EN)
[grammar] ~218-~218: There might be a mistake here.
Context: ...Monitor recovery for specific table only - --node, -n
: Monitor recovery on specific node only...
(QB_NEW_EN)
[grammar] ~219-~219: There might be a mistake here.
Context: ...: Monitor recovery on specific node only - --watch, -w
: Continuously monitor (refresh every 10...
(QB_NEW_EN)
[grammar] ~220-~220: There might be a mistake here.
Context: ...Continuously monitor (refresh every 10s) - --refresh-interval
: Refresh interval for watch mode in sec...
(QB_NEW_EN)
[grammar] ~221-~221: There might be a mistake here.
Context: ... for watch mode in seconds (default: 10) - --recovery-type
: Filter by recovery type - PEER, DISK, ...
(QB_NEW_EN)
[grammar] ~222-~222: There might be a mistake here.
Context: ...type - PEER, DISK, or all (default: all) - --include-transitioning
: Include recently completed recoveries ...
(QB_NEW_EN)
[grammar] ~243-~243: There might be a mistake here.
Context: ...ude-transitioning ``` Recovery Types: - PEER: Copying shard data from another ...
(QB_NEW_EN)
[grammar] ~244-~244: There might be a mistake here.
Context: ...om another node (replication/relocation) - DISK: Rebuilding shard from local data...
(QB_NEW_EN)
[grammar] ~377-~377: There might be a mistake here.
Context: ...TRING: CrateDB HTTP endpoint (required) -
CRATE_USERNAME`: Username for authentication (optional)...
(QB_NEW_EN)
[grammar] ~378-~378: There might be a mistake here.
Context: ...: Username for authentication (optional) - CRATE_PASSWORD
: Password for authentication (optional)...
(QB_NEW_EN)
[grammar] ~379-~379: There might be a mistake here.
Context: ...: Password for authentication (optional) - CRATE_SSL_VERIFY
: Enable SSL certificate verification (d...
(QB_NEW_EN)
[grammar] ~434-~434: There might be a mistake here.
Context: ...zone awareness prevents too many copies in same zone - Solution: Move shard...
(QB_NEW_EN)
[grammar] ~435-~435: There might be a mistake here.
Context: ... copies in same zone - Solution: Move shard to a different availability zone ...
(QB_NEW_EN)
[grammar] ~443-~443: There might be a mistake here.
Context: ...ufficient free space - Solution: Choose node with more capacity or free up spac...
(QB_NEW_EN)
[grammar] ~454-~454: There might be a mistake here.
Context: ...pace` 5. No Recommendations Generated - Cause: Cluster may already be well bal...
(QB_NEW_EN)
[grammar] ~484-~484: There might be a mistake here.
Context: ...age 95 ``` When to Adjust Thresholds: - Emergency situations: Increase to 90-9...
(QB_NEW_EN)
[style] ~485-~485: ‘Emergency situations’ might be wordy. Consider a shorter alternative.
Context: ...``` When to Adjust Thresholds: - Emergency situations: Increase to 90-95% for critical spac...
(EN_WORDINESS_PREMIUM_EMERGENCY_SITUATIONS)
[grammar] ~487-~487: There might be a mistake here.
Context: ...onments**: Can be more aggressive (90%+) - Production: Keep conservative (80-85%)...
(QB_NEW_EN)
[grammar] ~492-~492: There might be a mistake here.
Context: ...f}xmover-troubleshooting
which covers: - Step-by-step diagnostic procedures - Eme...
(QB_NEW_EN)
[grammar] ~493-~493: There might be a mistake here.
Context: ...rs: - Step-by-step diagnostic procedures - Emergency recovery procedures - Best pra...
(QB_NEW_EN)
[grammar] ~494-~494: There might be a mistake here.
Context: ...ocedures - Emergency recovery procedures - Best practices for safe operations - Com...
(QB_NEW_EN)
[grammar] ~495-~495: There might be a mistake here.
Context: ...res - Best practices for safe operations - Complete error reference guide ### Debu...
(QB_NEW_EN)
doc/admin/xmover/queries.md
[grammar] ~1-~1: There might be a mistake here.
Context: (xmover-queries)= # XMover Query Gallery ## Shard Distribut...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
doc/admin/xmover/handbook.md
220-220: Trailing punctuation in heading
Punctuation: '.'
(MD026, no-trailing-punctuation)
249-249: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-and-publish
- GitHub Check: build-and-publish
- GitHub Check: Cloud: Python 3.13 on OS ubuntu-latest
🔇 Additional comments (6)
cratedb_toolkit/admin/xmover/analysis/table.py (6)
121-140
: Well-structured CrateDB query with proper aggregationThe query correctly aggregates shard data and performs the necessary calculations for distribution analysis.
249-251
: Robust coefficient of variation calculationsGood use of the helper method for calculating CV values for multiple metrics.
432-445
: Solid statistical calculations with proper error handlingThe coefficient of variation calculation properly handles edge cases including zero mean and insufficient data points.
654-676
: Clean separation of concerns in anomaly detectionExcellent design pattern using separate detector methods for different anomaly types, making the code modular and maintainable.
637-638
: Potential division by zero in document imbalance detectionWhen
min_docs
is 0, the conditionmax_docs > min_docs * 3
would not work as intended.Add protection against zero document counts:
- if max_docs > min_docs * 3: # If difference is > 3x + if min_docs > 0 and max_docs > min_docs * 3: # If difference is > 3x recommendations.append(f"Significant document skew detected ({min_docs:,} to {max_docs:,} docs per node)") + elif min_docs == 0 and max_docs > 0: + recommendations.append(f"Some nodes have no documents while others have {max_docs:,} - check shard routing")Likely an incorrect or invalid review comment.
532-546
: Potential division by zero in storage imbalance detectionWhen
min_size
is 0 (which can happen for nodes with no shards), the comparisonmax_size > min_size * 2
would evaluate incorrectly.Add a check to prevent issues with zero minimum size:
if max_size > min_size * 2: # If difference is > 2x + # Skip if min_size is effectively zero + if min_size < 0.001: + recommendations.append( + f"Some nodes have no storage - consider distributing shards more evenly" + ) + else: overloaded_node = None underloaded_node = None for node, data in table.node_distributions.items(): if data["total_size_gb"] == max_size: overloaded_node = node elif data["total_size_gb"] == min_size: underloaded_node = node if overloaded_node and underloaded_node: recommendations.append( f"Rebalance storage from {overloaded_node} ({format_storage_size(max_size)}) " f"to {underloaded_node} ({format_storage_size(min_size)})" )Likely an incorrect or invalid review comment.
- More or less just line-length fixes. - Only a single type adjustment was needed on the return value of the `analyze_distribution` method. - Ruff recommended to use set comprehensions, so here we go. - At a single spot where an exception has been `pass`ed, we added error output. Is it bad?
2c6ef7f
to
21dd237
Compare
Thank you, @WalBeh! |
About
@WalBeh contributed a powerful utility that uses CrateDB's system tables to find out about cluster imbalances related to shard number and shard size distribution across the whole cluster. Thanks!
After analysing the situation, the program presents solutions in form of SQL commands to bring the cluster into a more balanced state again.
Install
uv pip install --upgrade 'cratedb-toolkit @ git+https://github.com/crate/cratedb-toolkit.git@xmover'
Documentation
https://cratedb-toolkit--523.org.readthedocs.build/admin/xmover/
References
ctk info summary
: Exploring shard size distribution #513