Skip to content

Commit bea91a2

Browse files
authored
clickhouse: prevent replicated tables from starting in read-only mode. (#9183)
On start, ClickHouse compares the local state of each distributed table to its distributed state. If it finds a discrepancy, it starts the table in read-only mode. When this happens, oximeter can't write new records to the relevant table(s). In the past, we've worked around this by manually instructing ClickHouse using the `force_restore_data` sentinel file, but this requires manual detection and intervention each time a table starts up in read-only mode. This patch sets the `replicated_max_ratio_of_wrong_parts` flag to 1.0 so that ClickHouse always accepts local state, and never starts tables in read-only mode. As described in ClickHouse/ClickHouse#66527, this appears to be a bug, or at least an ergonomic flaw, in ClickHouse. One replica of a table can routinely fall behind the others, e.g. due to restart or network partition, and shouldn't require manual intervention to start back up. Part of #8595. Note: I'm not sure now best to test this. It sounds like we have reasonably high confidence that the fix will work, so we could just merge and deploy to dogfood, and revert if necessary. Or is clickhouse cluster running on another rack that we can test?
1 parent 8d75b60 commit bea91a2

File tree

2 files changed

+10
-2
lines changed

2 files changed

+10
-2
lines changed

clickhouse-admin/types/src/config.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,9 +201,13 @@ impl ReplicaConfig {
201201
<max_tasks_in_queue>1000</max_tasks_in_queue>
202202
</distributed_ddl>
203203
204-
<!-- Disable sparse column serialization, which we expect to not need -->
205204
<merge_tree>
205+
<!-- Disable sparse column serialization, which we expect to not need -->
206206
<ratio_of_defaults_for_sparse_serialization>1.0</ratio_of_defaults_for_sparse_serialization>
207+
208+
<!-- Prevent ClickHouse from setting distributed tables to read-only. -->
209+
<!-- See https://github.com/oxidecomputer/omicron/issues/8595 for details. -->
210+
<replicated_max_ratio_of_wrong_parts>1.0</replicated_max_ratio_of_wrong_parts>
207211
</merge_tree>
208212
{macros}
209213
{remote_servers}

clickhouse-admin/types/testutils/replica-server-config.xml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,9 +104,13 @@
104104
<max_tasks_in_queue>1000</max_tasks_in_queue>
105105
</distributed_ddl>
106106

107-
<!-- Disable sparse column serialization, which we expect to not need -->
108107
<merge_tree>
108+
<!-- Disable sparse column serialization, which we expect to not need -->
109109
<ratio_of_defaults_for_sparse_serialization>1.0</ratio_of_defaults_for_sparse_serialization>
110+
111+
<!-- Prevent ClickHouse from setting distributed tables to read-only. -->
112+
<!-- See https://github.com/oxidecomputer/omicron/issues/8595 for details. -->
113+
<replicated_max_ratio_of_wrong_parts>1.0</replicated_max_ratio_of_wrong_parts>
110114
</merge_tree>
111115

112116
<macros>

0 commit comments

Comments
 (0)