Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions ch_backup/logic/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,13 @@ def _load_create_statement_from_disk(table: Table) -> Optional[str]:
return None
try:
return Path(table.metadata_path).read_text("utf-8")
except UnicodeDecodeError:
logging.warning(
'Table "{}"."{}": metadata contains non-UTF-8 bytes, using latin-1 fallback',
table.database,
table.name,
)
return Path(table.metadata_path).read_text("latin-1")
except OSError as e:
logging.debug(
'Cannot load a create statement of the table "{}"."{}": {}',
Expand Down
9 changes: 5 additions & 4 deletions ch_backup/storage/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,7 @@ def upload_files_tarball(
)
return remote_path

def download_data(
self, remote_path, is_async=False, encryption=False, encoding="utf-8"
):
def download_data(self, remote_path, is_async=False, encryption=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace encoding="latin-1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a problem (corrupted text) ?

data=query.encode("utf-8"),

If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?

Copy link
Author

@dimbo4ka dimbo4ka Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace encoding="latin-1"

this change breaks tests for valid UTF-8 data:

      ASSERT FAILED: 
      Expected: <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_试` String COMMENT '试') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_😈` String COMMENT '😈') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
           but: was <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_è¯\x95` String COMMENT 'è¯\x95') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_ð\x9f\x98\x88` String COMMENT 'ð\x9f\x98\x88') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>

the issue is that bytes.decode() returns a str (unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character

"""
Download file from storage and return its content.

Expand All @@ -139,7 +137,10 @@ def download_data(
data = self._ploader.download_data(
remote_path, is_async=is_async, encryption=encryption
)
return data.decode(encoding) if encoding else data
try:
return data.decode("utf-8")
except UnicodeDecodeError:
return data.decode("latin-1")

def download_file(
self,
Expand Down
63 changes: 63 additions & 0 deletions tests/integration/features/schema_encoding_compatibility.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
Feature: Non-UTF-8 schema encoding support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this tests fail without changes ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


Background:
Given default configuration
And a working s3
And a working zookeeper on zookeeper01
And a working clickhouse on clickhouse01
And a working clickhouse on clickhouse02

Scenario: Backup and restore multiple tables with correct utf-8 encodings
Given we have executed queries on clickhouse01
"""
CREATE DATABASE test_db;

CREATE TABLE test_db.table_ascii (
id Int32,
name_ascii String COMMENT 'ascii'
) ENGINE = MergeTree() ORDER BY id;

CREATE TABLE test_db.table_emoji (
id Int32,
`name_😈` String COMMENT '😈'
) ENGINE = MergeTree() ORDER BY id;

CREATE TABLE test_db.table_cyrillic (
id Int32,
`name_абвгд` String COMMENT 'абвгд'
) ENGINE = MergeTree() ORDER BY id;

CREATE TABLE test_db.table_chinese (
id Int32,
`name_试` String COMMENT '试'
) ENGINE = MergeTree() ORDER BY id;

INSERT INTO test_db.table_ascii VALUES (1, 'test1');
INSERT INTO test_db.table_emoji VALUES (2, 'test2');
INSERT INTO test_db.table_cyrillic VALUES (3, 'test3');
INSERT INTO test_db.table_chinese VALUES (4, 'test4');
"""
When we create clickhouse01 clickhouse backup
Then we got the following backups on clickhouse01
| num | state | data_count | link_count |
| 0 | created | 4 | 0 |
When we restore clickhouse backup #0 to clickhouse02
Then clickhouse02 has same schema as clickhouse01
And we got same clickhouse data at clickhouse01 clickhouse02

Scenario: Table with invalid utf-8 characters
Given we have created non-UTF-8 test table on clickhouse01
When we create clickhouse01 clickhouse backup
Then we got the following backups on clickhouse01
| num | state | data_count | link_count |
| 0 | created | 1 | 0 |
When we restore clickhouse backup #0 to clickhouse02
When we execute query on clickhouse02
"""
EXISTS TABLE test_db.table_rus
"""
Then we get response
"""
1
"""

7 changes: 5 additions & 2 deletions tests/integration/modules/clickhouse.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,14 @@ def ping(self) -> None:
"""
self._query("GET", url="ping")

def execute(self, query: str) -> None:
def execute(self, query: Union[str, bytes]) -> None:
"""
Execute arbitrary query.
"""
self._query("POST", query=query)
if isinstance(query, str):
self._query("POST", query=query)
return
self._query("POST", data=query)

def get_response(self, query: str) -> str:
"""
Expand Down
23 changes: 23 additions & 0 deletions tests/integration/steps/clickhouse.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,3 +349,26 @@ def step_create_multiple_tables(context, table_count, node):
for i in range(table_count):
table_schema = schema_template.format(table_number=i)
ch_client.execute(table_schema)


@given("we have created non-UTF-8 test table on {node}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make we execute query on {node:w} with encoding {encoding:w} step instead of this.
It seems more versatile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text

Copy link
Contributor

@aalexfvk aalexfvk Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined

    """
    CREATE TABLE test_db_01.table_rus (
        EventDate DateTime,
        CounterID UInt32,
        `\xcf\xf0\xe8\xe2\xe5\xf2` UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY CounterID % 10
    ORDER BY (CounterID, EventDate)
    """

def create_non_utf8_table(context, node):
"""
Create table with invalid utf-8 for testing latin-1 fallback
"""
ch_client = ClickhouseClient(context, node)
ch_client.execute("CREATE DATABASE IF NOT EXISTS test_db")

query = b"""
CREATE TABLE test_db.table_rus (
EventDate DateTime,
CounterID UInt32,
`\xcf\xf0\xe8\xe2\xe5\xf2` UInt32
)
ENGINE = MergeTree()
PARTITION BY CounterID % 10
ORDER BY (CounterID, EventDate)
"""

ch_client.execute(query)
ch_client.execute("INSERT INTO test_db.table_rus VALUES (toDateTime('17.01.2006 10:03:00'), 2, 3)")