-
Notifications
You must be signed in to change notification settings - Fork 10
Fix: non utf8 schema support #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds latin-1 fallback support for non-UTF-8 ClickHouse table schemas during backup/restore and introduces integration tests (including a raw-bytes ClickHouse client method) to verify behavior with both valid UTF-8 and invalid-UTF-8 metadata. Sequence diagram for fallback decoding when loading table metadatasequenceDiagram
participant BackupManager
participant TableLogic as TableLogic_table_module
participant FileSystem
BackupManager->>TableLogic: _load_create_statement_from_disk(table)
TableLogic->>FileSystem: read_text(utf-8)
alt utf-8 readable
FileSystem-->>TableLogic: create_statement (utf-8)
TableLogic-->>BackupManager: create_statement
else UnicodeDecodeError
TableLogic->>TableLogic: log warning (non UTF-8 bytes)
TableLogic->>FileSystem: read_text(latin-1)
FileSystem-->>TableLogic: create_statement (latin-1)
TableLogic-->>BackupManager: create_statement
end
Sequence diagram for fallback decoding in storage loader download_datasequenceDiagram
participant RestoreProcess
participant StorageLoader
participant PhysicalLoader as PhysicalLoader_impl
RestoreProcess->>StorageLoader: download_data(remote_path, encoding)
StorageLoader->>PhysicalLoader: download_data(remote_path, is_async, encryption)
PhysicalLoader-->>StorageLoader: data (bytes)
alt encoding is falsy
StorageLoader-->>RestoreProcess: data (bytes)
else encoding is truthy
StorageLoader->>StorageLoader: data.decode(utf-8)
alt utf-8 succeeds
StorageLoader-->>RestoreProcess: text (utf-8)
else UnicodeDecodeError
StorageLoader->>StorageLoader: data.decode(latin-1)
StorageLoader-->>RestoreProcess: text (latin-1)
end
end
Updated class diagram for storage loader and ClickHouse test clientclassDiagram
class StorageLoader {
+download_data(remote_path, encoding, is_async, encryption) any
+download_file(remote_path, local_path, is_async, encryption) void
}
class PhysicalLoader {
+download_data(remote_path, is_async, encryption) bytes
}
StorageLoader --> PhysicalLoader : uses
class TableModule {
+_load_create_statement_from_disk(table) str
}
class Table {
+database str
+name str
+metadata_path str
}
TableModule --> Table : reads_metadata
class ClickHouseTestClient {
+query(sql) str
+query_raw_bytes(sql) bytes
}
ClickHouseTestClient ..> StorageLoader : used_in_integration_tests
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 1 issue, and left some high level feedback:
- In
StorageLoader.download_data, theencodingargument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passedencoding(including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API. - Consider adding a warning log when
download_datafalls back from UTF-8 to latin-1 decoding, similar to_load_create_statement_from_disk, to make non-UTF-8 data easier to detect in production.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `StorageLoader.download_data`, the `encoding` argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed `encoding` (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
- Consider adding a warning log when `download_data` falls back from UTF-8 to latin-1 decoding, similar to `_load_create_statement_from_disk`, to make non-UTF-8 data easier to detect in production.
## Individual Comments
### Comment 1
<location> `ch_backup/storage/loader.py:142-147` </location>
<code_context>
remote_path, is_async=is_async, encryption=encryption
)
- return data.decode(encoding) if encoding else data
+ if not encoding:
+ return data
+ try:
+ return data.decode("utf-8")
+ except UnicodeDecodeError:
+ return data.decode("latin-1")
def download_file(
</code_context>
<issue_to_address>
**issue (bug_risk):** The `encoding` parameter is now ignored when truthy, which is a behavior change and likely a bug.
With the new logic, callers that pass `encoding="ascii"`, `"utf-16"`, or rely on strict decoding will now always get a UTF-8→latin-1 fallback instead, which can change behavior or hide errors.
If you want a UTF-8→latin-1 fallback while still respecting the caller’s choice, you could do something like:
```python
if not encoding:
return data
try:
return data.decode(encoding)
except UnicodeDecodeError:
if encoding.lower() == "utf-8":
return data.decode("latin-1")
raise
```
so that the fallback only applies when UTF-8 was actually requested.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of separate execute_raw
def execute(self, query: Union[str, bytes]) -> None:
if isinstance(query, bytes):
...
| @@ -0,0 +1,63 @@ | |||
| Feature: Non-UTF-8 schema encoding support | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this tests fail without changes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| ch_client.execute(table_schema) | ||
|
|
||
|
|
||
| @given("we have created non-UTF-8 test table on {node}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make we execute query on {node:w} with encoding {encoding:w} step instead of this.
It seems more versatile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined
"""
CREATE TABLE test_db_01.table_rus (
EventDate DateTime,
CounterID UInt32,
`\xcf\xf0\xe8\xe2\xe5\xf2` UInt32
)
ENGINE = MergeTree()
PARTITION BY CounterID % 10
ORDER BY (CounterID, EventDate)
"""
| def download_data( | ||
| self, remote_path, is_async=False, encryption=False, encoding="utf-8" | ||
| ): | ||
| def download_data(self, remote_path, is_async=False, encryption=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just replace encoding="latin-1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a problem (corrupted text) ?
ch-backup/ch_backup/clickhouse/client.py
Line 85 in 8802592
| data=query.encode("utf-8"), |
If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just replace
encoding="latin-1"
this change breaks tests for valid UTF-8 data:
ASSERT FAILED:
Expected: <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_试` String COMMENT '试') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_😈` String COMMENT '😈') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
but: was <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_è¯\x95` String COMMENT 'è¯\x95') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_ð\x9f\x98\x88` String COMMENT 'ð\x9f\x98\x88') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
the issue is that bytes.decode() returns a str (unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character
Summary by Sourcery
Add support for handling non-UTF-8 ClickHouse schemas and validate compatibility via integration tests.
Bug Fixes:
Enhancements:
Tests: