-
Notifications
You must be signed in to change notification settings - Fork 10
Fix: non utf8 schema support #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
75a4b6e
57f4f67
ebe55b8
81f5f12
ef17309
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| Feature: Non-UTF-8 schema encoding support | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do this tests fail without changes ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
|
|
||
| Background: | ||
| Given default configuration | ||
| And a working s3 | ||
| And a working zookeeper on zookeeper01 | ||
| And a working clickhouse on clickhouse01 | ||
| And a working clickhouse on clickhouse02 | ||
|
|
||
| Scenario: Backup and restore multiple tables with correct utf-8 encodings | ||
| Given we have executed queries on clickhouse01 | ||
| """ | ||
| CREATE DATABASE test_db; | ||
|
|
||
| CREATE TABLE test_db.table_ascii ( | ||
| id Int32, | ||
| name_ascii String COMMENT 'ascii' | ||
| ) ENGINE = MergeTree() ORDER BY id; | ||
|
|
||
| CREATE TABLE test_db.table_emoji ( | ||
| id Int32, | ||
| `name_😈` String COMMENT '😈' | ||
| ) ENGINE = MergeTree() ORDER BY id; | ||
|
|
||
| CREATE TABLE test_db.table_cyrillic ( | ||
| id Int32, | ||
| `name_абвгд` String COMMENT 'абвгд' | ||
| ) ENGINE = MergeTree() ORDER BY id; | ||
|
|
||
| CREATE TABLE test_db.table_chinese ( | ||
| id Int32, | ||
| `name_试` String COMMENT '试' | ||
| ) ENGINE = MergeTree() ORDER BY id; | ||
|
|
||
| INSERT INTO test_db.table_ascii VALUES (1, 'test1'); | ||
| INSERT INTO test_db.table_emoji VALUES (2, 'test2'); | ||
| INSERT INTO test_db.table_cyrillic VALUES (3, 'test3'); | ||
| INSERT INTO test_db.table_chinese VALUES (4, 'test4'); | ||
| """ | ||
| When we create clickhouse01 clickhouse backup | ||
| Then we got the following backups on clickhouse01 | ||
| | num | state | data_count | link_count | | ||
| | 0 | created | 4 | 0 | | ||
| When we restore clickhouse backup #0 to clickhouse02 | ||
| Then clickhouse02 has same schema as clickhouse01 | ||
| And we got same clickhouse data at clickhouse01 clickhouse02 | ||
|
|
||
| Scenario: Table with invalid utf-8 characters | ||
| Given we have created non-UTF-8 test table on clickhouse01 | ||
| When we create clickhouse01 clickhouse backup | ||
| Then we got the following backups on clickhouse01 | ||
| | num | state | data_count | link_count | | ||
| | 0 | created | 1 | 0 | | ||
| When we restore clickhouse backup #0 to clickhouse02 | ||
| When we execute query on clickhouse02 | ||
| """ | ||
| EXISTS TABLE test_db.table_rus | ||
| """ | ||
| Then we get response | ||
| """ | ||
| 1 | ||
| """ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -349,3 +349,26 @@ def step_create_multiple_tables(context, table_count, node): | |
| for i in range(table_count): | ||
| table_schema = schema_template.format(table_number=i) | ||
| ch_client.execute(table_schema) | ||
|
|
||
|
|
||
| @given("we have created non-UTF-8 test table on {node}") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's make
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined |
||
| def create_non_utf8_table(context, node): | ||
| """ | ||
| Create table with invalid utf-8 for testing latin-1 fallback | ||
| """ | ||
| ch_client = ClickhouseClient(context, node) | ||
| ch_client.execute("CREATE DATABASE IF NOT EXISTS test_db") | ||
|
|
||
| query = b""" | ||
| CREATE TABLE test_db.table_rus ( | ||
| EventDate DateTime, | ||
| CounterID UInt32, | ||
| `\xcf\xf0\xe8\xe2\xe5\xf2` UInt32 | ||
| ) | ||
| ENGINE = MergeTree() | ||
| PARTITION BY CounterID % 10 | ||
| ORDER BY (CounterID, EventDate) | ||
| """ | ||
|
|
||
| ch_client.execute(query) | ||
| ch_client.execute("INSERT INTO test_db.table_rus VALUES (toDateTime('17.01.2006 10:03:00'), 2, 3)") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just replace
encoding="latin-1"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be a problem (corrupted text) ?
ch-backup/ch_backup/clickhouse/client.py
Line 85 in 8802592
If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change breaks tests for valid UTF-8 data:
the issue is that
bytes.decode()returns astr(unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character