Draft distributed plan #1

mdianjun · 2022-02-07T15:54:24Z

功能列表

基于分布式架构重新设计clickhouse，分为meta-service，compute（计算层），store（存储层）三种角色。
- meta-service负责存储集群元数据（基于clickhouse-keeper改造）：
  - 提供访问权限控制信息（/clickhouse/access）：原生clickhouse支持，配置参数是<user_directories>。
  - 数据模型信息（/clickhouse/metadata）：由Replicated database engine提供。
  - 集群节点拓扑信息（/clickhouse/clusters）：自主开发。
  - 当ck-server实例扩容或重启时，从meta-service上加载元数据。加载database是自主开发，加载table是由Replicated database engine提供。
  - 多节点的session共享（/clickhouse/sessions）：（自主开发）把session数据存储到meta-service上，以session id为标识。初始接收query的计算节点负责创建session并同步到meta-service，其他节点在处理该query时从meta-service上加载session数据。共享内容包括过期时间，settings配置等。对于同一个query来说，initial节点主动发给其他计算节点和存储节点。对于相同session的多个query来说，只有initial节点才会从meta-service上加载session。
- compute层功能包括：接收用户query、生成分布式执行计划、承担大部分计算负载。
- store层功能包括：读写数据、承担小部分计算负载（如读数据后的filter操作）、数据块的后台处理（如合并、清理等）。
meta-service的优化：
- 接口优化：递归watch。每一个计算层节点和存储层节点都需要watch。
- 存储优化：将meta-service的数据存储由内存转移到磁盘上，使用rocksdb实现（TODO：性能测试）。
分布式DDL：在一个计算节点执行DDL语句，其他所有节点即可自动同步到结果。支持对库、表元数据和权限控制元数据的CREATE/DROP/ALTER等操作。
- table级别的DDL：由Replicated database engine提供。
- database级别的DDL：（自主开发）使用defalut数据库，派生其他database，实现方式类似于“在Replicated database里执行table DDL”的逻辑。
分布式admin query：（自主开发）将“SYSTEM”、“OPTIMIZE”等语句改造为分布式执行，实现方式类似于“Replicated database的DDL”。
分布式DML：在一个计算节点执行DML语句，所有存储节点即可自动同步到结果。支持对表数据的UPDATE/DELETE/INSERT操作（以无副本的MergeTree表为核心）：
- UPDATA/DELETE：由Replicated database engine提供。
- INSERT：（自主开发）参考分布式表的写操作，实现MergeTree表的分布式写。可以使用“随机shard”或“指定shard”方式，将数据写到任意存储节点。
分布式DQL：一个计算节点接收到SELECT查询语句，首先生成分布式执行计划，然后将构建执行计划片段（plan fragment）需要的参数信息发送给其他计算节点和存储节点，其他节点根据原始语句构建plan fragment并执行。目前已经支持的查询类型及算子包括：

select ... from
insert into ... select
create ... as select 
aggregate + group by
distinct
order by
limit
union
in
join: broadcast join
subquery：select+subquery, in+subquery, join+subquery
materialized view
with totals/rollup/cube
extremes
input function: insert into ... select ... from input()
cancel query

外部数据源：接入外部数据源通过特定的表引擎或表函数完成。对外部数据源的查询和写入是在计算节点执行，该功能将支持产品的数据导入/导出功能。已支持的外部数据源包括：kafka, mysql, hdfs, s3，input。

实现过程中的关键点或难点

模块耦合问题：
- 计算逻辑和存储逻辑的耦合
- 逻辑执行计划和物理执行计划的耦合
暂时无法实现最好的方案，只能采用折中方案：
- 所有节点都是从头到尾构建执行计划；较好的方案是发送plan fragment。
- 基于zookeeperclient接口实现某些功能，复杂度较高；较好的方案是在server侧作开发，把复杂度封装到server里，避免分布式场景下不一致的问题。
当开发出分布式执行计划的框架之后，无法以设计驱动的方式继续完善，而是以测试驱动的方式case by case地补漏洞。
分布式执行计划怎么切stage：
- union与它的子节点之间
- full join与它的两个节点之间：将左右表都汇总到一个节点
- left join与它的右子节点之间：将右表广播
- right join与它的左子节点之间：将左表广播
- aggregate：partial aggregate与final aggregate之间
- sort：partial sort与final sort之间
- dinstinct：partial distinct与final distinct之间
- limit：partial limit与final limit之间
子查询：包括三种形式：select from subquery, join subquery, in subquery。先计算子查询的结果集，把结果集发送到父查询对应的所有节点。
物化视图：由表找到它关联的物化视图，构建一个pipeline用于写该物化视图，这个pipeline读的就是insert时的那个Block数据。

分布式执行计划的待优化项

物理执行计划的序列化。
避免内存数据与网络数据之间相互转换时的序列化和反序列化开销。
重构生成分布式执行计划的代码，以“transformation rule”形式提供可扩展接口。
生成分布式执行计划时会添加某些逻辑算子，把这些代码放到生成单机执行计划的过程中。
shuffle算子。
利用数据分布信息，对分布式执行计划进行更精细的调度。

核心代码流程图

1. 分布式执行计划

2. 分布式DDL

Authors: madianjun <madianjun@qq.com> Chao Ma <machaojms@126.com> caspian <caspian.wang@bytejumps.com> Allen Zhang <allen.zhang@mail.bytejumps.com>

2.Fix "insert into ... select" to distributed plan.

…d database engine

… source

…LTER PART

…kip executeWhere

…tch the root cause exception

… table

…remote

… created at aggregate

…es is from left branch to right branch

mdianjun and others added 30 commits December 6, 2021 17:09

Add distributed plan, distributed source operator, grpc communication.

6252748

Authors: madianjun <madianjun@qq.com> Chao Ma <machaojms@126.com> caspian <caspian.wang@bytejumps.com> Allen Zhang <allen.zhang@mail.bytejumps.com>

Optimize building stages and plan fragment.

0704d1d

Support distributed ALTER UPDATE/DELETE statement.

ea7d64b

Add insert distributed for replicate merge tree

20bb20e

1.Optimize the execution of plan fragment.

89468c4

2.Fix "insert into ... select" to distributed plan.

add cancel request support on new distributed plan

147cc1e

add kill on all support

9bb71d7

Add distributed sort and limit.

5e8773e

Change stage to use multiple parent stages.

576de4a

Fix quering system tables on single worker.

a3ef01e

Fix GRPCClient log and exception message

c34c5ce

Fix limit with offset

dfe2a48

Refactor code

f261717

Rollback distributed DDL

9b3087f

Support automatical execution on all servers of CREATE/DROP replicate…

7bdc8e7

…d database engine

Add distributed aggregate

9d4a896

Add distributed broadcast join

705c8d5

Add distributed union

20551ac

Add exception processing when default database is not Replicated engine

e35e317

Fix bug that doesn't set current database

decd825

Add distributed insert for table engine mergetree

c2e77fc

Add distributed materialized view

e2d10bb

Read data of external tables on initial node

05c69f2

Skip distributed data which insert into system database

989d1e9

Fix to initialize ClustersWatcher before all servers start

de9962b

Support input function in distributed mode

73290cb

Only optimize_trivial_count on store workers

ab7b706

Select one result from system.merge_tree_settings

fdb9c89

Improve CREATE/DROP database executed distributedly

13b449a

Set database_replicated_always_detach_permanently default to true

ef74935

mdianjun and others added 30 commits January 18, 2022 12:51

Fix the tow use cases of StorageValues, one of them doesn't have view…

f2009c8

… source

Execute "optimize table" on all nodes

2afec23

Revert empty_result_for_aggregation_by_empty_set to false

81452fd

Improve subquery, including many bugs fixed.

709d153

Fix friend class

9aad861

Change private member functions of QueryPlan to public

5c01196

Allow ATTACH/DROP PARTITION; disallow FETCH PARTITION; disallow any A…

d897c07

…LTER PART

Fix hang when alter table where with materialized column

76a464a

Fix create as select multi results and no table error

03e0c20

Add distributed SYSTEM DDL

60f09f7

Schedule intermediate stages on local compute node

7c7f46a

Fix group by const error

6589478

Build same original query plan when trivial count is optimized, and s…

556228e

…kip executeWhere

Clean grpc exception periodically thrown by producer; let consumer ca…

75f22e6

…tch the root cause exception

Support DISTINCT

e2e5697

Add distributed read/write lock to control the access to database and…

9fe470f

… table

Remove table system.processes from distributed query plan

c1f3350

Fix grpc client read buffer size limit from 4MB to unlimit

945ac87

Clear garbage node

e3dbde8

Fix reading some storage(e.g. numbers, memory) from local instead of …

3127542

…remote

Eliminate unnecessary stages for sort and limit when shuffle has been…

052d008

… created at aggregate

Add default database name for SYSTEM

1b88b4d

Fix ddl timeout when other ck should fail fast

8b0fc4c

Change parent stages order to sequential in order that iterating stag…

fdbae38

…es is from left branch to right branch

Set current query id if empty

5ad6f43

Fix RENAME query when using distributed lock

e0ee52c

Fix create or replace table to insert occur no table error

0315aef

Fix distributed subquery max_query_size limitation inconsistency

12c000d

Fix JOIN and WITH TOTALS

3d0871d

Change stage building in right join and full join

9af4f34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft distributed plan #1

Draft distributed plan #1

Uh oh!

mdianjun commented Feb 7, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Draft distributed plan #1

Are you sure you want to change the base?

Draft distributed plan #1

Uh oh!

Conversation

mdianjun commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

功能列表

实现过程中的关键点或难点

分布式执行计划的待优化项

核心代码流程图

1. 分布式执行计划

2. 分布式DDL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdianjun commented Feb 7, 2022 •

edited

Loading