Skip to content

Commit cf4612e

Browse files
authored
Chore/issue 368 architecture (#372)
1 parent 6aad3f1 commit cf4612e

File tree

2 files changed

+67
-16
lines changed

2 files changed

+67
-16
lines changed

docs/architecture.md

Lines changed: 66 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,35 +2,86 @@
22

33
## High Level Architecture
44

5-
The following diagram depicts the high level architecture of Timeplus SQL engine, starting from a single node deployment.
5+
The following diagram depicts the high level components of Timeplus core engine.
66

77
![Architecture](/img/proton-high-level-arch.gif)
88

9-
All of the components / functionalities are built into one single binary.
9+
### The Flow of Data
1010

11-
## Data Storage
11+
#### Ingest
1212

13-
Users can create a stream by using `CREATE STREAM ...` [DDL SQL](/sql-create-stream). Every stream has 2 parts at storage layer by default:
13+
When data is ingested into Timeplus, it first lands in the NativeLog. As soon as the log commit completes, the data becomes immediately available for streaming queries.
1414

15-
1. the real-time streaming data part, backed by Timeplus NativeLog
16-
2. the historical data part, backed by ClickHouse historical data store.
15+
In the background, dedicated threads continuously tail new entries from the NativeLog and flush them to the Historical Store in larger, optimized batches.
1716

18-
Fundamentally, a stream in Proton is a regular database table with a replicated Write-Ahead-Log (WAL) in front but is streaming queryable.
17+
#### Query
1918

20-
## Data Ingestion
19+
Timeplus supports three query modes: **historical**, **streaming**, and **hybrid (streaming + historical)**.
2120

22-
When users `INSERT INTO ...` data to Proton, the data always first lands in NativeLog which is immediately queryable. Since NativeLog is in essence a replicated Write-Ahead-Log (WAL) and is append-only, it can support high frequent, low latency and large concurrent data ingestion work loads.
21+
- **Historical Query (a.k.a. Table Query)**
2322

24-
In background, there is a separate thread tailing the delta data from NativeLog and commits the data in bigger batch to the historical data store. Since Proton leverages ClickHouse for the historical part, its historical query processing is blazing fast as well.
23+
Works like a traditional database query. Data is fetched directly from the **Historical Store**, and all standard database optimizations like the following apply. These optimizations accelerate large-scale scans and point lookups, making historical queries fast and efficient.
24+
- Primary index
25+
- Skipping index
26+
- Secondary index
27+
- Bloom filter
28+
- Partition pruning
2529

26-
## External Stream
30+
- **Streaming Query**
2731

28-
In quite lots of scenarios, data is already in Kafka / Redpanda or other streaming data hubs, users can create [external streams](/external-stream) to point to the streaming data hub and do streaming query processing directly and then either materialize them in Proton or send the query results to external systems.
32+
Operates on the **NativeLog**, which stores records in sequence. Queries run incrementally, enabling real-time processing patterns such as **incremental ETL**, **joins**, and **aggregations**.
2933

34+
- **Hybrid Query**
3035

36+
Streaming queries can automatically **backfill** from the Historical Store when:
37+
1. Data no longer exists in the NativeLog (due to retention policies).
38+
2. Pulling from the Historical Store is faster than rewinding the NativeLog to replay old events.
3139

32-
## Learn More
40+
This allows seamless handling of scenarios like **fast backfill** and **mixed real-time + historical analysis** without breaking query continuity and also don't need yet another external batch system to load the historical data which usually introduce worse latency, inconsitency and cost.
3341

34-
Interested users can refer [How Timeplus Unifies Streaming and Historical Data Processing](https://www.timeplus.com/post/unify-streaming-and-historical-data-processing) blog for more details regarding its academic foundation and latest industry developments. You can also check the video below from [Kris Jenkins's Developer Voices podcast](https://www.youtube.com/watch?v=TBcWABm8Cro). Jove shared our key decision choices, how Timeplus manages data and state, and how Timeplus achieves high performance with single binary.
42+
### The Dural Storage
3543

36-
<iframe width="560" height="315" src="https://www.youtube.com/embed/QZ0le2WiJiY?si=eF45uwlXvFBpMR14" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
44+
#### NativeLog
45+
46+
The **Timeplus NativeLog** is the system’s write-ahead log (WAL) or journal: an append-only, high-throughput store optimized for low-latency, highly concurrent data ingestion. In a cluster deployment, it is replicated using **Multi-Raft** for fault tolerance. By enforcing a strict ordering of records, NativeLog forms the backbone of streaming processing in **Timeplus Core**.
47+
48+
NativeLog uses its own record format, consisting of two high-level types:
49+
50+
- **Control records** (a.k.a. meta records) – store metadata and operational information.
51+
- **Data records** – columnar-encoded for fast serialization/deserialization and efficient vectorized streaming execution.
52+
53+
Each record is assigned a monotonically increasing sequence number—similar to a Kafka offset—which guarantees ordering.
54+
55+
Lightweight indexes are maintained to support rapid rewind and replay operations by **timestamp** or **sequence number** in streaming queries.
56+
57+
#### Historical Store
58+
59+
The **Historical Store** in Timeplus stores data **derived** from the **NativeLog**. It powers use cases such as:
60+
61+
- **Historical queries** (a.k.a. *table queries* in Timeplus)
62+
- **Fast backfill** into streaming queries
63+
- Acting as a **serving layer** for downstream applications
64+
65+
Timeplus supports two storage encodings for the Historical Store: **columnar** and **row**.
66+
67+
##### 1. Columnar Encoding (*Append Stream*)
68+
Optimized for **append-most workloads** with minimal data mutation, such as telemetry or event logs. Benefits include:
69+
70+
- High data compression ratios
71+
- Blazing-fast scans for analytical workloads
72+
- Backed by the **ClickHouse MergeTree** engine
73+
74+
This format is ideal when the dataset is largely immutable and query speed over large volumes is a priority.
75+
76+
##### 2. Row Encoding (*Mutable Stream*)
77+
Designed for **frequently updated datasets** where `UPSERT` and `DELETE` operations are common. Features include:
78+
79+
- Per-row **primary indexes**
80+
- **Secondary indexes** for flexible lookups
81+
- Faster and more efficient **point queries** compared to columnar storage
82+
83+
Row encoding is the better choice when low-latency, high-frequency updates are required.
84+
85+
## References
86+
87+
[How Timeplus Unifies Streaming and Historical Data Processing](https://www.timeplus.com/post/unify-streaming-and-historical-data-processing)

docs/why-timeplus.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,4 +75,4 @@ SQL-based rules can be used to trigger or resolve alerts in systems such as Page
7575

7676
### Scalability and Elasticity
7777

78-
Timeplus supports both MPP architecture for pure on-prem deployments—ideal when ultra-low latency is critical and storage/compute separation for elastic, cloud-native setups. In the latter mode, S3 is used for the NativeLog, Historical Store, and Query State Checkpoints. Combined with Kubernetes HPA or AWS Auto Scaling Groups, this enables highly concurrent continuous queries on clusters that scale automatically with demand.
78+
Timeplus supports three deployment models: **MPP (shared-nothing)** for on-premises setups where ultra-low latency is critical, **storage/compute separation** for elastic cloud-native environments using S3 (or similar object storage) to store the NativeLog, Historical Store, and Query State Checkpoints with zero replication overhead, and **hybrid mode** that combines both approaches. In storage/compute separation deployments, clusters integrate seamlessly with Kubernetes HPA or AWS Auto Scaling Groups, enabling highly concurrent continuous queries while scaling automatically with demand. Please refer [Timeplus Architecture](/architecture) for more details.

0 commit comments

Comments
 (0)