Skip to content

Commit 987e177

Browse files
committed
Storage: Add information about segment merges and table refreshes
1 parent 45663cb commit 987e177

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed

docs/feature/storage/index.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,10 @@ a column store for fast sorting and aggregations.
8282

8383
How CrateDB stores data using Lucene.
8484

85+
tldr; CrateDB never needs explicit VACUUMs, manual compactions, or
86+
reindexing. The system maintains itself dynamically, which is a key advantage
87+
for always-on analytics environments where data never stops flowing in.
88+
8589
:Append-only segments:
8690

8791
A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
@@ -93,11 +97,34 @@ How CrateDB stores data using Lucene.
9397
some segments and discard the freed ones. This way, adding a new document does not require
9498
rebuilding the whole index structure completely.
9599

100+
:Segment merges:
101+
102+
When new data is inserted into CrateDB, it is written into small, immutable
103+
segments on disk. Over time, these segments are merged into larger ones by
104+
background tasks, balancing I/O load with query performance.
105+
106+
This process, known as segment merging, achieves three critical optimizations:
107+
- Space compaction: Merging removes deleted or superseded records, freeing disk
108+
space automatically.
109+
- Faster queries: Larger segments reduce index overhead and improve cache efficiency.
110+
- No downtime: Merging occurs transparently, allowing continuous ingestion and querying.
111+
96112
CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
97113
and controls the number of segments per "tier" to balance search performance with merge
98114
overhead. Lucene's [TieredMergePolicy] documentation explains in detail how CrateDB's
99115
underlying merge policy decides when to combine segments.
100116

117+
:Table refreshes:
118+
119+
CrateDB's refresh mechanism controls how often newly ingested data becomes visible
120+
for querying. Instead of committing every write immediately, which would degrade
121+
throughput, CrateDB batches writes in memory and periodically refreshes data
122+
segments, typically once per second by default.
123+
124+
This approach strikes a balance between low-latency visibility and high ingestion
125+
performance, allowing users to query the most recent data almost instantly while
126+
maintaining efficient bulk ingestion without overwhelming the storage layer
127+
or exhausting other cluster resources.
101128

102129
::::{todo}
103130
Enable after merging [GH-434: Indexing and storage](https://github.com/crate/cratedb-guide/pull/434).

0 commit comments

Comments
 (0)