@@ -82,6 +82,10 @@ a column store for fast sorting and aggregations.
8282
8383How CrateDB stores data using Lucene.
8484
85+ tldr; CrateDB never needs explicit VACUUMs, manual compactions, or
86+ reindexing. The system maintains itself dynamically, which is a key advantage
87+ for always-on analytics environments where data never stops flowing in.
88+
8589: Append-only segments:
8690
8791 A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
@@ -93,11 +97,34 @@ How CrateDB stores data using Lucene.
9397 some segments and discard the freed ones. This way, adding a new document does not require
9498 rebuilding the whole index structure completely.
9599
100+ : Segment merges:
101+
102+ When new data is inserted into CrateDB, it is written into small, immutable
103+ segments on disk. Over time, these segments are merged into larger ones by
104+ background tasks, balancing I/O load with query performance.
105+
106+ This process, known as segment merging, achieves three critical optimizations:
107+ - Space compaction: Merging removes deleted or superseded records, freeing disk
108+ space automatically.
109+ - Faster queries: Larger segments reduce index overhead and improve cache efficiency.
110+ - No downtime: Merging occurs transparently, allowing continuous ingestion and querying.
111+
96112 CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
97113 and controls the number of segments per "tier" to balance search performance with merge
98114 overhead. Lucene's [ TieredMergePolicy] documentation explains in detail how CrateDB's
99115 underlying merge policy decides when to combine segments.
100116
117+ : Table refreshes:
118+
119+ CrateDB's refresh mechanism controls how often newly ingested data becomes visible
120+ for querying. Instead of committing every write immediately, which would degrade
121+ throughput, CrateDB batches writes in memory and periodically refreshes data
122+ segments, typically once per second by default.
123+
124+ This approach strikes a balance between low-latency visibility and high ingestion
125+ performance, allowing users to query the most recent data almost instantly while
126+ maintaining efficient bulk ingestion without overwhelming the storage layer
127+ or exhausting other cluster resources.
101128
102129::::{todo}
103130Enable after merging [ GH-434 : Indexing and storage] ( https://github.com/crate/cratedb-guide/pull/434 ) .
0 commit comments