88The CrateDB storage layer is based on Lucene.
99:::
1010
11+ Lucene offers scalable and high-performance indexing, which enables efficient search and
12+ aggregations over documents and rapid updates to the existing documents. Solr and
13+ Elasticsearch are building upon the same technologies.
1114This page enumerates important concepts and implementations of Lucene used by CrateDB.
1215
13- ## Lucene
16+ ## Data structures
1417
15- Lucene offers scalable and high-performance indexing which enables efficient search and
16- aggregations over documents and rapid updates to the existing documents. Solr and
17- Elasticsearch are building upon the same technologies.
18+ A single record in Lucene is called "document".
1819
19- - ** Documents **
20+ :Document:
2021
21- A single record in Lucene is called " document", which is a unit of information for search
22+ A document is a unit of information for search
2223 and indexing that contains a set of fields, where each field has a name and value. A Lucene
2324 index can store an arbitrary number of documents, with an arbitrary number of different fields.
25+ By default, all fields are indexed, nested or not, but the indexing can be turned
26+ off selectively.
2427
25- - ** Append-only segments**
26-
27- A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
28- it is immutable, and built from a set of documents. When new documents are added to the
29- existing index, they are added to the next segment, while previous segments are never
30- modified. If the number of segments becomes too large, the system may decide to merge
31- some segments and discard the freed ones. This way, adding a new document does not require
32- rebuilding the whole index structure completely.
33-
34- CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
35- and controls the number of segments per "tier" to balance search performance with merge
36- overhead. Lucene's [ TieredMergePolicy] documentation explains in detail how CrateDB's
37- underlying merge policy decides when to combine segments.
38-
39- - ** Column store**
40-
41- For text values, other than storing the row data as-is (and indexing each value by default),
42- each value term is stored into a [ column-based store] by default, which offers performance
43- improvements for global aggregations and groupings, and enables efficient ordering, because
44- the data for one column is packed at one place.
45-
46- In CrateDB, the column store is enabled by default and can be disabled only for text fields,
47- not for other primitive types. Furthermore, CrateDB does not support storing values for
48- container and geospatial types in the column store.
49-
50- ## Data structures
51-
52- CrateDB uses three main data structures of Lucene:
53- Inverted indexes for text values, BKD trees for numeric values, and DocValues.
54-
55- By default, all fields are indexed, nested or not, but the indexing can be turned
56- off selectively.
28+ CrateDB uses three main data structures of Lucene: Inverted indexes for text values,
29+ BKD trees for numeric values, and doc values. On top of doc values, CrateDB implements
30+ a column store for fast sorting and aggregations.
5731
58- - ** Inverted index**
32+ : Inverted index:
5933
6034 The Lucene indexing strategy for text fields relies on a data structure called inverted
6135 index, which is defined as a "data structure storing a mapping from content, such as
@@ -69,7 +43,7 @@ off selectively.
6943
7044 The inverted index enables a very efficient search over textual data.
7145
72- - ** BKD tree**
46+ : BKD tree:
7347
7448 To optimize numeric range queries, Lucene uses an implementation of the Block KD (BKD)
7549 tree data structure. The BKD tree index structure is suitable for indexing large
@@ -82,7 +56,7 @@ off selectively.
8256 including fields defined as ` TIMESTAMP ` types, supporting performant date range
8357 queries.
8458
85- - ** DocValues **
59+ : Doc values:
8660
8761 Because Lucene's inverted index data structure implementation is not optimal for
8862 finding field values by given document identifier, and for performing column-oriented
@@ -92,12 +66,45 @@ off selectively.
9266 all field values that are not analyzed as strings in a compact column, making it more
9367 effective for sorting and aggregations.
9468
69+ : Column store:
70+
71+ CrateDB implements a {ref}` column store <crate-reference:ddl-storage-columnstore> `
72+ based on doc values in Lucene.
73+ For text values, other than storing the row data as-is (and indexing each value by default),
74+ each value term is stored into a column-based store by default.
75+
76+ This storage layout improves the performance of sorting, grouping, and aggregations,
77+ by keeping field data for one column packed at one place rather than scattered across documents.
78+ The column store is enabled by default in CrateDB and can be disabled only for text fields.
79+ It does not support container or geographic data types.
80+
81+ ## Storage process
82+
83+ How CrateDB stores data using Lucene.
84+
85+ : Append-only segments:
86+
87+ A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
88+ it is immutable, and built from a set of documents.
89+
90+ When new documents are added to the
91+ existing index, they are added to the next segment, while previous segments are never
92+ modified. If the number of segments becomes too large, the system may decide to merge
93+ some segments and discard the freed ones. This way, adding a new document does not require
94+ rebuilding the whole index structure completely.
95+
96+ CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
97+ and controls the number of segments per "tier" to balance search performance with merge
98+ overhead. Lucene's [ TieredMergePolicy] documentation explains in detail how CrateDB's
99+ underlying merge policy decides when to combine segments.
100+
101+
95102::::{todo}
96103Enable after merging [ GH-434 : Indexing and storage] ( https://github.com/crate/cratedb-guide/pull/434 ) .
97104``` md
98105## Related sections
99106
100- {ref}`indexing-and-storage` explores the internal workings and data structures
107+ {ref}`indexing-and-storage` illustrates the internal workings and data structures
101108of CrateDB's storage layer in more detail.
102109
103110:::{toctree}
@@ -108,5 +115,4 @@ indexing-and-storage
108115::::
109116
110117
111- [ column-based store ] : https://cratedb.com/docs/crate/reference/en/latest/general/ddl/storage.html
112118[ TieredMergePolicy ] : https://lucene.apache.org/core/9_12_1/core/org/apache/lucene/index/TieredMergePolicy.html
0 commit comments