Skip to content

Commit 45663cb

Browse files
committed
Storage: Reorganize sections
1 parent f7410d0 commit 45663cb

File tree

1 file changed

+49
-43
lines changed

1 file changed

+49
-43
lines changed

docs/feature/storage/index.md

Lines changed: 49 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -8,54 +8,28 @@
88
The CrateDB storage layer is based on Lucene.
99
:::
1010

11+
Lucene offers scalable and high-performance indexing, which enables efficient search and
12+
aggregations over documents and rapid updates to the existing documents. Solr and
13+
Elasticsearch are building upon the same technologies.
1114
This page enumerates important concepts and implementations of Lucene used by CrateDB.
1215

13-
## Lucene
16+
## Data structures
1417

15-
Lucene offers scalable and high-performance indexing which enables efficient search and
16-
aggregations over documents and rapid updates to the existing documents. Solr and
17-
Elasticsearch are building upon the same technologies.
18+
A single record in Lucene is called "document".
1819

19-
- **Documents**
20+
:Document:
2021

21-
A single record in Lucene is called "document", which is a unit of information for search
22+
A document is a unit of information for search
2223
and indexing that contains a set of fields, where each field has a name and value. A Lucene
2324
index can store an arbitrary number of documents, with an arbitrary number of different fields.
25+
By default, all fields are indexed, nested or not, but the indexing can be turned
26+
off selectively.
2427

25-
- **Append-only segments**
26-
27-
A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
28-
it is immutable, and built from a set of documents. When new documents are added to the
29-
existing index, they are added to the next segment, while previous segments are never
30-
modified. If the number of segments becomes too large, the system may decide to merge
31-
some segments and discard the freed ones. This way, adding a new document does not require
32-
rebuilding the whole index structure completely.
33-
34-
CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
35-
and controls the number of segments per "tier" to balance search performance with merge
36-
overhead. Lucene's [TieredMergePolicy] documentation explains in detail how CrateDB's
37-
underlying merge policy decides when to combine segments.
38-
39-
- **Column store**
40-
41-
For text values, other than storing the row data as-is (and indexing each value by default),
42-
each value term is stored into a [column-based store] by default, which offers performance
43-
improvements for global aggregations and groupings, and enables efficient ordering, because
44-
the data for one column is packed at one place.
45-
46-
In CrateDB, the column store is enabled by default and can be disabled only for text fields,
47-
not for other primitive types. Furthermore, CrateDB does not support storing values for
48-
container and geospatial types in the column store.
49-
50-
## Data structures
51-
52-
CrateDB uses three main data structures of Lucene:
53-
Inverted indexes for text values, BKD trees for numeric values, and DocValues.
54-
55-
By default, all fields are indexed, nested or not, but the indexing can be turned
56-
off selectively.
28+
CrateDB uses three main data structures of Lucene: Inverted indexes for text values,
29+
BKD trees for numeric values, and doc values. On top of doc values, CrateDB implements
30+
a column store for fast sorting and aggregations.
5731

58-
- **Inverted index**
32+
:Inverted index:
5933

6034
The Lucene indexing strategy for text fields relies on a data structure called inverted
6135
index, which is defined as a "data structure storing a mapping from content, such as
@@ -69,7 +43,7 @@ off selectively.
6943

7044
The inverted index enables a very efficient search over textual data.
7145

72-
- **BKD tree**
46+
:BKD tree:
7347

7448
To optimize numeric range queries, Lucene uses an implementation of the Block KD (BKD)
7549
tree data structure. The BKD tree index structure is suitable for indexing large
@@ -82,7 +56,7 @@ off selectively.
8256
including fields defined as `TIMESTAMP` types, supporting performant date range
8357
queries.
8458

85-
- **DocValues**
59+
:Doc values:
8660

8761
Because Lucene's inverted index data structure implementation is not optimal for
8862
finding field values by given document identifier, and for performing column-oriented
@@ -92,12 +66,45 @@ off selectively.
9266
all field values that are not analyzed as strings in a compact column, making it more
9367
effective for sorting and aggregations.
9468

69+
:Column store:
70+
71+
CrateDB implements a {ref}`column store <crate-reference:ddl-storage-columnstore>`
72+
based on doc values in Lucene.
73+
For text values, other than storing the row data as-is (and indexing each value by default),
74+
each value term is stored into a column-based store by default.
75+
76+
This storage layout improves the performance of sorting, grouping, and aggregations,
77+
by keeping field data for one column packed at one place rather than scattered across documents.
78+
The column store is enabled by default in CrateDB and can be disabled only for text fields.
79+
It does not support container or geographic data types.
80+
81+
## Storage process
82+
83+
How CrateDB stores data using Lucene.
84+
85+
:Append-only segments:
86+
87+
A Lucene index is composed of one or more sub-indexes. A sub-index is called a segment,
88+
it is immutable, and built from a set of documents.
89+
90+
When new documents are added to the
91+
existing index, they are added to the next segment, while previous segments are never
92+
modified. If the number of segments becomes too large, the system may decide to merge
93+
some segments and discard the freed ones. This way, adding a new document does not require
94+
rebuilding the whole index structure completely.
95+
96+
CrateDB uses Lucene's default TieredMergePolicy. It merges segments of roughly equal size
97+
and controls the number of segments per "tier" to balance search performance with merge
98+
overhead. Lucene's [TieredMergePolicy] documentation explains in detail how CrateDB's
99+
underlying merge policy decides when to combine segments.
100+
101+
95102
::::{todo}
96103
Enable after merging [GH-434: Indexing and storage](https://github.com/crate/cratedb-guide/pull/434).
97104
```md
98105
## Related sections
99106

100-
{ref}`indexing-and-storage` explores the internal workings and data structures
107+
{ref}`indexing-and-storage` illustrates the internal workings and data structures
101108
of CrateDB's storage layer in more detail.
102109

103110
:::{toctree}
@@ -108,5 +115,4 @@ indexing-and-storage
108115
::::
109116

110117

111-
[column-based store]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/storage.html
112118
[TieredMergePolicy]: https://lucene.apache.org/core/9_12_1/core/org/apache/lucene/index/TieredMergePolicy.html

0 commit comments

Comments
 (0)