Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/start/query/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ CrateDB is not just a real-time analytics database, it’s a powerful platform t

aggregations
ad-hoc
search
search/index
ai-integration
Performance <performance>
```
Expand Down
153 changes: 153 additions & 0 deletions docs/start/query/search/fulltext.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
(start-fulltext)=
# Full-text search

:::{div} sd-text-muted
CrateDB enables real-time full-text search at scale.
:::

Unlike exact-match filters, **full-text search** allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc.

CrateDB supports powerful full-text search capabilities directly via the `FULLTEXT` index and the `MATCH()` SQL predicate. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch.

CrateDB supports you whether you are working with log messages, customer feedback, machine-generated data, or IoT event streams.

## Why CrateDB for Full-text Search?

| Feature | Benefit |
| --------------------- | ------------------------------------------------- |
| Full-text indexing | Tokenized, language-aware search on any text |
| SQL + search | Combine structured filters with keyword queries |
| JSON support | Search within nested object fields |
| Real-time ingestion | Search new data immediately—no sync delay |
| Scalable architecture | Built to handle high-ingest, high-query workloads |

## Common Query Patterns

### Basic Keyword Search

```sql
SELECT id, message
FROM logs
WHERE MATCH(message, 'authentication failed');
```

### Combine with Structured Filters

```sql
SELECT id, message
FROM logs
WHERE service = 'auth'
AND MATCH(message, 'token expired');
```

### Search Nested JSON

```sql
SELECT id, payload['comment']
FROM feedback
WHERE MATCH(payload['comment'], 'battery life');
```

### Aggregate Search Results

```sql
SELECT COUNT(*)
FROM tickets
WHERE MATCH(description, 'login')
AND priority = 'high';
```

## Real-World Examples

### Log and Event Search

Search logs for error messages across microservices:

```sql
SELECT timestamp, service, message
FROM logs
WHERE MATCH(message, 'connection reset')
ORDER BY timestamp DESC
LIMIT 100;
```

### Customer Feedback Analysis

Extract customer sentiment from support messages:

```sql
SELECT payload['sentiment'], COUNT(*)
FROM feedback
WHERE MATCH(payload['message'], 'slow performance')
GROUP BY payload['sentiment'];
```

### Anomaly Investigation

Search across telemetry events for unexpected patterns:

```sql
SELECT *
FROM device_events
WHERE MATCH(payload['error_message'], 'overheat');
```

## Language Support and Analyzers

CrateDB supports language-specific analyzers, enabling more accurate matching across different natural languages. You can specify analyzers during table creation or at query time.

```sql
CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') );
```

To use a specific analyzer in a query:

```sql
SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
```
Comment on lines +99 to +107
Copy link

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix MATCH analyzer syntax; use USING <match_type> WITH (analyzer=...).

CrateDB specifies the analyzer at query time via WITH (analyzer='...') paired with a match type after USING. The current example USING 'english' is invalid. Update the example to one of the supported forms (e.g., use the default best_fields or a phrase query).

Apply this diff:

-```sql
-SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
-```
+```sql
+-- Use a specific analyzer at query time
+-- (best_fields is the default match type)
+SELECT *
+FROM docs
+WHERE MATCH(text, 'power outage')
+  USING best_fields WITH (analyzer = 'english');
+```

Reference: MATCH usage and per‑query analyzer options. (cratedb.com)

🤖 Prompt for AI Agents
In docs/start/query/search/fulltext.md around lines 98 to 106, the example uses
invalid MATCH analyzer syntax ("USING 'english'"); update it to specify the
match type and pass the analyzer with WITH (...) — e.g., replace the single-line
query with a multi-line SQL that uses a match type (such as best_fields) and
appends WITH (analyzer = 'english') to the USING clause so the query becomes:
SELECT ... WHERE MATCH(...) USING best_fields WITH (analyzer = 'english');

Copy link
Member Author

@amotl amotl Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Please re-evaluate your admonition now that the patch was updated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

CrateDB MATCH predicate syntax with analyzer at query time

💡 Result:

Short answer: use MATCH(..., 'query') USING <match_type> WITH (analyzer='name', ...). The analyzer option is passed in the WITH clause, e.g.:

SELECT * FROM my_table
WHERE MATCH(my_text_ft, 'some query')
USING best_fields
WITH (analyzer = 'english', fuzziness = 1)
ORDER BY _score DESC;

Notes:

  • The MATCH predicate syntax and options are documented in CrateDB’s Fulltext Search reference. [1]
  • The analyzer you specify at query time must be compatible with the analyzer that was used to create the fulltext index on the column (CrateDB only accepts the analyzer used for the indexed column). [1]
  • Example usage with analyzer and other options is shown in CrateDB examples. [2]

Sources:
[1] CrateDB — Fulltext search / MATCH predicate reference. (cratedb.com)
[2] CrateDB examples showing USING / WITH options (analyzer, fuzziness, etc.). (cratedb.com)
[3] CrateDB SQLAlchemy dialect MATCH usage (shows analyzer option in examples). (cratedb.com)

Citations:


🏁 Script executed:

#!/bin/bash
# Search for other MATCH examples in the codebase
rg -n -C3 "MATCH\s*\(" --type=md

Length of output: 17812


@amotl, I've re-evaluated my assessment and it remains correct. The syntax USING 'english' is invalid because:

  1. According to CrateDB documentation (source), the analyzer must be specified in the WITH clause after a match type:

    MATCH(...) USING <match_type> WITH (analyzer='name')
  2. From the codebase examples:

    • USING best_fields WITH (fuzziness = 2) (Line 103, docs/start/modelling/fulltext.md)
    • USING PHRASE (Line 106, docs/integrate/rsyslog/usage.md)
    • ❌ No examples of USING 'analyzer_name'
  3. The issue: USING 'english' treats 'english' as a match type (like best_fields or PHRASE), not as an analyzer option.

The correct syntax should be:

SELECT * FROM docs 
WHERE MATCH(text, 'power outage') 
  USING best_fields WITH (analyzer = 'english');

or if using the default match type:

SELECT * FROM docs 
WHERE MATCH(text, 'power outage') 
  WITH (analyzer = 'english');

Would you like me to clarify a specific aspect of the syntax?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. The page still includes an offending SQL command.

SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english';
SQLParseException[line 1:60: no viable alternative at input 'SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'']


## Indexing and Performance Tips

| Tip | Why It Helps |
| -------------------------------- | ----------------------------------------- |
| Use `TEXT` with `FULLTEXT` index | Enables tokenized search |
| Index only needed fields | Reduce indexing overhead |
| Pick appropriate analyzer | Match the language and context |
| Use `MATCH()` not `LIKE` | Full-text is more performant and relevant |
| Combine with filters | Boost performance using `WHERE` clauses |

## Further reading

:::::{grid} 1 3 3 3
:margin: 4 4 0 0
:padding: 0
:gutter: 2

::::{grid-item-card} {material-outlined}`article;1.5em` Reference
:columns: 3
- {ref}`crate-reference:sql_dql_fulltext_search`
- {ref}`crate-reference:fulltext-indices`
- {ref}`crate-reference:predicates_match`
- {ref}`crate-reference:ref-create-analyzer`
::::

::::{grid-item-card} {material-outlined}`link;1.5em` Related
:columns: 3
- {ref}`start-geospatial`
- {ref}`start-vector`
- {ref}`start-hybrid`
::::

::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more
:columns: 6
- [How CrateDB differs from Elasticsearch]
- [Tutorial: Full-text search on logs]
- {ref}`FTS feature details <fulltext-search>`
- {ref}`Data modeling with FTS <model-fulltext>`
::::

:::::


[How CrateDB differs from Elasticsearch]: https://archive.fosdem.org/2018/schedule/event/cratedb/
[Tutorial: Full-text search on logs]: https://community.cratedb.com/t/storing-server-logs-on-cratedb-for-fast-search-and-aggregations/1562
104 changes: 104 additions & 0 deletions docs/start/query/search/geo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
(start-geospatial)=
# Geospatial search

:::{div} sd-text-muted
Query geospatial data through SQL, combining ease of use with advanced capabilities.
:::

CrateDB enables geospatial search using **Lucene’s prefix tree** and **BKD tree** indexing structures. With CrateDB, you can:

* Store and index geographic **points** and **shapes**
* Perform spatial queries using **bounding boxes**, **circles**, **donut shapes**, and more
* Filter, sort, or boost results by **distance**, **area**, or **spatial relationship**

See the {ref}`data-modelling` section for details of data types and how to insert data.

## Querying Geospatial Data

CrateDB supports several SQL functions and predicates to work with geospatial data:

| Function | Description |
| -------------------------------------- | -------------------------------------------------------------------------------- |
| `distance(p1, p2)` | Computes the distance (in meters) between two points using the Haversine formula |
| `within(shape, region)` | Checks if a shape is fully within another shape |
| `intersects(shape1, shape2)` | Checks if two shapes intersect |
| `area(shape)` | Returns the area of a given shape in square degrees using geodetic awareness |
| `latitude(point)` / `longitude(point)` | Extracts lat/lon from a `GEO_POINT` |
| `geohash(point)` | Returns a 12-character geohash representation of a point |

### MATCH Predicate

CrateDB provides a `MATCH` predicate for geospatial relationships:

```sql
sqlCopierModifier-- Find parks that intersect with a given region
SELECT name
FROM parks
WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
```
Comment on lines +34 to +38
Copy link

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove stray token and fix MATCH syntax for geospatial queries.

  • Delete the artifact sqlCopierModifier--.
  • CrateDB’s geospatial MATCH doesn’t use AGAINST(...) (that’s MySQL). Use match(column, query_term) [USING intersects|disjoint|within].

Apply this diff:

-```sql
-sqlCopierModifier-- Find parks that intersect with a given region
-SELECT name
-FROM parks
-WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
-```
+```sql
+-- Find parks that intersect with a given region
+SELECT name
+FROM parks
+WHERE match(area, 'POLYGON ((...))') USING intersects;
+```

References: Geo MATCH predicate syntax and match types (intersects/disjoint/within). (cratedb.com)

🤖 Prompt for AI Agents
In docs/start/query/search/geo.md around lines 74-78, remove the stray token
"sqlCopierModifier--" and replace the MySQL-style MATCH ... AGAINST(...) usage
with CrateDB's geospatial MATCH predicate: change the snippet to use a proper
fenced code block and a SQL comment, and update the WHERE clause to "WHERE
match(area, 'POLYGON ((...))') USING intersects;" so it uses match(column,
query_term) USING intersects syntax instead of AGAINST(...).

Copy link
Member Author

@amotl amotl Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Please re-evaluate your admonition now that the patch was updated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

CrateDB MATCH predicate syntax for geospatial queries geo_shape

💡 Result:

Summary — MATCH predicate for GEO_SHAPE in CrateDB

  • Syntax: MATCH(column_ident, query_term) [ USING match_type ] where column_ident is a geo_shape indexed column and query_term is a shape (WKT or GeoJSON). [1][2]

  • Supported match_types: intersects (default), disjoint, within. Example relations:

    • intersects — return rows where shapes share any points/area.
    • disjoint — shapes share no points.
    • within — indexed shape is completely inside the query shape. [1]
  • Usage notes:

    • Only allowed in WHERE on user tables.
    • Fast (uses geo index). For exact results use scalar functions (intersects(), within(), distance()) which are slower. [1][2]
  • Example queries:

    • Intersects (default):
      SELECT name FROM countries WHERE match(geo, 'LINESTRING (13.38 52.52, 11.18 51.55)'); [1]
    • Disjoint:
      SELECT name FROM countries WHERE match(geo, 'LINESTRING (...)') USING disjoint; [1]
    • Within:
      SELECT id FROM regions WHERE match(geo, 'POLYGON ((...))') USING within; [1]

Sources:
[1] CrateDB Geo search / MATCH predicate documentation.
[2] CrateDB Geospatial Search guide.


🌐 Web query:

CrateDB AGAINST keyword full-text search MATCH predicate

💡 Result:

Short answer: CrateDB does not use MySQL's MATCH...AGAINST syntax — it exposes full‑text search via the MATCH predicate. You create FULLTEXT indexes on text columns and query them with MATCH(column_or_index, 'query') plus optional USING <match_type> and WITH(...) options (fuzziness, slop, boosts, phrase/phrase_prefix, etc.). Results expose a relevance _score you can ORDER BY. [1][2][3]

Minimal example:
CREATE TABLE docs (id TEXT PRIMARY KEY, body TEXT, INDEX body_ft USING FULLTEXT(body) WITH (analyzer='english'));
SELECT id, _score FROM docs WHERE MATCH(body_ft, 'quick brown') ORDER BY _score DESC;

References: CrateDB MATCH predicate and full‑text guide. [1][2].


🏁 Script executed:

#!/bin/bash
# Check if CrateDB documentation mentions AGAINST in the context of MATCH predicate
rg -n -C3 --type=md "AGAINST" .

# Also search for geospatial MATCH examples in the docs
rg -n -C3 --type=md "match.*geo.*USING" .

# Search for the specific INTERSECTS usage pattern
rg -n -C3 --type=md "INTERSECTS.*POLYGON" .

Length of output: 919


🏁 Script executed:

#!/bin/bash
# Fetch the official CrateDB geo search documentation to verify syntax
curl -s "https://cratedb.com/docs/crate/reference/en/latest/general/dql/geo.html" | grep -A 10 -B 10 "MATCH" | head -50

Length of output: 5464


@amotl, I've re-evaluated my admonition by checking multiple sources:

Evidence gathered:

  1. Official CrateDB documentation confirms the geospatial MATCH syntax is:

    MATCH(column_ident, query_term) [ USING match_type ]

    Examples from the docs:

    SELECT name FROM countries WHERE match(geo, 'LINESTRING (...)');
    SELECT id FROM regions WHERE match(geo, 'POLYGON ((...))') USING within;
  2. Web search confirms CrateDB does NOT use MySQL's MATCH...AGAINST syntax for either full-text or geospatial queries.

  3. Repository search shows AGAINST appears only in this new file (line 37) and nowhere else in the codebase.

Conclusion:

My original admonition stands. The code should be corrected to:

-- Find parks that intersect with a given region
SELECT name
FROM parks
WHERE match(area, 'POLYGON ((...))') USING intersects;

The AGAINST keyword is not valid CrateDB syntax. If you believe there's a scenario where AGAINST is valid in CrateDB, please share the documentation reference, and I'll gladly review it.

References:


Supported relations: `INTERSECTS`, `DISJOINT`, `WITHIN`.

## Example: Finding Nearby Cities

The following query finds the 10 closest capital cities to the current location of the International Space Station:

```sql
SELECT
city AS "City Name",
country AS "Country",
DISTANCE(i.position, c.location)::LONG / 1000 AS "Distance [km]"
FROM demo.iss i
CROSS JOIN demo.world_cities c
WHERE capital = 'primary'
AND ts = (SELECT MAX(ts) FROM demo.iss)
ORDER BY 3 ASC
LIMIT 10;
```

## Indexing Strategies

CrateDB supports multiple indexing strategies for `GEO_SHAPE` columns:

| Index Type | Description |
| ------------------- | ------------------------------------------------------------ |
| `geohash` (default) | Hash-based prefix tree for point-based queries |
| `quadtree` | Space-partitioning using recursive quadrant splits |
| `bkdtree` | Lucene BKD tree for efficient bounding box and range queries |

You can choose and configure the indexing method when defining your table schema.

### Performance Note

While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns.

For full details, refer to the Geo Shape column definition section in the reference documentation.

## Further reading

:::::{grid} 1 3 3 3
:margin: 4 4 0 0
:padding: 0
:gutter: 2

::::{grid-item-card} {material-outlined}`article;1.5em` Reference
:columns: 3
- {ref}`crate-reference:data-types-geo-point`
- {ref}`crate-reference:data-types-geo-shape`
- {ref}`crate-reference:sql_dql_geo_search`
::::

::::{grid-item-card} {material-outlined}`link;1.5em` Related
:columns: 3
- {ref}`start-fulltext`
- {ref}`start-vector`
- {ref}`start-hybrid`
::::

::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more
:columns: 6
- {ref}`Geospatial feature details <geospatial-search>`
- {ref}`Data modeling with geospatial data <model-geospatial>`
::::

:::::
120 changes: 120 additions & 0 deletions docs/start/query/search/hybrid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
(start-hybrid)=
# Hybrid search

:::{div} sd-text-muted
Combine vector similarity (kNN) and term-based full-text (BM25)
searches in a single SQL query.
:::

While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. **Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results.

Hybrid search is particularly effective for **knowledge bases, product or document search, multilingual content search, FAQ bots and semantic assistants**, and **AI-powered search experiences.** It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms.

CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using **common table expressions (CTEs)** and **scoring fusion techniques**, such as:

* **Convex combination** (weighted sum of scores)
* **Reciprocal rank fusion (RRF)**

## Supported Search Capabilities in CrateDB

| Search Type | Function | Description |
| --------------------- | ------------- |------------------------------------------------|
| **Vector search** | `KNN_MATCH()` | Finds vectors closest to a given vector |
| **Full-text search** | `MATCH()` | Uses Lucene's BM25 scoring |
| **Geospatial search** | `MATCH()` | For shapes and points (see: Geospatial search) |

CrateDB enables all three through **pure SQL**, allowing flexible combinations and advanced analytics.

## Example: Hybrid Search in SQL

Here’s a simple structure of a hybrid search query combining BM25 and vector results using a CTE:

```sql
WITH
vector_results AS (
SELECT id, title, content,
_score AS vector_score
FROM documents
WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10)
),
bm25_results AS (
SELECT id, title, content,
_score AS bm25_score
FROM documents
WHERE MATCH(content, 'knn search')
)

SELECT
v.id,
v.title,
bm25_score,
vector_score,
0.5 * bm25_score + 0.5 * vector_score AS hybrid_score
FROM
bm25_results b
JOIN
vector_results v ON v.id = b.id
ORDER BY
hybrid_score DESC
LIMIT 10;
```

You can adjust the weighting (`0.5`) depending on your desired balance between keyword precision and semantic similarity.

## Sample Results

### Hybrid Scoring (Convex Combination)

| hybrid\_score | bm25\_score | vector\_score | title |
| ------------- | ----------- | ------------- | --------------------------------------------- |
| 0.7440 | 1.0000 | 0.5734 | knn\_match(float\_vector, float\_vector, int) |
| 0.4868 | 0.5512 | 0.4439 | Searching On Multiple Columns |
| 0.4716 | 0.5694 | 0.4064 | array\_position(...) |

### Reciprocal Rank Fusion (RRF)

| final\_rank | bm25\_rank | vector\_rank | title |
| ----------- | ---------- | ------------ | --------------------------------------------- |
| 0.03278 | 1 | 1 | knn\_match(float\_vector, float\_vector, int) |
| 0.03105 | 7 | 2 | Searching On Multiple Columns |
| 0.03057 | 8 | 3 | Usage |

:::{note}
RRF rewards documents that rank highly across multiple methods,
regardless of exact score values.
:::

## Further reading

:::::{grid} 1 3 3 3
:margin: 4 4 0 0
:padding: 0
:gutter: 2

::::{grid-item-card} {material-outlined}`article;1.5em` Reference
:columns: 3
- {ref}`crate-reference:sql_dql_fulltext_search`
- {ref}`crate-reference:fulltext-indices`
- {ref}`crate-reference:predicates_match`
- {ref}`crate-reference:scalar_knn_match`
- {ref}`crate-reference:scalar_vector_similarity`
- {ref}`crate-reference:type-float_vector`
::::

::::{grid-item-card} {material-outlined}`link;1.5em` Related
:columns: 3
- {ref}`start-fulltext`
- {ref}`start-geospatial`
- {ref}`start-vector`
::::

::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more
:columns: 6
- [Doing Hybrid Search in CrateDB]
- {ref}`Hybrid search feature details <hybrid-search>`
::::

:::::


[Doing Hybrid Search in CrateDB]: https://cratedb.com/blog/hybrid-search-explained
Loading