-
Notifications
You must be signed in to change notification settings - Fork 1
Getting started / Search: Add new section (GenAI, edited) #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8dc6523
763c61e
b2b45c6
bbb6c7e
efb75e9
5bcf3c7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| (start-fulltext)= | ||
| # Full-text search | ||
|
|
||
| :::{div} sd-text-muted | ||
| CrateDB enables real-time full-text search at scale. | ||
| ::: | ||
|
|
||
| Unlike exact-match filters, **full-text search** allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. | ||
|
|
||
| CrateDB supports powerful full-text search capabilities directly via the `FULLTEXT` index and the `MATCH()` SQL predicate. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. | ||
|
|
||
| CrateDB supports you whether you are working with log messages, customer feedback, machine-generated data, or IoT event streams. | ||
|
|
||
| ## Why CrateDB for Full-text Search? | ||
|
|
||
| | Feature | Benefit | | ||
| | --------------------- | ------------------------------------------------- | | ||
| | Full-text indexing | Tokenized, language-aware search on any text | | ||
| | SQL + search | Combine structured filters with keyword queries | | ||
| | JSON support | Search within nested object fields | | ||
| | Real-time ingestion | Search new data immediately—no sync delay | | ||
| | Scalable architecture | Built to handle high-ingest, high-query workloads | | ||
|
|
||
| ## Common Query Patterns | ||
|
|
||
| ### Basic Keyword Search | ||
|
|
||
| ```sql | ||
| SELECT id, message | ||
| FROM logs | ||
| WHERE MATCH(message, 'authentication failed'); | ||
| ``` | ||
|
|
||
| ### Combine with Structured Filters | ||
|
|
||
| ```sql | ||
| SELECT id, message | ||
| FROM logs | ||
| WHERE service = 'auth' | ||
| AND MATCH(message, 'token expired'); | ||
| ``` | ||
|
|
||
| ### Search Nested JSON | ||
|
|
||
| ```sql | ||
| SELECT id, payload['comment'] | ||
| FROM feedback | ||
| WHERE MATCH(payload['comment'], 'battery life'); | ||
| ``` | ||
|
|
||
| ### Aggregate Search Results | ||
|
|
||
| ```sql | ||
| SELECT COUNT(*) | ||
| FROM tickets | ||
| WHERE MATCH(description, 'login') | ||
| AND priority = 'high'; | ||
| ``` | ||
|
|
||
| ## Real-World Examples | ||
|
|
||
| ### Log and Event Search | ||
|
|
||
| Search logs for error messages across microservices: | ||
|
|
||
| ```sql | ||
| SELECT timestamp, service, message | ||
| FROM logs | ||
| WHERE MATCH(message, 'connection reset') | ||
| ORDER BY timestamp DESC | ||
| LIMIT 100; | ||
| ``` | ||
|
|
||
| ### Customer Feedback Analysis | ||
|
|
||
| Extract customer sentiment from support messages: | ||
|
|
||
| ```sql | ||
| SELECT payload['sentiment'], COUNT(*) | ||
| FROM feedback | ||
| WHERE MATCH(payload['message'], 'slow performance') | ||
| GROUP BY payload['sentiment']; | ||
| ``` | ||
|
|
||
| ### Anomaly Investigation | ||
|
|
||
| Search across telemetry events for unexpected patterns: | ||
|
|
||
| ```sql | ||
| SELECT * | ||
| FROM device_events | ||
| WHERE MATCH(payload['error_message'], 'overheat'); | ||
| ``` | ||
|
|
||
| ## Language Support and Analyzers | ||
|
|
||
| CrateDB supports language-specific analyzers, enabling more accurate matching across different natural languages. You can specify analyzers during table creation or at query time. | ||
|
|
||
| ```sql | ||
| CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') ); | ||
| ``` | ||
|
|
||
| To use a specific analyzer in a query: | ||
|
|
||
| ```sql | ||
| SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'; | ||
| ``` | ||
|
|
||
| ## Indexing and Performance Tips | ||
|
|
||
| | Tip | Why It Helps | | ||
| | -------------------------------- | ----------------------------------------- | | ||
| | Use `TEXT` with `FULLTEXT` index | Enables tokenized search | | ||
| | Index only needed fields | Reduce indexing overhead | | ||
| | Pick appropriate analyzer | Match the language and context | | ||
| | Use `MATCH()` not `LIKE` | Full-text is more performant and relevant | | ||
| | Combine with filters | Boost performance using `WHERE` clauses | | ||
|
|
||
| ## Further reading | ||
|
|
||
| :::::{grid} 1 3 3 3 | ||
| :margin: 4 4 0 0 | ||
| :padding: 0 | ||
| :gutter: 2 | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`article;1.5em` Reference | ||
| :columns: 3 | ||
| - {ref}`crate-reference:sql_dql_fulltext_search` | ||
| - {ref}`crate-reference:fulltext-indices` | ||
| - {ref}`crate-reference:predicates_match` | ||
| - {ref}`crate-reference:ref-create-analyzer` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`link;1.5em` Related | ||
| :columns: 3 | ||
| - {ref}`start-geospatial` | ||
| - {ref}`start-vector` | ||
| - {ref}`start-hybrid` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more | ||
| :columns: 6 | ||
| - [How CrateDB differs from Elasticsearch] | ||
| - [Tutorial: Full-text search on logs] | ||
| - {ref}`FTS feature details <fulltext-search>` | ||
| - {ref}`Data modeling with FTS <model-fulltext>` | ||
| :::: | ||
|
|
||
| ::::: | ||
|
|
||
|
|
||
| [How CrateDB differs from Elasticsearch]: https://archive.fosdem.org/2018/schedule/event/cratedb/ | ||
| [Tutorial: Full-text search on logs]: https://community.cratedb.com/t/storing-server-logs-on-cratedb-for-fast-search-and-aggregations/1562 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| (start-geospatial)= | ||
| # Geospatial search | ||
|
|
||
| :::{div} sd-text-muted | ||
| Query geospatial data through SQL, combining ease of use with advanced capabilities. | ||
| ::: | ||
|
|
||
| CrateDB enables geospatial search using **Lucene’s prefix tree** and **BKD tree** indexing structures. With CrateDB, you can: | ||
|
|
||
| * Store and index geographic **points** and **shapes** | ||
| * Perform spatial queries using **bounding boxes**, **circles**, **donut shapes**, and more | ||
| * Filter, sort, or boost results by **distance**, **area**, or **spatial relationship** | ||
|
|
||
| See the {ref}`data-modelling` section for details of data types and how to insert data. | ||
|
|
||
| ## Querying Geospatial Data | ||
|
|
||
| CrateDB supports several SQL functions and predicates to work with geospatial data: | ||
|
|
||
| | Function | Description | | ||
| | -------------------------------------- | -------------------------------------------------------------------------------- | | ||
| | `distance(p1, p2)` | Computes the distance (in meters) between two points using the Haversine formula | | ||
| | `within(shape, region)` | Checks if a shape is fully within another shape | | ||
| | `intersects(shape1, shape2)` | Checks if two shapes intersect | | ||
| | `area(shape)` | Returns the area of a given shape in square degrees using geodetic awareness | | ||
| | `latitude(point)` / `longitude(point)` | Extracts lat/lon from a `GEO_POINT` | | ||
| | `geohash(point)` | Returns a 12-character geohash representation of a point | | ||
|
|
||
| ### MATCH Predicate | ||
|
|
||
| CrateDB provides a `MATCH` predicate for geospatial relationships: | ||
|
|
||
| ```sql | ||
| sqlCopierModifier-- Find parks that intersect with a given region | ||
| SELECT name | ||
| FROM parks | ||
| WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))'); | ||
| ``` | ||
|
Comment on lines
+34
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove stray token and fix MATCH syntax for geospatial queries.
Apply this diff: -```sql
-sqlCopierModifier-- Find parks that intersect with a given region
-SELECT name
-FROM parks
-WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))');
-```
+```sql
+-- Find parks that intersect with a given region
+SELECT name
+FROM parks
+WHERE match(area, 'POLYGON ((...))') USING intersects;
+```References: Geo MATCH predicate syntax and match types (intersects/disjoint/within). (cratedb.com) 🤖 Prompt for AI Agents
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @coderabbitai: Please re-evaluate your admonition now that the patch was updated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query: 💡 Result: Summary — MATCH predicate for GEO_SHAPE in CrateDB
Sources: 🌐 Web query: 💡 Result: Short answer: CrateDB does not use MySQL's MATCH...AGAINST syntax — it exposes full‑text search via the MATCH predicate. You create FULLTEXT indexes on text columns and query them with MATCH(column_or_index, 'query') plus optional USING <match_type> and WITH(...) options (fuzziness, slop, boosts, phrase/phrase_prefix, etc.). Results expose a relevance _score you can ORDER BY. [1][2][3] Minimal example: References: CrateDB MATCH predicate and full‑text guide. [1][2]. 🏁 Script executed: Length of output: 919 🏁 Script executed: Length of output: 5464 @amotl, I've re-evaluated my admonition by checking multiple sources: Evidence gathered:
Conclusion: My original admonition stands. The code should be corrected to: -- Find parks that intersect with a given region
SELECT name
FROM parks
WHERE match(area, 'POLYGON ((...))') USING intersects;The References: |
||
|
|
||
| Supported relations: `INTERSECTS`, `DISJOINT`, `WITHIN`. | ||
|
|
||
| ## Example: Finding Nearby Cities | ||
|
|
||
| The following query finds the 10 closest capital cities to the current location of the International Space Station: | ||
|
|
||
| ```sql | ||
| SELECT | ||
| city AS "City Name", | ||
| country AS "Country", | ||
| DISTANCE(i.position, c.location)::LONG / 1000 AS "Distance [km]" | ||
| FROM demo.iss i | ||
| CROSS JOIN demo.world_cities c | ||
| WHERE capital = 'primary' | ||
| AND ts = (SELECT MAX(ts) FROM demo.iss) | ||
| ORDER BY 3 ASC | ||
| LIMIT 10; | ||
| ``` | ||
|
|
||
| ## Indexing Strategies | ||
|
|
||
| CrateDB supports multiple indexing strategies for `GEO_SHAPE` columns: | ||
|
|
||
| | Index Type | Description | | ||
| | ------------------- | ------------------------------------------------------------ | | ||
| | `geohash` (default) | Hash-based prefix tree for point-based queries | | ||
| | `quadtree` | Space-partitioning using recursive quadrant splits | | ||
| | `bkdtree` | Lucene BKD tree for efficient bounding box and range queries | | ||
|
|
||
| You can choose and configure the indexing method when defining your table schema. | ||
|
|
||
| ### Performance Note | ||
|
|
||
| While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns. | ||
|
|
||
| For full details, refer to the Geo Shape column definition section in the reference documentation. | ||
|
|
||
| ## Further reading | ||
|
|
||
| :::::{grid} 1 3 3 3 | ||
| :margin: 4 4 0 0 | ||
| :padding: 0 | ||
| :gutter: 2 | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`article;1.5em` Reference | ||
| :columns: 3 | ||
| - {ref}`crate-reference:data-types-geo-point` | ||
| - {ref}`crate-reference:data-types-geo-shape` | ||
| - {ref}`crate-reference:sql_dql_geo_search` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`link;1.5em` Related | ||
| :columns: 3 | ||
| - {ref}`start-fulltext` | ||
| - {ref}`start-vector` | ||
| - {ref}`start-hybrid` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more | ||
| :columns: 6 | ||
| - {ref}`Geospatial feature details <geospatial-search>` | ||
| - {ref}`Data modeling with geospatial data <model-geospatial>` | ||
| :::: | ||
|
|
||
| ::::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| (start-hybrid)= | ||
| # Hybrid search | ||
|
|
||
| :::{div} sd-text-muted | ||
| Combine vector similarity (kNN) and term-based full-text (BM25) | ||
| searches in a single SQL query. | ||
| ::: | ||
|
|
||
| While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. **Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. | ||
|
|
||
| Hybrid search is particularly effective for **knowledge bases, product or document search, multilingual content search, FAQ bots and semantic assistants**, and **AI-powered search experiences.** It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. | ||
|
|
||
| CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using **common table expressions (CTEs)** and **scoring fusion techniques**, such as: | ||
|
|
||
| * **Convex combination** (weighted sum of scores) | ||
| * **Reciprocal rank fusion (RRF)** | ||
|
|
||
| ## Supported Search Capabilities in CrateDB | ||
|
|
||
| | Search Type | Function | Description | | ||
| | --------------------- | ------------- |------------------------------------------------| | ||
| | **Vector search** | `KNN_MATCH()` | Finds vectors closest to a given vector | | ||
| | **Full-text search** | `MATCH()` | Uses Lucene's BM25 scoring | | ||
| | **Geospatial search** | `MATCH()` | For shapes and points (see: Geospatial search) | | ||
|
|
||
| CrateDB enables all three through **pure SQL**, allowing flexible combinations and advanced analytics. | ||
|
|
||
| ## Example: Hybrid Search in SQL | ||
|
|
||
| Here’s a simple structure of a hybrid search query combining BM25 and vector results using a CTE: | ||
|
|
||
| ```sql | ||
| WITH | ||
| vector_results AS ( | ||
| SELECT id, title, content, | ||
| _score AS vector_score | ||
| FROM documents | ||
| WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10) | ||
| ), | ||
| bm25_results AS ( | ||
| SELECT id, title, content, | ||
| _score AS bm25_score | ||
| FROM documents | ||
| WHERE MATCH(content, 'knn search') | ||
| ) | ||
|
|
||
| SELECT | ||
| v.id, | ||
| v.title, | ||
| bm25_score, | ||
| vector_score, | ||
| 0.5 * bm25_score + 0.5 * vector_score AS hybrid_score | ||
| FROM | ||
| bm25_results b | ||
| JOIN | ||
| vector_results v ON v.id = b.id | ||
| ORDER BY | ||
| hybrid_score DESC | ||
| LIMIT 10; | ||
| ``` | ||
|
|
||
| You can adjust the weighting (`0.5`) depending on your desired balance between keyword precision and semantic similarity. | ||
|
|
||
| ## Sample Results | ||
|
|
||
| ### Hybrid Scoring (Convex Combination) | ||
|
|
||
| | hybrid\_score | bm25\_score | vector\_score | title | | ||
| | ------------- | ----------- | ------------- | --------------------------------------------- | | ||
| | 0.7440 | 1.0000 | 0.5734 | knn\_match(float\_vector, float\_vector, int) | | ||
| | 0.4868 | 0.5512 | 0.4439 | Searching On Multiple Columns | | ||
| | 0.4716 | 0.5694 | 0.4064 | array\_position(...) | | ||
|
|
||
| ### Reciprocal Rank Fusion (RRF) | ||
|
|
||
| | final\_rank | bm25\_rank | vector\_rank | title | | ||
| | ----------- | ---------- | ------------ | --------------------------------------------- | | ||
| | 0.03278 | 1 | 1 | knn\_match(float\_vector, float\_vector, int) | | ||
| | 0.03105 | 7 | 2 | Searching On Multiple Columns | | ||
| | 0.03057 | 8 | 3 | Usage | | ||
|
|
||
| :::{note} | ||
| RRF rewards documents that rank highly across multiple methods, | ||
| regardless of exact score values. | ||
| ::: | ||
|
|
||
| ## Further reading | ||
|
|
||
| :::::{grid} 1 3 3 3 | ||
| :margin: 4 4 0 0 | ||
| :padding: 0 | ||
| :gutter: 2 | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`article;1.5em` Reference | ||
| :columns: 3 | ||
| - {ref}`crate-reference:sql_dql_fulltext_search` | ||
| - {ref}`crate-reference:fulltext-indices` | ||
| - {ref}`crate-reference:predicates_match` | ||
| - {ref}`crate-reference:scalar_knn_match` | ||
| - {ref}`crate-reference:scalar_vector_similarity` | ||
| - {ref}`crate-reference:type-float_vector` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`link;1.5em` Related | ||
| :columns: 3 | ||
| - {ref}`start-fulltext` | ||
| - {ref}`start-geospatial` | ||
| - {ref}`start-vector` | ||
| :::: | ||
|
|
||
| ::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more | ||
| :columns: 6 | ||
| - [Doing Hybrid Search in CrateDB] | ||
| - {ref}`Hybrid search feature details <hybrid-search>` | ||
| :::: | ||
|
|
||
| ::::: | ||
|
|
||
|
|
||
| [Doing Hybrid Search in CrateDB]: https://cratedb.com/blog/hybrid-search-explained |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix MATCH analyzer syntax; use
USING <match_type> WITH (analyzer=...).CrateDB specifies the analyzer at query time via
WITH (analyzer='...')paired with a match type afterUSING. The current exampleUSING 'english'is invalid. Update the example to one of the supported forms (e.g., use the defaultbest_fieldsor a phrase query).Apply this diff:
Reference: MATCH usage and per‑query analyzer options. (cratedb.com)
🤖 Prompt for AI Agents
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai: Please re-evaluate your admonition now that the patch was updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
💡 Result:
Short answer: use MATCH(..., 'query') USING <match_type> WITH (analyzer='name', ...). The analyzer option is passed in the WITH clause, e.g.:
SELECT * FROM my_table
WHERE MATCH(my_text_ft, 'some query')
USING best_fields
WITH (analyzer = 'english', fuzziness = 1)
ORDER BY _score DESC;
Notes:
Sources:
[1] CrateDB — Fulltext search / MATCH predicate reference. (cratedb.com)
[2] CrateDB examples showing USING / WITH options (analyzer, fuzziness, etc.). (cratedb.com)
[3] CrateDB SQLAlchemy dialect MATCH usage (shows analyzer option in examples). (cratedb.com)
Citations:
🏁 Script executed:
Length of output: 17812
@amotl, I've re-evaluated my assessment and it remains correct. The syntax
USING 'english'is invalid because:According to CrateDB documentation (source), the analyzer must be specified in the
WITHclause after a match type:From the codebase examples:
USING best_fields WITH (fuzziness = 2)(Line 103,docs/start/modelling/fulltext.md)USING PHRASE(Line 106,docs/integrate/rsyslog/usage.md)USING 'analyzer_name'The issue:
USING 'english'treats'english'as a match type (likebest_fieldsorPHRASE), not as an analyzer option.The correct syntax should be:
or if using the default match type:
Would you like me to clarify a specific aspect of the syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. The page still includes an offending SQL command.