From 8dc65232d5a3d7571a91ed39c04a2d52f63b291d Mon Sep 17 00:00:00 2001 From: Stephane Castellani Date: Sat, 23 Aug 2025 17:54:05 +0200 Subject: [PATCH 1/6] Getting started / Search: Add new section --- docs/start/query/index.md | 2 +- docs/start/query/search/fulltext.md | 147 ++++++++++++++++++++++++++++ docs/start/query/search/geo.md | 126 ++++++++++++++++++++++++ docs/start/query/search/hybrid.md | 114 +++++++++++++++++++++ docs/start/query/search/index.md | 11 +++ docs/start/query/search/vector.md | 134 +++++++++++++++++++++++++ 6 files changed, 533 insertions(+), 1 deletion(-) create mode 100644 docs/start/query/search/fulltext.md create mode 100644 docs/start/query/search/geo.md create mode 100644 docs/start/query/search/hybrid.md create mode 100644 docs/start/query/search/index.md create mode 100644 docs/start/query/search/vector.md diff --git a/docs/start/query/index.md b/docs/start/query/index.md index 051f7b38..54c09244 100644 --- a/docs/start/query/index.md +++ b/docs/start/query/index.md @@ -48,7 +48,7 @@ CrateDB is not just a real-time analytics database, it’s a powerful platform t aggregations ad-hoc -search +search/index ai-integration Performance ``` diff --git a/docs/start/query/search/fulltext.md b/docs/start/query/search/fulltext.md new file mode 100644 index 00000000..09189095 --- /dev/null +++ b/docs/start/query/search/fulltext.md @@ -0,0 +1,147 @@ +# Full-text search + +CrateDB supports powerful **full-text search** capabilities directly within its distributed SQL engine. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. + +Whether you're working with log messages, customer feedback, machine-generated data, or IoT event streams, CrateDB enables **real-time full-text search at scale**. + +## What Is Full-text Search? + +Unlike exact-match filters, full-text search allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. + +CrateDB enables this via the `FULLTEXT` index and the `MATCH()` SQL predicate. + +## Why CrateDB for Full-text Search? + +| Feature | Benefit | +| --------------------- | ------------------------------------------------- | +| Full-text indexing | Tokenized, language-aware search on any text | +| SQL + search | Combine structured filters with keyword queries | +| JSON support | Search within nested object fields | +| Real-time ingestion | Search new data immediately—no sync delay | +| Scalable architecture | Built to handle high-ingest, high-query workloads | + +## Common Query Patterns + +### Basic Keyword Search + +```sql +SELECT id, message +FROM logs +WHERE MATCH(message, 'authentication failed'); +``` + +### Combine with Structured Filters + +```sql +SELECT id, message +FROM logs +WHERE service = 'auth' + AND MATCH(message, 'token expired'); +``` + +### Search Nested JSON + +```sql +SELECT id, payload['comment'] +FROM feedback +WHERE MATCH(payload['comment'], 'battery life'); +``` + +### Aggregate Search Results + +```sql +SELECT COUNT(*) +FROM tickets +WHERE MATCH(description, 'login') + AND priority = 'high'; +``` + +## Real-World Examples + +### Log and Event Search + +Search logs for error messages across microservices: + +```sql +SELECT timestamp, service, message +FROM logs +WHERE MATCH(message, 'connection reset') +ORDER BY timestamp DESC +LIMIT 100; +``` + +### Customer Feedback Analysis + +Extract customer sentiment from support messages: + +```sql +SELECT payload['sentiment'], COUNT(*) +FROM feedback +WHERE MATCH(payload['message'], 'slow performance') +GROUP BY payload['sentiment']; +``` + +### Anomaly Investigation + +Search across telemetry events for unexpected patterns: + +```sql +SELECT * +FROM device_events +WHERE MATCH(payload['error_message'], 'overheat'); +``` + +## Language Support and Analyzers + +CrateDB supports language-specific analyzers, enabling more accurate matching across different natural languages. You can specify analyzers during table creation or at query time. + +```sql +CREATE TABLE docs ( id INTEGER, text TEXT INDEX USING FULLTEXT WITH (analyzer = 'english') ); +``` + +To use a specific analyzer in a query: + +```sql +SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'; +``` + +## Indexing and Performance Tips + +| Tip | Why It Helps | +| -------------------------------- | ----------------------------------------- | +| Use `TEXT` with `FULLTEXT` index | Enables tokenized search | +| Index only needed fields | Reduce indexing overhead | +| Pick appropriate analyzer | Match the language and context | +| Use `MATCH()` not `LIKE` | Full-text is more performant and relevant | +| Combine with filters | Boost performance using `WHERE` clauses | + +## When to Use CrateDB for Full-text Search + +CrateDB is ideal when you need to: + +* Search human-generated data (logs, comments, messages) +* Perform search + filtering + aggregation in a **single SQL query** +* Handle **real-time ingestion** and **search immediately** +* Avoid managing a separate search engine or ETL pipeline +* Search **text within structured or semi-structured data** + +## Related Features + +| Feature | Description | +| --------------------- | ---------------------------------------------------- | +| Language analyzers | Built-in support for many languages | +| JSON object support | Index and search nested fields | +| SQL + full-text | Unified queries for structured and unstructured data | +| Distributed execution | Fast, scalable search across nodes | +| Aggregations | Group and analyze search results at scale | + +## Learn More + +* Full-text Search Data Model +* MATCH Clause Documentation +* How CrateDB Differs from Elasticsearch +* Tutorial: Full-text Search on Logs + +## Summary + +CrateDB delivers fast, scalable, and **SQL-native full-text search**—perfect for modern applications that need to search and analyze semi-structured or human-generated text in real time. By merging search and analytics into a **single operational database**, CrateDB simplifies your stack while unlocking rich insight. diff --git a/docs/start/query/search/geo.md b/docs/start/query/search/geo.md new file mode 100644 index 00000000..66e91276 --- /dev/null +++ b/docs/start/query/search/geo.md @@ -0,0 +1,126 @@ +# Geo search + +CrateDB natively supports geospatial data types and spatial queries, allowing you to store, index, and efficiently query geographic data using SQL. Built on Apache Lucene, CrateDB offers powerful location-based search capabilities at scale. + +## Overview + +CrateDB enables geospatial search using **Lucene’s Prefix Tree** and **BKD Tree** indexing structures. With CrateDB, you can: + +* Store and index geographic **points** and **shapes** +* Perform spatial queries using **bounding boxes**, **circles**, **donut shapes**, and more +* Filter, sort, or boost results by **distance**, **area**, or **spatial relationship** + +You interact with geospatial data through SQL, combining ease of use with advanced capabilities. + +## Geospatial Data Types + +CrateDB supports two primary geospatial types: + +### `GEO_POINT` + +* Represents a single point using latitude and longitude. +* Can be inserted as: + * An array: `[longitude, latitude]` + * A WKT string (e.g. `'POINT (13.4050 52.5200)'`) + +### `GEO_SHAPE` + +* Represents more complex 2D shapes defined via GeoJSON or WKT formats. +* Supported geometry types: + * `Point`, `MultiPoint` + * `LineString`, `MultiLineString` + * `Polygon`, `MultiPolygon` + * `GeometryCollection` +* Insertable using: + * A GeoJSON object + * A WKT string + +## Inserting Spatial Data + +You can insert geospatial values using either **GeoJSON** or **WKT** formats. + +**Examples**: + +```sql +-- Inserting a point +INSERT INTO locations (name, coordinates) +VALUES ('Berlin', [13.4050, 52.5200]); + +-- Inserting a shape (WKT format) +INSERT INTO parks (name, area) +VALUES ('Central Park', 'POLYGON ((...))'); +``` + +## Querying Geospatial Data + +CrateDB supports several SQL functions and predicates to work with geospatial data: + +### Common Functions + +| Function | Description | +| -------------------------------------- | ---------------------------------------------------- | +| `distance(p1, p2)` | Computes the distance (in meters) between two points | +| `within(shape, region)` | Checks if a shape is fully within another shape | +| `intersects(shape1, shape2)` | Checks if two shapes intersect | +| `area(shape)` | Returns the area of a given shape | +| `latitude(point)` / `longitude(point)` | Extracts lat/lon from a `GEO_POINT` | +| `geohash(point)` | Returns the geohash representation of a point | + +### MATCH Predicate + +CrateDB provides a `MATCH` predicate for geospatial relationships: + +```sql +sqlCopierModifier-- Find parks that intersect with a given region +SELECT name +FROM parks +WHERE MATCH(area) AGAINST('INTERSECTS POLYGON ((...))'); +``` + +Supported relations: `INTERSECTS`, `DISJOINT`, `WITHIN`. + +## Example: Finding Nearby Cities + +The following query finds the 10 closest capital cities to the current location of the International Space Station: + +```sql +SELECT + city AS "City Name", + country AS "Country", + DISTANCE(i.position, c.location)::LONG / 1000 AS "Distance [km]" +FROM demo.iss i +CROSS JOIN demo.world_cities c +WHERE capital = 'primary' + AND ts = (SELECT MAX(ts) FROM demo.iss) +ORDER BY 3 ASC +LIMIT 10; +``` + +## Indexing Strategies + +CrateDB supports multiple indexing strategies for `GEO_SHAPE` columns: + +| Index Type | Description | +| ------------------- | ------------------------------------------------------------ | +| `geohash` (default) | Hash-based prefix tree for point-based queries | +| `quadtree` | Space-partitioning using recursive quadrant splits | +| `bkdtree` | Lucene BKD tree for efficient bounding box and range queries | + +You can choose and configure the indexing method when defining your table schema. + +### Performance Note + +While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns. + +## Defining a Geospatial Column + +Here’s how to define a `GEO_SHAPE` column with a specific index: + +```sql +CREATE TABLE regions ( + name TEXT, + area GEO_SHAPE INDEX USING 'quadtree' +); +``` + +For full details, refer to the Geo Shape Column Definition section in the reference. diff --git a/docs/start/query/search/hybrid.md b/docs/start/query/search/hybrid.md new file mode 100644 index 00000000..b76821ca --- /dev/null +++ b/docs/start/query/search/hybrid.md @@ -0,0 +1,114 @@ +# Hybrid search + +CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. It’s fully powered by Apache Lucene and accessible through standard SQL—no external services or DSLs required. + +:::{note} +CrateDB is all you need: run hybrid, vector, full-text, and geospatial search with SQL at scale. +::: + +## Overview + +While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. + +On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. + +**Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. + +## What Is Hybrid Search? + +Hybrid search enhances relevancy by combining the scores or rankings from multiple search algorithms, typically: + +* **BM25** for keyword relevance +* **kNN** for semantic proximity in vector space + +CrateDB lets you implement hybrid search natively in SQL using **Common Table Expressions (CTEs)** and **scoring fusion techniques**, such as: + +* **Convex combination** (weighted sum of scores) +* **Reciprocal Rank Fusion (RRF)** + +## Supported Search Capabilities in CrateDB + +| Search Type | Function | Description | +| --------------------- | ------------- | ---------------------------------------------- | +| **Vector search** | `KNN_MATCH()` | Finds vectors closest to a given vector | +| **Full-text search** | `MATCH()` | Uses Lucene's BM25 scoring | +| **Geospatial search** | `MATCH()` | For shapes and points (see: Geospatial Search) | + +CrateDB enables all three through **pure SQL**, allowing flexible combinations and advanced analytics. + +## Example: Hybrid Search in SQL + +Here’s a simple structure of a hybrid search query combining BM25 and vector results using a CTE: + +```sql +WITH + vector_results AS ( + SELECT id, title, content, + _score AS vector_score + FROM documents + WHERE KNN_MATCH(embedding, [0.2, 0.1, ..., 0.3], 10) + ), + bm25_results AS ( + SELECT id, title, content, + _score AS bm25_score + FROM documents + WHERE MATCH(content, 'knn search') + ) + +SELECT + v.id, + v.title, + bm25_score, + vector_score, + 0.5 * bm25_score + 0.5 * vector_score AS hybrid_score +FROM + bm25_results b +JOIN + vector_results v ON v.id = b.id +ORDER BY + hybrid_score DESC +LIMIT 10; +``` + +You can adjust the weighting (`0.5`) depending on your desired balance between keyword precision and semantic similarity. + +## Sample Results + +### Hybrid Scoring (Convex Combination) + +| hybrid\_score | bm25\_score | vector\_score | title | +| ------------- | ----------- | ------------- | --------------------------------------------- | +| 0.7440 | 1.0000 | 0.5734 | knn\_match(float\_vector, float\_vector, int) | +| 0.4868 | 0.5512 | 0.4439 | Searching On Multiple Columns | +| 0.4716 | 0.5694 | 0.4064 | array\_position(...) | + +### Reciprocal Rank Fusion (RRF) + +| final\_rank | bm25\_rank | vector\_rank | title | +| ----------- | ---------- | ------------ | --------------------------------------------- | +| 0.03278 | 1 | 1 | knn\_match(float\_vector, float\_vector, int) | +| 0.03105 | 7 | 2 | Searching On Multiple Columns | +| 0.03057 | 8 | 3 | Usage | + +> RRF rewards documents that rank highly across multiple methods, regardless of exact score values. + +## Why Use Hybrid Search? + +| Benefit | Description | +| ------------------------- | ----------------------------------------------------------------- | +| 🔍 **Improved relevance** | Combines semantic and keyword-based matches | +| ⚙️ **Pure SQL** | No DSLs or external services—runs directly in CrateDB | +| ⚡ **High performance** | Built on Apache Lucene with CrateDB’s distributed SQL engine | +| 🔄 **Flexible ranking** | Use scoring functions (convex, RRF, etc.) based on use case needs | + +## Usage in Applications + +Hybrid search is particularly effective for: + +* **Knowledge bases** +* **Product or document search** +* **Multilingual content search** +* **FAQ bots and semantic assistants** +* **AI-powered search experiences** + +It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. diff --git a/docs/start/query/search/index.md b/docs/start/query/search/index.md new file mode 100644 index 00000000..cc8de0ca --- /dev/null +++ b/docs/start/query/search/index.md @@ -0,0 +1,11 @@ +(start-search)= +# Search + +```{toctree} +:maxdepth: 1 + +fulltext +geo +vector +hybrid +``` diff --git a/docs/start/query/search/vector.md b/docs/start/query/search/vector.md new file mode 100644 index 00000000..f87a6e7e --- /dev/null +++ b/docs/start/query/search/vector.md @@ -0,0 +1,134 @@ +# Vector search + +CrateDB supports **native vector search**, enabling you to perform **similarity-based retrieval** directly in SQL, without needing a separate vector database or search engine. + +Whether you're powering **semantic search**, **recommendation engines**, **anomaly detection**, or **AI-enhanced applications**, CrateDB lets you store, manage, and search vector embeddings at scale **right alongside your structured, JSON, and full-text data.** + +## What Is Vector Search? + +Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB provides unified SQL support for this via `KNN_MATCH`. + +## Why CrateDB for Vector Search? + +| FLOAT\_VECTOR | Store embeddings up to 2048 dimensions | +| ------------------- | ------------------------------------------------------------ | +| KNN\_MATCH | SQL-native k-nearest neighbor function with `_score` support | +| VECTOR\_SIMILARITY | Compute similarity scores between vectors in queries | +| Real-time indexing | Fresh vectors are immediately searchable | +| Hybrid queries | Combine vector search with filters, full-text, and JSON | + +## Common Query Patterns + +### K-Nearest Neighbors (KNN) Search + +```sql +SELECT text, _score +FROM word_embeddings +WHERE KNN_MATCH(embedding, [0.3, 0.6, 0.0, 0.9], 3) +ORDER BY _score DESC; +``` + +Returns top 3 most similar embeddings. + +### Combine with Filters + +```sql +SELECT product_name, _score +FROM products +WHERE category = 'shoes' + AND KNN_MATCH(features, [0.2, 0.1, 0.3], 5) +ORDER BY _score DESC; +``` + +### Compute Similarity Score + +```sql +SELECT id, VECTOR_SIMILARITY(emb, [q_vector]) AS score +FROM items +WHERE KNN_MATCH(emb, [q_vector], 10) +ORDER BY score DESC; +``` + +Useful if combining scoring logic manually. + +## Real-World Examples + +### Semantic Document Search + +```sql +SELECT id, title +FROM documents +WHERE KNN_MATCH(embedding, [query_emb], 5) +ORDER BY _score DESC; +``` + +### E-commerce Recommendations + +```sql +SELECT id, name +FROM product_vecs +WHERE in_stock + AND KNN_MATCH(feature_vec, [user_emb], 4) +ORDER BY _score DESC; +``` + +### Chat Memory Recall + +```sql +SELECT message +FROM chat_history +WHERE KNN_MATCH(vec, [query_emb], 3) +ORDER BY _score DESC; +``` + +Anomaly Detection + +```sql +SELECT * +FROM events +WHERE type = 'sensor' + AND KNN_MATCH(vector_repr, [normal_pattern_emb], 1) +ORDER BY _score ASC +LIMIT 1; +``` + +## Performance & Indexing Tips + +| Tip | Benefit | +| ---------------------------------- | ------------------------------------------------------- | +| Use `FLOAT_VECTOR` | Efficiency with fixed-size arrays up to 2048 dimensions | +| Create HNSW index when supported | Enables fast ANN queries via Lucene | +| Consistent vector length | All embeddings must match column definition | +| Pre-filter with structured filters | Reduces scanning overhead | +| Tune `KNN_MATCH` | Adjust neighbor count per shard or globally | + +## When to Use CrateDB for Vector Search + +Use CrateDB when you need to: + +* Execute **semantic search** within your primary database +* Avoid managing an external vector database +* Build **hybrid queries** combining embeddings and metadata +* Scale to real-time pipelines with **millions of vectors** +* Keep everything accessible through SQL + +## Related Features + +| Feature | Description | +| ------------------ | ----------------------------------------------- | +| FLOAT\_VECTOR | Native support for high-dimensional arrays | +| KNN\_MATCH | Core SQL predicate for vector similarity search | +| VECTOR\_SIMILARITY | Compute proximity scores in SQL | +| Lucene HNSW ANN | Efficient graph-based search engine | +| Hybrid search | Combine ANN search with full-text, geo, JSON | + +## Learn More + +* [Vector Search Guide](https://cratedb.com/docs/guide/feature/search/vector/index.html) +* `KNN_MATCH` & `VECTOR_SIMILARITY` reference +* [Intro Blog: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) +* [LangChain & Vector Store integration](https://cratedb.com/docs/guide/domain/ml/index.html) + +## Summary + +CrateDB’s **vector search** empowers developers to build **AI-driven applications** without the complexity of separate infrastructure. Use `KNN_MATCH` for fast retrieval, combine it with filters, metadata, or textual logic, and stay entirely within SQL while scaling across your cluster seamlessly. From 763c61e2db5c4098f15713b237cc2fb29bc218b9 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Wed, 15 Oct 2025 17:54:59 +0200 Subject: [PATCH 2/6] Start/Search: Relocate index page --- docs/start/query/search.md | 34 ------------------------------- docs/start/query/search/index.md | 35 ++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 34 deletions(-) delete mode 100644 docs/start/query/search.md diff --git a/docs/start/query/search.md b/docs/start/query/search.md deleted file mode 100644 index e10d07d3..00000000 --- a/docs/start/query/search.md +++ /dev/null @@ -1,34 +0,0 @@ -(start-search)= -# Search - -:::{rubric} Features -::: - -CrateDB offers robust search capabilities by combining native full-text, -geospatial, vector, and hybrid search—all accessible through standard SQL -queries. At its core, CrateDB leverages Apache Lucene and the BM25 ranking -algorithm for high-performance full-text search, making it well-suited for -large-scale, complex information retrieval tasks. - -Geospatial and vector search are also natively supported, enabling use cases -ranging from text analytics to AI/ML and location-based queries, all within -the same unified platform. - -:::{rubric} Hybrid search -::: - -Hybrid search in CrateDB allows you to combine multiple search methods—such -as term-based, vector, and geospatial—within a single query for powerful -information discovery. This versatility, together with horizontal -scalability and SQL compatibility, makes CrateDB an exceptional choice for -organisations wanting to run advanced search and analytics on diverse data -types, including structured, semi-structured, and unstructured content. - -:::{rubric} Next step -::: - -:::{card} All search features of CrateDB at a glance -:link: search-overview -:link-type: ref -CrateDB provides full-text, geospatial, and vector search natively. -::: diff --git a/docs/start/query/search/index.md b/docs/start/query/search/index.md index cc8de0ca..155f351f 100644 --- a/docs/start/query/search/index.md +++ b/docs/start/query/search/index.md @@ -1,6 +1,32 @@ (start-search)= # Search +:::{rubric} Features +::: + +CrateDB offers robust search capabilities by combining native full-text, +geospatial, vector, and hybrid search—all accessible through standard SQL +queries. At its core, CrateDB leverages Apache Lucene and the BM25 ranking +algorithm for high-performance full-text search, making it well-suited for +large-scale, complex information retrieval tasks. + +Geospatial and vector search are also natively supported, enabling use cases +ranging from text analytics to AI/ML and location-based queries, all within +the same unified platform. + +:::{rubric} Hybrid search +::: + +Hybrid search in CrateDB allows you to combine multiple search methods—such +as term-based, vector, and geospatial—within a single query for powerful +information discovery. This versatility, together with horizontal +scalability and SQL compatibility, makes CrateDB an exceptional choice for +organisations wanting to run advanced search and analytics on diverse data +types, including structured, semi-structured, and unstructured content. + +:::{rubric} Introduction +::: + ```{toctree} :maxdepth: 1 @@ -9,3 +35,12 @@ geo vector hybrid ``` + +:::{rubric} Next step +::: + +:::{card} More details about all search features of CrateDB at a glance +:link: search-overview +:link-type: ref +CrateDB provides full-text, geospatial, and vector search natively. +::: From b2b45c63b5c10cd0fe7196d2a51032b553e4fece Mon Sep 17 00:00:00 2001 From: surister Date: Wed, 15 Oct 2025 17:58:39 +0200 Subject: [PATCH 3/6] Start/Search: Remove AI slop from pages about FTS-, Vector-, and Hybrid-Search --- docs/start/query/search/fulltext.md | 36 ++---------------- docs/start/query/search/hybrid.md | 44 ++-------------------- docs/start/query/search/vector.md | 58 ++++++----------------------- 3 files changed, 18 insertions(+), 120 deletions(-) diff --git a/docs/start/query/search/fulltext.md b/docs/start/query/search/fulltext.md index 09189095..07580aff 100644 --- a/docs/start/query/search/fulltext.md +++ b/docs/start/query/search/fulltext.md @@ -1,14 +1,10 @@ # Full-text search -CrateDB supports powerful **full-text search** capabilities directly within its distributed SQL engine. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. - -Whether you're working with log messages, customer feedback, machine-generated data, or IoT event streams, CrateDB enables **real-time full-text search at scale**. - -## What Is Full-text Search? - Unlike exact-match filters, full-text search allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. -CrateDB enables this via the `FULLTEXT` index and the `MATCH()` SQL predicate. +CrateDB supports powerful **full-text search** capabilities directly via the `FULLTEXT` index and the `MATCH()` SQL predicate. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. + +Whether you're working with log messages, customer feedback, machine-generated data, or IoT event streams, CrateDB enables **real-time full-text search at scale**. ## Why CrateDB for Full-text Search? @@ -115,33 +111,9 @@ SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'; | Use `MATCH()` not `LIKE` | Full-text is more performant and relevant | | Combine with filters | Boost performance using `WHERE` clauses | -## When to Use CrateDB for Full-text Search - -CrateDB is ideal when you need to: - -* Search human-generated data (logs, comments, messages) -* Perform search + filtering + aggregation in a **single SQL query** -* Handle **real-time ingestion** and **search immediately** -* Avoid managing a separate search engine or ETL pipeline -* Search **text within structured or semi-structured data** - -## Related Features - -| Feature | Description | -| --------------------- | ---------------------------------------------------- | -| Language analyzers | Built-in support for many languages | -| JSON object support | Index and search nested fields | -| SQL + full-text | Unified queries for structured and unstructured data | -| Distributed execution | Fast, scalable search across nodes | -| Aggregations | Group and analyze search results at scale | - -## Learn More +## Further Learning & Resources * Full-text Search Data Model * MATCH Clause Documentation * How CrateDB Differs from Elasticsearch * Tutorial: Full-text Search on Logs - -## Summary - -CrateDB delivers fast, scalable, and **SQL-native full-text search**—perfect for modern applications that need to search and analyze semi-structured or human-generated text in real time. By merging search and analytics into a **single operational database**, CrateDB simplifies your stack while unlocking rich insight. diff --git a/docs/start/query/search/hybrid.md b/docs/start/query/search/hybrid.md index b76821ca..fc54312c 100644 --- a/docs/start/query/search/hybrid.md +++ b/docs/start/query/search/hybrid.md @@ -1,27 +1,10 @@ # Hybrid search -CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. It’s fully powered by Apache Lucene and accessible through standard SQL—no external services or DSLs required. +While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. **Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. -:::{note} -CrateDB is all you need: run hybrid, vector, full-text, and geospatial search with SQL at scale. -::: +Hybrid search is particularly effective for **Knowledge bases, Product or document search, Multilingual content search, FAQ bots and semantic assistants**, and **AI-powered search experiences.** It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. -## Overview - -While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. - -On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. - -**Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. - -## What Is Hybrid Search? - -Hybrid search enhances relevancy by combining the scores or rankings from multiple search algorithms, typically: - -* **BM25** for keyword relevance -* **kNN** for semantic proximity in vector space - -CrateDB lets you implement hybrid search natively in SQL using **Common Table Expressions (CTEs)** and **scoring fusion techniques**, such as: +CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using **Common Table Expressions (CTEs)** and **scoring fusion techniques**, such as: * **Convex combination** (weighted sum of scores) * **Reciprocal Rank Fusion (RRF)** @@ -91,24 +74,3 @@ You can adjust the weighting (`0.5`) depending on your desired balance between k | 0.03057 | 8 | 3 | Usage | > RRF rewards documents that rank highly across multiple methods, regardless of exact score values. - -## Why Use Hybrid Search? - -| Benefit | Description | -| ------------------------- | ----------------------------------------------------------------- | -| 🔍 **Improved relevance** | Combines semantic and keyword-based matches | -| ⚙️ **Pure SQL** | No DSLs or external services—runs directly in CrateDB | -| ⚡ **High performance** | Built on Apache Lucene with CrateDB’s distributed SQL engine | -| 🔄 **Flexible ranking** | Use scoring functions (convex, RRF, etc.) based on use case needs | - -## Usage in Applications - -Hybrid search is particularly effective for: - -* **Knowledge bases** -* **Product or document search** -* **Multilingual content search** -* **FAQ bots and semantic assistants** -* **AI-powered search experiences** - -It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. diff --git a/docs/start/query/search/vector.md b/docs/start/query/search/vector.md index f87a6e7e..c738e4c1 100644 --- a/docs/start/query/search/vector.md +++ b/docs/start/query/search/vector.md @@ -1,21 +1,19 @@ # Vector search +Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). + CrateDB supports **native vector search**, enabling you to perform **similarity-based retrieval** directly in SQL, without needing a separate vector database or search engine. Whether you're powering **semantic search**, **recommendation engines**, **anomaly detection**, or **AI-enhanced applications**, CrateDB lets you store, manage, and search vector embeddings at scale **right alongside your structured, JSON, and full-text data.** -## What Is Vector Search? - -Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB provides unified SQL support for this via `KNN_MATCH`. - ## Why CrateDB for Vector Search? -| FLOAT\_VECTOR | Store embeddings up to 2048 dimensions | -| ------------------- | ------------------------------------------------------------ | -| KNN\_MATCH | SQL-native k-nearest neighbor function with `_score` support | -| VECTOR\_SIMILARITY | Compute similarity scores between vectors in queries | -| Real-time indexing | Fresh vectors are immediately searchable | -| Hybrid queries | Combine vector search with filters, full-text, and JSON | +| FLOAT\_VECTOR | Store embeddings up to 2048 dimensions | +| ------------------ | ------------------------------------------------------------ | +| KNN\_MATCH | SQL-native k-nearest neighbor function with `_score` support | +| VECTOR\_SIMILARITY | Compute similarity scores between vectors in queries | +| Real-time indexing | Fresh vectors are immediately searchable | +| Hybrid queries | Combine vector search with filters, full-text, and JSON | ## Common Query Patterns @@ -36,7 +34,7 @@ Returns top 3 most similar embeddings. SELECT product_name, _score FROM products WHERE category = 'shoes' - AND KNN_MATCH(features, [0.2, 0.1, 0.3], 5) + AND KNN_MATCH(features, [0.2, 0.1, …], 5) ORDER BY _score DESC; ``` @@ -81,7 +79,7 @@ WHERE KNN_MATCH(vec, [query_emb], 3) ORDER BY _score DESC; ``` -Anomaly Detection +### Anomaly Detection ```sql SELECT * @@ -92,43 +90,9 @@ ORDER BY _score ASC LIMIT 1; ``` -## Performance & Indexing Tips - -| Tip | Benefit | -| ---------------------------------- | ------------------------------------------------------- | -| Use `FLOAT_VECTOR` | Efficiency with fixed-size arrays up to 2048 dimensions | -| Create HNSW index when supported | Enables fast ANN queries via Lucene | -| Consistent vector length | All embeddings must match column definition | -| Pre-filter with structured filters | Reduces scanning overhead | -| Tune `KNN_MATCH` | Adjust neighbor count per shard or globally | - -## When to Use CrateDB for Vector Search - -Use CrateDB when you need to: - -* Execute **semantic search** within your primary database -* Avoid managing an external vector database -* Build **hybrid queries** combining embeddings and metadata -* Scale to real-time pipelines with **millions of vectors** -* Keep everything accessible through SQL - -## Related Features - -| Feature | Description | -| ------------------ | ----------------------------------------------- | -| FLOAT\_VECTOR | Native support for high-dimensional arrays | -| KNN\_MATCH | Core SQL predicate for vector similarity search | -| VECTOR\_SIMILARITY | Compute proximity scores in SQL | -| Lucene HNSW ANN | Efficient graph-based search engine | -| Hybrid search | Combine ANN search with full-text, geo, JSON | - -## Learn More +## Further Learning & Resources * [Vector Search Guide](https://cratedb.com/docs/guide/feature/search/vector/index.html) * `KNN_MATCH` & `VECTOR_SIMILARITY` reference * [Intro Blog: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) * [LangChain & Vector Store integration](https://cratedb.com/docs/guide/domain/ml/index.html) - -## Summary - -CrateDB’s **vector search** empowers developers to build **AI-driven applications** without the complexity of separate infrastructure. Use `KNN_MATCH` for fast retrieval, combine it with filters, metadata, or textual logic, and stay entirely within SQL while scaling across your cluster seamlessly. From bbb6c7efaa416e024a5badd3f0aa8f88c622dbf3 Mon Sep 17 00:00:00 2001 From: Kenneth Geisshirt Date: Wed, 15 Oct 2025 17:59:20 +0200 Subject: [PATCH 4/6] Start/Search: Remove AI slop from pages about Geospatial search --- docs/start/query/search/geo.md | 76 +++++----------------------------- 1 file changed, 11 insertions(+), 65 deletions(-) diff --git a/docs/start/query/search/geo.md b/docs/start/query/search/geo.md index 66e91276..a5ae185a 100644 --- a/docs/start/query/search/geo.md +++ b/docs/start/query/search/geo.md @@ -1,8 +1,4 @@ -# Geo search - -CrateDB natively supports geospatial data types and spatial queries, allowing you to store, index, and efficiently query geographic data using SQL. Built on Apache Lucene, CrateDB offers powerful location-based search capabilities at scale. - -## Overview +# Geospatial search CrateDB enables geospatial search using **Lucene’s Prefix Tree** and **BKD Tree** indexing structures. With CrateDB, you can: @@ -12,59 +8,20 @@ CrateDB enables geospatial search using **Lucene’s Prefix Tree** and **BKD Tre You interact with geospatial data through SQL, combining ease of use with advanced capabilities. -## Geospatial Data Types - -CrateDB supports two primary geospatial types: - -### `GEO_POINT` - -* Represents a single point using latitude and longitude. -* Can be inserted as: - * An array: `[longitude, latitude]` - * A WKT string (e.g. `'POINT (13.4050 52.5200)'`) - -### `GEO_SHAPE` - -* Represents more complex 2D shapes defined via GeoJSON or WKT formats. -* Supported geometry types: - * `Point`, `MultiPoint` - * `LineString`, `MultiLineString` - * `Polygon`, `MultiPolygon` - * `GeometryCollection` -* Insertable using: - * A GeoJSON object - * A WKT string - -## Inserting Spatial Data - -You can insert geospatial values using either **GeoJSON** or **WKT** formats. - -**Examples**: - -```sql --- Inserting a point -INSERT INTO locations (name, coordinates) -VALUES ('Berlin', [13.4050, 52.5200]); - --- Inserting a shape (WKT format) -INSERT INTO parks (name, area) -VALUES ('Central Park', 'POLYGON ((...))'); -``` +See the Data Modelling (!!! add link) section for details of Data Types and how to insert data. ## Querying Geospatial Data CrateDB supports several SQL functions and predicates to work with geospatial data: -### Common Functions - -| Function | Description | -| -------------------------------------- | ---------------------------------------------------- | -| `distance(p1, p2)` | Computes the distance (in meters) between two points | -| `within(shape, region)` | Checks if a shape is fully within another shape | -| `intersects(shape1, shape2)` | Checks if two shapes intersect | -| `area(shape)` | Returns the area of a given shape | -| `latitude(point)` / `longitude(point)` | Extracts lat/lon from a `GEO_POINT` | -| `geohash(point)` | Returns the geohash representation of a point | +| Function | Description | +| -------------------------------------- | -------------------------------------------------------------------------------- | +| `distance(p1, p2)` | Computes the distance (in meters) between two points using the Haversine formula | +| `within(shape, region)` | Checks if a shape is fully within another shape | +| `intersects(shape1, shape2)` | Checks if two shapes intersect | +| `area(shape)` | Returns the area of a given shape in square degrees using geodetic awareness | +| `latitude(point)` / `longitude(point)` | Extracts lat/lon from a `GEO_POINT` | +| `geohash(point)` | Returns a 12-character geohash representation of a point | ### MATCH Predicate @@ -112,15 +69,4 @@ You can choose and configure the indexing method when defining your table schema While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns. -## Defining a Geospatial Column - -Here’s how to define a `GEO_SHAPE` column with a specific index: - -```sql -CREATE TABLE regions ( - name TEXT, - area GEO_SHAPE INDEX USING 'quadtree' -); -``` - -For full details, refer to the Geo Shape Column Definition section in the reference. +For full details, refer to the Geo Shape Column Definition section (!!! add link) in the reference. From efb75e9f637d080583cffb159aad29548922c70d Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Wed, 15 Oct 2025 18:27:58 +0200 Subject: [PATCH 5/6] Start/Search: Copy-edit the "Further reading" sections at page footers --- docs/start/query/search/fulltext.md | 42 ++++++++++++++++++++++++----- docs/start/query/search/geo.md | 30 +++++++++++++++++++++ docs/start/query/search/hybrid.md | 31 +++++++++++++++++++++ docs/start/query/search/vector.md | 39 +++++++++++++++++++++++---- 4 files changed, 131 insertions(+), 11 deletions(-) diff --git a/docs/start/query/search/fulltext.md b/docs/start/query/search/fulltext.md index 07580aff..3d0a1fb7 100644 --- a/docs/start/query/search/fulltext.md +++ b/docs/start/query/search/fulltext.md @@ -1,3 +1,4 @@ +(start-fulltext)= # Full-text search Unlike exact-match filters, full-text search allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. @@ -111,9 +112,38 @@ SELECT * FROM docs WHERE MATCH(text, 'power outage') USING 'english'; | Use `MATCH()` not `LIKE` | Full-text is more performant and relevant | | Combine with filters | Boost performance using `WHERE` clauses | -## Further Learning & Resources - -* Full-text Search Data Model -* MATCH Clause Documentation -* How CrateDB Differs from Elasticsearch -* Tutorial: Full-text Search on Logs +## Further reading + +:::::{grid} 1 3 3 3 +:margin: 4 4 0 0 +:padding: 0 +:gutter: 2 + +::::{grid-item-card} {material-outlined}`article;1.5em` Reference +:columns: 3 +- {ref}`crate-reference:sql_dql_fulltext_search` +- {ref}`crate-reference:fulltext-indices` +- {ref}`crate-reference:predicates_match` +- {ref}`crate-reference:ref-create-analyzer` +:::: + +::::{grid-item-card} {material-outlined}`link;1.5em` Related +:columns: 3 +- {ref}`start-geospatial` +- {ref}`start-vector` +- {ref}`start-hybrid` +:::: + +::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more +:columns: 6 +- [How CrateDB differs from Elasticsearch] +- [Tutorial: Full-text search on logs] +- {ref}`FTS feature details ` +- {ref}`Data modeling with FTS ` +:::: + +::::: + + +[How CrateDB differs from Elasticsearch]: https://archive.fosdem.org/2018/schedule/event/cratedb/ +[Tutorial: Full-text search on logs]: https://community.cratedb.com/t/storing-server-logs-on-cratedb-for-fast-search-and-aggregations/1562 diff --git a/docs/start/query/search/geo.md b/docs/start/query/search/geo.md index a5ae185a..ce7ef021 100644 --- a/docs/start/query/search/geo.md +++ b/docs/start/query/search/geo.md @@ -1,3 +1,4 @@ +(start-geospatial)= # Geospatial search CrateDB enables geospatial search using **Lucene’s Prefix Tree** and **BKD Tree** indexing structures. With CrateDB, you can: @@ -70,3 +71,32 @@ You can choose and configure the indexing method when defining your table schema While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns. For full details, refer to the Geo Shape Column Definition section (!!! add link) in the reference. + +## Further reading + +:::::{grid} 1 3 3 3 +:margin: 4 4 0 0 +:padding: 0 +:gutter: 2 + +::::{grid-item-card} {material-outlined}`article;1.5em` Reference +:columns: 3 +- {ref}`crate-reference:data-types-geo-point` +- {ref}`crate-reference:data-types-geo-shape` +- {ref}`crate-reference:sql_dql_geo_search` +:::: + +::::{grid-item-card} {material-outlined}`link;1.5em` Related +:columns: 3 +- {ref}`start-fulltext` +- {ref}`start-vector` +- {ref}`start-hybrid` +:::: + +::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more +:columns: 6 +- {ref}`Geospatial feature details ` +- {ref}`Data modeling with geospatial data ` +:::: + +::::: diff --git a/docs/start/query/search/hybrid.md b/docs/start/query/search/hybrid.md index fc54312c..0517b085 100644 --- a/docs/start/query/search/hybrid.md +++ b/docs/start/query/search/hybrid.md @@ -1,3 +1,4 @@ +(start-hybrid)= # Hybrid search While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. **Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. @@ -74,3 +75,33 @@ You can adjust the weighting (`0.5`) depending on your desired balance between k | 0.03057 | 8 | 3 | Usage | > RRF rewards documents that rank highly across multiple methods, regardless of exact score values. +## Further reading + +:::::{grid} 1 3 3 3 +:margin: 4 4 0 0 +:padding: 0 +:gutter: 2 + +::::{grid-item-card} {material-outlined}`article;1.5em` Reference +:columns: 3 +- {ref}`crate-reference:sql_dql_fulltext_search` +- {ref}`crate-reference:fulltext-indices` +- {ref}`crate-reference:predicates_match` +- {ref}`crate-reference:scalar_knn_match` +- {ref}`crate-reference:scalar_vector_similarity` +- {ref}`crate-reference:type-float_vector` +:::: + +::::{grid-item-card} {material-outlined}`link;1.5em` Related +:columns: 3 +- {ref}`start-fulltext` +- {ref}`start-geospatial` +- {ref}`start-vector` +:::: + +::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more +:columns: 6 +- {ref}`Hybrid search feature details ` +:::: + +::::: diff --git a/docs/start/query/search/vector.md b/docs/start/query/search/vector.md index c738e4c1..5b9a9f56 100644 --- a/docs/start/query/search/vector.md +++ b/docs/start/query/search/vector.md @@ -1,3 +1,4 @@ +(start-vector)= # Vector search Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). @@ -90,9 +91,37 @@ ORDER BY _score ASC LIMIT 1; ``` -## Further Learning & Resources +## Further reading -* [Vector Search Guide](https://cratedb.com/docs/guide/feature/search/vector/index.html) -* `KNN_MATCH` & `VECTOR_SIMILARITY` reference -* [Intro Blog: Vector support & KNN search in CrateDB](https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb) -* [LangChain & Vector Store integration](https://cratedb.com/docs/guide/domain/ml/index.html) +:::::{grid} 1 3 3 3 +:margin: 4 4 0 0 +:padding: 0 +:gutter: 2 + +::::{grid-item-card} {material-outlined}`article;1.5em` Reference +:columns: 3 +- {ref}`crate-reference:type-float_vector` +- {ref}`crate-reference:scalar_knn_match` +- {ref}`crate-reference:scalar_vector_similarity` +:::: + +::::{grid-item-card} {material-outlined}`link;1.5em` Related +:columns: 3 +- {ref}`start-fulltext` +- {ref}`start-geospatial` +- {ref}`start-hybrid` +:::: + +::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more +:columns: 6 +- [Intro Blog: Vector support & KNN search in CrateDB] +- {ref}`Vector search feature details ` +- {ref}`Data modeling with vector data ` +- {ref}`machine-learning` +- {ref}`Integration with LangChain ` +:::: + +::::: + + +[Intro Blog: Vector support & KNN search in CrateDB]: https://cratedb.com/blog/unlocking-the-power-of-vector-support-and-knn-search-in-cratedb From 5bcf3c731a407c891a5c16ee17bba3691c92bf8c Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Wed, 15 Oct 2025 18:45:06 +0200 Subject: [PATCH 6/6] Start/Search: More copy-editing - Add links where some were missing - Less bold - Native notes instead of blockquotes - Fix bogus SQL queries - More cross-linking - Muted teaser texts at top of pages - Less mixed case --- docs/start/query/search/fulltext.md | 10 +++++++--- docs/start/query/search/geo.md | 12 +++++++----- docs/start/query/search/hybrid.md | 25 +++++++++++++++++++------ docs/start/query/search/vector.md | 19 ++++++++++++------- 4 files changed, 45 insertions(+), 21 deletions(-) diff --git a/docs/start/query/search/fulltext.md b/docs/start/query/search/fulltext.md index 3d0a1fb7..df5c4be9 100644 --- a/docs/start/query/search/fulltext.md +++ b/docs/start/query/search/fulltext.md @@ -1,11 +1,15 @@ (start-fulltext)= # Full-text search -Unlike exact-match filters, full-text search allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. +:::{div} sd-text-muted +CrateDB enables real-time full-text search at scale. +::: -CrateDB supports powerful **full-text search** capabilities directly via the `FULLTEXT` index and the `MATCH()` SQL predicate. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. +Unlike exact-match filters, **full-text search** allows **fuzzy, linguistic matching** on human language text. It tokenizes input, analyzes language, and searches for **tokens, stems, synonyms**, etc. -Whether you're working with log messages, customer feedback, machine-generated data, or IoT event streams, CrateDB enables **real-time full-text search at scale**. +CrateDB supports powerful full-text search capabilities directly via the `FULLTEXT` index and the `MATCH()` SQL predicate. This allows you to **combine unstructured search with structured filtering and aggregations**—all in one query, with no need for external search systems like Elasticsearch. + +CrateDB supports you whether you are working with log messages, customer feedback, machine-generated data, or IoT event streams. ## Why CrateDB for Full-text Search? diff --git a/docs/start/query/search/geo.md b/docs/start/query/search/geo.md index ce7ef021..477af93e 100644 --- a/docs/start/query/search/geo.md +++ b/docs/start/query/search/geo.md @@ -1,15 +1,17 @@ (start-geospatial)= # Geospatial search -CrateDB enables geospatial search using **Lucene’s Prefix Tree** and **BKD Tree** indexing structures. With CrateDB, you can: +:::{div} sd-text-muted +Query geospatial data through SQL, combining ease of use with advanced capabilities. +::: + +CrateDB enables geospatial search using **Lucene’s prefix tree** and **BKD tree** indexing structures. With CrateDB, you can: * Store and index geographic **points** and **shapes** * Perform spatial queries using **bounding boxes**, **circles**, **donut shapes**, and more * Filter, sort, or boost results by **distance**, **area**, or **spatial relationship** -You interact with geospatial data through SQL, combining ease of use with advanced capabilities. - -See the Data Modelling (!!! add link) section for details of Data Types and how to insert data. +See the {ref}`data-modelling` section for details of data types and how to insert data. ## Querying Geospatial Data @@ -70,7 +72,7 @@ You can choose and configure the indexing method when defining your table schema While CrateDB can perform **exact computations** on complex geometries (e.g. large polygons, geometry collections), these can be computationally expensive. Choose your index strategy carefully based on your query patterns. -For full details, refer to the Geo Shape Column Definition section (!!! add link) in the reference. +For full details, refer to the Geo Shape column definition section in the reference documentation. ## Further reading diff --git a/docs/start/query/search/hybrid.md b/docs/start/query/search/hybrid.md index 0517b085..94e7d3bc 100644 --- a/docs/start/query/search/hybrid.md +++ b/docs/start/query/search/hybrid.md @@ -1,22 +1,27 @@ (start-hybrid)= # Hybrid search +:::{div} sd-text-muted +Combine vector similarity (kNN) and term-based full-text (BM25) +searches in a single SQL query. +::: + While **vector search** provides powerful semantic retrieval based on machine learning models, it's not always optimal, especially when models are not fine-tuned for a specific domain. On the other hand, **traditional full-text search** (e.g., BM25 scoring) offers high precision on exact or keyword-based queries, with strong performance out of the box. **Hybrid search** blends these approaches, combining semantic understanding with keyword relevance to deliver more accurate, robust, and context-aware search results. -Hybrid search is particularly effective for **Knowledge bases, Product or document search, Multilingual content search, FAQ bots and semantic assistants**, and **AI-powered search experiences.** It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. +Hybrid search is particularly effective for **knowledge bases, product or document search, multilingual content search, FAQ bots and semantic assistants**, and **AI-powered search experiences.** It allows applications to go beyond keyword matching, incorporating vector similarity while still respecting domain-specific terms. -CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using **Common Table Expressions (CTEs)** and **scoring fusion techniques**, such as: +CrateDB supports **hybrid search** by combining **vector similarity search** (kNN) and **term-based full-text search** (BM25) in a single SQL query. CrateDB lets you implement hybrid search natively in SQL using **common table expressions (CTEs)** and **scoring fusion techniques**, such as: * **Convex combination** (weighted sum of scores) -* **Reciprocal Rank Fusion (RRF)** +* **Reciprocal rank fusion (RRF)** ## Supported Search Capabilities in CrateDB | Search Type | Function | Description | -| --------------------- | ------------- | ---------------------------------------------- | +| --------------------- | ------------- |------------------------------------------------| | **Vector search** | `KNN_MATCH()` | Finds vectors closest to a given vector | | **Full-text search** | `MATCH()` | Uses Lucene's BM25 scoring | -| **Geospatial search** | `MATCH()` | For shapes and points (see: Geospatial Search) | +| **Geospatial search** | `MATCH()` | For shapes and points (see: Geospatial search) | CrateDB enables all three through **pure SQL**, allowing flexible combinations and advanced analytics. @@ -74,7 +79,11 @@ You can adjust the weighting (`0.5`) depending on your desired balance between k | 0.03105 | 7 | 2 | Searching On Multiple Columns | | 0.03057 | 8 | 3 | Usage | -> RRF rewards documents that rank highly across multiple methods, regardless of exact score values. +:::{note} +RRF rewards documents that rank highly across multiple methods, +regardless of exact score values. +::: + ## Further reading :::::{grid} 1 3 3 3 @@ -101,7 +110,11 @@ You can adjust the weighting (`0.5`) depending on your desired balance between k ::::{grid-item-card} {material-outlined}`read_more;1.5em` Read more :columns: 6 +- [Doing Hybrid Search in CrateDB] - {ref}`Hybrid search feature details ` :::: ::::: + + +[Doing Hybrid Search in CrateDB]: https://cratedb.com/blog/hybrid-search-explained diff --git a/docs/start/query/search/vector.md b/docs/start/query/search/vector.md index 5b9a9f56..b0403241 100644 --- a/docs/start/query/search/vector.md +++ b/docs/start/query/search/vector.md @@ -1,18 +1,23 @@ (start-vector)= # Vector search -Vector search retrieves the most semantically similar items to a query vector using **Approximate Nearest Neighbor (ANN)** algorithms (e.g., HNSW via Lucene). +:::{div} sd-text-muted +Store, manage, and search vector embeddings at scale. +::: + +Vector search retrieves the most semantically similar items to a query vector using **approximate nearest neighbor (ANN)** algorithms (e.g., HNSW via Lucene). CrateDB supports **native vector search**, enabling you to perform **similarity-based retrieval** directly in SQL, without needing a separate vector database or search engine. -Whether you're powering **semantic search**, **recommendation engines**, **anomaly detection**, or **AI-enhanced applications**, CrateDB lets you store, manage, and search vector embeddings at scale **right alongside your structured, JSON, and full-text data.** +Whether you're powering **semantic search**, **recommendation engines**, **anomaly detection**, or **AI-enhanced applications**, CrateDB lets you manage vector data **right alongside your structured, JSON, and full-text data.** ## Why CrateDB for Vector Search? -| FLOAT\_VECTOR | Store embeddings up to 2048 dimensions | -| ------------------ | ------------------------------------------------------------ | -| KNN\_MATCH | SQL-native k-nearest neighbor function with `_score` support | -| VECTOR\_SIMILARITY | Compute similarity scores between vectors in queries | +| Feature | Benefit | +|--------------------|--------------------------------------------------------------| +| FLOAT_VECTOR | Store embeddings up to 2048 dimensions | +| KNN_MATCH | SQL-native k-nearest neighbor function with `_score` support | +| VECTOR_SIMILARITY | Compute similarity scores between vectors in queries | | Real-time indexing | Fresh vectors are immediately searchable | | Hybrid queries | Combine vector search with filters, full-text, and JSON | @@ -35,7 +40,7 @@ Returns top 3 most similar embeddings. SELECT product_name, _score FROM products WHERE category = 'shoes' - AND KNN_MATCH(features, [0.2, 0.1, …], 5) + AND KNN_MATCH(features, [0.2, 0.1, 0.3], 5) ORDER BY _score DESC; ```