From dd62b9b730ad144699a6a3a2462d7c45c05a7b33 Mon Sep 17 00:00:00 2001 From: Stephane Castellani Date: Fri, 8 Aug 2025 01:42:43 +0200 Subject: [PATCH 01/11] Data modelling: Add new section --- docs/start/index.md | 1 + docs/start/modelling/fulltext.md | 150 ++++++++++++++++++ docs/start/modelling/geospatial.md | 101 +++++++++++++ docs/start/modelling/index.md | 21 +++ docs/start/modelling/json.md | 227 ++++++++++++++++++++++++++++ docs/start/modelling/primary-key.md | 174 +++++++++++++++++++++ docs/start/modelling/relational.md | 178 ++++++++++++++++++++++ docs/start/modelling/timeseries.md | 137 +++++++++++++++++ docs/start/modelling/vector.md | 151 ++++++++++++++++++ 9 files changed, 1140 insertions(+) create mode 100644 docs/start/modelling/fulltext.md create mode 100644 docs/start/modelling/geospatial.md create mode 100644 docs/start/modelling/index.md create mode 100644 docs/start/modelling/json.md create mode 100644 docs/start/modelling/primary-key.md create mode 100644 docs/start/modelling/relational.md create mode 100644 docs/start/modelling/timeseries.md create mode 100644 docs/start/modelling/vector.md diff --git a/docs/start/index.md b/docs/start/index.md index 896d9fbf..520f45e0 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -110,6 +110,7 @@ and explore key features. first-steps connect query/index +modelling/index Ingesting data <../ingest/index> application/index going-further diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md new file mode 100644 index 00000000..41342afc --- /dev/null +++ b/docs/start/modelling/fulltext.md @@ -0,0 +1,150 @@ +# Full-text data + +CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. + +## 1. Data Types & Indexing Strategy + +* By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries +* To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: + +```sql +CREATE TABLE documents ( + title TEXT, + body TEXT, + INDEX ft_body USING FULLTEXT(body) WITH (analyzer = 'english') +); +``` + +* You may also define **composite full-text indices**, indexing multiple columns at once: + +```sql +INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); +``` + +## 2. Index Design & Custom Analyzers + +| Component | Purpose | +| ----------------- | ---------------------------------------------------------------------------- | +| **Analyzer** | Tokenizer + token filters + char filters; splits text into searchable terms. | +| **Tokenizer** | Splits on whitespace/characters. | +| **Token Filters** | e.g. lowercase, stemming, stop‑word removal. | +| **Char Filters** | Pre-processing (e.g. stripping HTML). | + +CrateDB offers **built-in analyzers** for many languages (e.g. English, German, French). You can also **create custom analyzers**: + +```sql +CREATE ANALYZER myanalyzer ( + TOKENIZER whitespace, + TOKEN_FILTERS (lowercase, kstem), + CHAR_FILTERS (html_strip) +); +``` + +Or **extend** a built-in analyzer: + +```sql +CREATE ANALYZER german_snowball + EXTENDS snowball + WITH (language = 'german'); +``` + +## 3. Querying: MATCH Predicate & Scoring + +CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. + +**Basic usage:** + +```sql +SELECT title, _score +FROM documents +WHERE MATCH(ft_body, 'search term') +ORDER BY _score DESC; +``` + +**Searching multiple indices with weighted ranking:** + +```sql +MATCH((ft_title boost 2.0, ft_body), 'keyword') +``` + +**You can configure match options like:** + +* `using best_fields` (default) +* `fuzziness = 1` (tolerate minor typos) +* `operator = 'AND'` or `OR` +* `slop = N` for phrase proximity + +**Example: Fuzzy Search** + +```sql +SELECT firstname, lastname, _score +FROM person +WHERE MATCH(lastname_ft, 'bronw') USING best_fields WITH (fuzziness = 2) +ORDER BY _score DESC; +``` + +This matches similar names like ‘brown’ or ‘browne’. + +**Example: Multi‑language Composite Search** + +```sql +CREATE TABLE documents ( + name STRING PRIMARY KEY, + description TEXT, + INDEX ft_en USING FULLTEXT(description) WITH (analyzer = 'english'), + INDEX ft_de USING FULLTEXT(description) WITH (analyzer = 'german') +); +SELECT name, _score +FROM documents +WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzziness = 1) +ORDER BY _score DESC; +``` + +## 4. Use Cases & Integration + +CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. + +Because full-text indices are updated in real-time, search results reflect newly ingested data almost instantly. This tight integration avoids the complexity of maintaining separate search infrastructure. + +You can **combine full-text search with other data domains**, for example: + +```sql +SELECT * +FROM listings +WHERE + MATCH(ft_desc, 'garden deck') AND + price < 500000 AND + within(location, :polygon); +``` + +This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. + +## 5. Architectural Strengths + +* **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. +* **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. +* **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. + +## 6. Best Practices Checklist + +| Topic | Recommendation | +| ------------------- | ---------------------------------------------------------------------------------- | +| Schema & Indexing | Define full-text indices at table creation; plain indices are insufficient. | +| Language Support | Pick built-in analyzer matching your content language. | +| Composite Search | Use multi-column indices to search across title/body/fields. | +| Query Tuning | Configure fuzziness, operator, boost, and slop options. | +| Scoring & Ranking | Use `_score` and ordering to sort by relevance. | +| Real-time Updates | Full-text indices update automatically on INSERT/UPDATE. | +| Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | +| Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | + +## 7. Further Learning & Resources + +* **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. +* **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. +* **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). +* **CrateDB Community Insights**: real‑world advice and experiences from users. + +## **8. Summary** + +CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md new file mode 100644 index 00000000..7ddbc982 --- /dev/null +++ b/docs/start/modelling/geospatial.md @@ -0,0 +1,101 @@ +# Geospatial data + +CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. + +## 1. Geospatial Data Types + +### **GEO\_POINT** + +* Stores a single location via latitude/longitude. +* Insert using either a coordinate array `[lon, lat]` or WKT string `'POINT (lon lat)'`. +* Must be declared explicitly; dynamic schema inference will not detect geo\_point type. + +### **GEO\_SHAPE** + +* Supports complex geometries (Point, LineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold + +## 2. Table Schema Example + +
CREATE TABLE parcel_zones (
+    zone_id INTEGER PRIMARY KEY,
+    name VARCHAR,
+    area GEO_SHAPE,
+    centroid GEO_POINT
+)
+WITH (column_policy = 'dynamic');
+
+ +* Use `GEO_SHAPE` to define zones or service areas. +* `GEO_POINT` allows for simple referencing (e.g. store approximate center of zone). + +## 3. Core Geospatial Functions + +CrateDB provides key scalar functions for spatial operations: + +* **`distance(geo_point1, geo_point2)`** – returns meters using the Haversine formula (e.g. compute distance between two points) +* **`within(shape1, shape2)`** – true if one geo object is fully contained within another +* **`intersects(shape1, shape2)`** – true if shapes overlap or touch anywhere +* **`latitude(geo_point)` / `longitude(geo_point)`** – extract individual coordinates +* **`geohash(geo_point)`** – compute a 12‑character geohash for the point +* **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness + +Note: More precise relational operations on shapes may bypass indexes and can be slower. + +## 4. Spatial Queries & Indexing + +CrateDB supports Lucene-based spatial indexing (Prefix Tree and BKD-tree structures) for efficient geospatial search. Use the `MATCH` predicate to leverage indices when filtering spatial data by bounding boxes, circles, polygons, etc. + +**Example: Find nearby assets** + +```sql +SELECT asset_id, DISTANCE(center_point, asset_location) AS dist +FROM assets +WHERE center_point = 'POINT(-1.234 51.050)'::GEO_POINT +ORDER BY dist +LIMIT 10; +``` + +**Example: Count incidents within service area** + +```sql +SELECT area_id, count(*) AS incident_count +FROM incidents +WHERE within(incidents.location, service_areas.area) +GROUP BY area_id; +``` + +**Example: Which zones intersect a flight path** + +```sql +SELECT zone_id, name +FROM flight_paths f +JOIN service_zones z +ON intersects(f.path_geom, z.area); +``` + +## 5. Real-World Examples: Chicago Use Cases + +* **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport. +* **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones. +* **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate. + +## 6. Architectural Strengths & Suitability + +* Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps). +* **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query. +* **High ingest and query throughput**, suitable for large-scale location-based workloads + +## 7. Best Practices Checklist + +
TopicRecommendation
Data typesDeclare GEO_POINT/GEO_SHAPE explicitly
Geometric formatsUse WKT or GeoJSON for insertions
Index tuningChoose geohash/quadtree/BKD tree & adjust precision
QueriesPrefer MATCH for indexed filtering; use functions for precise checks
Joins & spatial filtersUse within/intersects to correlate spatial entities
Scale & performanceIndex shapes, use distance/wwithin filters early
Mixed-model integrationCombine spatial with JSON, full-text, vector, time-series
+ +## 8. Further Learning & Resources + +* Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage. +* CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. +* CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40 + +## 9. Summary + +CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible. diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md new file mode 100644 index 00000000..a198efe1 --- /dev/null +++ b/docs/start/modelling/index.md @@ -0,0 +1,21 @@ +# Data modelling + +CrateDB provides a unified storage engine that supports different data types. +```{toctree} +:maxdepth: 1 + +relational +json +timeseries +geospatial +fulltext +vector +``` + +Because CrateDB is a distributed OLAP database designed store large volumes +of data, it needs a few special considerations on certain details. +```{toctree} +:maxdepth: 1 + +primary-key +``` diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md new file mode 100644 index 00000000..78583efa --- /dev/null +++ b/docs/start/modelling/json.md @@ -0,0 +1,227 @@ +# JSON data + +CrateDB combines the flexibility of NoSQL document stores with the power of SQL. It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. + +CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. + +## 1. Object (JSON) Columns + +CrateDB allows you to define **object columns** that can store JSON-style data structures. + +```sql +CREATE TABLE events ( + id UUID PRIMARY KEY, + timestamp TIMESTAMP, + payload OBJECT(DYNAMIC) +); +``` + +This allows inserting flexible, nested JSON data into `payload`: + +```json +{ + "user": { + "id": 42, + "name": "Alice" + }, + "action": "login", + "device": { + "type": "mobile", + "os": "iOS" + } +} +``` + +## 2. Column Policy: Strict vs Dynamic + +You can control how CrateDB handles unexpected fields in an object column: + +| Column Policy | Behavior | +| ------------- | ----------------------------------------------------------- | +| `DYNAMIC` | New fields are automatically added to the schema at runtime | +| `STRICT` | Only explicitly defined fields are allowed | +| `IGNORED` | Extra fields are stored but not indexed or queryable | + +Example with explicitly defined fields: + +```sql +CREATE TABLE sensor_data ( + id UUID PRIMARY KEY, + attributes OBJECT(STRICT) AS ( + temperature DOUBLE, + humidity DOUBLE + ) +); +``` + +## 3. Querying JSON Fields + +Use **dot notation** to access nested fields: + +```sql +SELECT payload['user']['name'], payload['device']['os'] +FROM events +WHERE payload['action'] = 'login'; +``` + +CrateDB also supports **filtering, sorting, and aggregations** on nested values: + +```sql +SELECT COUNT(*) +FROM events +WHERE payload['device']['os'] = 'Android'; +``` + +:::{note} +Dot-notation works for both explicitly and dynamically added fields. +::: + +## 4. Querying DYNAMIC OBJECTs + +To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within an OBJECT, CrateDB provides the [error\_on\_unknown\_object\_key](https://cratedb.com/docs/crate/reference/en/latest/config/session.html#conf-session-error-on-unknown-object-key) session setting. It controls the behaviour when querying unknown object keys to dynamic objects. + +By default, CrateDB will raise an error if any of the queried object keys are unknown. When adjusting this setting to `false`, it will return `NULL` as the value of the corresponding key. + +```sql +cr> CREATE TABLE testdrive (item OBJECT(DYNAMIC)); +CREATE OK, 1 row affected (0.563 sec) + +cr> SELECT item['unknown'] FROM testdrive; +ColumnUnknownException[Column item['unknown'] unknown] + +cr> SET error_on_unknown_object_key = false; +SET OK, 0 rows affected (0.001 sec) + +cr> SELECT item['unknown'] FROM testdrive; ++-----------------+ +| item['unknown'] | ++-----------------+ ++-----------------+ +SELECT 0 rows in set (0.051 sec) +``` + +## 5. Arrays of Objects + +Store arrays of objects for multi-valued nested data: + +```sql +CREATE TABLE products ( + id UUID PRIMARY KEY, + name TEXT, + tags ARRAY(TEXT), + specs ARRAY(OBJECT AS ( + name TEXT, + value TEXT + )) +); +``` + +Query nested arrays with filters: + +```sql +SELECT * +FROM products +WHERE 'outdoor' = ANY(tags); +``` + +You can also filter by object array fields: + +```sql +SELECT * +FROM products +WHERE specs['name'] = 'battery' AND specs['value'] = 'AA'; +``` + +## 6. Combining Structured & Semi-Structured Data + +CrateDB supports **hybrid schemas**, mixing standard columns with JSON fields: + +```sql +CREATE TABLE logs ( + id UUID PRIMARY KEY, + service TEXT, + log_level TEXT, + metadata OBJECT(DYNAMIC), + created_at TIMESTAMP +); +``` + +This allows you to: + +* Query by fixed attributes (`log_level`) +* Flexibly store structured or unstructured metadata +* Add new fields on the fly without migrations + +## 7. Indexing Behavior + +CrateDB **automatically indexes** object fields if: + +* Column policy is `DYNAMIC` +* Field type can be inferred at insert time + +You can also explicitly define and index object fields: + +```sql +CREATE TABLE metrics ( + id UUID PRIMARY KEY, + data OBJECT(DYNAMIC) AS ( + cpu DOUBLE INDEX USING FULLTEXT, + memory DOUBLE + ) +); +``` + +To exclude fields from indexing, set: + +```sql +data['some_field'] INDEX OFF +``` + +:::{note} +Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. +::: + +## 8. Aggregating JSON Fields + +CrateDB allows full SQL-style aggregations on nested fields: + +```sql +SELECT AVG(payload['temperature']) AS avg_temp +FROM sensor_readings +WHERE payload['location'] = 'room1'; +``` + +CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. + +## 9. Use Cases for JSON Modeling + +| Use Case | Description | +| ------------------ | -------------------------------------------- | +| Logs & Traces | Unstructured payloads with flexible metadata | +| Sensor & IoT Data | Variable field schemas, nested measurements | +| Product Catalogs | Specs, tags, reviews in varying formats | +| User Profiles | Custom settings, device info, preferences | +| Telemetry / Events | Event streams with evolving structure | + +## 10. Best Practices + +| Area | Recommendation | +| ---------------- | -------------------------------------------------------------------- | +| Schema Evolution | Use `DYNAMIC` for flexibility, `STRICT` for control | +| Index Management | Avoid over-indexing rarely used fields | +| Nested Depth | Prefer flat structures or shallow nesting for performance | +| Column Mixing | Combine structured columns with JSON for hybrid models | +| Observability | Monitor number of dynamic columns using `information_schema.columns` | + +## 11. Further Learning & Resources + +* CrateDB Docs – Object Columns +* Working with JSON in CrateDB +* CrateDB Academy – Modeling with JSON +* Understanding Column Policies + +## 12. Summary + +CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. + +You don’t need to choose between JSON and SQL—**CrateDB gives you both.** diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md new file mode 100644 index 00000000..a181e930 --- /dev/null +++ b/docs/start/modelling/primary-key.md @@ -0,0 +1,174 @@ +# Primary key strategies + +CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. + +This page explains why that is and walks you through **five common alternatives** to generate unique primary key values in CrateDB, including a recipe to implement your own auto-incrementing sequence mechanism when needed. + +## Why Auto-Increment Doesn't Exist in CrateDB + +In traditional RDBMS systems, auto-increment fields rely on a central counter. In a distributed system like CrateDB, this would create a **global coordination bottleneck**, limiting insert throughput and reducing scalability. + +Instead, CrateDB provides **flexibility**: you can choose a primary key strategy tailored to your use case, whether for strict uniqueness, time ordering, or external system integration. + +## Primary Key Strategies in CrateDB + +### 1. Use a Timestamp as a Primary Key + +```sql +BIGINT DEFAULT now() PRIMARY KEY +``` + +**Pros** + +* Auto-generated, always-increasing value +* Useful when records are timestamped anyway + +**Cons** + +* Can result in gaps +* Collisions possible if multiple records are created in the same millisecond + +### 2. Use UUIDs (v4) + +```sql +TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY +``` + +**Pros** + +* Universally unique +* No conflicts when merging from multiple environments or sources + +**Cons** + +* Not ordered +* Harder to read/debug +* No efficient range queries + +### Use UUIDv7 for Time-Ordered IDs + +UUIDv7 is a new format that preserves **temporal ordering**, making them better suited for distributed inserts and range queries. + +You can use UUIDv7 in CrateDB via a **User-Defined Function (UDF)**, based on your preferred language. + +**Pros** + +* Globally unique and **almost sequential** +* Range queries possible + +**Cons** + +* Not human-friendly +* Slight overhead due to UDF use + +### 4. Use External System IDs + +If you're ingesting data from a source system that **already generates unique IDs**, you can reuse those: + +* No need for CrateDB to generate anything +* Ensures consistency across systems + +> See Replicating data from other databases to CrateDB with Debezium and Kafka for an example. + +### 5. Implement a Custom Sequence Table + +If you **must** have an auto-incrementing numeric ID (e.g., for compatibility or legacy reasons), you can implement a simple sequence generator using a dedicated table and client-side logic. + +**Step 1: Create a sequence tracking table** + +```sql +CREATE TABLE sequences ( + name TEXT PRIMARY KEY, + last_value BIGINT +) CLUSTERED INTO 1 SHARDS; +``` + +**Step 2: Initialize your sequence** + +```sql +INSERT INTO sequences (name, last_value) +VALUES ('mysequence', 0); +``` + +**Step 3: Create a target table** + +```sql +CREATE TABLE mytable ( + id BIGINT PRIMARY KEY, + field1 TEXT +); +``` + +**Step 4: Generate and use sequence values in Python** + +Use optimistic concurrency control to generate unique, incrementing values even in parallel ingestion scenarios: + +```python +# Requires: records, sqlalchemy-cratedb +import time +import records + +db = records.Database("crate://") +sequence_name = "mysequence" + +max_retries = 5 +base_delay = 0.1 # 100 milliseconds + +for attempt in range(max_retries): + select_query = """ + SELECT last_value, _seq_no, _primary_term + FROM sequences + WHERE name = :sequence_name; + """ + row = db.query(select_query, sequence_name=sequence_name).first() + new_value = row.last_value + 1 + + update_query = """ + UPDATE sequences + SET last_value = :new_value + WHERE name = :sequence_name + AND _seq_no = :seq_no + AND _primary_term = :primary_term + RETURNING last_value; + """ + result = db.query( + update_query, + new_value=new_value, + sequence_name=sequence_name, + seq_no=row._seq_no, + primary_term=row._primary_term + ).all() + + if result: + break + + delay = base_delay * (2**attempt) + print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") + time.sleep(delay) +else: + raise Exception("Failed to acquire sequence after multiple retries.") + +insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" +db.query(insert_query, id=new_value, field1="abc") +db.close() +``` + +**Pros** + +* Fully customizable (you can add prefixes, adjust increment size, etc.) +* Sequential IDs possible + +**Cons** + +* More complex client logic required +* The sequence table may become a bottleneck at very high ingestion rates + +## Summary + +| Strategy | Ordered | Unique | Scalable | Human-Friendly | Range Queries | Notes | +| ------------------- | ------- | ------ | -------- | -------------- | ------------- | -------------------- | +| Timestamp | ✅ | ⚠️ | ✅ | ✅ | ✅ | Potential collisions | +| UUID (v4) | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | +| UUIDv7 | ✅ | ✅ | ✅ | ❌ | ✅ | Requires UDF | +| External System IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | +| Sequence Table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md new file mode 100644 index 00000000..bdaa4245 --- /dev/null +++ b/docs/start/modelling/relational.md @@ -0,0 +1,178 @@ +# Relational data + +CrateDB is a **distributed SQL database** that offers full **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary and foreign keys**, **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector, and semi-structured data. + +Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **ANSI SQL**. + +## 1. Table Definitions + +CrateDB supports strongly typed relational schemas using familiar SQL syntax: + +```sql +CREATE TABLE customers ( + id UUID PRIMARY KEY, + name TEXT, + email TEXT, + created_at TIMESTAMP +); + +CREATE TABLE orders ( + order_id UUID PRIMARY KEY, + customer_id UUID, + total_amount DOUBLE, + created_at TIMESTAMP +); +``` + +**Key Features:** + +* Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, etc.) +* `UUID` recommended for primary keys in distributed environments +* Default **replication**, **sharding**, and **partitioning** options are built-in for scale + +:::{note} +CrateDB supports `column_policy = 'dynamic'` if you want to mix relational and semi-structured models (like JSON) in the same table. +::: + +## 2. Joins & Relationships + +CrateDB supports **inner joins**, **left/right joins**, **cross joins**, and even **self joins**. + +**Example: Join Customers and Orders** + +```sql +SELECT c.name, o.order_id, o.total_amount +FROM customers c +JOIN orders o ON c.id = o.customer_id +WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; +``` + +Joins are executed efficiently across shards in a **distributed query planner** that parallelizes execution. + +## 3. Normalization vs. Embedding + +CrateDB supports both **normalized** (relational) and **denormalized** (embedded JSON) approaches. + +* For strict referential integrity and modularity: use normalized tables with joins. +* For performance in high-ingest or read-optimized workloads: embed reference data as nested JSON. + +Example: Embedded products inside an `orders` table: + +```sql +CREATE TABLE orders ( + order_id UUID PRIMARY KEY, + customer_id UUID, + items ARRAY(OBJECT ( + name TEXT, + quantity INTEGER, + price DOUBLE + )), + created_at TIMESTAMP +); +``` + +:::{note} +CrateDB lets you **query nested fields** directly using dot notation: `items['name']`, `items['price']`, etc. +::: + +## 4. Aggregations & Grouping + +Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS`, and even `FILTER`. + +```sql +SELECT customer_id, COUNT(*) AS num_orders, SUM(total_amount) AS revenue +FROM orders +GROUP BY customer_id +HAVING revenue > 1000; +``` + +:::{note} +CrateDB's **columnar storage** optimizes performance for aggregations—even on large datasets. +::: + +## 5. Constraints & Indexing + +CrateDB supports: + +* **Primary Keys** – enforced for uniqueness and data distribution +* **Unique Constraints** – optional, enforced locally +* **Check Constraints** – for value validation +* **Indexes** – automatic for primary keys and full-text fields; manual for others + +```sql +CREATE TABLE products ( + id UUID PRIMARY KEY, + name TEXT, + price DOUBLE CHECK (price >= 0) +); +``` + +:::{note} +Foreign key constraints are not strictly enforced at write time but can be modeled at the application or query layer. +::: + +## 6. Views & Subqueries + +CrateDB supports **views**, **CTEs**, and **nested subqueries**. + +**Example: Reusable View** + +```sql +CREATE VIEW recent_orders AS +SELECT * FROM orders +WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'; +``` + +**Example: Correlated Subquery** + +```sql +SELECT name, + (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count +FROM customers c; +``` + +## 7. Use Cases for Relational Modeling + +| Use Case | Description | +| -------------------- | ------------------------------------------------ | +| Customer & Orders | Classic normalized setup with joins and filters | +| Inventory Management | Products, stock levels, locations | +| Financial Systems | Transactions, balances, audit logs | +| User Profiles | Users, preferences, activity logs | +| Multi-tenant Systems | Use schemas or partitioning for tenant isolation | + +## 8. Scalability & Distribution + +CrateDB automatically shards tables across nodes, distributing both **data and query processing**. + +* Tables can be **sharded and replicated** for fault tolerance +* Use **partitioning** for time-series or tenant-based scaling +* SQL queries are transparently **parallelized across the cluster** + +:::{note} +Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distribution patterns. +::: + +## 9. Best Practices + +| Area | Recommendation | +| ------------- | ------------------------------------------------------------ | +| Keys & IDs | Use UUIDs or consistent IDs for primary keys | +| Sharding | Let CrateDB auto-shard unless you have advanced requirements | +| Join Strategy | Minimize joins over large, high-cardinality columns | +| Nested Fields | Use `column_policy = 'dynamic'` if schema needs flexibility | +| Aggregations | Favor columnar tables for analytical workloads | +| Co-location | Consider denormalization for write-heavy workloads | + +## 10. Further Learning & Resources + +* CrateDB Docs – Data Modeling +* CrateDB Academy – Relational Modeling +* Working with Joins in CrateDB +* Schema Design Guide + +## 11. Summary + +CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records—without sacrificing the flexibility to embed, index, and evolve schema dynamically. + +CrateDB is the modern SQL engine for building relational, real-time, and hybrid apps in a distributed world. diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md new file mode 100644 index 00000000..a8993b7c --- /dev/null +++ b/docs/start/modelling/timeseries.md @@ -0,0 +1,137 @@ +# Time series data + +CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. + +## 1. Why CrateDB for Time Series? + +* **Distributed architecture and columnar storage** enable very high ingest throughput with fast aggregations and near‑real‑time analytical queries. +* Handles **high cardin­ality** and **mixed data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. +* **PostgreSQL wire‑protocol compatible**, so it integrates easily with existing tools and drivers. + +## 2. Data Model Template + +A typical time‑series schema looks like this: + +
CREATE TABLE IF NOT EXISTS weather_data (
+    ts TIMESTAMP,
+    location VARCHAR,
+    temperature DOUBLE,
+    humidity DOUBLE CHECK (humidity >= 0),
+    PRIMARY KEY (ts, location)
+)
+WITH (column_policy = 'dynamic');
+
+ +Key points: + +* `ts`: append‑only timestamp column +* Composite primary key `(ts, location)` ensures uniqueness and efficient sort/group by time +* `column_policy = 'dynamic'` allows schema evolution: inserting a new field auto‑creates the column. + +## 3. Ingesting and Querying + +### **Data Ingestion** + +* Use SQL `INSERT` or bulk import techniques like `COPY FROM` with JSON or CSV files. +* Schema inference can often happen automatically during import. + +### **Aggregation and Transformations** + +CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses: + +* **`DATE_BIN(interval, timestamp, origin)`** for bucketed aggregations (down‑sampling). +* **Window functions** like `LAG()` and `LEAD()` to detect trends or gaps. +* **`MAX_BY()`** returns the value from one column matching the min/max value of another column in a group. + +**Example**: compute hourly average battery levels and join with metadata: + +```postgresql +WITH avg_metrics AS ( + SELECT device_id, + DATE_BIN('1 hour', time, 0) AS period, + AVG(battery_level) AS avg_battery + FROM devices.readings + GROUP BY device_id, period +) +SELECT period, t.device_id, i.manufacturer, avg_battery +FROM avg_metrics t +JOIN devices.info i USING (device_id) +WHERE i.model = 'mustang'; +``` + +**Example**: gap detection interpolation: + +```text +WITH all_hours AS ( + SELECT generate_series(ts_start, ts_end, ‘30 second’) AS expected_time +), +raw AS ( + SELECT time, battery_level FROM devices.readings +) +SELECT expected_time, r.battery_level +FROM all_hours +LEFT JOIN raw r ON expected_time = r.time +ORDER BY expected_time; +``` + +## 4. Downsampling & Interpolation + +To reduce volume while preserving trends, use `DATE_BIN`.\ +Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. + +## 5. Schema Evolution & Contextual Data + +With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. + +You can also store: + +* **Geospatial** (`GEO_POINT`, `GEO_SHAPE`) +* **Vectors** (up to 2048 dims via HNSW indexing) +* **BLOBs** for binary data (e.g. images, logs) + +All types are supported within the same table or joined together. + +## 6. Storage Optimization + +* **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. +* Supports long‑term retention with performant historic storage. +* Columnar layout reduces storage footprint and accelerates aggregation queries. + +## 7. Advanced Use Cases + +* **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. +* **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools + +## 8. Sample Workflow (Chicago Weather Dataset) + +CrateDB’s sample data set captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). + +Typical operations: + +* Table creation and ingestion +* Average per station +* Using `MAX_BY()` to find highest temperature timestamps +* Downsampling using `DATE_BIN` into 4‑week buckets + +This workflow illustrates how CrateDB scales and simplifies time series modeling. + +## 9. Best Practices Checklist + +| Topic | Recommendation | +| ----------------------------- | ------------------------------------------------------------------- | +| Schema design | Use composite primary key (timestamp + series key), dynamic columns | +| Ingestion | Use bulk import (COPY) and JSON ingestion | +| Aggregations | Use DATE\_BIN, window functions, GROUP BY | +| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | +| Schema evolution | Dynamic columns allow adding fields on the fly | +| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | +| Partitioning & shard strategy | Partition by time, shard across nodes for scale | +| Downsampling | Use DATE\_BIN for aggregating resolution | +| Integration with analytics/ML | Export to pandas/Plotly or train ML models inside CrateDB pipeline | + +## 10. Further Learning + +* Video: **Time Series Data Modeling** – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. +* Official CrateDB Guide: **Time Series Fundamentals**, **Advanced Time Series Analysis**, **Sharding & Partitioning**. +* CrateDB Academy: free courses including an **Advanced Time Series Modeling** module. + diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md new file mode 100644 index 00000000..6c537428 --- /dev/null +++ b/docs/start/modelling/vector.md @@ -0,0 +1,151 @@ +# Vector data + +CrateDB natively supports **vector embeddings** for efficient **similarity search** using **approximate nearest neighbor (ANN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics—all in the simplicity of SQL. + +Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series.\ + + +## 1. Data Type: VECTOR + +CrateDB introduces a native `VECTOR` type with the following key characteristics: + +* Fixed-length float arrays (e.g. 768, 1024, 2048 dimensions) +* Supports **HNSW (Hierarchical Navigable Small World)** indexing for fast approximate search +* Optimized for cosine, Euclidean, and dot-product similarity + +**Example: Define a Table with Vector Embeddings** + +```sql +CREATE TABLE documents ( + id UUID PRIMARY KEY, + title TEXT, + content TEXT, + embedding VECTOR(FLOAT[768]) +); +``` + +* `VECTOR(FLOAT[768])` declares a fixed-size vector column. +* You can ingest vectors directly or compute them externally and store them via SQL + +## 2. Indexing: Enabling Vector Search + +To use fast similarity search, define an **HNSW index** on the vector column: + +```sql +CREATE INDEX embedding_hnsw +ON documents (embedding) +USING HNSW +WITH ( + m = 16, + ef_construction = 128, + ef_search = 64, + similarity = 'cosine' +); +``` + +**Parameters:** + +* `m`: controls the number of bi-directional links per node (default: 16) +* `ef_construction`: affects index build accuracy/speed (default: 128) +* `ef_search`: controls recall/latency trade-off at query time +* `similarity`: choose from `'cosine'`, `'l2'` (Euclidean), `'dot_product'` + +> CrateDB automatically builds the ANN index in the background, allowing for real-time updates. + +## 3. Querying Vectors with SQL + +Use the `nearest_neighbors` predicate to perform similarity search: + +```sql +SELECT id, title, content +FROM documents +ORDER BY embedding <-> [0.12, 0.73, ..., 0.01] +LIMIT 5; +``` + +This ranks results by **vector similarity** using the index. + +Or, filter and rank by proximity: + +```sql +SELECT id, title, content, embedding <-> [0.12, ..., 0.01] AS score +FROM documents +WHERE MATCH(content_ft, 'machine learning') AND author = 'Alice' +ORDER BY score +LIMIT 10; +``` + +:::{note} +Combine vector similarity with full-text, metadata, or geospatial filters! +::: + +## 4. Ingestion: Working with Embeddings + +You can ingest vectors in several ways: + +* **Precomputed embeddings** from models like OpenAI, HuggingFace, or SentenceTransformers: + + ```sql + INSERT INTO documents (id, title, embedding) + VALUES ('uuid-123', 'AI and Databases', [0.12, 0.34, ..., 0.01]); + ``` +* **Batched imports** via `COPY FROM` using JSON or CSV +* CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. + +## 5. Use Cases + +| Use Case | Description | +| ----------------------- | ------------------------------------------------------------------ | +| Semantic Search | Rank documents by meaning instead of keywords | +| Recommendation Systems | Find similar products, users, or behaviors | +| Image / Audio Retrieval | Store and compare embeddings of images/audio | +| Fraud Detection | Match behavioral patterns via vectors | +| Hybrid Search | Combine vector similarity with full-text, geo, or temporal filters | + +Example: Hybrid semantic product search + +```sql +SELECT id, title, price, description +FROM products +WHERE MATCH(description_ft, 'running shoes') AND brand = 'Nike' +ORDER BY features <-> [vector] ASC +LIMIT 10; +``` + +## 6. Performance & Scaling + +* Vector search uses **HNSW**: state-of-the-art ANN algorithm with logarithmic search complexity. +* CrateDB parallelizes ANN search across shards/nodes. +* Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. + +:::{note} +Note: vector dimensionality must be consistent for each column. +::: + +## 7. Best Practices + +| Area | Recommendation | +| -------------- | ----------------------------------------------------------------------- | +| Vector length | Use standard embedding sizes (e.g. 384, 512, 768, 1024) | +| Similarity | Cosine for semantic/textual data; dot-product for ranking models | +| Index tuning | Tune `ef_search` for latency/recall trade-offs | +| Hybrid queries | Combine vector similarity with metadata filters (e.g. category, region) | +| Updates | Re-inserting or updating vectors is fully supported | +| Data pipelines | Use external tools for vector generation; push to CrateDB via REST/SQL | + +## 8. Integrations + +* **Python / pandas / LangChain**: CrateDB has native drivers and REST interface +* **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models +* **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store + +## 9. Further Learning & Resources + +* CrateDB Docs – Vector Search +* Blog: Using CrateDB for Hybrid Search (Vector + Full-Text) +* CrateDB Academy – Vector Data +* [Sample notebooks on GitHub](https://github.com/crate/cratedb-examples) + +## 10. Summary + +CrateDB gives you the power of **vector similarity search** with the **flexibility of SQL** and the **scalability of a distributed database**. It lets you unify structured, unstructured, and semantic data—enabling modern applications in AI, search, and recommendation without additional vector databases or pipelines. From 25490061320ac0a20c7aef1f568665d61ce7d8b2 Mon Sep 17 00:00:00 2001 From: surister Date: Sat, 23 Aug 2025 12:32:05 +0200 Subject: [PATCH 02/11] Data modelling: Fix page about "relational data" --- docs/start/modelling/relational.md | 100 ++++++++++++++++------------- 1 file changed, 56 insertions(+), 44 deletions(-) diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index bdaa4245..bd982ae8 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -1,42 +1,35 @@ # Relational data -CrateDB is a **distributed SQL database** that offers full **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary and foreign keys**, **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector, and semi-structured data. +CrateDB is a **distributed SQL database** that offers rich **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and semi-structured data. -Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **ANSI SQL**. +Use CrateDB when you need to scale relational workloads horizontally while keeping the simplicity of **SQL**. -## 1. Table Definitions +## Table Definitions CrateDB supports strongly typed relational schemas using familiar SQL syntax: ```sql CREATE TABLE customers ( - id UUID PRIMARY KEY, + id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, name TEXT, email TEXT, - created_at TIMESTAMP -); - -CREATE TABLE orders ( - order_id UUID PRIMARY KEY, - customer_id UUID, - total_amount DOUBLE, - created_at TIMESTAMP + created_at TIMESTAMP DEFAULT now() ); ``` **Key Features:** * Supports scalar types (`TEXT`, `INTEGER`, `DOUBLE`, `BOOLEAN`, `TIMESTAMP`, etc.) -* `UUID` recommended for primary keys in distributed environments +* `gen_random_text_uuid()`, `now()` or `current_timestamp()` recommended for primary keys in distributed environments * Default **replication**, **sharding**, and **partitioning** options are built-in for scale :::{note} CrateDB supports `column_policy = 'dynamic'` if you want to mix relational and semi-structured models (like JSON) in the same table. ::: -## 2. Joins & Relationships +## Joins & Relationships -CrateDB supports **inner joins**, **left/right joins**, **cross joins**, and even **self joins**. +CrateDB supports **inner joins**, **left/right joins**, **cross joins**, **outer joins**, and even **self joins**. **Example: Join Customers and Orders** @@ -49,7 +42,7 @@ WHERE o.created_at >= CURRENT_DATE - INTERVAL '30 days'; Joins are executed efficiently across shards in a **distributed query planner** that parallelizes execution. -## 3. Normalization vs. Embedding +## Normalization vs. Embedding CrateDB supports both **normalized** (relational) and **denormalized** (embedded JSON) approaches. @@ -60,24 +53,25 @@ Example: Embedded products inside an `orders` table: ```sql CREATE TABLE orders ( - order_id UUID PRIMARY KEY, - customer_id UUID, - items ARRAY(OBJECT ( - name TEXT, - quantity INTEGER, - price DOUBLE - )), + order_id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY, + items ARRAY( + OBJECT(DYNAMIC) AS ( + name TEXT, + quantity INTEGER, + price DOUBLE + ) + ), created_at TIMESTAMP ); ``` :::{note} -CrateDB lets you **query nested fields** directly using dot notation: `items['name']`, `items['price']`, etc. +CrateDB lets you **query nested fields** directly using bracket notation: `items['name']`, `items['price']`, etc. ::: -## 4. Aggregations & Grouping +## Aggregations & Grouping -Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS`, and even `FILTER`. +Use familiar SQL aggregation functions (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`) with `GROUP BY`, `HAVING`, `WINDOW FUNCTIONS` ... etc. ```sql SELECT customer_id, COUNT(*) AS num_orders, SUM(total_amount) AS revenue @@ -90,28 +84,28 @@ HAVING revenue > 1000; CrateDB's **columnar storage** optimizes performance for aggregations—even on large datasets. ::: -## 5. Constraints & Indexing +## Constraints & Indexing CrateDB supports: * **Primary Keys** – enforced for uniqueness and data distribution -* **Unique Constraints** – optional, enforced locally -* **Check Constraints** – for value validation -* **Indexes** – automatic for primary keys and full-text fields; manual for others +* **Check -** enforces custom value validation +* **Indexes** – automatic index for all columns +* **Full-text indexes -** manually defined, supports many tokenizers, analyzers and filters + +In CrateDB every column is indexed by default, depending on the datatype a different index is used, indexing is controlled and maintained by the database, there is no need to `vacuum` or `re-index` like in other systems. Indexing can be manually turned off. ```sql CREATE TABLE products ( - id UUID PRIMARY KEY, + id TEXT PRIMARY KEY, name TEXT, - price DOUBLE CHECK (price >= 0) + price DOUBLE CHECK (price >= 0), + tag TEXT INDEX OFF, + description TEXT INDEX using fulltext ); ``` -:::{note} -Foreign key constraints are not strictly enforced at write time but can be modeled at the application or query layer. -::: - -## 6. Views & Subqueries +## Views & Subqueries CrateDB supports **views**, **CTEs**, and **nested subqueries**. @@ -120,7 +114,7 @@ CrateDB supports **views**, **CTEs**, and **nested subqueries**. ```sql CREATE VIEW recent_orders AS SELECT * FROM orders -WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'; +WHERE created_at >= CURRENT_DATE::TIMESTAMP - INTERVAL '7 days'; ``` **Example: Correlated Subquery** @@ -131,7 +125,25 @@ SELECT name, FROM customers c; ``` -## 7. Use Cases for Relational Modeling +**Example: Common table expression** + +```sql +WITH order_counts AS ( + SELECT + o.customer_id, + COUNT(*) AS order_count + FROM orders o + GROUP BY o.customer_id +) +SELECT + c.name, + COALESCE(oc.order_count, 0) AS order_count +FROM customers c +LEFT JOIN order_counts oc + ON c.id = oc.customer_id; +``` + +## Use Cases for Relational Modeling | Use Case | Description | | -------------------- | ------------------------------------------------ | @@ -141,7 +153,7 @@ FROM customers c; | User Profiles | Users, preferences, activity logs | | Multi-tenant Systems | Use schemas or partitioning for tenant isolation | -## 8. Scalability & Distribution +## Scalability & Distribution CrateDB automatically shards tables across nodes, distributing both **data and query processing**. @@ -153,7 +165,7 @@ CrateDB automatically shards tables across nodes, distributing both **data and q Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distribution patterns. ::: -## 9. Best Practices +## Best Practices | Area | Recommendation | | ------------- | ------------------------------------------------------------ | @@ -164,15 +176,15 @@ Use `CLUSTERED BY` and `PARTITIONED BY` in `CREATE TABLE` to control distributio | Aggregations | Favor columnar tables for analytical workloads | | Co-location | Consider denormalization for write-heavy workloads | -## 10. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Data Modeling * CrateDB Academy – Relational Modeling * Working with Joins in CrateDB * Schema Design Guide -## 11. Summary +## Summary -CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records—without sacrificing the flexibility to embed, index, and evolve schema dynamically. +CrateDB offers a familiar, powerful **relational model with full SQL** and built-in support for scale, performance, and hybrid data. You can model clean, normalized data structures and join them across millions of records, without sacrificing the flexibility to embed, index, and evolve schema dynamically. CrateDB is the modern SQL engine for building relational, real-time, and hybrid apps in a distributed world. From 73ce05705220d011de19aea3754dd8694ee767ad Mon Sep 17 00:00:00 2001 From: Daryl Dudey Date: Sat, 23 Aug 2025 12:36:31 +0200 Subject: [PATCH 03/11] Data modelling: Fix page about "json data" --- docs/start/modelling/json.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index 78583efa..de71eb27 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -4,7 +4,7 @@ CrateDB combines the flexibility of NoSQL document stores with the power of SQL. CrateDB’s support for dynamic objects, nested structures, and dot-notation querying brings the best of both relational and document-based data modeling—without leaving the SQL world. -## 1. Object (JSON) Columns +## Object (JSON) Columns CrateDB allows you to define **object columns** that can store JSON-style data structures. @@ -32,7 +32,7 @@ This allows inserting flexible, nested JSON data into `payload`: } ``` -## 2. Column Policy: Strict vs Dynamic +## Column Policy: Strict vs Dynamic You can control how CrateDB handles unexpected fields in an object column: @@ -54,9 +54,9 @@ CREATE TABLE sensor_data ( ); ``` -## 3. Querying JSON Fields +## Querying JSON Fields -Use **dot notation** to access nested fields: +Use **bracket notation** to access nested fields: ```sql SELECT payload['user']['name'], payload['device']['os'] @@ -76,7 +76,7 @@ WHERE payload['device']['os'] = 'Android'; Dot-notation works for both explicitly and dynamically added fields. ::: -## 4. Querying DYNAMIC OBJECTs +## Querying DYNAMIC OBJECTs To support querying DYNAMIC OBJECTs using SQL, where keys may not exist within an OBJECT, CrateDB provides the [error\_on\_unknown\_object\_key](https://cratedb.com/docs/crate/reference/en/latest/config/session.html#conf-session-error-on-unknown-object-key) session setting. It controls the behaviour when querying unknown object keys to dynamic objects. @@ -100,7 +100,7 @@ cr> SELECT item['unknown'] FROM testdrive; SELECT 0 rows in set (0.051 sec) ``` -## 5. Arrays of Objects +## Arrays of OBJECTs Store arrays of objects for multi-valued nested data: @@ -132,7 +132,7 @@ FROM products WHERE specs['name'] = 'battery' AND specs['value'] = 'AA'; ``` -## 6. Combining Structured & Semi-Structured Data +## Combining Structured & Semi-Structured Data CrateDB supports **hybrid schemas**, mixing standard columns with JSON fields: @@ -152,7 +152,7 @@ This allows you to: * Flexibly store structured or unstructured metadata * Add new fields on the fly without migrations -## 7. Indexing Behavior +## Indexing Behavior CrateDB **automatically indexes** object fields if: @@ -181,7 +181,7 @@ data['some_field'] INDEX OFF Too many dynamic fields can lead to schema explosion. Use `STRICT` or `IGNORED` if needed. ::: -## 8. Aggregating JSON Fields +## Aggregating JSON Fields CrateDB allows full SQL-style aggregations on nested fields: @@ -193,7 +193,7 @@ WHERE payload['location'] = 'room1'; CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on object fields. -## 9. Use Cases for JSON Modeling +## Use Cases for JSON Modeling | Use Case | Description | | ------------------ | -------------------------------------------- | @@ -203,7 +203,7 @@ CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on | User Profiles | Custom settings, device info, preferences | | Telemetry / Events | Event streams with evolving structure | -## 10. Best Practices +## Best Practices | Area | Recommendation | | ---------------- | -------------------------------------------------------------------- | @@ -213,14 +213,14 @@ CrateDB also supports **`GROUP BY`**, **`HAVING`**, and **window functions** on | Column Mixing | Combine structured columns with JSON for hybrid models | | Observability | Monitor number of dynamic columns using `information_schema.columns` | -## 11. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Object Columns * Working with JSON in CrateDB * CrateDB Academy – Modeling with JSON * Understanding Column Policies -## 12. Summary +## Summary CrateDB makes it easy to model **semi-structured JSON data** with full SQL support. Whether you're building a telemetry pipeline, an event store, or a product catalog, CrateDB offers the flexibility of a document store—while preserving the structure, indexing, and power of a relational engine. From 75899f6a86d11f43e0d2258f063d905be4874862 Mon Sep 17 00:00:00 2001 From: karynzv Date: Sat, 23 Aug 2025 12:38:32 +0200 Subject: [PATCH 04/11] Data modelling: Fix page about "timeseries data" --- docs/start/modelling/timeseries.md | 130 +++++++++++++++++------------ 1 file changed, 76 insertions(+), 54 deletions(-) diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index a8993b7c..f9703ca6 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -2,33 +2,45 @@ CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. -## 1. Why CrateDB for Time Series? +## Why CrateDB for Time Series? -* **Distributed architecture and columnar storage** enable very high ingest throughput with fast aggregations and near‑real‑time analytical queries. -* Handles **high cardin­ality** and **mixed data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. +* While maintaining a high ingest rate, its **columnar storage** and **automatic indexing** let you access and analyze the data immediately with **fast aggregations** and **near-real-time queries**. +* Handles **high cardin­ality** and **a variety of data types**, including nested JSON, geospatial and vector data—all queryable via the same SQL statements. * **PostgreSQL wire‑protocol compatible**, so it integrates easily with existing tools and drivers. -## 2. Data Model Template +## Data Model Template A typical time‑series schema looks like this: -
CREATE TABLE IF NOT EXISTS weather_data (
-    ts TIMESTAMP,
-    location VARCHAR,
-    temperature DOUBLE,
-    humidity DOUBLE CHECK (humidity >= 0),
-    PRIMARY KEY (ts, location)
-)
-WITH (column_policy = 'dynamic');
-
+```sql +CREATE TABLE IF NOT EXISTS devices_readings ( + ts TIMESTAMP WITH TIME ZONE, + device_id TEXT, + battery OBJECT(DYNAMIC) AS ( + level BIGINT, + status TEXT, + temperature DOUBLE PRECISION + ), + cpu OBJECT(DYNAMIC) AS ( + avg_1min DOUBLE PRECISION, + avg_5min DOUBLE PRECISION, + avg_15min DOUBLE PRECISION + ), + memory OBJECT(DYNAMIC) AS ( + free BIGINT, + used BIGINT + ), + month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', ts) +) PARTITIONED BY (month); +``` Key points: -* `ts`: append‑only timestamp column -* Composite primary key `(ts, location)` ensures uniqueness and efficient sort/group by time -* `column_policy = 'dynamic'` allows schema evolution: inserting a new field auto‑creates the column. +* `month` is the partitioning key, optimizing data storage and retrieval. +* Every column is stored in the column store by default for fast aggregations. +* Using **OBJECT columns** in the `devices_readings` table provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility. -## 3. Ingesting and Querying +## Ingesting and Querying ### **Data Ingestion** @@ -45,43 +57,56 @@ CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses **Example**: compute hourly average battery levels and join with metadata: -```postgresql +```sql WITH avg_metrics AS ( SELECT device_id, - DATE_BIN('1 hour', time, 0) AS period, - AVG(battery_level) AS avg_battery - FROM devices.readings + DATE_BIN('1 hour'::interval, ts, 0) AS period, + AVG(battery['level']) AS avg_battery + FROM devices_readings GROUP BY device_id, period ) SELECT period, t.device_id, i.manufacturer, avg_battery FROM avg_metrics t -JOIN devices.info i USING (device_id) +JOIN devices_info i USING (device_id) WHERE i.model = 'mustang'; ``` **Example**: gap detection interpolation: -```text +```sql WITH all_hours AS ( - SELECT generate_series(ts_start, ts_end, ‘30 second’) AS expected_time + SELECT + generate_series( + '2025-01-01', + '2025-01-02', + '30 second' :: interval + ) AS expected_time ), raw AS ( - SELECT time, battery_level FROM devices.readings + SELECT + ts, + battery ['level'] + FROM + devices_readings ) -SELECT expected_time, r.battery_level -FROM all_hours -LEFT JOIN raw r ON expected_time = r.time -ORDER BY expected_time; +SELECT + expected_time, + r.battery ['level'] +FROM + all_hours + LEFT JOIN raw r ON expected_time = r.ts +ORDER BY + expected_time; ``` -## 4. Downsampling & Interpolation +## Down-sampling & Interpolation To reduce volume while preserving trends, use `DATE_BIN`.\ Missing data can be handled using `LAG()`/`LEAD()` or other interpolation logic within SQL. -## 5. Schema Evolution & Contextual Data +## Schema Evolution & Contextual Data -With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. +With `column_policy = 'dynamic'`, ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use `OBJECT(DYNAMIC)` to auto-create (and, by default, index) subcolumns, or `OBJECT(IGNORED)`to accept unknown keys without creating or indexing subcolumns. You can also store: @@ -91,47 +116,44 @@ You can also store: All types are supported within the same table or joined together. -## 6. Storage Optimization +## Storage Optimization * **Partitioning and sharding**: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster. * Supports long‑term retention with performant historic storage. * Columnar layout reduces storage footprint and accelerates aggregation queries. -## 7. Advanced Use Cases +## Advanced Use Cases * **Exploratory data analysis** (EDA), decomposition, and forecasting via CrateDB’s SQL or by exporting to Pandas/Plotly. * **Machine learning workflows**: time‑series features and anomaly detection pipelines can be built using CrateDB + external tools -## 8. Sample Workflow (Chicago Weather Dataset) +## Sample Workflow (Chicago Weather Dataset) -CrateDB’s sample data set captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). +In [this lesson of the CrateDB Academy](https://cratedb.com/academy/fundamentals/data-modelling-with-cratedb/hands-on-time-series-data) introducing Time Series data, we provide a sample data set that captures hourly temperature, humidity, pressure, wind at three Chicago stations (150,000+ records). Typical operations: * Table creation and ingestion * Average per station * Using `MAX_BY()` to find highest temperature timestamps -* Downsampling using `DATE_BIN` into 4‑week buckets +* Down-sampling using `DATE_BIN` into 4‑week buckets This workflow illustrates how CrateDB scales and simplifies time series modeling. -## 9. Best Practices Checklist - -| Topic | Recommendation | -| ----------------------------- | ------------------------------------------------------------------- | -| Schema design | Use composite primary key (timestamp + series key), dynamic columns | -| Ingestion | Use bulk import (COPY) and JSON ingestion | -| Aggregations | Use DATE\_BIN, window functions, GROUP BY | -| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | -| Schema evolution | Dynamic columns allow adding fields on the fly | -| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | -| Partitioning & shard strategy | Partition by time, shard across nodes for scale | -| Downsampling | Use DATE\_BIN for aggregating resolution | -| Integration with analytics/ML | Export to pandas/Plotly or train ML models inside CrateDB pipeline | +## Best Practices Checklist -## 10. Further Learning +| Topic | Recommendation | +| ----------------------------- | ---------------------------------------------------------------------------------- | +| Schema design and evolution | Dynamic columns add fields as needed; diverse data types ensure proper typing | +| Ingestion | Use bulk import (COPY) and JSON ingestion | +| Aggregations | Use DATE\_BIN, window functions, GROUP BY | +| Interpolation / gap analysis | Employ LAG(), LEAD(), generate\_series, joins | +| Mixed data types | Combine time series, JSON, geo, full‑text in one dataset | +| Partitioning & shard strategy | Partition by time, shard across nodes for scale | +| Down-sampling | Use DATE\_BIN for aggregating resolution or implement your own strategy using UDFs | +| Integration with analytics/ML | Export to pandas/Plotly to train your ML models | -* Video: **Time Series Data Modeling** – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. -* Official CrateDB Guide: **Time Series Fundamentals**, **Advanced Time Series Analysis**, **Sharding & Partitioning**. -* CrateDB Academy: free courses including an **Advanced Time Series Modeling** module. +## Further Learning +* **Video:** [Time Series Data Modeling](https://cratedb.com/resources/videos/time-series-data-modeling) – covers relational & time series, document, geospatial, vector, and full-text in one tutorial. +* **CrateDB Academy:** [Advanced Time Series Modeling course](https://cratedb.com/academy/time-series/getting-started/introduction-to-time-series-data). From f6d3bbac6a61cbb50320b5a0539223d237ef2a0c Mon Sep 17 00:00:00 2001 From: Kenneth Geisshirt Date: Sat, 23 Aug 2025 12:41:12 +0200 Subject: [PATCH 05/11] Data modelling: Fix page about "geospatial data" --- docs/start/modelling/geospatial.md | 118 +++++++++++++++++------------ 1 file changed, 68 insertions(+), 50 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 7ddbc982..576b3da0 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -1,35 +1,74 @@ # Geospatial data -CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. +CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. -## 1. Geospatial Data Types +## Geospatial Data Types ### **GEO\_POINT** * Stores a single location via latitude/longitude. -* Insert using either a coordinate array `[lon, lat]` or WKT string `'POINT (lon lat)'`. -* Must be declared explicitly; dynamic schema inference will not detect geo\_point type. +* Insert using either a coordinate array `[lon, lat]` or Well-Known Text (WKT) string `'POINT (lon lat)'`. +* Must be declared explicitly; dynamic schema inference will not detect `geo_point` type. ### **GEO\_SHAPE** -* Supports complex geometries (Point, LineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. -* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold +* Supports complex geometries (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon, GeometryCollection) via GeoJSON or WKT. +* Indexed using geohash, quadtree, or BKD-tree, with configurable precision (e.g. `50m`) and error threshold. The indexes are described in the [reference manual](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#type-geo-shape-index). -## 2. Table Schema Example +## Table Schema Example -
CREATE TABLE parcel_zones (
-    zone_id INTEGER PRIMARY KEY,
-    name VARCHAR,
-    area GEO_SHAPE,
-    centroid GEO_POINT
-)
-WITH (column_policy = 'dynamic');
-
+Let's define a table with country boarders and capital: -* Use `GEO_SHAPE` to define zones or service areas. -* `GEO_POINT` allows for simple referencing (e.g. store approximate center of zone). +```sql +CREATE TABLE country ( + name text, + country_code text primary key, + shape geo_shape INDEX USING "geohash" WITH (precision='100m'), + capital text, + capital_location geo_point +) +``` + +* Use `GEO_SHAPE` to define the border. +* `GEO_POINT` to define the location of the capital. -## 3. Core Geospatial Functions +## Insert rows + +We can populate the table with Austria: + +```sql +INSERT INTO country (name, country_code, shape, capital, capital_location) +VALUES ( + 'Austria', + 'at', + + ##{type='Polygon', coordinates=[ + [[16.979667, 48.123497], [16.903754, 47.714866], + [16.340584, 47.712902], [16.534268, 47.496171], + [16.202298, 46.852386], [16.011664, 46.683611], + [15.137092, 46.658703], [14.632472, 46.431817], + [13.806475, 46.509306], [12.376485, 46.767559], + [12.153088, 47.115393], [11.164828, 46.941579], + [11.048556, 46.751359], [10.442701, 46.893546], + [9.932448, 46.920728], [9.47997, 47.10281], + [9.632932, 47.347601], [9.594226, 47.525058], + [9.896068, 47.580197], [10.402084, 47.302488], + [10.544504, 47.566399], [11.426414, 47.523766], + [12.141357, 47.703083], [12.62076, 47.672388], + [12.932627, 47.467646], [13.025851, 47.637584], + [12.884103, 48.289146], [13.243357, 48.416115], + [13.595946, 48.877172], [14.338898, 48.555305], + [14.901447, 48.964402], [15.253416, 49.039074], + [16.029647, 48.733899], [16.499283, 48.785808], + [16.960288, 48.596982], [16.879983, 48.470013], + [16.979667, 48.123497]] + ]}, + 'Vienna', + [16.372778, 48.209206] +); +``` + +## Core Geospatial Functions CrateDB provides key scalar functions for spatial operations: @@ -40,62 +79,41 @@ CrateDB provides key scalar functions for spatial operations: * **`geohash(geo_point)`** – compute a 12‑character geohash for the point * **`area(geo_shape)`** – returns approximate area in square degrees; uses geodetic awareness -Note: More precise relational operations on shapes may bypass indexes and can be slower. - -## 4. Spatial Queries & Indexing - -CrateDB supports Lucene-based spatial indexing (Prefix Tree and BKD-tree structures) for efficient geospatial search. Use the `MATCH` predicate to leverage indices when filtering spatial data by bounding boxes, circles, polygons, etc. - -**Example: Find nearby assets** +Furthermore, it is possible to use the **match** predicate with geospatial data in queries. -```sql -SELECT asset_id, DISTANCE(center_point, asset_location) AS dist -FROM assets -WHERE center_point = 'POINT(-1.234 51.050)'::GEO_POINT -ORDER BY dist -LIMIT 10; -``` - -**Example: Count incidents within service area** +Note: More precise relational operations on shapes may bypass indexes and can be slower. -```sql -SELECT area_id, count(*) AS incident_count -FROM incidents -WHERE within(incidents.location, service_areas.area) -GROUP BY area_id; -``` +## An example query -**Example: Which zones intersect a flight path** +It is possible to find the distance to the capital of each country in the table: ```sql -SELECT zone_id, name -FROM flight_paths f -JOIN service_zones z -ON intersects(f.path_geom, z.area); +SELECT distance(capital_location, [9.74, 47.41])/1000 +FROM country; ``` -## 5. Real-World Examples: Chicago Use Cases +## Real-World Examples: Chicago Use Cases * **311 calls**: Each record includes `location` as `GEO_POINT`. Queries use `within()` to find calls near a polygon around O’Hare airport. * **Community areas**: Polygon boundaries stored in `GEO_SHAPE`. Queries for intersections with arbitrary lines or polygons using `intersects()` return overlapping zones. * **Taxi rides**: Pickup/drop off locations stored as geo points. Use `distance()` filter to compute trip distances and aggregate. -## 6. Architectural Strengths & Suitability +## Architectural Strengths & Suitability * Designed for **real-time geospatial tracking and analytics** (e.g. fleet tracking, mapping, location-layered apps). * **Unified SQL platform**: spatial data can be combined with full-text search, JSON, vectors, time-series — in the same table or query. * **High ingest and query throughput**, suitable for large-scale location-based workloads -## 7. Best Practices Checklist +## Best Practices Checklist
TopicRecommendation
Data typesDeclare GEO_POINT/GEO_SHAPE explicitly
Geometric formatsUse WKT or GeoJSON for insertions
Index tuningChoose geohash/quadtree/BKD tree & adjust precision
QueriesPrefer MATCH for indexed filtering; use functions for precise checks
Joins & spatial filtersUse within/intersects to correlate spatial entities
Scale & performanceIndex shapes, use distance/wwithin filters early
Mixed-model integrationCombine spatial with JSON, full-text, vector, time-series
-## 8. Further Learning & Resources +## Further Learning & Resources * Official **Geospatial Search Guide** in CrateDB docs, detailing geospatial types, indexing, and MATCH predicate usage. * CrateDB Academy **Hands-on: Geospatial Data** modules, with sample datasets (Chicago 311 calls, taxi rides, community zones) and example queries. * CrateDB Blog: **Geospatial Queries with CrateDB** – outlines capabilities, limitations, and practical use cases (available since version 0.40 -## 9. Summary +## Summary CrateDB provides robust support for geospatial modeling through clearly defined data types (`GEO_POINT`, `GEO_SHAPE`), powerful scalar functions (`distance`, `within`, `intersects`, `area`), and Lucene‑based indexing for fast queries. It excels in high‑volume, real‑time spatial analytics and integrates smoothly with multi-model use cases. Whether storing vehicle positions, mapping regions, or enabling spatial joins—CrateDB’s geospatial layer makes it easy, scalable, and extensible. From d60feef0030ce01fc611de74e6abec199d9d4013 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 12:46:29 +0200 Subject: [PATCH 06/11] Data modelling: Fix SQL in page about "geospatial data" --- docs/start/modelling/geospatial.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 576b3da0..7cc8c012 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -26,7 +26,7 @@ CREATE TABLE country ( shape geo_shape INDEX USING "geohash" WITH (precision='100m'), capital text, capital_location geo_point -) +); ``` * Use `GEO_SHAPE` to define the border. @@ -34,15 +34,14 @@ CREATE TABLE country ( ## Insert rows -We can populate the table with Austria: +We can populate the table with the coordinate shape of Vienna/Austria: -```sql +```psql INSERT INTO country (name, country_code, shape, capital, capital_location) VALUES ( 'Austria', 'at', - - ##{type='Polygon', coordinates=[ + {type='Polygon', coordinates=[ [[16.979667, 48.123497], [16.903754, 47.714866], [16.340584, 47.712902], [16.534268, 47.496171], [16.202298, 46.852386], [16.011664, 46.683611], From 52c3008f37459e93d42b5b4ead767dca62e7312e Mon Sep 17 00:00:00 2001 From: Brian Munkholm Date: Sat, 23 Aug 2025 12:48:51 +0200 Subject: [PATCH 07/11] Data modelling: Fix page about "full-text data" --- docs/start/modelling/fulltext.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index 41342afc..43f754b9 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -2,7 +2,7 @@ CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. -## 1. Data Types & Indexing Strategy +## Data Types & Indexing Strategy * By default, all text columns are indexed as `plain` (raw, unanalyzed)—efficient for equality search but not suitable for full‑text queries * To enable full‑text search, you must define a **FULLTEXT index** with an optional language **analyzer**, e.g.: @@ -21,7 +21,7 @@ CREATE TABLE documents ( INDEX ft_all USING FULLTEXT(title, body) WITH (analyzer = 'english'); ``` -## 2. Index Design & Custom Analyzers +## Index Design & Custom Analyzers | Component | Purpose | | ----------------- | ---------------------------------------------------------------------------- | @@ -48,7 +48,7 @@ CREATE ANALYZER german_snowball WITH (language = 'german'); ``` -## 3. Querying: MATCH Predicate & Scoring +## Querying: MATCH Predicate & Scoring CrateDB uses the SQL `MATCH` predicate to run full‑text queries against full‑text indices. It optionally returns a relevance score `_score`, ranked via BM25. @@ -100,7 +100,7 @@ WHERE MATCH((ft_en, ft_de), 'jupm OR verwrlost') USING best_fields WITH (fuzzine ORDER BY _score DESC; ``` -## 4. Use Cases & Integration +## Use Cases & Integration CrateDB is ideal for searching **semi-structured large text data**—product catalogs, article archives, user-generated content, descriptions and logs. @@ -119,13 +119,13 @@ WHERE This blend lets you query by text relevance, numeric filters, and spatial constraints, all in one. -## 5. Architectural Strengths +## Architectural Strengths * **Built on Lucene inverted index + BM25**, offering relevance ranking comparable to search engines. * **Scale horizontally across clusters**, while maintaining fast indexing and search even on high volume datasets. * **Integrated SQL interface**: eliminates need for separate search services like Elasticsearch or Solr. -## 6. Best Practices Checklist +## Best Practices Checklist | Topic | Recommendation | | ------------------- | ---------------------------------------------------------------------------------- | @@ -138,13 +138,13 @@ This blend lets you query by text relevance, numeric filters, and spatial constr | Multi-model Queries | Combine full-text search with geo, JSON, numerical filters. | | Analyze Limitations | Understand phrase\_prefix caveats at scale; tune analyzer/tokenizer appropriately. | -## 7. Further Learning & Resources +## Further Learning & Resources * **CrateDB Full‑Text Search Guide**: details index creation, analyzers, MATCH usage. * **FTS Options & Advanced Features**: fuzziness, synonyms, multi-language idioms. * **Hands‑On Academy Course**: explore FTS on real datasets (e.g. Chicago neighborhoods). * **CrateDB Community Insights**: real‑world advice and experiences from users. -## **8. Summary** +## **Summary** CrateDB combines powerful Lucene‑based full‑text search capabilities with SQL, making it easy to model and query textual data at scale. It supports fuzzy matching, multi-language analysis, composite indexing, and integrates fully with other data types for rich, multi-model queries. Whether you're building document search, catalog lookup, or content analytics—CrateDB offers a flexible and scalable foundation.\ From 736652c1e52c78f3bc0e7a8c18c9cf0503b881f9 Mon Sep 17 00:00:00 2001 From: Juan Pardo Date: Sat, 23 Aug 2025 13:01:21 +0200 Subject: [PATCH 08/11] Data modelling: Fix page about "vector data" --- docs/start/modelling/vector.md | 45 ++++++++-------------------------- 1 file changed, 10 insertions(+), 35 deletions(-) diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 6c537428..5ac191a9 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -5,7 +5,7 @@ CrateDB natively supports **vector embeddings** for efficient **similarity searc Whether you’re working with text, images, sensor data, or any domain represented as high-dimensional embeddings, CrateDB enables **real-time vector search at scale**, in combination with other data types like full-text, geospatial, and time-series.\ -## 1. Data Type: VECTOR +## Data Type: VECTOR CrateDB introduces a native `VECTOR` type with the following key characteristics: @@ -27,32 +27,7 @@ CREATE TABLE documents ( * `VECTOR(FLOAT[768])` declares a fixed-size vector column. * You can ingest vectors directly or compute them externally and store them via SQL -## 2. Indexing: Enabling Vector Search - -To use fast similarity search, define an **HNSW index** on the vector column: - -```sql -CREATE INDEX embedding_hnsw -ON documents (embedding) -USING HNSW -WITH ( - m = 16, - ef_construction = 128, - ef_search = 64, - similarity = 'cosine' -); -``` - -**Parameters:** - -* `m`: controls the number of bi-directional links per node (default: 16) -* `ef_construction`: affects index build accuracy/speed (default: 128) -* `ef_search`: controls recall/latency trade-off at query time -* `similarity`: choose from `'cosine'`, `'l2'` (Euclidean), `'dot_product'` - -> CrateDB automatically builds the ANN index in the background, allowing for real-time updates. - -## 3. Querying Vectors with SQL +## Querying Vectors with SQL Use the `nearest_neighbors` predicate to perform similarity search: @@ -79,7 +54,7 @@ LIMIT 10; Combine vector similarity with full-text, metadata, or geospatial filters! ::: -## 4. Ingestion: Working with Embeddings +## Ingestion: Working with Embeddings You can ingest vectors in several ways: @@ -92,7 +67,7 @@ You can ingest vectors in several ways: * **Batched imports** via `COPY FROM` using JSON or CSV * CrateDB doesn't currently compute embeddings internally—you bring your own model or use pipelines that call CrateDB. -## 5. Use Cases +## Use Cases | Use Case | Description | | ----------------------- | ------------------------------------------------------------------ | @@ -112,17 +87,17 @@ ORDER BY features <-> [vector] ASC LIMIT 10; ``` -## 6. Performance & Scaling +## Performance & Scaling * Vector search uses **HNSW**: state-of-the-art ANN algorithm with logarithmic search complexity. * CrateDB parallelizes ANN search across shards/nodes. * Ideal for 100K to tens of millions of vectors; supports real-time ingestion and queries. :::{note} -Note: vector dimensionality must be consistent for each column. +vector dimensionality must be consistent for each column. ::: -## 7. Best Practices +## Best Practices | Area | Recommendation | | -------------- | ----------------------------------------------------------------------- | @@ -133,19 +108,19 @@ Note: vector dimensionality must be consistent for each column. | Updates | Re-inserting or updating vectors is fully supported | | Data pipelines | Use external tools for vector generation; push to CrateDB via REST/SQL | -## 8. Integrations +## Integrations * **Python / pandas / LangChain**: CrateDB has native drivers and REST interface * **Embedding models**: Use OpenAI, HuggingFace, Cohere, or in-house models * **RAG architecture**: CrateDB stores vector + metadata + raw text in a unified store -## 9. Further Learning & Resources +## Further Learning & Resources * CrateDB Docs – Vector Search * Blog: Using CrateDB for Hybrid Search (Vector + Full-Text) * CrateDB Academy – Vector Data * [Sample notebooks on GitHub](https://github.com/crate/cratedb-examples) -## 10. Summary +## Summary CrateDB gives you the power of **vector similarity search** with the **flexibility of SQL** and the **scalability of a distributed database**. It lets you unify structured, unstructured, and semantic data—enabling modern applications in AI, search, and recommendation without additional vector databases or pipelines. From 33ecc9424fbc19e94b90535663c805be4605c39f Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 23:29:24 +0200 Subject: [PATCH 09/11] Layout: Improve responsiveness on pages using cards heavily --- docs/index.md | 10 +++++----- docs/start/index.md | 6 ++++++ 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/docs/index.md b/docs/index.md index 596306ce..8aa577a0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -9,7 +9,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. -::::{grid} 1 2 2 2 +::::{grid} 4 :padding: 0 @@ -17,7 +17,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: getting-started :link-type: ref :link-alt: Getting started with CrateDB -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -31,7 +31,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: install :link-type: ref :link-alt: Installing CrateDB -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -45,7 +45,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: administration :link-type: ref :link-alt: CrateDB Administration -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 @@ -59,7 +59,7 @@ Guides and tutorials about how to use CrateDB and CrateDB Cloud in practice. :link: performance :link-type: ref :link-alt: CrateDB Performance Guides -:padding: 3 +:padding: 1 :text-align: center :class-card: sd-pt-3 :class-body: sd-fs-1 diff --git a/docs/start/index.md b/docs/start/index.md index 520f45e0..628bfe64 100644 --- a/docs/start/index.md +++ b/docs/start/index.md @@ -18,6 +18,7 @@ and explore key features. :link: first-steps :link-type: ref :link-alt: First steps with CrateDB +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -31,6 +32,7 @@ and explore key features. :link: connect :link-type: ref :link-alt: Connect to CrateDB +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -44,6 +46,7 @@ and explore key features. :link: query-capabilities :link-type: ref :link-alt: Query Capabilities +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -57,6 +60,7 @@ and explore key features. :link: ingest :link-type: ref :link-alt: Ingesting Data +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -78,6 +82,7 @@ and explore key features. :link: example-applications :link-type: ref :link-alt: Sample Applications +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 @@ -91,6 +96,7 @@ and explore key features. :link: start-going-further :link-type: ref :link-alt: Going Further +:columns: 6 3 3 3 :padding: 3 :text-align: center :class-card: sd-pt-3 From b5ad8b4f8ae99503af27d321415094c3f29625e8 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sat, 23 Aug 2025 23:29:49 +0200 Subject: [PATCH 10/11] Data modelling: Populate index page --- docs/start/modelling/fulltext.md | 1 + docs/start/modelling/geospatial.md | 1 + docs/start/modelling/index.md | 113 +++++++++++++++++++++++++++- docs/start/modelling/json.md | 1 + docs/start/modelling/primary-key.md | 1 + docs/start/modelling/relational.md | 1 + docs/start/modelling/timeseries.md | 1 + docs/start/modelling/vector.md | 1 + 8 files changed, 118 insertions(+), 2 deletions(-) diff --git a/docs/start/modelling/fulltext.md b/docs/start/modelling/fulltext.md index 43f754b9..d94c3921 100644 --- a/docs/start/modelling/fulltext.md +++ b/docs/start/modelling/fulltext.md @@ -1,3 +1,4 @@ +(model-fulltext)= # Full-text data CrateDB features **native full‑text search** powered by **Apache Lucene** and Okapi BM25 ranking, fully accessible via SQL. You can blend this seamlessly with other data types—JSON, time‑series, geospatial, vectors and more—all in a single SQL query platform. diff --git a/docs/start/modelling/geospatial.md b/docs/start/modelling/geospatial.md index 7cc8c012..e40ef3e4 100644 --- a/docs/start/modelling/geospatial.md +++ b/docs/start/modelling/geospatial.md @@ -1,3 +1,4 @@ +(model-geospatial)= # Geospatial data CrateDB supports **real-time geospatial analytics at scale**, enabling you to store, query, and analyze 2D location-based data using standard SQL over two dedicated types: **GEO\_POINT** and **GEO\_SHAPE**. You can seamlessly combine spatial data with full-text, vector, JSON, or time-series in the same SQL queries. diff --git a/docs/start/modelling/index.md b/docs/start/modelling/index.md index a198efe1..58bce3b8 100644 --- a/docs/start/modelling/index.md +++ b/docs/start/modelling/index.md @@ -1,8 +1,99 @@ +(modelling)= +(data-modelling)= # Data modelling +:::{div} sd-text-muted CrateDB provides a unified storage engine that supports different data types. +::: + +:::::{grid} 2 3 3 3 +:padding: 0 +:class-container: installation-grid + +::::{grid-item-card} Relational data +:link: model-relational +:link-type: ref +:link-alt: Relational data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`table-list` +:::: + +::::{grid-item-card} JSON data +:link: model-json +:link-type: ref +:link-alt: JSON data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`file-lines` +:::: + +::::{grid-item-card} Timeseries data +:link: model-timeseries +:link-type: ref +:link-alt: Timeseries data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`timeline` +:::: + +::::{grid-item-card} Geospatial data +:link: model-geospatial +:link-type: ref +:link-alt: Geospatial data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`globe` +:::: + +::::{grid-item-card} Fulltext data +:link: model-fulltext +:link-type: ref +:link-alt: Fulltext data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`font` +:::: + +::::{grid-item-card} Vector data +:link: model-vector +:link-type: ref +:link-alt: Vector data +:padding: 3 +:text-align: center +:class-card: sd-pt-3 +:class-body: sd-fs-1 +:class-title: sd-fs-6 + +{fas}`lightbulb` +:::: + +::::: + + ```{toctree} :maxdepth: 1 +:hidden: relational json @@ -12,10 +103,28 @@ fulltext vector ``` -Because CrateDB is a distributed OLAP database designed store large volumes -of data, it needs a few special considerations on certain details. +:::{rubric} Implementation notes +::: + +Because CrateDB is a distributed analytical database (OLAP) designed to store +large volumes of data, users need to consider certain details compared to +traditional RDBMS. + + +:::{card} Primary key strategies +:link: model-primary-key +:link-type: ref +CrateDB is built for horizontal scalability and high ingestion throughput. ++++ +To achieve this, operations must complete independently on each node—without +central coordination. This design choice means CrateDB does not support +traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL +or MySQL by default. +::: + ```{toctree} :maxdepth: 1 +:hidden: primary-key ``` diff --git a/docs/start/modelling/json.md b/docs/start/modelling/json.md index de71eb27..fc0fda8b 100644 --- a/docs/start/modelling/json.md +++ b/docs/start/modelling/json.md @@ -1,3 +1,4 @@ +(model-json)= # JSON data CrateDB combines the flexibility of NoSQL document stores with the power of SQL. It enables you to store, query, and index **semi-structured JSON data** using **standard SQL**, making it an excellent choice for applications that handle diverse or evolving schemas. diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index a181e930..fbc8e756 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -1,3 +1,4 @@ +(model-primary-key)= # Primary key strategies CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. diff --git a/docs/start/modelling/relational.md b/docs/start/modelling/relational.md index bd982ae8..8f9e90eb 100644 --- a/docs/start/modelling/relational.md +++ b/docs/start/modelling/relational.md @@ -1,3 +1,4 @@ +(model-relational)= # Relational data CrateDB is a **distributed SQL database** that offers rich **relational data modeling** with the flexibility of dynamic schemas and the scalability of NoSQL systems. It supports **primary keys,** **joins**, **aggregations**, and **subqueries**, just like traditional RDBMS systems—while also enabling hybrid use cases with time-series, geospatial, full-text, vector search, and semi-structured data. diff --git a/docs/start/modelling/timeseries.md b/docs/start/modelling/timeseries.md index f9703ca6..71be6067 100644 --- a/docs/start/modelling/timeseries.md +++ b/docs/start/modelling/timeseries.md @@ -1,3 +1,4 @@ +(model-timeseries)= # Time series data CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data. diff --git a/docs/start/modelling/vector.md b/docs/start/modelling/vector.md index 5ac191a9..083e1284 100644 --- a/docs/start/modelling/vector.md +++ b/docs/start/modelling/vector.md @@ -1,3 +1,4 @@ +(model-vector)= # Vector data CrateDB natively supports **vector embeddings** for efficient **similarity search** using **approximate nearest neighbor (ANN)** algorithms. This makes it a powerful engine for building AI-powered applications involving semantic search, recommendations, anomaly detection, and multimodal analytics—all in the simplicity of SQL. From 036ebcf3057a2edc2697ba57834506458ec3cd5a Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Sun, 24 Aug 2025 01:59:38 +0200 Subject: [PATCH 11/11] Data modelling: Relocate original page about primary keys and sequences --- docs/performance/inserts/index.rst | 1 - docs/performance/inserts/sequences.rst | 205 -------------------- docs/start/modelling/primary-key.md | 249 +++++++++++++++---------- 3 files changed, 154 insertions(+), 301 deletions(-) delete mode 100644 docs/performance/inserts/sequences.rst diff --git a/docs/performance/inserts/index.rst b/docs/performance/inserts/index.rst index 6934363e..e11462ac 100644 --- a/docs/performance/inserts/index.rst +++ b/docs/performance/inserts/index.rst @@ -30,6 +30,5 @@ This section of the guide will show you how. parallel tuning testing - sequences .. _Abstract Syntax Tree: https://en.wikipedia.org/wiki/Abstract_syntax_tree diff --git a/docs/performance/inserts/sequences.rst b/docs/performance/inserts/sequences.rst deleted file mode 100644 index d381d931..00000000 --- a/docs/performance/inserts/sequences.rst +++ /dev/null @@ -1,205 +0,0 @@ -.. _autogenerated_sequences_performance: - -########################################################### - Autogenerated sequences and PRIMARY KEY values in CrateDB -########################################################### - -As you begin working with CrateDB, you might be puzzled why CrateDB does not -have a built-in, auto-incrementing "serial" data type as PostgreSQL or MySQL. - -As a distributed database, designed to scale horizontally, CrateDB needs as many -operations as possible to complete independently on each node without any -coordination between nodes. - -Maintaining a global auto-increment value requires that a node checks with other -nodes before allocating a new value. This bottleneck would be hindering our -ability to achieve `extremely fast ingestion speeds`_. - -That said, there are many alternatives available and we can also implement true -consistent/synchronized sequences if we want to. - -************************************ - Using a timestamp as a primary key -************************************ - -This option involves declaring a column as follows: - -.. code:: psql - - BIGINT DEFAULT now() PRIMARY KEY - -:Pros: - Always increasing number - ideal if we need to timestamp records creation - anyway - -:Cons: - gaps between the numbers, not suitable if we may have more than one record on - the same millisecond - -************* - Using UUIDs -************* - -This option involves declaring a column as follows: - -.. code:: psql - - TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY - -:Pros: - Globally unique, no risk of conflicts if merging things from different - tables/environments - -:Cons: - No order guarantee. Not as human-friendly as numbers. String format may not - be applicable to cover all scenarios. Range queries are not possible. - -************************ - Use UUIDv7 identifiers -************************ - -`Version 7 UUIDs`_ are a relatively new kind of UUIDs which feature a -time-ordered value. We can use these in CrateDB with an UDF_ with the code from -`UUIDv7 in N languages`_. - -:Pros: - Same as `gen_random_text_uuid` above but almost sequential, which enables - range queries. - -:Cons: - not as human-friendly as numbers and slight performance impact from UDF use - -********************************* - Use IDs from an external system -********************************* - -In cases where data is imported into CrateDB from external systems that employ -identifier governance, CrateDB does not need to generate any identifier values -and primary key values can be inserted as-is from the source system. - -See `Replicating data from other databases to CrateDB with Debezium and Kafka`_ -for an example. - -********************* - Implement sequences -********************* - -This approach involves a table to keep the latest values that have been consumed -and client side code to keep it up-to-date in a way that guarantees unique -values even when many ingestion processes run in parallel. - -:Pros: - Can have any arbitrary type of sequences, (we may for instance want to - increment values by 10 instead of 1 - prefix values with a year number - - combine numbers and letters - etc) - -:Cons: - Need logic for the optimistic update implemented client-side, the sequences - table becomes a bottleneck so not suitable for high-velocity ingestion - scenarios - -We will first create a table to keep the latest values for our sequences: - -.. code:: psql - - CREATE TABLE sequences ( - name TEXT PRIMARY KEY, - last_value BIGINT - ) CLUSTERED INTO 1 SHARDS; - -We will then initialize it with one new sequence at 0: - -.. code:: psql - - INSERT INTO sequences (name,last_value) - VALUES ('mysequence',0); - -And we are going to do an example with a new table defined as follows: - -.. code:: psql - - CREATE TABLE mytable ( - id BIGINT PRIMARY KEY, - field1 TEXT - ); - -The Python code below reads the last value used from the sequences table, and -then attempts an `optimistic UPDATE`_ with a ``RETURNING`` clause, if a -contending process already consumed the identity nothing will be returned so our -process will retry until a value is returned, then it uses that value as the new -ID for the record we are inserting into the ``mytable`` table. - -.. code:: python - - # /// script - # requires-python = ">=3.8" - # dependencies = [ - # "records", - # "sqlalchemy-cratedb", - # ] - # /// - - import time - - import records - - db = records.Database("crate://") - sequence_name = "mysequence" - - max_retries = 5 - base_delay = 0.1 # 100 milliseconds - - for attempt in range(max_retries): - select_query = """ - SELECT last_value, - _seq_no, - _primary_term - FROM sequences - WHERE name = :sequence_name; - """ - row = db.query(select_query, sequence_name=sequence_name).first() - new_value = row.last_value + 1 - - update_query = """ - UPDATE sequences - SET last_value = :new_value - WHERE name = :sequence_name - AND _seq_no = :seq_no - AND _primary_term = :primary_term - RETURNING last_value; - """ - if ( - str( - db.query( - update_query, - new_value=new_value, - sequence_name=sequence_name, - seq_no=row._seq_no, - primary_term=row._primary_term, - ).all() - ) - != "[]" - ): - break - - delay = base_delay * (2**attempt) - print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") - time.sleep(delay) - else: - raise Exception(f"Failed after {max_retries} retries with exponential backoff") - - insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" - db.query(insert_query, id=new_value, field1="abc") - db.close() - -.. _extremely fast ingestion speeds: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second - -.. _optimistic update: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update - -.. _replicating data from other databases to cratedb with debezium and kafka: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka - -.. _udf: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html - -.. _uuidv7 in n languages: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb - -.. _version 7 uuids: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7 diff --git a/docs/start/modelling/primary-key.md b/docs/start/modelling/primary-key.md index fbc8e756..744224a7 100644 --- a/docs/start/modelling/primary-key.md +++ b/docs/start/modelling/primary-key.md @@ -1,112 +1,168 @@ (model-primary-key)= -# Primary key strategies - -CrateDB is built for horizontal scalability and high ingestion throughput. To achieve this, operations must complete independently on each node—without central coordination. This design choice means CrateDB does **not** support traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL or MySQL by default. - -This page explains why that is and walks you through **five common alternatives** to generate unique primary key values in CrateDB, including a recipe to implement your own auto-incrementing sequence mechanism when needed. - -## Why Auto-Increment Doesn't Exist in CrateDB - -In traditional RDBMS systems, auto-increment fields rely on a central counter. In a distributed system like CrateDB, this would create a **global coordination bottleneck**, limiting insert throughput and reducing scalability. - -Instead, CrateDB provides **flexibility**: you can choose a primary key strategy tailored to your use case, whether for strict uniqueness, time ordering, or external system integration. - -## Primary Key Strategies in CrateDB - -### 1. Use a Timestamp as a Primary Key - -```sql +(autogenerated-sequences)= +# Primary key strategies and autogenerated sequences + +:::{rubric} Introduction +::: + +As you begin working with CrateDB, you might be puzzled why CrateDB does not +have a built-in, auto-incrementing "serial" data type, like PostgreSQL or MySQL. + +This page explains why that is and walks you through **five common alternatives** +to generate unique primary key values in CrateDB, including a recipe to implement +your own auto-incrementing sequence mechanism when needed. + +:::{rubric} Why auto-increment sequences don't exist in CrateDB +::: +In traditional RDBMS systems, auto-increment fields rely on a central counter. +In a distributed system like CrateDB, maintaining a global auto-increment value +would require that a node checks with other nodes before allocating a new value. +This would create a **global coordination bottleneck**, limit insert throughput, +and reduce scalability. + +CrateDB is designed for horizontal scalability and [high ingestion throughput]. +To achieve this, operations must complete independently on each node—without +central coordination. This design choice means CrateDB does **not** support +traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL +or MySQL by default. + +:::{rubric} Solutions +::: +CrateDB provides flexibility: You can choose a primary key strategy +tailored to your use case, whether for strict uniqueness, time ordering, or +external system integration. You can also implement true consistent/synchronized +sequences if you want to. + +## Using a timestamp as a primary key + +This option involves declaring a column using `DEFAULT now()`. +```psql BIGINT DEFAULT now() PRIMARY KEY ``` -**Pros** - -* Auto-generated, always-increasing value -* Useful when records are timestamped anyway +:Pros: + - Auto-generated, always-increasing value + - Useful when records are timestamped anyway -**Cons** +:Cons: + - Can result in gaps + - Collisions possible if multiple records are created in the same millisecond -* Can result in gaps -* Collisions possible if multiple records are created in the same millisecond +## Using UUIDv4 identifiers -### 2. Use UUIDs (v4) - -```sql +This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. +```psql TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY ``` -**Pros** - -* Universally unique -* No conflicts when merging from multiple environments or sources +:Pros: + - Universally unique + - No conflicts when merging from multiple environments or sources -**Cons** +:Cons: + - Not ordered + - Harder to read/debug + - No efficient range queries -* Not ordered -* Harder to read/debug -* No efficient range queries +## Using UUIDv7 identifiers -### Use UUIDv7 for Time-Ordered IDs +[UUIDv7] is a new format that preserves **temporal ordering**, making UUIDs +better suited for inserts and range queries in distributed databases. -UUIDv7 is a new format that preserves **temporal ordering**, making them better suited for distributed inserts and range queries. +We can use these in CrateDB with an UDF with the code from [UUIDv7 in N languages]. -You can use UUIDv7 in CrateDB via a **User-Defined Function (UDF)**, based on your preferred language. +You can use [UUIDv7 for CrateDB] via a {ref}`User-Defined Function (UDF) ` +in JavaScript, or in your preferred programming language by using one of the +available UUIDv7 libraries. -**Pros** +:Pros: + - Globally unique and **almost sequential** + - Efficient range queries possible -* Globally unique and **almost sequential** -* Range queries possible +:Cons: + - Not as human-friendly as integer numbers + - Slight overhead due to UDF use -**Cons** +## Using IDs from external systems -* Not human-friendly -* Slight overhead due to UDF use +If you are importing data from a source system that **already generates unique +IDs**, you can reuse those by inserting primary key values as-is from the +source system. -### 4. Use External System IDs +In this case, CrateDB does not need to generate any identifier values, +and consistency is ensured across systems. -If you're ingesting data from a source system that **already generates unique IDs**, you can reuse those: +:::{seealso} +An example for that is [Replicating data from other databases to CrateDB with Debezium and Kafka]. +::: -* No need for CrateDB to generate anything -* Ensures consistency across systems +## Implementing a custom sequence table -> See Replicating data from other databases to CrateDB with Debezium and Kafka for an example. +If you **must** have an auto-incrementing numeric ID (e.g., for compatibility +or legacy reasons), you can implement a simple sequence generator using a +dedicated table and client-side logic. -### 5. Implement a Custom Sequence Table +This approach involves a table to keep the latest values that have been consumed +and client side code to keep it up-to-date in a way that guarantees unique +values even when many ingestion processes run in parallel. -If you **must** have an auto-incrementing numeric ID (e.g., for compatibility or legacy reasons), you can implement a simple sequence generator using a dedicated table and client-side logic. +:Pros: + - Fully customizable (you can add prefixes, adjust increment size, etc.) + - Sequential IDs possible -**Step 1: Create a sequence tracking table** +:Cons: + - Additional client logic about optimistic updates is required for writing + - The sequence table may become a bottleneck at very high ingestion rates -```sql +### Step 1: Create a sequence tracking table +Create a table to keep the latest values for the sequences. +```psql CREATE TABLE sequences ( - name TEXT PRIMARY KEY, - last_value BIGINT + name TEXT PRIMARY KEY, + last_value BIGINT ) CLUSTERED INTO 1 SHARDS; ``` -**Step 2: Initialize your sequence** - -```sql -INSERT INTO sequences (name, last_value) -VALUES ('mysequence', 0); +### Step 2: Initialize your sequence +Initialize the table with one new sequence at 0. +```psql +INSERT INTO sequences (name,last_value) +VALUES ('mysequence',0); ``` -**Step 3: Create a target table** - -```sql +### Step 3: Create a target table +Start an example with a newly defined table. +```psql CREATE TABLE mytable ( - id BIGINT PRIMARY KEY, - field1 TEXT + id BIGINT PRIMARY KEY, + field1 TEXT ); ``` -**Step 4: Generate and use sequence values in Python** +### Step 4: Generate and use sequence values in Python + +Use optimistic concurrency control to generate unique, incrementing values +even in parallel ingestion scenarios. -Use optimistic concurrency control to generate unique, incrementing values even in parallel ingestion scenarios: +The Python code below reads the last value used from the sequences table, and +then attempts an [optimistic UPDATE] with a `RETURNING` clause, if a +contending process already consumed the identity nothing will be returned so our +process will retry until a value is returned, then it uses that value as the new +ID for the record we are inserting into the `mytable` table. ```python # Requires: records, sqlalchemy-cratedb +# +# /// script +# requires-python = ">=3.8" +# dependencies = [ +# "records", +# "sqlalchemy-cratedb", +# ] +# /// + import time + import records db = records.Database("crate://") @@ -117,7 +173,9 @@ base_delay = 0.1 # 100 milliseconds for attempt in range(max_retries): select_query = """ - SELECT last_value, _seq_no, _primary_term + SELECT last_value, + _seq_no, + _primary_term FROM sequences WHERE name = :sequence_name; """ @@ -128,48 +186,49 @@ for attempt in range(max_retries): UPDATE sequences SET last_value = :new_value WHERE name = :sequence_name - AND _seq_no = :seq_no - AND _primary_term = :primary_term + AND _seq_no = :seq_no + AND _primary_term = :primary_term RETURNING last_value; """ - result = db.query( - update_query, - new_value=new_value, - sequence_name=sequence_name, - seq_no=row._seq_no, - primary_term=row._primary_term - ).all() - - if result: + if ( + str( + db.query( + update_query, + new_value=new_value, + sequence_name=sequence_name, + seq_no=row._seq_no, + primary_term=row._primary_term, + ).all() + ) + != "[]" + ): break delay = base_delay * (2**attempt) print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...") time.sleep(delay) else: - raise Exception("Failed to acquire sequence after multiple retries.") + raise Exception(f"Failed after {max_retries} retries with exponential backoff") insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)" db.query(insert_query, id=new_value, field1="abc") db.close() ``` -**Pros** - -* Fully customizable (you can add prefixes, adjust increment size, etc.) -* Sequential IDs possible - -**Cons** - -* More complex client logic required -* The sequence table may become a bottleneck at very high ingestion rates - ## Summary -| Strategy | Ordered | Unique | Scalable | Human-Friendly | Range Queries | Notes | -| ------------------- | ------- | ------ | -------- | -------------- | ------------- | -------------------- | +| Strategy | Ordered | Unique | Scalable | Human-friendly | Range queries | Notes | +|---------------------|----------| ------ | -------- |----------------|---------------| -------------------- | | Timestamp | ✅ | ⚠️ | ✅ | ✅ | ✅ | Potential collisions | -| UUID (v4) | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | +| UUIDv4 | ❌ | ✅ | ✅ | ❌ | ❌ | Default UUIDs | | UUIDv7 | ✅ | ✅ | ✅ | ❌ | ✅ | Requires UDF | -| External System IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | -| Sequence Table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | +| External system IDs | ✅/❌ | ✅ | ✅ | ✅ | ✅ | Depends on source | +| Sequence table | ✅ | ✅ | ⚠️ | ✅ | ✅ | Manual retry logic | + + +[high ingestion throughput]: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second +[optimistic update]: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update +[replicating data from other databases to cratedb with debezium and kafka]: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka +[udf]: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html +[UUIDv7]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7 +[UUIDv7 for CrateDB]: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb