Integrate/dlt: Pull README and usage guide from upstream repository

amotl · amotl · commit 36bf2ac3f96c · 2025-11-11T20:44:35.000+01:00
Includes:
- Overview about supported features
- Usage guide based on `dlt init`
diff --git a/docs/integrate/dlt/index.md b/docs/integrate/dlt/index.md
@@ -13,7 +13,9 @@
 [dlt] (data load tool)—think ELT as Python code—is a popular,
 production-ready Python library for moving data. It loads data from
 various and often messy data sources into well-structured, live datasets.
-dlt is used by {ref}`ingestr`.
+
+dlt supports [30+ databases supported by SQLAlchemy],
+and is also the workhorse behind the {ref}`ingestr` toolkit.
 
 ::::{grid}
 
@@ -36,12 +38,13 @@ dlt is used by {ref}`ingestr`.
 Prerequisites:
 Install dlt and the CrateDB destination adapter:
 ```shell
-pip install dlt dlt-cratedb
+pip install --upgrade dlt-cratedb
 ```
 
 Load data from cloud storage or files into CrateDB.
 ```python
 import dlt
+import dlt_cratedb
 from dlt.sources.filesystem import filesystem
 
 resource = filesystem(
@@ -60,6 +63,7 @@ pipeline.run(resource)
 
 Load data from SQL databases into CrateDB.
 ```python
+import dlt_cratedb
 from dlt.sources.sql_database import sql_database
 
 source = sql_database(
@@ -75,32 +79,136 @@ pipeline = dlt.pipeline(
 pipeline.run(source)
 ```
 
-## Learn
+## Supported features
+
+### Data loading
+
+Data is loaded into CrateDB using the most efficient method depending on the data source.
+
+- For local files, the `psycopg2` library is used to directly load files into
+  CrateDB tables using the `INSERT` command.
+- For files in remote storage like S3 or Azure Blob Storage,
+  CrateDB data loading functions are used to read the files and insert the data into tables.
+
+### Datasets
+
+Use `dataset_name="doc"` to address CrateDB's default schema `doc`.
+When addressing other schemas, make sure they contain at least one table. [^create-schema]
+
+### File formats
+
+- The [SQL INSERT file format] is the preferred format for both direct loading and staging.
+
+### Column types
+
+The `cratedb` destination has a few specific deviations from the default SQL destinations.
+
+- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column.
+- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column.
+- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype.
+  Make sure to use the `decimal` datatype if you can’t afford to have rounding errors.
+
+### Column hints
+
+CrateDB supports the following [column hints].
+
+- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
+
+### File staging
+
+CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.
+
+`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
+to load the data directly from the staged files.
+
+Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.
+
+- [AWS S3]
+- [Azure Blob Storage]
+- [Google Storage]
+
+Invoke a pipeline with staging enabled.
+
+```python
+pipeline = dlt.pipeline(
+  pipeline_name='chess_pipeline',
+  destination='cratedb',
+  staging='filesystem',  # add this to activate staging
+  dataset_name='chess_data'
+)
+```
+
+### dbt support
+
+Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us.
+
+### dlt state sync
+
+The CrateDB destination fully supports [dlt state sync].
+
+
+## See also
+
+:::{rubric} Examples
+:::
 
 ::::{grid}
 
+:::{grid-item-card} Usage guide: Load API data with dlt
+:link: dlt-usage
+:link-type: ref
+Exercise a canonical `dlt init` example with CrateDB.
+:::
+
 :::{grid-item-card} Examples: Use dlt with CrateDB
 :link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt
 :link-type: url
-Executable code examples that demonstrate how to use dlt with CrateDB.
+Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
+:::
+
+::::
+
+:::{rubric} Resources
 :::
 
-:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB
-:link: https://github.com/crate/dlt-cratedb
+::::{grid}
+
+:::{grid-item-card} Package: `dlt-cratedb`
+:link: https://pypi.org/project/dlt-cratedb/
 :link-type: url
-Based on the dlt PostgreSQL adapter, the package enables you to work
-with dlt and CrateDB.
+The dlt destination adapter for CrateDB is
+based on the dlt PostgreSQL adapter.
 :::
 
-:::{grid-item-card} See also: ingestr
+:::{grid-item-card} Related: `ingestr`
 :link: ingestr
 :link-type: ref
-The ingestr data import/export application uses dlt.
+The ingestr data import/export application uses dlt as a workhorse.
 :::
 
 ::::
 
 
+:::{toctree}
+:maxdepth: 1
+:hidden:
+Usage <usage>
+:::
+
+
+[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
+  This means by default, unless any table exists within a schema, the schema appears
+  not to exist at all. However, it also can't be created explicitly. Schemas are
+  currently implicitly created when tables exist in them.
 
-[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/
+[30+ databases supported by SQLAlchemy]: https://dlthub.com/docs/dlt-ecosystem/destinations/sqlalchemy
+[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3
+[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage
+[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules
+[CRATEDB-14601]: https://github.com/crate/crate/issues/14601
+[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations
+[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
 [dlt]: https://dlthub.com/
+[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination
+[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage
+[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format
diff --git a/docs/integrate/dlt/usage.md b/docs/integrate/dlt/usage.md
@@ -0,0 +1,97 @@
+---
+title: CrateDB
+description: CrateDB `dlt` destination
+keywords: [ cratedb, destination, data warehouse ]
+---
+
+(dlt-usage)=
+# Load API data with dlt
+
+:::{div} sd-text-muted
+Exercise a canonical `dlt init` example with CrateDB.
+:::
+
+## Install the package
+
+Install the dlt destination adapter for CrateDB.
+```shell
+pip install dlt-cratedb
+```
+
+## Initialize the dlt project
+
+Start by initializing a new example `dlt` project.
+
+```shell
+export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres
+dlt init chess cratedb
+```
+
+The `dlt init` command will initialize your pipeline with `chess` [^chess-source]
+as the source, and `cratedb` as the destination. It generates several files and directories.
+
+## Edit the pipeline definition
+
+The pipeline definition is stored in the Python file `chess_pipeline.py`.
+
+- Because the dlt adapter currently only supports writing to the default `doc` schema
+  of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"`
+  by `dataset_name="doc"` within the generated `chess_pipeline.py` file.
+
+- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb`
+  statement at the top of the file. Otherwise, the destination will not be found,
+  so you will receive a corresponding error [^not-initialized-error].
+
+## Configure credentials
+
+Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below.
+CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the
+`postgres` destination.
+
+```toml
+[destination.cratedb.credentials]
+host = "localhost"                       # CrateDB server host.
+port = 5432                              # CrateDB PostgreSQL TCP protocol port, default is 5432.
+username = "crate"                       # CrateDB username, default is usually "crate".
+password = "crate"                       # CrateDB password, if any.
+database = "crate"                       # CrateDB only knows a single database called `crate`.
+connect_timeout = 15
+```
+
+Alternatively, you can pass a database connection string as shown below.
+```toml
+destination.cratedb.credentials="postgres://crate:crate@localhost:5432/"
+```
+Keep it at the top of your TOML file, before any section starts.
+Because CrateDB uses `psycopg2`, using `postgres://` is the right choice.
+
+## Start CrateDB
+
+Use Docker or Podman to run an instance of CrateDB for evaluation purposes.
+```shell
+docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node'
+```
+
+## Run pipeline
+
+```shell
+python chess_pipeline.py
+```
+
+## Explore data
+```shell
+crash -c 'SELECT * FROM players_profiles LIMIT 10;'
+crash -c 'SELECT * FROM players_online_status LIMIT 10;'
+```
+
+
+[^chess-source]: The `chess` dlt source pulls publicly available data from
+  the [Chess.com Published-Data API].
+[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
+  This means by default, unless any table exists within a schema, the schema appears
+  not to exist at all. However, it also can't be created explicitly.
+  currently implicitly created when tables exist in them.
+[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations`
+
+[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api
+[CRATEDB-14601]: https://github.com/crate/crate/issues/14601