Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 119 additions & 11 deletions docs/integrate/dlt/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
[dlt] (data load tool)—think ELT as Python code—is a popular,
production-ready Python library for moving data. It loads data from
various and often messy data sources into well-structured, live datasets.
dlt is used by {ref}`ingestr`.

dlt supports [30+ databases supported by SQLAlchemy],
and is also the workhorse behind the {ref}`ingestr` toolkit.

::::{grid}

Expand All @@ -36,12 +38,13 @@ dlt is used by {ref}`ingestr`.
Prerequisites:
Install dlt and the CrateDB destination adapter:
```shell
pip install dlt dlt-cratedb
pip install --upgrade dlt-cratedb
```

Load data from cloud storage or files into CrateDB.
```python
import dlt
import dlt_cratedb
from dlt.sources.filesystem import filesystem

resource = filesystem(
Expand All @@ -60,6 +63,7 @@ pipeline.run(resource)

Load data from SQL databases into CrateDB.
```python
import dlt_cratedb
from dlt.sources.sql_database import sql_database

source = sql_database(
Expand All @@ -75,32 +79,136 @@ pipeline = dlt.pipeline(
pipeline.run(source)
```

## Learn
## Supported features

### Data loading

Data is loaded into CrateDB using the most efficient method depending on the data source.

- For local files, the `psycopg2` library is used to directly load files into
CrateDB tables using the `INSERT` command.
- For files in remote storage like S3 or Azure Blob Storage,
CrateDB data loading functions are used to read the files and insert the data into tables.

### Datasets

Use `dataset_name="doc"` to address CrateDB's default schema `doc`.
When addressing other schemas, make sure they contain at least one table. [^create-schema]

### File formats

- The [SQL INSERT file format] is the preferred format for both direct loading and staging.

### Column types

The `cratedb` destination has a few specific deviations from the default SQL destinations.

- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column.
- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column.
- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype.
Make sure to use the `decimal` datatype if you can’t afford to have rounding errors.

### Column hints

CrateDB supports the following [column hints].

- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.

### File staging

CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.

`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
to load the data directly from the staged files.

Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.

- [AWS S3]
- [Azure Blob Storage]
- [Google Storage]

Invoke a pipeline with staging enabled.

```python
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='cratedb',
staging='filesystem', # add this to activate staging
dataset_name='chess_data'
)
```

### dbt support

Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us.

### dlt state sync

The CrateDB destination fully supports [dlt state sync].


## See also

:::{rubric} Examples
:::

::::{grid}

:::{grid-item-card} Usage guide: Load API data with dlt
:link: dlt-usage
:link-type: ref
Exercise a canonical `dlt init` example with CrateDB.
:::

:::{grid-item-card} Examples: Use dlt with CrateDB
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt
:link-type: url
Executable code examples that demonstrate how to use dlt with CrateDB.
Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
:::

::::

:::{rubric} Resources
:::

:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB
:link: https://github.com/crate/dlt-cratedb
::::{grid}

:::{grid-item-card} Package: `dlt-cratedb`
:link: https://pypi.org/project/dlt-cratedb/
:link-type: url
Based on the dlt PostgreSQL adapter, the package enables you to work
with dlt and CrateDB.
The dlt destination adapter for CrateDB is
based on the dlt PostgreSQL adapter.
:::

:::{grid-item-card} See also: ingestr
:::{grid-item-card} Related: `ingestr`
:link: ingestr
:link-type: ref
The ingestr data import/export application uses dlt.
The ingestr data import/export application uses dlt as a workhorse.
:::

::::


:::{toctree}
:maxdepth: 1
:hidden:
Usage <usage>
:::


[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
This means by default, unless any table exists within a schema, the schema appears
not to exist at all. However, it also can't be created explicitly. Schemas are
currently implicitly created when tables exist in them.

[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/
[30+ databases supported by SQLAlchemy]: https://dlthub.com/docs/dlt-ecosystem/destinations/sqlalchemy
[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3
[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage
[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601
[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
[dlt]: https://dlthub.com/
[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination
[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage
[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format
97 changes: 97 additions & 0 deletions docs/integrate/dlt/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: CrateDB
description: CrateDB `dlt` destination
keywords: [ cratedb, destination, data warehouse ]
---

(dlt-usage)=
# Load API data with dlt

:::{div} sd-text-muted
Exercise a canonical `dlt init` example with CrateDB.
:::

## Install the package

Install the dlt destination adapter for CrateDB.
```shell
pip install dlt-cratedb
```

## Initialize the dlt project

Start by initializing a new example `dlt` project.

```shell
export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres
dlt init chess cratedb
```

The `dlt init` command will initialize your pipeline with `chess` [^chess-source]
as the source, and `cratedb` as the destination. It generates several files and directories.

## Edit the pipeline definition

The pipeline definition is stored in the Python file `chess_pipeline.py`.

- Because the dlt adapter currently only supports writing to the default `doc` schema
of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"`
by `dataset_name="doc"` within the generated `chess_pipeline.py` file.

- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb`
statement at the top of the file. Otherwise, the destination will not be found,
so you will receive a corresponding error [^not-initialized-error].

## Configure credentials

Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below.
CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the
`postgres` destination.

```toml
[destination.cratedb.credentials]
host = "localhost" # CrateDB server host.
port = 5432 # CrateDB PostgreSQL TCP protocol port, default is 5432.
username = "crate" # CrateDB username, default is usually "crate".
password = "crate" # CrateDB password, if any.
database = "crate" # CrateDB only knows a single database called `crate`.
connect_timeout = 15
```

Alternatively, you can pass a database connection string as shown below.
```toml
destination.cratedb.credentials="postgres://crate:crate@localhost:5432/"
```
Keep it at the top of your TOML file, before any section starts.
Because CrateDB uses `psycopg2`, using `postgres://` is the right choice.

## Start CrateDB

Use Docker or Podman to run an instance of CrateDB for evaluation purposes.
```shell
docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node'
```

## Run pipeline

```shell
python chess_pipeline.py
```

## Explore data
```shell
crash -c 'SELECT * FROM players_profiles LIMIT 10;'
crash -c 'SELECT * FROM players_online_status LIMIT 10;'
```


[^chess-source]: The `chess` dlt source pulls publicly available data from
the [Chess.com Published-Data API].
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
This means by default, unless any table exists within a schema, the schema appears
not to exist at all. However, it also can't be created explicitly.
currently implicitly created when tables exist in them.
[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations`

[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601