Skip to content

Commit 36bf2ac

Browse files
committed
Integrate/dlt: Pull README and usage guide from upstream repository
Includes: - Overview about supported features - Usage guide based on `dlt init`
1 parent 914fd22 commit 36bf2ac

File tree

2 files changed

+216
-11
lines changed

2 files changed

+216
-11
lines changed

docs/integrate/dlt/index.md

Lines changed: 119 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@
1313
[dlt] (data load tool)—think ELT as Python code—is a popular,
1414
production-ready Python library for moving data. It loads data from
1515
various and often messy data sources into well-structured, live datasets.
16-
dlt is used by {ref}`ingestr`.
16+
17+
dlt supports [30+ databases supported by SQLAlchemy],
18+
and is also the workhorse behind the {ref}`ingestr` toolkit.
1719

1820
::::{grid}
1921

@@ -36,12 +38,13 @@ dlt is used by {ref}`ingestr`.
3638
Prerequisites:
3739
Install dlt and the CrateDB destination adapter:
3840
```shell
39-
pip install dlt dlt-cratedb
41+
pip install --upgrade dlt-cratedb
4042
```
4143

4244
Load data from cloud storage or files into CrateDB.
4345
```python
4446
import dlt
47+
import dlt_cratedb
4548
from dlt.sources.filesystem import filesystem
4649

4750
resource = filesystem(
@@ -60,6 +63,7 @@ pipeline.run(resource)
6063

6164
Load data from SQL databases into CrateDB.
6265
```python
66+
import dlt_cratedb
6367
from dlt.sources.sql_database import sql_database
6468

6569
source = sql_database(
@@ -75,32 +79,136 @@ pipeline = dlt.pipeline(
7579
pipeline.run(source)
7680
```
7781

78-
## Learn
82+
## Supported features
83+
84+
### Data loading
85+
86+
Data is loaded into CrateDB using the most efficient method depending on the data source.
87+
88+
- For local files, the `psycopg2` library is used to directly load files into
89+
CrateDB tables using the `INSERT` command.
90+
- For files in remote storage like S3 or Azure Blob Storage,
91+
CrateDB data loading functions are used to read the files and insert the data into tables.
92+
93+
### Datasets
94+
95+
Use `dataset_name="doc"` to address CrateDB's default schema `doc`.
96+
When addressing other schemas, make sure they contain at least one table. [^create-schema]
97+
98+
### File formats
99+
100+
- The [SQL INSERT file format] is the preferred format for both direct loading and staging.
101+
102+
### Column types
103+
104+
The `cratedb` destination has a few specific deviations from the default SQL destinations.
105+
106+
- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column.
107+
- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column.
108+
- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype.
109+
Make sure to use the `decimal` datatype if you can’t afford to have rounding errors.
110+
111+
### Column hints
112+
113+
CrateDB supports the following [column hints].
114+
115+
- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
116+
117+
### File staging
118+
119+
CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.
120+
121+
`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
122+
to load the data directly from the staged files.
123+
124+
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.
125+
126+
- [AWS S3]
127+
- [Azure Blob Storage]
128+
- [Google Storage]
129+
130+
Invoke a pipeline with staging enabled.
131+
132+
```python
133+
pipeline = dlt.pipeline(
134+
pipeline_name='chess_pipeline',
135+
destination='cratedb',
136+
staging='filesystem', # add this to activate staging
137+
dataset_name='chess_data'
138+
)
139+
```
140+
141+
### dbt support
142+
143+
Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us.
144+
145+
### dlt state sync
146+
147+
The CrateDB destination fully supports [dlt state sync].
148+
149+
150+
## See also
151+
152+
:::{rubric} Examples
153+
:::
79154

80155
::::{grid}
81156

157+
:::{grid-item-card} Usage guide: Load API data with dlt
158+
:link: dlt-usage
159+
:link-type: ref
160+
Exercise a canonical `dlt init` example with CrateDB.
161+
:::
162+
82163
:::{grid-item-card} Examples: Use dlt with CrateDB
83164
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt
84165
:link-type: url
85-
Executable code examples that demonstrate how to use dlt with CrateDB.
166+
Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
167+
:::
168+
169+
::::
170+
171+
:::{rubric} Resources
86172
:::
87173

88-
:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB
89-
:link: https://github.com/crate/dlt-cratedb
174+
::::{grid}
175+
176+
:::{grid-item-card} Package: `dlt-cratedb`
177+
:link: https://pypi.org/project/dlt-cratedb/
90178
:link-type: url
91-
Based on the dlt PostgreSQL adapter, the package enables you to work
92-
with dlt and CrateDB.
179+
The dlt destination adapter for CrateDB is
180+
based on the dlt PostgreSQL adapter.
93181
:::
94182

95-
:::{grid-item-card} See also: ingestr
183+
:::{grid-item-card} Related: `ingestr`
96184
:link: ingestr
97185
:link-type: ref
98-
The ingestr data import/export application uses dlt.
186+
The ingestr data import/export application uses dlt as a workhorse.
99187
:::
100188

101189
::::
102190

103191

192+
:::{toctree}
193+
:maxdepth: 1
194+
:hidden:
195+
Usage <usage>
196+
:::
197+
198+
199+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
200+
This means by default, unless any table exists within a schema, the schema appears
201+
not to exist at all. However, it also can't be created explicitly. Schemas are
202+
currently implicitly created when tables exist in them.
104203

105-
[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/
204+
[30+ databases supported by SQLAlchemy]: https://dlthub.com/docs/dlt-ecosystem/destinations/sqlalchemy
205+
[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3
206+
[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage
207+
[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules
208+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601
209+
[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations
210+
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
106211
[dlt]: https://dlthub.com/
212+
[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination
213+
[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage
214+
[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format

docs/integrate/dlt/usage.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: CrateDB
3+
description: CrateDB `dlt` destination
4+
keywords: [ cratedb, destination, data warehouse ]
5+
---
6+
7+
(dlt-usage)=
8+
# Load API data with dlt
9+
10+
:::{div} sd-text-muted
11+
Exercise a canonical `dlt init` example with CrateDB.
12+
:::
13+
14+
## Install the package
15+
16+
Install the dlt destination adapter for CrateDB.
17+
```shell
18+
pip install dlt-cratedb
19+
```
20+
21+
## Initialize the dlt project
22+
23+
Start by initializing a new example `dlt` project.
24+
25+
```shell
26+
export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres
27+
dlt init chess cratedb
28+
```
29+
30+
The `dlt init` command will initialize your pipeline with `chess` [^chess-source]
31+
as the source, and `cratedb` as the destination. It generates several files and directories.
32+
33+
## Edit the pipeline definition
34+
35+
The pipeline definition is stored in the Python file `chess_pipeline.py`.
36+
37+
- Because the dlt adapter currently only supports writing to the default `doc` schema
38+
of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"`
39+
by `dataset_name="doc"` within the generated `chess_pipeline.py` file.
40+
41+
- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb`
42+
statement at the top of the file. Otherwise, the destination will not be found,
43+
so you will receive a corresponding error [^not-initialized-error].
44+
45+
## Configure credentials
46+
47+
Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below.
48+
CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the
49+
`postgres` destination.
50+
51+
```toml
52+
[destination.cratedb.credentials]
53+
host = "localhost" # CrateDB server host.
54+
port = 5432 # CrateDB PostgreSQL TCP protocol port, default is 5432.
55+
username = "crate" # CrateDB username, default is usually "crate".
56+
password = "crate" # CrateDB password, if any.
57+
database = "crate" # CrateDB only knows a single database called `crate`.
58+
connect_timeout = 15
59+
```
60+
61+
Alternatively, you can pass a database connection string as shown below.
62+
```toml
63+
destination.cratedb.credentials="postgres://crate:crate@localhost:5432/"
64+
```
65+
Keep it at the top of your TOML file, before any section starts.
66+
Because CrateDB uses `psycopg2`, using `postgres://` is the right choice.
67+
68+
## Start CrateDB
69+
70+
Use Docker or Podman to run an instance of CrateDB for evaluation purposes.
71+
```shell
72+
docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node'
73+
```
74+
75+
## Run pipeline
76+
77+
```shell
78+
python chess_pipeline.py
79+
```
80+
81+
## Explore data
82+
```shell
83+
crash -c 'SELECT * FROM players_profiles LIMIT 10;'
84+
crash -c 'SELECT * FROM players_online_status LIMIT 10;'
85+
```
86+
87+
88+
[^chess-source]: The `chess` dlt source pulls publicly available data from
89+
the [Chess.com Published-Data API].
90+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
91+
This means by default, unless any table exists within a schema, the schema appears
92+
not to exist at all. However, it also can't be created explicitly.
93+
currently implicitly created when tables exist in them.
94+
[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations`
95+
96+
[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api
97+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601

0 commit comments

Comments
 (0)