diff --git a/docs/integrate/dlt/index.md b/docs/integrate/dlt/index.md index 68055756..568fb14e 100644 --- a/docs/integrate/dlt/index.md +++ b/docs/integrate/dlt/index.md @@ -13,7 +13,9 @@ [dlt] (data load tool)—think ELT as Python code—is a popular, production-ready Python library for moving data. It loads data from various and often messy data sources into well-structured, live datasets. -dlt is used by {ref}`ingestr`. + +dlt supports [30+ databases supported by SQLAlchemy], +and is also the workhorse behind the {ref}`ingestr` toolkit. ::::{grid} @@ -36,12 +38,13 @@ dlt is used by {ref}`ingestr`. Prerequisites: Install dlt and the CrateDB destination adapter: ```shell -pip install dlt dlt-cratedb +pip install --upgrade dlt-cratedb ``` Load data from cloud storage or files into CrateDB. ```python import dlt +import dlt_cratedb from dlt.sources.filesystem import filesystem resource = filesystem( @@ -60,6 +63,7 @@ pipeline.run(resource) Load data from SQL databases into CrateDB. ```python +import dlt_cratedb from dlt.sources.sql_database import sql_database source = sql_database( @@ -75,32 +79,136 @@ pipeline = dlt.pipeline( pipeline.run(source) ``` -## Learn +## Supported features + +### Data loading + +Data is loaded into CrateDB using the most efficient method depending on the data source. + +- For local files, the `psycopg2` library is used to directly load files into + CrateDB tables using the `INSERT` command. +- For files in remote storage like S3 or Azure Blob Storage, + CrateDB data loading functions are used to read the files and insert the data into tables. + +### Datasets + +Use `dataset_name="doc"` to address CrateDB's default schema `doc`. +When addressing other schemas, make sure they contain at least one table. [^create-schema] + +### File formats + +- The [SQL INSERT file format] is the preferred format for both direct loading and staging. + +### Column types + +The `cratedb` destination has a few specific deviations from the default SQL destinations. + +- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column. +- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column. +- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype. + Make sure to use the `decimal` datatype if you can’t afford to have rounding errors. + +### Column hints + +CrateDB supports the following [column hints]. + +- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key. + +### File staging + +CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations. + +`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions +to load the data directly from the staged files. + +Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations. + +- [AWS S3] +- [Azure Blob Storage] +- [Google Storage] + +Invoke a pipeline with staging enabled. + +```python +pipeline = dlt.pipeline( + pipeline_name='chess_pipeline', + destination='cratedb', + staging='filesystem', # add this to activate staging + dataset_name='chess_data' +) +``` + +### dbt support + +Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us. + +### dlt state sync + +The CrateDB destination fully supports [dlt state sync]. + + +## See also + +:::{rubric} Examples +::: ::::{grid} +:::{grid-item-card} Usage guide: Load API data with dlt +:link: dlt-usage +:link-type: ref +Exercise a canonical `dlt init` example with CrateDB. +::: + :::{grid-item-card} Examples: Use dlt with CrateDB :link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt :link-type: url -Executable code examples that demonstrate how to use dlt with CrateDB. +Executable code examples on GitHub that demonstrate how to use dlt with CrateDB. +::: + +:::: + +:::{rubric} Resources ::: -:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB -:link: https://github.com/crate/dlt-cratedb +::::{grid} + +:::{grid-item-card} Package: `dlt-cratedb` +:link: https://pypi.org/project/dlt-cratedb/ :link-type: url -Based on the dlt PostgreSQL adapter, the package enables you to work -with dlt and CrateDB. +The dlt destination adapter for CrateDB is +based on the dlt PostgreSQL adapter. ::: -:::{grid-item-card} See also: ingestr +:::{grid-item-card} Related: `ingestr` :link: ingestr :link-type: ref -The ingestr data import/export application uses dlt. +The ingestr data import/export application uses dlt as a workhorse. ::: :::: +:::{toctree} +:maxdepth: 1 +:hidden: +Usage +::: + + +[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601]. + This means by default, unless any table exists within a schema, the schema appears + not to exist at all. However, it also can't be created explicitly. Schemas are + currently implicitly created when tables exist in them. -[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/ +[30+ databases supported by SQLAlchemy]: https://dlthub.com/docs/dlt-ecosystem/destinations/sqlalchemy +[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3 +[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage +[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules +[CRATEDB-14601]: https://github.com/crate/crate/issues/14601 +[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations +[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/ [dlt]: https://dlthub.com/ +[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination +[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage +[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format diff --git a/docs/integrate/dlt/usage.md b/docs/integrate/dlt/usage.md new file mode 100644 index 00000000..4604d071 --- /dev/null +++ b/docs/integrate/dlt/usage.md @@ -0,0 +1,97 @@ +--- +title: CrateDB +description: CrateDB `dlt` destination +keywords: [ cratedb, destination, data warehouse ] +--- + +(dlt-usage)= +# Load API data with dlt + +:::{div} sd-text-muted +Exercise a canonical `dlt init` example with CrateDB. +::: + +## Install the package + +Install the dlt destination adapter for CrateDB. +```shell +pip install dlt-cratedb +``` + +## Initialize the dlt project + +Start by initializing a new example `dlt` project. + +```shell +export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres +dlt init chess cratedb +``` + +The `dlt init` command will initialize your pipeline with `chess` [^chess-source] +as the source, and `cratedb` as the destination. It generates several files and directories. + +## Edit the pipeline definition + +The pipeline definition is stored in the Python file `chess_pipeline.py`. + +- Because the dlt adapter currently only supports writing to the default `doc` schema + of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"` + by `dataset_name="doc"` within the generated `chess_pipeline.py` file. + +- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb` + statement at the top of the file. Otherwise, the destination will not be found, + so you will receive a corresponding error [^not-initialized-error]. + +## Configure credentials + +Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below. +CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the +`postgres` destination. + +```toml +[destination.cratedb.credentials] +host = "localhost" # CrateDB server host. +port = 5432 # CrateDB PostgreSQL TCP protocol port, default is 5432. +username = "crate" # CrateDB username, default is usually "crate". +password = "crate" # CrateDB password, if any. +database = "crate" # CrateDB only knows a single database called `crate`. +connect_timeout = 15 +``` + +Alternatively, you can pass a database connection string as shown below. +```toml +destination.cratedb.credentials="postgres://crate:crate@localhost:5432/" +``` +Keep it at the top of your TOML file, before any section starts. +Because CrateDB uses `psycopg2`, using `postgres://` is the right choice. + +## Start CrateDB + +Use Docker or Podman to run an instance of CrateDB for evaluation purposes. +```shell +docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node' +``` + +## Run pipeline + +```shell +python chess_pipeline.py +``` + +## Explore data +```shell +crash -c 'SELECT * FROM players_profiles LIMIT 10;' +crash -c 'SELECT * FROM players_online_status LIMIT 10;' +``` + + +[^chess-source]: The `chess` dlt source pulls publicly available data from + the [Chess.com Published-Data API]. +[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601]. + This means by default, unless any table exists within a schema, the schema appears + not to exist at all. However, it also can't be created explicitly. + currently implicitly created when tables exist in them. +[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations` + +[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api +[CRATEDB-14601]: https://github.com/crate/crate/issues/14601