From 3a4d5a66630380b55c2db27a17bf69d1255d684f Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Mon, 25 Aug 2025 02:01:43 +0200 Subject: [PATCH 1/3] Integrate/dlt: Add section and category item --- docs/ingest/etl/index.md | 4 ++ docs/integrate/dlt/index.md | 100 ++++++++++++++++++++++++++++++++++++ docs/integrate/index.md | 1 + 3 files changed, 105 insertions(+) create mode 100644 docs/integrate/dlt/index.md diff --git a/docs/ingest/etl/index.md b/docs/ingest/etl/index.md index 0d3ff371..b5d9ae3d 100644 --- a/docs/ingest/etl/index.md +++ b/docs/ingest/etl/index.md @@ -38,6 +38,10 @@ outlines how to use them effectively. Additionally, see support for {ref}`cdc` s dbt is an SQL-first platform for transforming data in data warehouses using Python and SQL. The data abstraction layer provided by dbt-core allows the decoupling of the models on which reports and dashboards rely from the source data. +- {ref}`dlt` + + dlt is a popular production-ready Python library for moving data: + Think ELT as Python code. - {ref}`flink` diff --git a/docs/integrate/dlt/index.md b/docs/integrate/dlt/index.md new file mode 100644 index 00000000..06c49aad --- /dev/null +++ b/docs/integrate/dlt/index.md @@ -0,0 +1,100 @@ +(dlt)= +# dlt + +```{div} .float-right .text-right +![dlt logo](https://cdn.sanity.io/images/nsq559ov/production/7f85e56e715b847c5519848b7198db73f793448d-82x25.svg?w=2000&auto=format){loading=lazy}[dlt] +

+ + CI status: dlt +``` +```{div} .clearfix +``` + +[dlt] (data load tool)--think ELT as Python code--is the most popular +production-ready Python library for moving data. It loads data from +various and often messy data sources into well-structured, live datasets. +dlt is used by {ref}`ingestr`. + +::::{grid} + +:::{grid-item} +- **Just code**: no need to use any backends or containers. + +- **Platform agnostic**: Does not replace your data platform, deployments, or security + models. Simply import dlt in your favorite AI code editor, or add it to your Jupyter + Notebook. + +- **Versatile**: You can load data from any source that produces Python data structures, + including APIs, files, databases, and more. +::: + +:::: + + +## Synopsis + +Load data from cloud storage or files into CrateDB. +```python +import dlt +from dlt.sources.filesystem import filesystem + +resource = filesystem( + bucket_url="s3://example-bucket", + file_glob="*.csv" +) + +pipeline = dlt.pipeline( + pipeline_name="filesystem_example", + destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"), + dataset_name="doc", +) + +pipeline.run(resource) +``` + +Load data from SQL databases into CrateDB. +```python +from dlt.sources.sql_database import sql_database + +source = sql_database( + "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam" +) + +pipeline = dlt.pipeline( + pipeline_name="sql_database_example", + destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"), + dataset_name="doc", +) + +pipeline.run(source) +``` + +## Learn + +::::{grid} + +:::{grid-item-card} Examples: Use dlt with CrateDB +:link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt +:link-type: url +Executable code examples that demonstrate how to use dlt with CrateDB. +::: + +:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB +:link: https://github.com/crate/dlt-cratedb +:link-type: url +Based on the dlt PostgreSQL adapter, the package enables you to work +with dlt and CrateDB. +::: + +:::{grid-item-card} See also: ingestr +:link: ingestr +:link-type: ref +The ingestr data import/export application uses dlt. +::: + +:::: + + + +[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/ +[dlt]: https://dlthub.com/ diff --git a/docs/integrate/index.md b/docs/integrate/index.md index 5baddec1..73987885 100644 --- a/docs/integrate/index.md +++ b/docs/integrate/index.md @@ -26,6 +26,7 @@ dbeaver/index dbt/index debezium/index django/index +dlt/index dms/index dynamodb/index estuary/index From 234d0040cb11abc2ee0a178f30e0732219ab059c Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Mon, 25 Aug 2025 02:01:55 +0200 Subject: [PATCH 2/3] Integrate/ingestr: Add section and category item --- docs/ingest/etl/index.md | 5 ++ docs/integrate/index.md | 1 + docs/integrate/ingestr/index.md | 128 ++++++++++++++++++++++++++++++++ 3 files changed, 134 insertions(+) create mode 100644 docs/integrate/ingestr/index.md diff --git a/docs/ingest/etl/index.md b/docs/ingest/etl/index.md index b5d9ae3d..cf664c32 100644 --- a/docs/ingest/etl/index.md +++ b/docs/ingest/etl/index.md @@ -48,6 +48,11 @@ outlines how to use them effectively. Additionally, see support for {ref}`cdc` s Apache Flink is a programming framework and distributed processing engine for stateful computations over unbounded and bounded data streams, written in Java. +- {ref}`ingestr` + + ingestr is a command-line application that allows copying data from any + source into any destination database. + - {ref}`kestra` Kestra is an open-source workflow automation and orchestration toolkit with a rich diff --git a/docs/integrate/index.md b/docs/integrate/index.md index 73987885..96152972 100644 --- a/docs/integrate/index.md +++ b/docs/integrate/index.md @@ -37,6 +37,7 @@ grafana/index hop/index iceberg/index influxdb/index +ingestr/index kafka/index kestra/index kinesis/index diff --git a/docs/integrate/ingestr/index.md b/docs/integrate/ingestr/index.md new file mode 100644 index 00000000..4ff280c7 --- /dev/null +++ b/docs/integrate/ingestr/index.md @@ -0,0 +1,128 @@ +(ingestr)= +# ingestr + +```{div} .float-right .text-right + + CI status: ingestr +``` +```{div} .clearfix +``` + +[ingestr] is a command-line application that allows copying data from any +source into any destination database. It supports CrateDB on the source +and the destination side. ingestr uses {ref}`dlt`. + +::::{grid} + +:::{grid-item} +- **Single command**: ingestr allows copying & ingesting data from any source + to any destination with a single command. + +- **Many sources & destinations**: ingestr supports all common source and + destination databases. + +- **Incremental Loading**: ingestr supports both full-refresh and + incremental loading modes. +::: + +:::{grid-item} +![ingestr in a nutshell](https://github.com/bruin-data/ingestr/blob/main/resources/demo.gif?raw=true){loading=lazy} +::: + +:::: + + +## Synopsis + +Invoke ingestr for exporting data from CrateDB. +```shell +ingestr ingest \ + --source-uri 'crate://crate@localhost:4200/' \ + --source-table 'sys.summits' \ + --dest-uri 'duckdb:///cratedb.duckdb' \ + --dest-table 'dest.summits' +``` + +Invoke ingestr for loading data into CrateDB. +```shell +ingestr ingest \ + --source-uri 'csv://input.csv' \ + --source-table 'sample' \ + --dest-uri 'cratedb://crate:@localhost:5432/?sslmode=disable' \ + --dest-table 'doc.sample' +``` + +:::{note} +Please note there a subtle differences in the CrateDB source vs. target URL. +While `--source-uri=crate://...` addresses CrateDB's SQLAlchemy dialect, +`--dest-uri=cratedb://...` is effectively a PostgreSQL connection URL +with a protocol schema designating CrateDB. The source adapter uses +CrateDB's HTTP protocol, while the destination adapter uses CrateDB's +PostgreSQL interface. +::: + + +## Coverage + +ingestr supports migration from 20-plus databases, data platforms, analytics +engines, including all [databases supported by SQLAlchemy]. + +:::{rubric} Databases +::: +Actian Data Platform, Vector, Actian X, Ingres, Amazon Athena, Amazon Redshift, +Amazon S3, Apache Drill, Apache Druid, Apache Hive and Presto, Apache Solr, +Clickhouse, CockroachDB, CrateDB, Databend, Databricks, Denodo, DuckDB, EXASOL DB, +Elasticsearch, Firebird, Firebolt, Google BigQuery, Google Sheets, Greenplum, +HyperSQL (hsqldb), IBM DB2 and Informix, IBM Netezza Performance Server, Impala, InfluxDB, +Kinetica, Microsoft Access, Microsoft SQL Server, MonetDB, MongoDB, MySQL and MariaDB, +OpenGauss, OpenSearch, Oracle, PostgreSQL, Rockset, SAP ASE, SAP HANA, +SAP Sybase SQL Anywhere, Snowflake, SQLite, Teradata Vantage, TiDB, YDB, YugabyteDB. + +:::{rubric} Brokers +::: +Amazon Kinesis, Apache Kafka (Amazon MSK, Confluent Kafka, Redpanda, RobustMQ) + +:::{rubric} File formats +::: +CSV, JSONL/NDJSON, Parquet + +:::{rubric} Object stores +::: +Amazon S3, Google Cloud Storage + +:::{rubric} Services +::: +Airtable, Asana, GitHub, Google Ads, Google Analytics, Google Sheets, HubSpot, +Notion, Personio, Salesforce, Slack, Stripe, Zendesk, etc. + + +## Learn + +::::{grid} + +:::{grid-item-card} Documentation: ingestr CrateDB source +:link: https://bruin-data.github.io/ingestr/supported-sources/cratedb.html#source +:link-type: url +Documentation about the CrateDB source adapter for ingestr. +::: + +:::{grid-item-card} Documentation: ingestr CrateDB destination +:link: https://bruin-data.github.io/ingestr/supported-sources/cratedb.html#destination +:link-type: url +Documentation about the CrateDB destination adapter for ingestr. +::: + +:::{grid-item-card} Examples: Use ingestr with CrateDB +:link: https://github.com/crate/cratedb-examples/tree/main/application/ingestr +:link-type: url +Executable code examples / rig that demonstrates how to use ingestr to +load data from Kafka to CrateDB. +::: + +:::: + + + +[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/ +[ingestr]: https://bruin-data.github.io/ingestr/ +[sources supported by ingestr]: https://bruin-data.github.io/ingestr/supported-sources/ From 8c0a2aeaf4c19fc4c5e756b65623e65a540b4bd2 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Tue, 26 Aug 2025 23:24:57 +0200 Subject: [PATCH 3/3] Integrate/dlt+ingestr: Implement suggestions by CodeRabbit --- docs/ingest/etl/index.md | 2 ++ docs/integrate/dlt/index.md | 10 +++++-- docs/integrate/ingestr/index.md | 46 +++++++++++++++++++-------------- 3 files changed, 36 insertions(+), 22 deletions(-) diff --git a/docs/ingest/etl/index.md b/docs/ingest/etl/index.md index cf664c32..049b071d 100644 --- a/docs/ingest/etl/index.md +++ b/docs/ingest/etl/index.md @@ -239,6 +239,7 @@ Load data from datasets and open table formats. - {ref}`aws-lambda` - {ref}`azure-functions` - {ref}`dbt` +- {ref}`dlt` - {ref}`dms` - {ref}`dynamodb` - {ref}`estuary` @@ -246,6 +247,7 @@ Load data from datasets and open table formats. - {ref}`hop` - {ref}`iceberg` - {ref}`influxdb` +- {ref}`ingestr` - {ref}`kafka` - {ref}`kestra` - {ref}`kinesis` diff --git a/docs/integrate/dlt/index.md b/docs/integrate/dlt/index.md index 06c49aad..68055756 100644 --- a/docs/integrate/dlt/index.md +++ b/docs/integrate/dlt/index.md @@ -10,7 +10,7 @@ ```{div} .clearfix ``` -[dlt] (data load tool)--think ELT as Python code--is the most popular +[dlt] (data load tool)—think ELT as Python code—is a popular, production-ready Python library for moving data. It loads data from various and often messy data sources into well-structured, live datasets. dlt is used by {ref}`ingestr`. @@ -21,7 +21,7 @@ dlt is used by {ref}`ingestr`. - **Just code**: no need to use any backends or containers. - **Platform agnostic**: Does not replace your data platform, deployments, or security - models. Simply import dlt in your favorite AI code editor, or add it to your Jupyter + models. Simply import dlt in your favorite code editor, or add it to your Jupyter Notebook. - **Versatile**: You can load data from any source that produces Python data structures, @@ -33,6 +33,12 @@ dlt is used by {ref}`ingestr`. ## Synopsis +Prerequisites: +Install dlt and the CrateDB destination adapter: +```shell +pip install dlt dlt-cratedb +``` + Load data from cloud storage or files into CrateDB. ```python import dlt diff --git a/docs/integrate/ingestr/index.md b/docs/integrate/ingestr/index.md index 4ff280c7..9d319c4b 100644 --- a/docs/integrate/ingestr/index.md +++ b/docs/integrate/ingestr/index.md @@ -8,9 +8,9 @@ ```{div} .clearfix ``` -[ingestr] is a command-line application that allows copying data from any -source into any destination database. It supports CrateDB on the source -and the destination side. ingestr uses {ref}`dlt`. +[ingestr] is a command-line application for copying data from any source +to any destination database. It supports CrateDB on both the source and +destination sides. ingestr builds on {ref}`dlt`. ::::{grid} @@ -53,7 +53,7 @@ ingestr ingest \ ``` :::{note} -Please note there a subtle differences in the CrateDB source vs. target URL. +Please note there are subtle differences between the CrateDB source and target URLs. While `--source-uri=crate://...` addresses CrateDB's SQLAlchemy dialect, `--dest-uri=cratedb://...` is effectively a PostgreSQL connection URL with a protocol schema designating CrateDB. The source adapter uses @@ -64,33 +64,40 @@ PostgreSQL interface. ## Coverage -ingestr supports migration from 20-plus databases, data platforms, analytics +ingestr supports migration from 20-plus databases, data platforms, and analytics engines, including all [databases supported by SQLAlchemy]. -:::{rubric} Databases +:::{rubric} Traditional Databases ::: -Actian Data Platform, Vector, Actian X, Ingres, Amazon Athena, Amazon Redshift, -Amazon S3, Apache Drill, Apache Druid, Apache Hive and Presto, Apache Solr, -Clickhouse, CockroachDB, CrateDB, Databend, Databricks, Denodo, DuckDB, EXASOL DB, -Elasticsearch, Firebird, Firebolt, Google BigQuery, Google Sheets, Greenplum, -HyperSQL (hsqldb), IBM DB2 and Informix, IBM Netezza Performance Server, Impala, InfluxDB, -Kinetica, Microsoft Access, Microsoft SQL Server, MonetDB, MongoDB, MySQL and MariaDB, -OpenGauss, OpenSearch, Oracle, PostgreSQL, Rockset, SAP ASE, SAP HANA, -SAP Sybase SQL Anywhere, Snowflake, SQLite, Teradata Vantage, TiDB, YDB, YugabyteDB. - -:::{rubric} Brokers +CockroachDB, CrateDB, Firebird, HyperSQL (hsqldb), IBM DB2 and Informix, +Microsoft Access, Microsoft SQL Server, MonetDB, MySQL and MariaDB, +OpenGauss, Oracle, PostgreSQL, SAP ASE, SAP HANA, SAP Sybase SQL Anywhere, +SQLite, TiDB, YDB, YugabyteDB + +:::{rubric} Cloud Data Warehouses & Analytics +::: +Amazon Athena, Amazon Redshift, Databend, Databricks, Denodo, DuckDB, +EXASOL DB, Firebolt, Google BigQuery, Greenplum, IBM Netezza Performance Server, +Impala, Kinetica, Rockset, Snowflake, Teradata Vantage + +:::{rubric} Specialized Data Stores +::: +Apache Drill, Apache Druid, Apache Hive and Presto, Clickhouse, Elasticsearch, +InfluxDB, MongoDB, OpenSearch + +:::{rubric} Message Brokers ::: Amazon Kinesis, Apache Kafka (Amazon MSK, Confluent Kafka, Redpanda, RobustMQ) -:::{rubric} File formats +:::{rubric} File Formats ::: CSV, JSONL/NDJSON, Parquet -:::{rubric} Object stores +:::{rubric} Object Stores ::: Amazon S3, Google Cloud Storage -:::{rubric} Services +:::{rubric} SaaS Platforms & Services ::: Airtable, Asana, GitHub, Google Ads, Google Analytics, Google Sheets, HubSpot, Notion, Personio, Salesforce, Slack, Stripe, Zendesk, etc. @@ -125,4 +132,3 @@ load data from Kafka to CrateDB. [databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/ [ingestr]: https://bruin-data.github.io/ingestr/ -[sources supported by ingestr]: https://bruin-data.github.io/ingestr/supported-sources/