|
| 1 | +(df)= |
| 2 | +(dataframes)= |
| 3 | +# CrateDB and DataFrame libraries |
| 4 | + |
| 5 | +Data frame libraries and frameworks which can |
| 6 | +be used together with CrateDB. |
| 7 | + |
| 8 | + |
| 9 | +:::::{grid} 1 2 2 2 |
| 10 | +:margin: 4 4 0 0 |
| 11 | +:padding: 0 |
| 12 | +:gutter: 2 |
| 13 | + |
| 14 | +::::{grid-item-card} {material-outlined}`lightbulb;2em` Tutorials |
| 15 | +:link: guide:dataframes |
| 16 | +:link-type: ref |
| 17 | +Learn how to use CrateDB together with popular open-source data frame |
| 18 | +libraries, on behalf of hands-on tutorials and code examples. |
| 19 | ++++ |
| 20 | +{tag-info}`Dask` {tag-info}`pandas` {tag-info}`Polars` |
| 21 | +:::: |
| 22 | + |
| 23 | +::::{grid-item-card} {material-outlined}`read_more;2em` SQLAlchemy |
| 24 | +CrateDB's SQLAlchemy dialect implementation provides fundamental infrastructure |
| 25 | +to integrations with Dask, pandas, and Polars. |
| 26 | ++++ |
| 27 | +[ORM Guides](inv:guide#orm) • |
| 28 | +{ref}`ORM Catalog <orm>` |
| 29 | +:::: |
| 30 | + |
| 31 | +::::: |
| 32 | + |
| 33 | + |
| 34 | +(dask)= |
| 35 | +## Dask |
| 36 | + |
| 37 | +[Dask] is a parallel computing library for analytics with task scheduling. |
| 38 | +It is built on top of the Python programming language, making it easy to scale |
| 39 | +the Python libraries that you know and love, like NumPy, pandas, and scikit-learn. |
| 40 | + |
| 41 | +```{div} |
| 42 | +:style: "float: right" |
| 43 | +[{w=180px}](https://www.dask.org/) |
| 44 | +``` |
| 45 | + |
| 46 | +- [Dask DataFrames] help you process large tabular data by parallelizing pandas, |
| 47 | + either on your laptop for larger-than-memory computing, or on a distributed |
| 48 | + cluster of computers. |
| 49 | + |
| 50 | +- [Dask Futures], implementing a real-time task framework, allow you to scale |
| 51 | + generic Python workflows across a Dask cluster with minimal code changes, |
| 52 | + by extending Python's `concurrent.futures` interface. |
| 53 | + |
| 54 | +```{div} |
| 55 | +:style: "clear: both" |
| 56 | +``` |
| 57 | + |
| 58 | + |
| 59 | +(pandas)= |
| 60 | +## pandas |
| 61 | + |
| 62 | +```{div} |
| 63 | +:style: "float: right" |
| 64 | +[{w=180px}](https://pandas.pydata.org/) |
| 65 | +``` |
| 66 | + |
| 67 | +[pandas] is a fast, powerful, flexible, and easy to use open source data analysis |
| 68 | +and manipulation tool, built on top of the Python programming language. |
| 69 | + |
| 70 | +Pandas (stylized as pandas) is a software library written for the Python programming |
| 71 | +language for data manipulation and analysis. In particular, it offers data structures |
| 72 | +and operations for manipulating numerical tables and time series. |
| 73 | + |
| 74 | +:::{rubric} Data Model |
| 75 | +::: |
| 76 | +- Pandas is built around data structures called Series and DataFrames. Data for these |
| 77 | + collections can be imported from various file formats such as comma-separated values, |
| 78 | + JSON, Parquet, SQL database tables or queries, and Microsoft Excel. |
| 79 | +- A Series is a 1-dimensional data structure built on top of NumPy's array. |
| 80 | +- Pandas includes support for time series, such as the ability to interpolate values |
| 81 | + and filter using a range of timestamps. |
| 82 | +- By default, a Pandas index is a series of integers ascending from 0, similar to the |
| 83 | + indices of Python arrays. However, indices can use any NumPy data type, including |
| 84 | + floating point, timestamps, or strings. |
| 85 | +- Pandas supports hierarchical indices with multiple values per data point. An index |
| 86 | + with this structure, called a "MultiIndex", allows a single DataFrame to represent |
| 87 | + multiple dimensions, similar to a pivot table in Microsoft Excel. Each level of a |
| 88 | + MultiIndex can be given a unique name. |
| 89 | + |
| 90 | +```{div} |
| 91 | +:style: "clear: both" |
| 92 | +``` |
| 93 | + |
| 94 | + |
| 95 | +(polars)= |
| 96 | +## Polars |
| 97 | + |
| 98 | +```{div} |
| 99 | +:style: "float: right; margin-left: 0.5em" |
| 100 | +[{w=180px}](https://pola.rs/) |
| 101 | +``` |
| 102 | + |
| 103 | +[Polars] is a blazingly fast DataFrames library with language bindings for |
| 104 | +Rust, Python, Node.js, R, and SQL. Polars is powered by a multithreaded, |
| 105 | +vectorized query engine, it is open source, and written in Rust. |
| 106 | + |
| 107 | +- **Fast:** Written from scratch in Rust and with performance in mind, |
| 108 | + designed close to the machine, and without external dependencies. |
| 109 | + |
| 110 | +- **I/O:** First class support for all common data storage layers: local, |
| 111 | + cloud storage & databases. |
| 112 | + |
| 113 | +- **Intuitive API:** Write your queries the way they were intended. Polars, |
| 114 | + internally, will determine the most efficient way to execute using its query |
| 115 | + optimizer. Polars' expressions are intuitive and empower you to write |
| 116 | + readable and performant code at the same time. |
| 117 | + |
| 118 | +- **Out of Core:** The streaming API allows you to process your results without |
| 119 | + requiring all your data to be in memory at the same time. |
| 120 | + |
| 121 | +- **Parallel:** Polars' multi-threaded query engine utilises the power of your |
| 122 | + machine by dividing the workload among the available CPU cores without any |
| 123 | + additional configuration. |
| 124 | + |
| 125 | +- **Vectorized Query Engine:** Uses [Apache Arrow], a columnar data format, to |
| 126 | + process your queries in a vectorized manner and SIMD to optimize CPU usage. |
| 127 | + This enables cache-coherent algorithms and high performance on modern processors. |
| 128 | + |
| 129 | +- **Open Source:** Polars is and always will be open source. Driven by an active |
| 130 | + community of developers. Everyone is encouraged to add new features and contribute. |
| 131 | + It is free to use under the MIT license. |
| 132 | + |
| 133 | +:::{rubric} Data formats |
| 134 | +::: |
| 135 | + |
| 136 | +Polars supports reading and writing to many common data formats. |
| 137 | +This allows you to easily integrate Polars into your existing data stack. |
| 138 | + |
| 139 | +- Text: CSV & JSON |
| 140 | +- Binary: Parquet, Delta Lake, AVRO & Excel |
| 141 | +- IPC: Feather, Arrow |
| 142 | +- Databases: MySQL, Postgres, SQL Server, Sqlite, Redshift & Oracle |
| 143 | +- Cloud Storage: S3, Azure Blob & Azure File |
| 144 | + |
| 145 | +```{div} |
| 146 | +:style: "clear: both" |
| 147 | +``` |
| 148 | + |
| 149 | + |
| 150 | +## Examples |
| 151 | + |
| 152 | +How to use CrateDB together with popular open-source dataframe libraries. |
| 153 | + |
| 154 | +## Dask |
| 155 | +- [Guide to efficient data ingestion to CrateDB with pandas and Dask] |
| 156 | +- [Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy] |
| 157 | +- [Import weather data using Dask] |
| 158 | +- [Dask code examples] |
| 159 | + |
| 160 | +## pandas |
| 161 | +- [Guide to efficient data ingestion to CrateDB with pandas] |
| 162 | +- [Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy] |
| 163 | +- [pandas code examples] |
| 164 | + |
| 165 | +## Polars |
| 166 | +- [Polars code examples] |
| 167 | + |
| 168 | + |
| 169 | + |
| 170 | +[Apache Arrow]: https://arrow.apache.org/ |
| 171 | +[Dask]: https://www.dask.org/ |
| 172 | +[Dask DataFrames]: https://docs.dask.org/en/latest/dataframe.html |
| 173 | +[Dask Futures]: https://docs.dask.org/en/latest/futures.html |
| 174 | +[pandas]: https://pandas.pydata.org/ |
| 175 | +[Polars]: https://pola.rs/ |
| 176 | + |
| 177 | +[Dask code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/dask |
| 178 | +[Efficient batch/bulk INSERT operations with pandas, Dask, and SQLAlchemy]: https://cratedb.com/docs/python/en/latest/by-example/sqlalchemy/dataframe.html |
| 179 | +[Guide to efficient data ingestion to CrateDB with pandas]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas/1541 |
| 180 | +[Guide to efficient data ingestion to CrateDB with pandas and Dask]: https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas-and-dask/1482 |
| 181 | +[Import weather data using Dask]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/dask-weather-data-import.ipynb |
| 182 | +[Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy]: https://community.cratedb.com/t/importing-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 |
| 183 | +[pandas code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/pandas |
| 184 | +[Polars code examples]: https://github.com/crate/cratedb-examples/tree/main/by-dataframe/polars |
0 commit comments