Skip to content

Commit ea8fac1

Browse files
marijaselakovicamotl
authored andcommitted
Prefect: Index page and starter tutorial
1 parent 427d701 commit ea8fac1

File tree

4 files changed

+133
-0
lines changed

4 files changed

+133
-0
lines changed

docs/ingest/etl/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ Load data from streaming platforms.
5353
- {ref}`kestra`
5454
- {ref}`meltano`
5555
- {ref}`nifi`
56+
- {ref}`prefect`
5657
+++
5758
Use data pipeline programming frameworks and platforms.
5859
::::

docs/integrate/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ oracle/index
5757
plotly/index
5858
postgresql/index
5959
Power BI <powerbi/index>
60+
prefect/index
6061
prometheus/index
6162
pyviz/index
6263
queryzen/index

docs/integrate/prefect/index.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
(prefect)=
2+
# Prefect
3+
4+
```{div} .float-right
5+
[![Prefect logo](https://i.logos-download.com/112205/28342-og-60c4845cec1905f22b2878f23a9c29ad.png/Prefect_Technologies_Logo_og.png){w=180px}][Prefect]
6+
```
7+
```{div} .clearfix
8+
```
9+
10+
:::{div} sd-text-muted
11+
Modern Workflow Orchestration.
12+
:::
13+
14+
:::{rubric} About
15+
:::
16+
17+
[Prefect] is a workflow orchestration framework for building resilient data
18+
pipelines in Python.
19+
20+
Give your team the power to build reliable workflows without sacrificing
21+
development speed. Prefect Core combines the freedom of pure Python
22+
development with production-grade resilience, putting you in control of
23+
your data operations. Transform your code into scalable workflows that
24+
deliver consistent results.
25+
26+
:::{rubric} Learn
27+
:::
28+
29+
::::{grid}
30+
31+
:::{grid-item-card} Tutorial: Combining Prefect and CrateDB
32+
:link: prefect-tutorial
33+
:link-type: ref
34+
Building Seamless Data Pipelines Made Easy: Combining Prefect and CrateDB.
35+
:::
36+
37+
::::
38+
39+
40+
:::{toctree}
41+
:maxdepth: 1
42+
:hidden:
43+
Tutorial <tutorial>
44+
:::
45+
46+
47+
[Prefect]: https://www.prefect.io/opensource

docs/integrate/prefect/tutorial.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
(prefect-tutorial)=
2+
# Building Seamless Data Pipelines Made Easy: Combining Prefect and CrateDB
3+
4+
## Introduction
5+
6+
[Prefect](https://www.prefect.io/opensource/) is an open-source workflow automation and orchestration tool for data engineering, machine learning, and other data-related tasks. It allows you to define, schedule, and execute complex data workflows in a straightforward manner.
7+
8+
Prefect workflows are defined using *Python code*. Each step in the workflow is represented as a "task," and tasks can be connected to create a directed acyclic graph (DAG). The workflow defines the sequence of task execution and can include conditional logic and branching. Furthermore, Prefect provides built-in scheduling features that set up cron-like schedules for the flow. You can also parameterize your flow, allowing a run of the same flow with different input values.
9+
10+
This tutorial will explore how CrateDB and Prefect come together to streamline data ingestion, transformation, and loading (ETL) processes with a few lines of Python code.
11+
12+
## Prerequisites
13+
14+
Before we begin, ensure you have the following prerequisites installed on your system:
15+
16+
* **Python 3.x**: Prefect is a Python-based workflow management system, so you'll need Python installed on your machine.
17+
* **CrateDB**: To work with CrateDB, create a new cluster in [CrateDB Cloud](https://console.cratedb.cloud/). You can choose the CRFEE tier cluster that does not require any payment information.
18+
* **Prefect**: Install Prefect using pip by running the following command in your terminal or command prompt: `pip install -U prefect`
19+
20+
## Getting started with Perfect
21+
22+
1. To get started with Prefect, you need to connect to Prefect’s API: the easiest way is to sign up for a free forever Cloud account at [https://app.prefect.cloud/](https://app.prefect.cloud/?deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a&deviceId=cfc80edd-a234-4911-a25e-ff0d6bb2c32a).
23+
2. Once you create a new account, create a new workspace with a name of your choice.
24+
3. Run `prefect cloud login` to [log into Prefect Cloud](https://docs.prefect.io/cloud/users/api-keys) from the local environment.
25+
26+
Now you are ready to build your first data workflows!
27+
28+
## Run your first ETL workflow with CrateDB
29+
We'll dive into the basics of Prefect by creating a simple workflow with tasks that fetch data from a source, perform basic transformations, and load it into CrateDB. For this example, we will use [the yellow taxi trip data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz), which includes pickup time, geo-coordinates, number of passengers, and several other variables. The goal is to create a workflow that does a basic transformation on this data and inserts it into a CrateDB table named `trip_data`:
30+
31+
```python
32+
import pandas as pd
33+
from prefect import flow, task
34+
from crate import client
35+
36+
CSV_URL = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"
37+
URI = "crate://admin:password@host:5432"
38+
39+
@task()
40+
def extract_data(url: str):
41+
df = pd.read_csv(url, compression="gzip")
42+
return df
43+
44+
@task()
45+
def transform_data(df):
46+
df = df[df['passenger_count'] != 0]
47+
return df
48+
49+
@task()
50+
def load_data(table_name, df):
51+
df.to_sql(table_name,URI,if_exists="replace",index=False)
52+
53+
@flow(name="ETL workflow", log_prints=True)
54+
def main_flow():
55+
raw_data = extract_data(CSV_URL)
56+
data = transform_data(raw_data)
57+
load_data("trip_data", data)
58+
59+
if __name__ == '__main__':
60+
main_flow()
61+
```
62+
63+
1. We start defining the flow by importing the necessary modules, including `prefect` for working with workflows, `pandas` for data manipulation, and `crate` for interacting with CrateDB.
64+
2. Next, we specify the connection parameters for CrateDB and the URL for a file containing the dataset. You should modify these values according to your CrateDB Cloud setup.
65+
3. We define three tasks using the `@task` decorator: `extract_data(url)`, `transform_data(data)`, and `load_data(table_name, transformed_data)`. Each task represents a unit of work in the workflow:
66+
1. The `read_data()` task loads the data from the CSV file to a `pandas` data frame.
67+
2. The `transform_data(data)` task takes the data frame and returns the data frame with entries where the `passenger_count` value is different than 0.
68+
3. The `load_data(transformed_data)` task connects to the CrateDB and loads data into the `trip_data` table.
69+
4. We define the workflow, name it “ETL workflow“, and specify the sequence of tasks: `extract_data()`, `transform_data(data)`, and `load_data(table_name, transformed_data)`.
70+
5. Finally, we execute the flow by calling `main_flow()`. This runs the workflow, and each task is executed in the order defined.
71+
72+
When you run this Python script, the workflow will read the trip data from a `csv` file, transform it, and load it into the CrateDB table. You can see the state of the flow run in the *Flows Runs* tab in Prefect UI:
73+
74+
![Screenshot 2023-08-01 at 09.50.02|690x328](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/ecd02359cf23b5048e084faa785c7ad795bb5e57.png)
75+
76+
You can enrich the ETL pipeline with many advanced features available in Prefect such as parameterization, error handling, retries, and more. Finally, after the successful execution of the workflow, you can query the data in the CrateDB:
77+
78+
![Screenshot 2023-08-01 at 09.49.20|690x340](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/5582fcd2a677f78f8f7c6a1aa4b8e14f25dda2d1.png)
79+
80+
## Wrap up
81+
82+
Throughout this tutorial, you made a simple Prefect workflow, defined tasks, and orchestrated data transformations and loading into CrateDB. Both tools offer extensive feature sets that you can use to optimize and scale your data workflows further.
83+
84+
As you continue exploring, don’t forget to check out the {ref}`reference documentation <crate-reference:index>`. If you have further questions or would like to learn more about updates, features, and integrations, join the [CrateDB community](https://community.cratedb.com/). Happy data wrangling!

0 commit comments

Comments
 (0)