You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrate/airflow/data-retention-hot-cold.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,7 +139,7 @@ Assume a basic Astronomer/Airflow setup is in place, as described in the {ref}`f
139
139
140
140
The CrateDB cluster will then automatically initiate the relocation of the affected partition to a node that fulfills the requirement (`cratedb03` in our case).
141
141
142
-
The full implementation is available as [data_retention_reallocate_dag.py](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/data_retention_reallocate_dag.py) on GitHub.
142
+
The full implementation is available as [data_retention_reallocate_dag.py](https://github.com/crate/cratedb-airflow-tutorial/blob/main/dags/data_retention_reallocate_dag.py) on GitHub.
143
143
144
144
To validate our implementation, we trigger the DAG once manually via the Airflow UI at `http://localhost:8081/`. Once executed, log messages of the `reallocate_partitions` task confirm the reallocation was triggered for the partition with the sample data set up earlier:
To automate the process of deleting expired data we use [Apache Airflow](https://airflow.apache.org/). Our workflow implementation does the following: _once a day, fetch policies from the database, and delete all data for which the retention period expired._
45
+
46
+
Use [Apache Airflow](https://airflow.apache.org/) to automate deletions. Once a day, fetch policies from the database and delete data whose retention period expired.
46
47
47
48
### Retrieving Retention Policies
48
49
The first step consists of a task that queries partitions affected by retention policies. We do this by joining `retention_policies` and `information_schema.table_partitions` tables and selecting values with expired retention periods. In CrateDB, `information_schema.table_partitions`[{ref}`documentation <crate-reference:is_table_partitions>`] contains information about all partitioned tables including the name of the table, schema, partition column, and the values of the partition.
@@ -79,7 +80,7 @@ The first step is to create the function `get_policies` that takes as a paramete
79
80
### Cross-Communication Between Tasks
80
81
Before we continue into the implementation of the next task in Apache Airflow, we would like to give a brief overview of how the data is communicated between different tasks in a DAG. For this purpose, Airflow introduces the [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) system. Simply speaking `XCom` can be seen as a small object with storage that allows tasks to `push` data into that storage that can be later used by a different task in the DAG.
81
82
82
-
The key thing here is that it allows the exchange of a **small** amount of data between tasks. From Airflow 2.0, the return value of a Python method used as a task will be automatically stored in `XCom`. For our example, this means that the `get_policies`return value is available from the next task after the `get_policies` operator executes. To access the data from another task, a reference to the previous task can be passed to the next task when defining dependencies between tasks.
83
+
XCom exchanges a small amount of data between tasks. Since Airflow 2.0, a Python task’s return value is stored in XCom. In our case, `get_policies`returns the partitions; the next task reads them via a reference to `get_policies`when defining dependencies.
83
84
84
85
### Applying Retention Policies
85
86
Now that we retrieved the policies and Airflow automatically saved them via `XCom`, we need to create another task that will go through each element in the list and delete expired data.
@@ -147,7 +148,7 @@ data_retention_delete()
147
148
148
149
On the `SQLExecuteQueryOperator`, a certain set of attributes are passed via `partial` instead of `expand`. These are static values that are the same for each `DELETE` statement, like the connection and task ID.
149
150
150
-
The full DAG implementation of the data retention policy can be found in our [GitHub repository](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/data_retention_delete_dag.py). To run the workflow, we rely on Astronomer infrastructure with the same setup as shown in the {ref}`getting started <airflow-getting-started>` section.
151
+
The full DAG implementation of the data retention policy can be found in our [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial/blob/main/dags/data_retention_delete_dag.py). To run the workflow, we rely on Astronomer infrastructure with the same setup as shown in the {ref}`getting started <airflow-getting-started>` section.
151
152
152
153
## Summary
153
154
This tutorial gives a guide on how to delete data with expired retention policies. The first part shows how to design policies in CrateDB and then, how to use Apache Airflow to automate data deletion. The DAG implementation is fairly simple: the first task performs the extraction of relevant policies, while the second task makes sure that affected partitions are deleted. In the following tutorial, we will focus on another real-world example that can be automated with Apache Airflow and CrateDB.
Copy file name to clipboardExpand all lines: docs/integrate/airflow/export-s3.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ TABLES = [
49
49
```
50
50
The DAG itself is specified as a Python file `astro-project/dags`. It loads the above-defined `TABLES` list and iterates over it. For each entry, a corresponding `SQLExecuteQueryOperator` is instantiated, which will perform the actual export during execution. If the `TABLES` list contains more than one element, Airflow will be able to process the corresponding exports in parallel, as there are no dependencies between them.
51
51
52
-
The resulting DAG code is as follows (see the [GitHub repository](https://github.com/crate/crate-airflow-tutorial) for the complete project):
52
+
The resulting DAG code is as follows (see the [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial) for the complete project):
53
53
```python
54
54
import os
55
55
import pendulum
@@ -115,4 +115,4 @@ To find more details about running DAGs, go to `Browse/DAG runs` which opens a n
115
115
After a successful DAG execution, the data will be stored on the remote filesystem.
116
116
117
117
## Summary
118
-
This article covered a simple use case: periodic data export to a remote filesystem. In the following articles, we will cover more complex use cases composed of several tasks based on real-world scenarios. If you want to try our examples with Apache Airflow and Astronomer, you are free to check out the code on the public [GitHub repository](https://github.com/crate/crate-airflow-tutorial).
118
+
This article covered a simple use case: periodic data export to a remote filesystem. In the following articles, we will cover more complex use cases composed of several tasks based on real-world scenarios. If you want to try our examples with Apache Airflow and Astronomer, you are free to check out the code on the public [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial).
The file path above corresponds to the data from March 2022. So, to retrieve a specific file, the task gets the date and formats it to compose the name of the specific file. Important to mention that the data is released with 2 months of delay, so it had to be taken into consideration.
96
-
***process_parquet:**afterward, use the name to download the file to local storage and convert it from Parquet to CSV using`parquet-tools` (Apache Parquet CLI; see [Apache Arrow]).
96
+
***process_parquet:**Use the formatted name to download the file to local storage and convert it from Parquet to CSV with`parquet-tools` (Apache Parquet CLI; see [Apache Arrow]).
***copy_csv_to_s3:**Once the newly transformed file is available, it gets uploaded to an S3 Bucket to then, be used in the {ref}`crate-reference:sql-copy-from` SQL statement.
103
-
***copy_csv_staging:**copy the CSV file stored in S3 to the staging table described previously.
104
-
***copy_staging_to_trips:**finally, copy the data from the staging table to the trips table, casting the columns that are not in the right type yet.
105
-
***delete_staging:**after it is all processed, clean up the staging table by deleting all rows, and preparing for the next file.
106
-
***delete_local_parquet_csv:**delete the files (Parquet and CSV) from the storage.
102
+
***copy_csv_to_s3:**Upload the transformed file to an S3 bucket and reference it in the {ref}`crate-reference:sql-copy-from` statement.
103
+
***copy_csv_staging:**Copy the CSV file stored in S3 to the staging table described previously.
104
+
***copy_staging_to_trips:**Copy data from the staging table to the trips table, casting columns to their final types.
105
+
***delete_staging:**After processing, delete all rows from the staging table to prepare for the next file.
106
+
***delete_local_parquet_csv:**Delete the local Parquet and CSV files.
107
107
108
108
The DAG was configured based on the characteristics of the data in use. In this case, there are two crucial pieces of information about the data provider:
109
109
@@ -113,19 +113,19 @@ The DAG was configured based on the characteristics of the data in use. In this
113
113
The NYC TLC publishes trip data monthly with a two‑month delay. Set the DAG to
114
114
run monthly with a start date of March 2009. The first run (logical date March
115
115
2009) downloads the file for January 2009 (logical date minus two months),
116
-
2010)which is the first available dataset.
116
+
which is the first available dataset.
117
117
118
118
You may find the full code for the DAG described above available in our
0 commit comments