Airflow: Implement suggestions by CodeRabbit, part 5

amotl · amotl · commit 3e7d99ad2e55 · 2025-09-17T11:09:43.000+02:00
diff --git a/docs/integrate/airflow/data-retention-hot-cold.md b/docs/integrate/airflow/data-retention-hot-cold.md
@@ -139,7 +139,7 @@ Assume a basic Astronomer/Airflow setup is in place, as described in the {ref}`f
 
 The CrateDB cluster will then automatically initiate the relocation of the affected partition to a node that fulfills the requirement (`cratedb03` in our case).
 
-The full implementation is available as [data_retention_reallocate_dag.py](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/data_retention_reallocate_dag.py) on GitHub.
+The full implementation is available as [data_retention_reallocate_dag.py](https://github.com/crate/cratedb-airflow-tutorial/blob/main/dags/data_retention_reallocate_dag.py) on GitHub.
 
 To validate our implementation, we trigger the DAG once manually via the Airflow UI at `http://localhost:8081/`. Once executed, log messages of the `reallocate_partitions` task confirm the reallocation was triggered for the partition with the sample data set up earlier:
 
diff --git a/docs/integrate/airflow/data-retention-policy.md b/docs/integrate/airflow/data-retention-policy.md
@@ -42,7 +42,8 @@ INSERT INTO retention_policies (table_schema, table_name, partition_column, rete
 ```
 
 ## Implementation in Apache Airflow
-To automate the process of deleting expired data we use [Apache Airflow](https://airflow.apache.org/). Our workflow implementation does the following: _once a day, fetch policies from the database, and delete all data for which the retention period expired._
+
+Use [Apache Airflow](https://airflow.apache.org/) to automate deletions. Once a day, fetch policies from the database and delete data whose retention period expired.
 
 ### Retrieving Retention Policies
 The first step consists of a task that queries partitions affected by retention policies. We do this by joining `retention_policies` and `information_schema.table_partitions` tables and selecting values with expired retention periods. In CrateDB, `information_schema.table_partitions` [{ref}`documentation <crate-reference:is_table_partitions>`] contains information about all partitioned tables including the name of the table, schema, partition column, and the values of the partition.
@@ -79,7 +80,7 @@ The first step is to create the function `get_policies` that takes as a paramete
 ### Cross-Communication Between Tasks
 Before we continue into the implementation of the next task in Apache Airflow, we would like to give a brief overview of how the data is communicated between different tasks in a DAG. For this purpose, Airflow introduces the [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) system. Simply speaking `XCom` can be seen as a small object with storage that allows tasks to `push` data into that storage that can be later used by a different task in the DAG.
 
-The key thing here is that it allows the exchange of a **small** amount of data between tasks. From Airflow 2.0, the return value of a Python method used as a task will be automatically stored in `XCom`. For our example, this means that the `get_policies` return value is available from the next task after the `get_policies` operator executes. To access the data from another task, a reference to the previous task can be passed to the next task when defining dependencies between tasks.
+XCom exchanges a small amount of data between tasks. Since Airflow 2.0, a Python task’s return value is stored in XCom. In our case, `get_policies` returns the partitions; the next task reads them via a reference to `get_policies` when defining dependencies.
 
 ### Applying Retention Policies
 Now that we retrieved the policies and Airflow automatically saved them via `XCom`, we need to create another task that will go through each element in the list and delete expired data.
@@ -147,7 +148,7 @@ data_retention_delete()
 
 On the `SQLExecuteQueryOperator`, a certain set of attributes are passed via `partial` instead of `expand`. These are static values that are the same for each `DELETE` statement, like the connection and task ID.
 
-The full DAG implementation of the data retention policy can be found in our [GitHub repository](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/data_retention_delete_dag.py). To run the workflow, we rely on Astronomer infrastructure with the same setup as shown in the {ref}`getting started <airflow-getting-started>` section.
+The full DAG implementation of the data retention policy can be found in our [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial/blob/main/dags/data_retention_delete_dag.py). To run the workflow, we rely on Astronomer infrastructure with the same setup as shown in the {ref}`getting started <airflow-getting-started>` section.
 
 ## Summary
 This tutorial gives a guide on how to delete data with expired retention policies. The first part shows how to design policies in CrateDB and then, how to use Apache Airflow to automate data deletion. The DAG implementation is fairly simple: the first task performs the extraction of relevant policies, while the second task makes sure that affected partitions are deleted. In the following tutorial, we will focus on another real-world example that can be automated with Apache Airflow and CrateDB.
diff --git a/docs/integrate/airflow/export-s3.md b/docs/integrate/airflow/export-s3.md
@@ -49,7 +49,7 @@ TABLES = [
 ```
 The DAG itself is specified as a Python file `astro-project/dags`. It loads the above-defined `TABLES` list and iterates over it. For each entry, a corresponding `SQLExecuteQueryOperator` is instantiated, which will perform the actual export during execution. If the `TABLES` list contains more than one element, Airflow will be able to process the corresponding exports in parallel, as there are no dependencies between them.
 
-The resulting DAG code is as follows (see the [GitHub repository](https://github.com/crate/crate-airflow-tutorial) for the complete project):
+The resulting DAG code is as follows (see the [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial) for the complete project):
 ```python
 import os
 import pendulum
@@ -115,4 +115,4 @@ To find more details about running DAGs, go to `Browse/DAG runs` which opens a n
 After a successful DAG execution, the data will be stored on the remote filesystem.
 
 ## Summary
-This article covered a simple use case: periodic data export to a remote filesystem. In the following articles, we will cover more complex use cases composed of several tasks based on real-world scenarios. If you want to try our examples with Apache Airflow and Astronomer, you are free to check out the code on the public [GitHub repository](https://github.com/crate/crate-airflow-tutorial).
+This article covered a simple use case: periodic data export to a remote filesystem. In the following articles, we will cover more complex use cases composed of several tasks based on real-world scenarios. If you want to try our examples with Apache Airflow and Astronomer, you are free to check out the code on the public [GitHub repository](https://github.com/crate/cratedb-airflow-tutorial).
diff --git a/docs/integrate/airflow/getting-started.md b/docs/integrate/airflow/getting-started.md
@@ -87,7 +87,7 @@ The astronomer project consists of four Docker containers:
 - Triggerer (running an event loop for deferrable tasks)
 
 The PostgreSQL server listens on port 5432. The web server listens on port 8080
-and is available at <http://localhost:8080/> with `admin`/`admin`.
+and is available at `http://localhost:8080/` with `admin`/`admin`.
 
 If these ports are already in use, change them in `.astro/config.yaml`. For
 example, set the webserver to 8081 and PostgreSQL to 5435:
diff --git a/docs/integrate/airflow/import-parquet.md b/docs/integrate/airflow/import-parquet.md
@@ -14,7 +14,7 @@ For an alternative Parquet ingestion approach, see {ref}`arrow-import-parquet`.
 
 Before you start, have Airflow and CrateDB running. The SQL shown below also
 resides in the setup folder of the
-[GitHub repository](https://github.com/crate/crate-airflow-tutorial).
+[GitHub repository](https://github.com/crate/cratedb-airflow-tutorial).
 
 Create two tables in CrateDB: a temporary staging table
 (`nyc_taxi.load_trips_staging`) and the final table (`nyc_taxi.trips`).
@@ -78,8 +78,8 @@ CREATE TABLE IF NOT EXISTS "nyc_taxi"."trips" (
 )
 PARTITIONED BY ("pickup_year");
 ```
-To better understand how Airflow works and its applications, you can check other
-tutorials related to that topic {ref}`here <airflow-tutorials>`.
+To explore more Airflow use cases, see the related tutorials
+{ref}`here <airflow-tutorials>`.
 
 With the tools set up and tables created, proceed to the DAG.
 
@@ -93,17 +93,17 @@ The Airflow DAG used in this tutorial contains 7 tasks:
    https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet
    ```
    The file path above corresponds to the data from March 2022. So, to retrieve a specific file, the task gets the date and formats it to compose the name of the specific file. Important to mention that the data is released with 2 months of delay, so it had to be taken into consideration.
-* **process_parquet:** afterward, use the name to download the file to local storage and convert it from Parquet to CSV using `parquet-tools` (Apache Parquet CLI; see [Apache Arrow]).
+* **process_parquet:** Use the formatted name to download the file to local storage and convert it from Parquet to CSV with `parquet-tools` (Apache Parquet CLI; see [Apache Arrow]).
 
   * `curl -o "<LOCAL-PARQUET-FILE-PATH>" "<REMOTE-PARQUET-FILE>"`
   * `parquet-tools csv <LOCAL-PARQUET-FILE-PATH> > <CSV-FILE-PATH>`
 
   Both commands run within one `BashOperator`.
-* **copy_csv_to_s3:** Once the newly transformed file is available, it gets uploaded to an S3 Bucket to then, be used in the {ref}`crate-reference:sql-copy-from` SQL statement.
-* **copy_csv_staging:** copy the CSV file stored in S3 to the staging table described previously.
-* **copy_staging_to_trips:** finally, copy the data from the staging table to the trips table, casting the columns that are not in the right type yet.
-* **delete_staging:** after it is all processed, clean up the staging table by deleting all rows, and preparing for the next file.
-* **delete_local_parquet_csv:** delete the files (Parquet and CSV) from the storage.
+* **copy_csv_to_s3:** Upload the transformed file to an S3 bucket and reference it in the {ref}`crate-reference:sql-copy-from` statement.
+* **copy_csv_staging:** Copy the CSV file stored in S3 to the staging table described previously.
+* **copy_staging_to_trips:** Copy data from the staging table to the trips table, casting columns to their final types.
+* **delete_staging:** After processing, delete all rows from the staging table to prepare for the next file.
+* **delete_local_parquet_csv:** Delete the local Parquet and CSV files.
 
 The DAG was configured based on the characteristics of the data in use. In this case, there are two crucial pieces of information about the data provider:
 
@@ -113,19 +113,19 @@ The DAG was configured based on the characteristics of the data in use. In this
 The NYC TLC publishes trip data monthly with a two‑month delay. Set the DAG to
 run monthly with a start date of March 2009. The first run (logical date March
 2009) downloads the file for January 2009 (logical date minus two months),
-2010) which is the first available dataset.
+which is the first available dataset.
 
 You may find the full code for the DAG described above available in our
-[GitHub repository](https://github.com/crate/crate-airflow-tutorial/blob/main/dags/nyc_taxi_dag.py).
+[GitHub repository](https://github.com/crate/cratedb-airflow-tutorial/blob/main/dags/nyc_taxi_dag.py).
 
 ## Wrap up
 
 The workflow represented in this tutorial is a simple way to import Parquet files
 to CrateDB by transforming them into a CSV file. As previously mentioned, there
 are other approaches out there, we encourage you to try them out.
 
-If you want to continue to explore how CrateDB can be used with Airflow, you can
-check other tutorials related to that topic {ref}`here <airflow-tutorials>`.
+To continue exploring CrateDB with Airflow, browse the related tutorials
+{ref}`here <airflow-tutorials>`.
 
 
 [Apache Arrow]: https://github.com/apache/arrow
diff --git a/docs/integrate/airflow/index.md b/docs/integrate/airflow/index.md
@@ -26,7 +26,7 @@ Airflow has a modular architecture and uses a message queue to orchestrate an
 arbitrary number of workers. Pipelines are defined in Python, allowing for
 dynamic pipeline generation and on-demand, code-driven pipeline invocation.
 
-Pipeline parameterization is using the powerful Jinja templating engine.
+Airflow parameterizes pipelines with the Jinja templating engine.
 To extend the system, you can define your own operators and extend libraries
 to fit the level of abstraction that suits your environment.
 :::
@@ -38,19 +38,16 @@ to fit the level of abstraction that suits your environment.
 [![Astronomer logo](https://logowik.com/content/uploads/images/astronomer2824.jpg){w=180px}](https://www.astronomer.io/)
 ```
 
-[Astro][Astronomer] is the best managed service in the market for teams on any step of their data
-journey. Spend time where it counts.
+[Astro][Astronomer] is a managed Airflow service.
 
 - Astro runs on the cloud of your choice. Astro manages Airflow and gives you all the
   features you need to focus on what really matters – your data. All while connecting
   securely to any service in your network.
-- Create Airflow environments with a click of a button.
+- Create Airflow environments quickly.
 - Protect production DAGs with easy Airflow upgrades and custom high-availability configs.
 - Get visibility into what’s running with analytics views and easy interfaces for logs
   and alerts. Across environments.
-- Take down tech-debt and learn how to drive Airflow best practices from the experts
-  behind the project. Get world-class support, fast-tracked bug fixes, and same-day
-  access to new Airflow versions.
+- Adopt Airflow best practices with support and timely upgrades.
 
 ```{div} .clearfix
 ```