This repository contains the DAG definitions for the UnIC datalake project (Univers Informationnel du CHU Sainte-Justine). These DAGs orchestrate ETL jobs defined in the unic-etl repository.
- π Python 3.9
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDAGs can be defined in two ways:
- JSON configuration files located in
dags/config. - Python DAGs directly in the
dagsfolder for advanced use cases.
Each JSON file in dags/config defines one DAG to orchestrate a single resource. The file must follow this naming
and placement convention:
- The filename must start with the desired dag_id and end with
_config.json. The convention is to use the * resource_code* as the dag_id. - The file must be placed under the folder for its starting zone:
redoryellow
- And inside the corresponding starting subzone folder:
redβingestion,curated,enrichedyellowβanonymized,enriched,warehouse
The generated DAG ID will follow this pattern:
<subzone>_<resource_code>
Example:
A file nameddags/config/red/curated/opera_config.jsonwill create a DAG namedcurated_opera.
The root fields of the file are :
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
concurrency |
int | β | 1 |
Max number of parallel tasks. |
start_date |
string | β | "2021-01-01" |
Initial DAG run date (format: "YYYY-MM-DD"). |
schedule |
string | β | β | CRON expression for scheduling. |
catchup |
bool | β | false |
Whether to backfill missed DAG runs. |
timeout_hours |
int | β | 4 |
Max execution time in hours before timeout. |
steps |
array | β | β | List of execution steps. See below. |
Each step can contain
| Step Field | Type | Required | Default | Description |
|---|---|---|---|---|
destination_zone |
string | β | β | Target zone (e.g., "red", "yellow", "green"). |
destination_subzone |
string | β | β | Subzone within the zone (e.g., "curated", "released"). |
main_class |
string | β | β | Main class to run in unic-etl. Optional for published tasks. |
multiple_main_methods |
bool | β | false |
Whether multiple entrypoints exist in the main class. |
pre_tests |
array | β | [] |
QA tests before step execution. See below. |
datasets |
array | β | β | List of datasets to process in the step. One task per dataset will be created. See below. |
post_tests |
array | β | [] |
QA tests after step execution. See below. |
optimize |
array | β | [] |
List of dataset IDs to optimize (i.e. Delta compaction). |
Each dataset can contain
| Dataset Field | Type | Required | Default | Description |
|---|---|---|---|---|
dataset_id |
string | β | β | Dataset ID. Supports wildcards. |
cluster_type |
string | β | β | Cluster size ("xsmall", "small", "medium" or "large"). Optional for published tasks. |
run_type |
string | β | β | Execution type ("default" or "initial" to reset data). Optional for published tasks. |
pass_date |
bool | β | β | Whether to pass the execution date as a --date parameter (for enriched tasks) or a --version parameter (for released and published tasks). |
dependencies |
array | β | - | List of dataset IDs to run upstream. Set to [] if there are no upstream dependencies. |
Each test (pre or post) must contain
| Test Field | Type | Required | Default | Description |
|---|---|---|---|---|
name |
string | β | β | Name of the QA test in unic-etl. |
destinations |
string | β | β | List of dataset IDs to test. Supports wildcards. |
Each published step can contain
| Step Field | Type | Required | Default | Description |
|---|---|---|---|---|
destination_subzone |
string | β | β | Use "published" here to create a published step. This will create a task that triggers the unic_publish_project DAG with the necessary config. |
resource_code |
string | β | The filename before _config.json |
To overwrite the filename if it doesn't correspond to the resource_code. |
pass_date |
bool | β | false |
Whether to pass the execution date as the --version parameter to the unic_publish_project DAG. If false, "latest" will be passed. |
include_dictionary |
bool | β | false |
Whether to include the dictionary as an Excel file in the published bucket. |
Project-DAGs (i.e., starting in the enriched subzone) can be divided into two types:
-
1οΈβ£ One-time projects :
One-time projects should define the enriched and released steps in their config file. When you are satisfied with a release candidate, you can manually trigger the
unic_publish_projectDAG. -
π Recurring projects :
Recurring projects should define the enriched, released, and published steps in their config file, as to automatically trigger the
unic_publish_projectDAG.
When manually triggering a DAG in the Airflow UI, you'll see two default input parameters:
| Parameter | Default | Description |
|---|---|---|
branch |
master |
Selects the JAR file to run, corresponding to the branch used to deploy the JAR (e.g., unic-etl-master.jar). |
version |
latest |
For enriched DAGs: set to a date (YYYY-MM-dd) to publish a specific version. |
cp .env.sample .envdocker-compose up- URL :
http://localhost:50080 - Username :
airflow - Password :
airflow
Navigate to Admin β Variables and add:
- dags_path :
/opt/airflow/dags - base_url (optional) :
http://localhost:50080
To speed up setup, upload the variables.json file directly from the UI.
Run a task manually from the Airflow UI or use the CLI:
docker-compose exec airflow-scheduler airflow tasks test <dag> <task> 2022-01-01- URL :
http://localhost:59001 - Username :
minioadmin - Password :
minioadmin
Navigate to Admin β Variables and add:
- s3_conn_id :
minio
Navigate to Admin β Connections and add:
- Connection ID :
minio - Connection Type :
Amazon S3 - Extra :
{
"host": "http://minio:9000",
"aws_access_key_id": "minioadmin",
"aws_secret_access_key": "minioadmin"
}- URL :
http://localhost:5050 - Username :
pgadmin@pgadmin.com - Password :
pgadmin
Navigate to Admin β Variables and add:
- pg_conn_id :
postgres(Set to corresponding Postgres connection ID. Eg:unic_qa_postgresql_vlan2_rw)
Navigate to Admin β Connections and add:
- Connection ID :
postgres(Set to corresponding Postgres connection ID. Eg:unic_qa_postgresql_vlan2_rw) - Connection Type :
Postgres - Host:
postgres-unic - Schema:
unic - Password:
pgadmin - Port:
5432
Navigate to Admin β Variables and add:
- slack_hook_url :
https://hooks.slack.com/services/...