Project replicating the stored rpocedure: ABCS.AON_AROAS_DASH.create_optimization_report_test_ip_during_campaign. Utilizes a modular architecture to handle dependencies separately interactively crating documentation of the used assets, while replicating the original pipeline workflow.
*General architecture (abbreviated)
graph TD A[Start] --> B(Load Config) B --> C(Establish Connection) C --> D(Run Executor) D --> E[End] Components:
- Is in charge of calling the
connector.py, for establishing connection to snowflake, in charge of callingloader.pyto load the configuration file, in charge of callingexecutor.pyto run the pipeline.
- It saves in
profiles/the different condigurations in json format, each configuration contains the task in order to execute and the attributes to be set to run the pipeline.loader.pyloads these files and returns the respective dictionary.
- It establishes connection to snowflake giving the cursor and the session.
-
This directory contains several components whose main task is to execute the pipeline and log the intermediate results.
-
stage_registry.pyis a dictionary giving for each task name the respective function -
database_utils.pyare auxiliary functions that update the config, execute queries and log which task is done, with timestamp, tables created and tables read from.
-executor.py that loops through the tasks in the config dictionary, passing on the config for each task and the context that is updated along the pipeline.
-stages/ contains the functions which are of two types, functions that execute processes in procedures.py and functions that return queries in create.py to create tables and get.pythat return queries to fetch information.
the format of the tasks log includes the timestamp of the task, then the name of the task, the first 70 characters of the query, the created tables, specifing persistent and temporal ones, and the tables that the query uses in its execution. This format is applied for each task, with exception of the firts task get campaign and the last task smooth values.
In the final line of the logging file the ran_id identifier is printed, however it can also be extracted from the last 6 characters of each logged created persistent table.
- All persistent tables' names are identified with the project id + tag_campaign id, additionally for the project to allow parallel non-conflicting runs at the same time of the same project and campaign a unique identifier is appended to each (persistent) table's name. This identifier will be printed at the end of the log file to be able to locate the respective tables. #---------------------------------------
Mantaining a dictionary for data assets keeping an hierarchy inherited from the degree of manipulation that a table contains.
Modularizing functions/query generation by different tasks: ** tables