Skip to content

Refactor Operator Logic to Avoid Package Dependency Conflicts #9

@soltanianalytics

Description

@soltanianalytics

Currently, all operators are based on an EWAHBaseOperator that contains all necessary functionality for loading data into the DWH and individual operators contain the logic of extracting data in the execute() function. This will inevitably lead to dependency conflicts. To avoid this in the future, I see two options:

  • refactor all operators as children of the kubernetes pod operator, and make sure that each image either contains only the required dependencies or that they are installed after the pod is spun up, or
  • refactor all operators as children of the python virtualenv operator, and make that they contain their dependencies to be installed within the virtualenv at execution

Initially I preferred the kubernetes pod operator options, but this would limit the usecases of EWAH by excluding all users who are unwilling or unable to use kubernetes; the alternatively would have been some ugly hybrid which I didn't exactly fancy either.

Though requiring more refactoring, the second option now seems preferrable to me and comes with a few unexpected upsides. What follows is a high-level description of how I imagine EWAH operators to work in the future.

Recap of general EWAH Extract and Load logic

  • Each data source has one DAG (full refresh) or set of DAGs (incremental load) that loads data into one raw data schema in the DWH
  • Each full-refresh EL DAG has three components: two schema tasks (kickoff and final) and one type extract and load tasks, which is instantiated once per table that is loaded from the data source
  • Incremental EL DAGs have an additional sensor task to make sure they are executed in order and backfill properly
  • on a table level, loading can occur with chunking of requests by either a timestamp or serial (integer) field
  • incremental models need a timestamp field to incrementally load by
  • also planned for the future: hybrid model which incrementally loads immutable data based on a serial

How an EWAHBaseOpator based on the python virtualenv operator might be constructed logically

  • Components of the base operator
    • init()
      • save all operator kwargs related to the base operator
      • test values of all operator kwargs for validity and compatibility
      • set the callable kwarg to a function of the class itself which is overwritten by child operators
      • call super()
    • execute()
      • contain logic for full-refresh, incremental loading, and various chunking mechanisms
      • calls super() for each chunk, returning the result of the callable
    • operator_callable()
      • function returning None which is overwritten by the child class and in the base operator returns None
  • Components of the child operator
    • init()
      • ....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions