-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path26thSep2_Notes.txt
More file actions
51 lines (48 loc) · 2.42 KB
/
26thSep2_Notes.txt
File metadata and controls
51 lines (48 loc) · 2.42 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
KubernetesPodOperator (K8sPodOperator)
What it is:
An Airflow operator that creates a Kubernetes Pod directly in your cluster.
You can run any container (Python, Bash, Spark, etc.), as long as it’s packaged into a
Docker image.
Very flexible — you control Pod spec (CPU, memory, volumes, environment
variables).
Where to use it:
When you want fine-grained control of how pods are created.
For non-Spark workloads (ETL in Python, shell scripts, ML inference, etc.).
When your Spark job is already containerized (e.g., using Spark-on-K8s operator,
Helm, or a prebuilt Spark Docker image).
If you need IRSA, secrets, config maps mounting into the pod.
Good for:
Running general-purpose jobs in Kubernetes.
Triggering Spark jobs via a pod that runs spark-submit (custom container).
Jobs that need custom dependencies pre-installed in the image.
Caution:
You are responsible for writing the full container logic.
Monitoring/logging may be less “native” compared to SparkSubmitOperator
SparkSubmitOperator
What it is:
Airflow operator built for submitting Spark jobs.
Runs spark-submit against a Spark cluster (standalone, YARN, Mesos, or Kubernetes).
Assumes Spark is already installed or available in the environment.
Where to use it:
When you already have a running Spark cluster (on YARN, EMR, Dataproc, or K8s).
If you want tight integration with Spark (spark-submit arguments, configs, jars, etc.).
When you don’t want to manage Pod specs manually — the Spark cluster handles
scheduling/executors.
Good for:
Directly running Spark JARs, PySpark, SQL queries.
Classic Spark deployments where spark-submit is the standard.
Quick Spark job orchestration with Airflow.
Caution:
Limited to Spark workloads only.
Requires Spark to be accessible in Airflow workers (or cluster config available).
Less control over Kubernetes pod spec compared to KubernetesPodOperator.
When to choose what?
Use KubernetesPodOperator if:
o You run Spark-on-K8s with Helm or Operator (containerized Spark job).
o You need more Kubernetes control (volumes, IRSA, secrets).
o You also run non-Spark jobs in the same DAG.
Use SparkSubmitOperator if:
o You already have Spark cluster (EMR, Dataproc, Standalone).
o You want direct spark-submit integration.
o You’re mostly running Spark JARs/scripts without needing Kubernetes
specifics.