-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path26thSep1_Notes.txt
More file actions
377 lines (270 loc) · 14.4 KB
/
26thSep1_Notes.txt
File metadata and controls
377 lines (270 loc) · 14.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
external.conf
It's in s3 but you're not in s3 this is not running in s3 your script error doesn't run in s3 it runs on it runs in a pod.
you want to get that thing from s3 to another to it yet another pod, where is what is actually reading it? What is reading it from s3 Putting the values into memory and then passing it.
a file exists on an s3 you can't just say the name of the file and think something's going to load it for you.
If you want a configuration file, you're either going to have to pass a full JSON payload right there, or you you're
going to have to mount an external
source in your driver pod or your executer pod, I don't think you want to do that. So that's going to be that's going to be down at the infrastructure level.
What that means is we would have to actually mount a virtual drive in the pods to your s3 location.
That's That's one legitimate way to get an s3 file into your thing where you can then address it as dot slash external config. That's the only way that that would actually work what you wrote.
You're gonna have to mount. You're gonna have to mount that configuration file as an external mount.
infrastructure layer, and that, that driver and executer we're going to have to figure out how to mount an s3 file as a directory inside our containers.
ChatGPT said:
Thought for 42s
You’re right: nothing in Kubernetes “auto-loads” a file from S3. You have a few solid patterns—pick the one that fits your ops constraints.
0) Quick decision guide
Just a small config file? Put it in a ConfigMap/Secret and mount it. Easiest, no S3 needed.
Kubernetes
+1
Must source directly from S3 but only at start-up? Use an initContainer to aws s3 cp into an emptyDir, then mount that path to your driver/executor. (Grant S3 read with IRSA.)
App truly needs a POSIX-ish path to a bucket (e.g., /mnt/config/external.conf) at runtime? Use the Mountpoint for Amazon S3 CSI driver (EKS add-on) to mount the bucket/prefix into the pod. Best “S3-as-filesystem” option today; understand POSIX caveats.
AWS Docs
+2
AWS Docs
+2
For Spark specifically, another option is spark-submit --files s3a://bucket/path/external.conf and then reference it by filename in your app. That avoids mounts entirely if your image has hadoop-aws and credentials set.
Apache Spark
+1
1) ConfigMap/Secret (simplest)
If external.conf is non-secret:
apiVersion: v1
kind: ConfigMap
metadata:
name: external-conf
data:
external.conf: |
key1=val1
key2=val2
---
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
image: yourimage
volumeMounts:
- name: conf
mountPath: /app/config
volumes:
- name: conf
configMap:
name: external-conf
items:
- key: external.conf
path: external.conf
Now your code reads /app/config/external.conf. (Use a Secret instead if the file is sensitive.)
Kubernetes
+1
2) Init container pulls from S3 (no special CSI driver)
Good when you want the truth to live in S3 but don’t need a live mount.
Prereqs
IRSA role with s3:GetObject for s3://your-bucket/path/external.conf.
Image with awscli in the init container.
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
serviceAccountName: spark # bound to an IAM role via IRSA
initContainers:
- name: fetch-conf
image: amazon/aws-cli
command: ["sh","-c"]
args:
- aws s3 cp s3://your-bucket/path/external.conf /work/config/external.conf
volumeMounts:
- name: work
mountPath: /work
containers:
- name: app
image: yourimage
volumeMounts:
- name: work
mountPath: /app/config
volumes:
- name: work
emptyDir: {}
Your app reads /app/config/external.conf. (This pattern is common and avoids the POSIX caveats of S3 filesystems.) For Spark on EKS, make sure you use IRSA for the driver/executors.
Amazon Web Services, Inc.
3) Mount an S3 prefix into the pod (CSI driver)
If you must treat S3 like a local directory, use Mountpoint for Amazon S3 CSI (the EKS add-on).
Install (once per cluster): follow the EKS docs or add-on in the console.
AWS Docs
+1
Define a PV/PVC to your bucket/prefix (read-only for configs):
apiVersion: v1
kind: PersistentVolume
metadata:
name: s3-conf-pv
spec:
capacity:
storage: 5Gi
accessModes: ["ReadOnlyMany"]
csi:
driver: s3.csi.aws.com
volumeHandle: s3-conf-handle
volumeAttributes:
bucketName: your-bucket
# Optional prefix inside the bucket:
prefix: path/
mounter: mountpoint # uses Mountpoint for S3
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: s3-conf-pvc
spec:
accessModes: ["ReadOnlyMany"]
resources:
requests:
storage: 5Gi
volumeName: s3-conf-pv
Mount it in your pod:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
serviceAccountName: spark # IRSA with s3:GetObject
containers:
- name: app
image: yourimage
volumeMounts:
- name: s3conf
mountPath: /app/config
readOnly: true
volumes:
- name: s3conf
persistentVolumeClaim:
claimName: s3-conf-pvc
Now /app/config/external.conf is just there. Be aware: S3≠POSIX; some semantics differ. This driver targets high throughput and is the AWS-supported path; older FUSE options (s3fs/goofys) exist but come with performance/compatibility tradeoffs.
Pure Storage Blog
+4
AWS Docs
+4
GitHub
+4
4) Spark-native alternative (no mounts at all)
If your use case is Spark (driver/executors on EKS), you can push the config file with --files:
spark-submit \
--master k8s://https://$K8S_APISERVER \
--deploy-mode cluster \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--files s3a://your-bucket/path/external.conf \
local:///opt/spark/app.jar \
--your-args ...
Inside your code, read it by filename only (external.conf) from the working dir; Spark distributes it to driver and executors. Ensure your image includes hadoop-aws and credentials via IRSA or Spark conf.
Apache Spark
+1
Which one should you use?
For a single external.conf that changes occasionally, I’d start with (2) initContainer pull (clean, predictable, easy to cache/version).
If you need the file to look like a regular file path without modifying the app, and possibly many files, use (3) S3 CSI (Mountpoint).
If it’s Spark-only, consider (4) --files—it’s built for exactly this.
If you want, tell me your exact runner (plain K8s vs Spark on K8s via MWAA KPO), and I’ll drop in the Airflow KubernetesPodOperator snippet with the appropriate volumes/volume_mounts or an init_containers block wired for IRSA.
You said:
Yeah, I just wanted to understand like how my jar which is stored inside S3 location is reading through the kubernetes cluster EKS pods because in arguments I'm passing with the flags like hyphen hyphen jar then jar location while external configuration files are not able to read but jar how is taking up without mounting it
ChatGPT said:
Thought for 11s
Great question — the “why does my JAR work from S3 but my config file doesn’t?” puzzle is super common on Spark-on-K8s.
What’s happening under the hood
Your application JAR at s3a://.../app.jar works because Spark knows how to fetch remote URIs (Hadoop-compatible schemes like s3a://, hdfs://, wasbs://, etc.) for application code and dependencies.
When you pass --jars s3a://.../lib1.jar or the app itself as local:///... or s3a://.../app.jar, Spark’s launcher downloads those artifacts to the driver and propagates them to executors automatically (distributed cache).
This requires your image to include hadoop-aws + AWS SDK and credentials (ideally via IRSA).
Your external config file doesn’t work because Spark only auto-fetches items it knows about: app JAR, --jars, --packages, --files, --archives, --py-files.
A plain code path like new File("/app/config/external.conf") expects a local file in the container filesystem. Spark won’t pull it unless you explicitly tell it to (e.g., --files) or you mount/copy it into the container.
Three clean ways to make the config available (no Kubernetes volume mounts needed)
Use Spark’s distributed files
Add the config with --files (or spark.files=...), then read it via SparkFiles:
spark-submit \
--master k8s://https://$K8S_APISERVER \
--deploy-mode cluster \
--class com.your.App \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--files s3a://your-bucket/path/external.conf \
local:///opt/spark/app.jar
In Scala/Java:
import org.apache.spark.SparkFiles
val confPath = SparkFiles.get("external.conf") // now a local path on driver & executors
(You can keep your code reading a relative filename if you prefer; Spark copies it into the working dir.)
Read it directly from S3 using Hadoop FS (no local file at all)
If your code can accept an S3A URL, skip local files:
val spark = SparkSession.builder.getOrCreate()
val confUri = "s3a://your-bucket/path/external.conf"
val lines = spark.sparkContext.textFile(confUri).collect()
// or use Hadoop FS API:
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
val in = fs.open(new org.apache.hadoop.fs.Path(confUri))
This keeps the “source of truth” in S3 and avoids extra flags, at the cost of tweaking your code to accept S3 URIs.
Bake or fetch at container start (still no K8s volume mount)
Bake it into the image (simple but less dynamic), or
Init-container pull: have an init container aws s3 cp ... /work/external.conf into an emptyDir, then your app reads /work/external.conf. (This is infra-lite and works for any JVM app, not just Spark.)
Minimal Airflow KPO example (using Spark’s --files)
If you’re launching via KubernetesPodOperator:
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
submit = KubernetesPodOperator(
task_id="spark-job",
name="spark-job",
namespace="default",
image="your-spark-image:latest", # must include hadoop-aws + aws sdk
service_account_name="spark", # IRSA-enabled SA
cmds=["/opt/spark/bin/spark-submit"],
arguments=[
"--master", f"k8s://https://{K8S_APISERVER}",
"--deploy-mode", "cluster",
"--class", "com.your.App",
"--conf", "spark.kubernetes.container.image=your-spark-image:latest",
"--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=spark",
"--files", "s3a://your-bucket/path/external.conf",
"--jars", "s3a://your-bucket/jars/dep1.jar,s3a://your-bucket/jars/dep2.jar",
"local:///opt/spark/app.jar",
# app args...
],
)
Sanity checklist (applies to all approaches)
Image contents: hadoop-aws jar + matching AWS SDK version for your Spark/Hadoop build.
Auth: IRSA role with s3:GetObject (and kms:Decrypt if objects are KMS-encrypted).
Spark conf: (often optional) spark.hadoop.fs.s3a.aws.credentials.provider if you’re not using IRSA defaults; ensure fs.s3a settings match your environment.
Don’t use s3://; use s3a:// for Hadoop/Spark.
TL;DR
JARs “just work from S3” because Spark knows to fetch them as code dependencies.
Your config is not a dependency unless you declare it. Use --files (or read via s3a:// in code), or pre-stage it (init-container or baked image).
No mount is required unless you specifically want a POSIX path exposed by the OS.
You said:
JARs “just work from S3” because Spark knows to fetch them as code dependencies.
Your config is not a dependency unless you declare it. please elaborate
A network error occurred. Please check your connection and try again.
Retry
ChatGPT said:
Got it 👍 — let’s dive deeper into that difference.
1. Why JARs “just work”
When you run spark-submit and pass your application JAR or extra dependencies with --jars s3a://bucket/app.jar:
Spark knows these are code artifacts.
The launcher uses the Hadoop FileSystem layer (fs.s3a) to download those JARs into the driver container.
Spark then ships them to executors automatically (through its distributed cache mechanism).
That means your app code is present on every executor without you needing to manually mount or copy anything.
👉 JARs are in a special category: Spark treats them as “application dependencies” and has built-in logic to fetch them from remote stores (S3, HDFS, WASB, GCS, etc.).
2. Why configs don’t work by default
A file like external.conf is not a code artifact in Spark’s eyes. Unless you tell Spark otherwise:
Your app tries new File("/path/to/external.conf") → the container’s filesystem is checked.
There’s no auto-fetch step, because Spark doesn’t assume every arbitrary file path might be in S3.
Result: file not found unless you manually mount, bake, or download it.
3. How to make config behave like a JAR
You have to explicitly promote your config into one of Spark’s distributable resources:
--files flag → tells Spark “this is a small side file to distribute.”
spark-submit \
--files s3a://bucket/external.conf \
--class com.example.App \
s3a://bucket/app.jar
Then inside your code:
import org.apache.spark.SparkFiles
val confPath = SparkFiles.get("external.conf")
Spark ensures external.conf is downloaded into the driver and executors’ working directories.
Alternative: change your code to read directly from S3 (s3a://bucket/external.conf). That bypasses Spark’s distribution logic but ties your code to S3.
4. Quick analogy
JARs = Spark treats them like libraries → auto-fetch, auto-distribute.
Configs = Spark treats them like data → you must explicitly tell Spark how to fetch/distribute (--files, --archives, or read via s3a://).
✅ Bottom line:
JARs are “first-class citizens” in Spark’s submission process. Config files are not. If you want your config file to be handled the same way, you must either use --files (so Spark distributes it just like JARs) or mount/copy it yourself into the pod.
Do you want me to show you a KubernetesPodOperator YAML/JSON snippet where I add --files s3a://.../external.conf so the config ends up mounted in the driver and executors automatically?