Commit ed23cc3
[SPARK-54344][PYTHON] Kill the worker if flush fails in daemon.py
### What changes were proposed in this pull request?
Kills the worker if flush fails in `daemon.py`.
- Spark conf: `spark.python.daemon.killWorkerOnFlushFailure` (default `true`)
- SQL conf: `spark.sql.execution.pyspark.udf.daemonKillWorkerOnFlushFailure` (fallback to the above)
Before it just dies, reuse `faulthandler` feature and record the thread dump and it will appear in the error message if `faulthandler` is enabled.
```
WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 8) (127.0.0.1 executor 1): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Current thread 0x00000001f0796140 (most recent call first):
File "/.../python/pyspark/daemon.py", line 95 in worker
File "/.../python/pyspark/daemon.py", line 228 in manager
File "/.../python/pyspark/daemon.py", line 253 in <module>
File "<frozen runpy>", line 88 in _run_code
File "<frozen runpy>", line 198 in _run_module_as_main
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:679)
...
```
Even when `faulthandler` is not eabled, the error will appear in the executor's `stderr` file.
```
Traceback (most recent call last):
File "/.../python/pyspark/daemon.py", line 228, in manager
code = worker(sock, authenticated)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python/pyspark/daemon.py", line 88, in worker
raise Exception("test")
Exception: test
```
When this is disabled, the behavior is the same as before but with a log.
### Why are the changes needed?
Currently an exception caused by `outfile.flush()` failure in `daemon.py` is ignored, but if the last command in `worker_main` is still not flushed, it could cause a UDF stuck in Java waiting for the response from the Python worker.
It should just die and let Spark retry the task.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually.
<details>
<summary>Test with the patch to emulate the case</summary>
```patch
% git diff
diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 54c9507..e107216d769 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
-84,6 +84,8 def worker(sock, authenticated):
exit_code = compute_real_exit_code(exc.code)
finally:
try:
+ if worker_main.__globals__.get("TEST", False):
+ raise Exception("test")
outfile.flush()
except Exception:
faulthandler_log_path = os.environ.get("PYTHON_FAULTHANDLER_DIR", None)
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index 6e34b04..ff210f4fd97 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
-3413,7 +3413,14 def main(infile, outfile):
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
- write_int(SpecialLengths.END_OF_STREAM, outfile)
+ import random
+
+ if random.random() < 0.1:
+ # emulate the last command is not flushed yet
+ global TEST
+ TEST = True
+ else:
+ write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
-3423,6 +3430,9 def main(infile, outfile):
faulthandler.cancel_dump_traceback_later()
+TEST = False
+
+ if __name__ == "__main__":
# Read information about how to connect back to the JVM from the environment.
conn_info = os.environ.get(
```
</details>
With just `pass` (before this), it gets stuck, and after this it lets Spark retry the task.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #53055 from ueshin/issues/SPARK-54344/daemon_flush.
Lead-authored-by: Takuya Ueshin <ueshin@databricks.com>
Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>1 parent 2115023 commit ed23cc3
File tree
11 files changed
+60
-2
lines changed- core/src/main/scala/org/apache/spark
- api/python
- internal/config
- python/pyspark
- sql
- catalyst/src/main/scala/org/apache/spark/sql/internal
- core/src/main/scala/org/apache/spark/sql/execution/python
- streaming
11 files changed
+60
-2
lines changedLines changed: 7 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
188 | 188 | | |
189 | 189 | | |
190 | 190 | | |
| 191 | + | |
191 | 192 | | |
192 | 193 | | |
193 | 194 | | |
194 | 195 | | |
195 | 196 | | |
196 | 197 | | |
| 198 | + | |
| 199 | + | |
197 | 200 | | |
198 | 201 | | |
199 | 202 | | |
| |||
294 | 297 | | |
295 | 298 | | |
296 | 299 | | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
297 | 303 | | |
298 | 304 | | |
299 | 305 | | |
300 | 306 | | |
301 | 307 | | |
302 | 308 | | |
303 | | - | |
| 309 | + | |
304 | 310 | | |
305 | 311 | | |
306 | 312 | | |
| |||
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
141 | 153 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
88 | | - | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
89 | 102 | | |
90 | 103 | | |
91 | 104 | | |
| |||
Lines changed: 11 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3985 | 3985 | | |
3986 | 3986 | | |
3987 | 3987 | | |
| 3988 | + | |
| 3989 | + | |
| 3990 | + | |
| 3991 | + | |
| 3992 | + | |
| 3993 | + | |
| 3994 | + | |
| 3995 | + | |
3988 | 3996 | | |
3989 | 3997 | | |
3990 | 3998 | | |
| |||
7510 | 7518 | | |
7511 | 7519 | | |
7512 | 7520 | | |
| 7521 | + | |
| 7522 | + | |
| 7523 | + | |
7513 | 7524 | | |
7514 | 7525 | | |
7515 | 7526 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
| 64 | + | |
63 | 65 | | |
64 | 66 | | |
65 | 67 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
83 | 83 | | |
84 | 84 | | |
85 | 85 | | |
| 86 | + | |
| 87 | + | |
86 | 88 | | |
87 | 89 | | |
88 | 90 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| 74 | + | |
| 75 | + | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
| |||
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
| |||
98 | 99 | | |
99 | 100 | | |
100 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
101 | 105 | | |
102 | 106 | | |
103 | 107 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
| 63 | + | |
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| 82 | + | |
| 83 | + | |
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
| |||
0 commit comments