-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
I had a couple get "lost" (stuck in the running state) this way - in the case of 66751002 the job was submitted and ran, but a network error occurred during postprocessing and no state files were left behind in the {manager}-*-jobs dirs, #354 is relevant here as well), and the last messaage logged for this job was:
2025-04-15 15:23:23,518 INFO [pulsar.client.staging.down][[manager=vgp_jetstream2]-[action=postprocess]-[job=66751002]] collecting output database.dmnd with action FileAction[path=/corral4/main/objects/6/4/0/dataset_640edda5-f2c0-4209-bb9c-3f801701a638.dat,action_type=rem>
In the case of 66751004 the job did not finish preprocessing and there was a {manager}-preprocessing-jobs file for the job, this was the last message:
2025-04-15 15:22:51,993 DEBUG [pulsar.managers.staging.pre][[manager=vgp_jetstream2]-[action=preprocess]-[job=66751004]] Staging input 'dataset_98dadf00-5182-4fbb-a07e-9f9ca6210985.dat' via FileAction[path=/corral4/main/objects/9/8/d/dataset_98dadf00-5182-4fbb-a07e-9f9ca62>
So despite not logging anything else or raising exceptions for either of these jobs, there are clear network issues recorded for other jobs:
2025-04-15 15:26:17,569 INFO [pulsar.managers.util.retry][[manager=vgp_jetstream2]-[action=postprocess]-[job=66783218]] Failed to execute staging out file /jetstream2/scratch/main/jobs-vgp/66783218/outputs/dataset_94265413-e9a8-4ff1-b12a-776d0948c293.dat via FileAction[pa>
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/lib64/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
rv = real_getresponse(self, *args, **kwargs)
File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/lib64/python3.9/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 82, in perform
resp = requests.patch(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 145, in patch
return request("patch", url, data=data, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
return fun(*args, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 513, in write_from_path
tus_upload_file(self.url, pulsar_path)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/tus.py", line 32, in tus_upload_file
uploader.upload()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 45, in upload
self.upload_chunk()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 59, in upload_chunk
self._do_request()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 88, in _do_request
self._retry_or_cry(error)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 102, in _retry_or_cry
raise error
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 85, in _do_request
self.request.perform()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 92, in perform
raise TusUploadFailed(error)
tusclient.exceptions.TusUploadFailed: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))And a publisher error:
2025-04-15 15:27:19,837 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] Acknowledging UUID 1ec6cd9c-1a0e-11f0-99d0-005056bc743e on queue setup_ack
2025-04-15 15:27:19,841 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Begin publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,842 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Have producer for publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,844 ERROR [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Connection error while publishing: TimeoutError(110, 'Connection timed out')
Traceback (most recent call last):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/connection.py", line 556, in _ensured
return fun(*args, **kwargs)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/messaging.py", line 208, in _publish
return channel.basic_publish(
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/channel.py", line 1791, in _basic_publish
self.connection.drain_events(timeout=0)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 526, in drain_events
while not self.blocking_read(timeout):
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 531, in blocking_read
frame = self.transport.read_frame()
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 294, in read_frame
frame_header = read(7, True)
File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 574, in _read
s = recv(n - len(rbuf)) # see note above
File "/usr/lib64/python3.9/ssl.py", line 1135, in read
return self._sslobj.read(len)
TimeoutError: [Errno 110] Connection timed out
2025-04-15 15:27:19,856 INFO [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Retrying in 0 seconds
2025-04-15 15:27:20,150 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Published to key pulsar_vgp_jetstream2__setup_ackWhich I have to assume are the causes of the loss here.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels