Skip to content

Losing jobs due to network interruptions #393

@natefoo

Description

@natefoo

I had a couple get "lost" (stuck in the running state) this way - in the case of 66751002 the job was submitted and ran, but a network error occurred during postprocessing and no state files were left behind in the {manager}-*-jobs dirs, #354 is relevant here as well), and the last messaage logged for this job was:

2025-04-15 15:23:23,518 INFO  [pulsar.client.staging.down][[manager=vgp_jetstream2]-[action=postprocess]-[job=66751002]] collecting output database.dmnd with action FileAction[path=/corral4/main/objects/6/4/0/dataset_640edda5-f2c0-4209-bb9c-3f801701a638.dat,action_type=rem>

In the case of 66751004 the job did not finish preprocessing and there was a {manager}-preprocessing-jobs file for the job, this was the last message:

2025-04-15 15:22:51,993 DEBUG [pulsar.managers.staging.pre][[manager=vgp_jetstream2]-[action=preprocess]-[job=66751004]] Staging input 'dataset_98dadf00-5182-4fbb-a07e-9f9ca6210985.dat' via FileAction[path=/corral4/main/objects/9/8/d/dataset_98dadf00-5182-4fbb-a07e-9f9ca62>

So despite not logging anything else or raising exceptions for either of these jobs, there are clear network issues recorded for other jobs:

2025-04-15 15:26:17,569 INFO  [pulsar.managers.util.retry][[manager=vgp_jetstream2]-[action=postprocess]-[job=66783218]] Failed to execute staging out file /jetstream2/scratch/main/jobs-vgp/66783218/outputs/dataset_94265413-e9a8-4ff1-b12a-776d0948c293.dat via FileAction[pa>
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 82, in perform
    resp = requests.patch(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 145, in patch
    return request("patch", url, data=data, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
    self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 513, in write_from_path
    tus_upload_file(self.url, pulsar_path)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/tus.py", line 32, in tus_upload_file
    uploader.upload()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 45, in upload
    self.upload_chunk()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 59, in upload_chunk
    self._do_request()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 88, in _do_request
    self._retry_or_cry(error)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 102, in _retry_or_cry
    raise error
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 85, in _do_request
    self.request.perform()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 92, in perform
    raise TusUploadFailed(error)
tusclient.exceptions.TusUploadFailed: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

And a publisher error:

2025-04-15 15:27:19,837 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] Acknowledging UUID 1ec6cd9c-1a0e-11f0-99d0-005056bc743e on queue setup_ack
2025-04-15 15:27:19,841 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Begin publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,842 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Have producer for publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,844 ERROR [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Connection error while publishing: TimeoutError(110, 'Connection timed out')
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/connection.py", line 556, in _ensured
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/messaging.py", line 208, in _publish
    return channel.basic_publish(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/channel.py", line 1791, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 526, in drain_events
    while not self.blocking_read(timeout):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 531, in blocking_read
    frame = self.transport.read_frame()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 294, in read_frame
    frame_header = read(7, True)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 574, in _read
    s = recv(n - len(rbuf))  # see note above
  File "/usr/lib64/python3.9/ssl.py", line 1135, in read
    return self._sslobj.read(len)
TimeoutError: [Errno 110] Connection timed out
2025-04-15 15:27:19,856 INFO  [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Retrying in 0 seconds
2025-04-15 15:27:20,150 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Published to key pulsar_vgp_jetstream2__setup_ack

Which I have to assume are the causes of the loss here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions