Skip to content
This repository was archived by the owner on Feb 22, 2022. It is now read-only.
This repository was archived by the owner on Feb 22, 2022. It is now read-only.

Handling batch submit/status failure for many tasks #210

@knagaitsev

Description

@knagaitsev

There is currently some inconsistent behavior with how /batch_status and /submit error handling is done if many tasks are being dealt with.

In the case of /batch_status, there is no real error handling. This route returns a list of task statuses that are requested for using the task_ids parameter. If a task is not found, it is added to this list with 'status': 'Failed', and if it succeeds it is added to the list with the desired data for that task. The route always responds with a success response, even if all the tasks are marked as 'Failed'.

In the case of /submit, there is some error handling, and the route will respond with a 4xx/5xx if any of the submitted tasks fail during submission. This means that even if some tasks were submitted successfully during the request, a failure will be sent back with no additional info if any task fails.

This is a bit tricky particularly for the /submit endpoint, because if a user submits a single task the ideal standard is good error readability. But since this single submission is internally just a batch submit, it needs to also maintain consistency with what happens when many tasks are submitted, where some fail and others succeed.

My proposal: If an internal error occurs preventing the request from being processed at all, or if status/submit fails for all of the tasks, a 4xx/5xx is sent back with an error. If a status/submit succeeds for one of the tasks, send back a 200, with a list of successes/failures for each individual task. For both status and submit, even if a task fails, proceed with the other tasks in the batch until they have all been tried.

The pros of this approach are readability for simple, single submission tasks. The cons of this are that it could be confusing that a 4xx/5xx is sent back if all tasks fail, but a 200 is sent back if some tasks fail.

Proposed Changes

The /submit route response would change from

{'status': 'Success',
'task_uuids': ['a', 'b'],
'task_uuid': ""}

(response code usually 200 unless some or all task launches fail)

to

{
'response': 'batch',
'results': [
  {
    'status': 'Success',
    'task_uuid': 'a',
    'http_status_code': 200
  },
  {
    'status': 'Failed',
    'code': 1,
    'task_uuid': 'b',
    'reason': 'human readable reason',
    'http_status_code': 4XX/5XX,
    ...
  },
  ...
]
}

(This response code would be a 207 HTTP multi-response since there were some successes and some fails. If everything was a success, it would be a 200. If an internal error occurred that made everything fail, it would be some 5XX)

When the funcx sdk receives such a batch response, it would store all the failed task submits in the local table, to be retrieved with get_result or get_batch_result. These failures would not be saved on the service side.

I think similar changes to the /batch_status route would be fitting, though they wouldn't need to be as drastic. Each status response object in the list would be an "http response object" of its own like above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions