Skip to content

Conversation

@jduust
Copy link
Contributor

@jduust jduust commented Feb 12, 2025

Using the below logic, i've implemented an option to retry per queue element, instead of having a global retry only, so you can both define max_retries and queue_element attempts, so you can customize your process even more depending on what the queue type requires.

attempts = 1  # Number times to try queue_element (1 is no retry, 2 is 2 total attempts and so on)

for attempt in range(1, attempts + 1):
    try:
        print(f"Run {attempt}...")
        break
    except Exception as e:
        print(f"Attempt {attempt} failed: {e}")
        if attempt < attempts:
            print("Retrying...")
        else:
            print(f"Operation failed after {attempt} attempts.")
            raise

And i've added the following lines to config (and changed default email domain to aarhus.dk to avoid the outlook warning)

# Number of attempts per queue_element (1 is no retry, 2 is 2 total attempts and so on)
QUEUE_ATTEMPTS = 1

@jduust
Copy link
Contributor Author

jduust commented Feb 12, 2025

Can be added as a separate function to avoid linting errors, and add reset.reset(orchestrator_connection) to if attempt < attempts:

@ghbm-itk
Copy link
Member

ghbm-itk commented Mar 6, 2025

This implementation breaks the existing retry loop because the queue element try block catches all exeptions.
This leads to a queue element being retried without the entire loop being reset.

It also doesn't work if a queue element needs to be retried between runs since the retry counter is in the process instead of in Orchestrator.

@jduust
Copy link
Contributor Author

jduust commented Jul 17, 2025

This implementation breaks the existing retry loop because the queue element try block catches all exeptions. This leads to a queue element being retried without the entire loop being reset.

It also doesn't work if a queue element needs to be retried between runs since the retry counter is in the process instead of in Orchestrator.

That's the idea, when using unstable systems (like opus) sometimes it just needs to retry the queue element a few times, before then throwing an error outside of the try block triggering the existing retry loop. This basically allows the user to have multiple points of failure before hard failing. If you just have queue retry to 0 by default the old logic should still work perfectly since it raises it again after the try if attempts exceed the set retry counter.

On certrain processes, we want to retry queue elements, and not have one queue element fail after the first attempt. Then if it fails more than the defined queue retry count, then it triggers the regular error handling. This allows a lot of customization for the user, thus they can define if one queue element fails, retry it 3 times, and then throw an error, and then move on to the next. Then this only counts as one error in the process, so that one queue element failing 3 times doesn't count as the entire process hard failing 3 times, since it might be specific to that queue element.

Then if say 3 queue elements fail 3 times each, then it will trigger the hard fail, and disable the trigger if the user specified the max retry count to 3 and fail on to many errors. Thus we can retry queue elements, but still disable the trigger if something is clearly wrong with the system, and not just one or two queue elements.

We've been using this logic since january and it has worked perfectly for us. You don't have to implement it, but i think it's very valuable for people who uses queues a lot in unstable systems, and feel free to improve it if you dislike the way it's set up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants