fix: eip race condition when updating ASG #1314

blackillzone · 2025-08-07T17:33:38Z

Description

Add a second EIP, when we decide to use EIP for the manager. This way, when we update the ASG, a free EIP is available to pick by the aws-ec2-assign-elastic-ip tool.

This fix the rolling update, and allow for a smoother transition (instead of trying to work around the removing of the EIP association), and breaking the availability of the runner.

The caveats is that we require two EIP for this use case, but I guess it's okay, as long the user is aware of it.

Migrations required

No

Verification

Running a deployment with a runner instance using an EIP, then applying a configuration change for this (like changing the type, or an AMI update).

runner_instance = {
  use_eip  = true
}

The second runner is starting correctly, and get an EIP associated.

Alternative

To avoid using two EIP ressources, we may add some more logic in the template/eip.tftpl in order to check the EIP availability, and do a switchover if it is already associated to an existing running instance. The issue is that the runner would still be unavailable in the meantime, on Gitlab side, because it will still running the rest of the cloud init script, and won't be able to keep the communication.
I'm open to the discussion about this, but having two EIP is more reliable I think.

github-actions · 2025-08-07T17:33:47Z

Hey @blackillzone! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

the problem being solved
the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

/help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

kayman-mk · 2025-08-14T09:06:04Z

Zero downtime should be the top priority and this is definitely the easiest solution, but adds $3.60 per IP and machine to the monthly bill.

What about removing the EIP from Terraform and order it during startup and release it during shutdown? But to be safe you always have to check for dangling EIPs (due to failures) and release them too

blackillzone · 2025-08-14T15:39:02Z

Seems possible, so the workflow would be this one, if I'm correct:

We create an EIP and assign it to the instance, directly in the user_data script (in template/eip.tftpl)
In the monitor script (/opt/monitor_runner.sh), we create a new function, to detach, and release the EIP. And we call it just before send_complete_lifecycle_action, in monitor_process() function ?

This way, each agent manage it's own EIP lifecycle, and we don't have race condition issues, and no more EIP deployed for nothing.

For the dangling EIP, I don't know where we should manage this, to do it properly. Maybe in the lambda function, as it's already used to cleanup what's left from a runner agent shutdown ?

blackillzone · 2025-08-19T15:21:59Z

I've updated the PR, and tested all this with docker-autoscaler mode. Currently, the runner agent is able to:

At startup phase:

Create a new EIP, if not out of limit (default 10/region/account)
Associate the EIP to itself

At terminate phase:

Desassociate it's own EIP
Release the EIP

Lambda fallback:

In case there was any issues in the script at terminate phase (network issue for example), the lambda will cleanup any remaining free EIP, related to the runner agent (based on the tag filtering)

As I was testing in autoscaler mode, I've also include a fix for this: #1316

It's ready for review, let me know if you think we can enhance anything in the current workflow for the EIP.

Note: I've not worked on backward compatibility with the old behavior, but the transition should be easy

blackillzone requested review from npalm and kayman-mk as code owners August 7, 2025 17:33

feat(): add EIP creation/release in gitlab runner lifecycle

9ac1b30

blackillzone force-pushed the patch-1 branch from edf23a2 to 9ac1b30 Compare August 19, 2025 15:09

kayman-mk assigned kayman-mk and unassigned kayman-mk Sep 18, 2025

This was referenced Sep 18, 2025

fix: add missing policies for the autoscaler #1328

Merged

docker-autoscaler: Missing permission for lifecycle hook update #1316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: eip race condition when updating ASG #1314

fix: eip race condition when updating ASG #1314

Uh oh!

blackillzone commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

kayman-mk commented Aug 14, 2025

Uh oh!

blackillzone commented Aug 14, 2025 •

edited

Loading

Uh oh!

blackillzone commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

fix: eip race condition when updating ASG #1314

Are you sure you want to change the base?

fix: eip race condition when updating ASG #1314

Uh oh!

Conversation

blackillzone commented Aug 7, 2025

Description

Migrations required

Verification

Alternative

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

kayman-mk commented Aug 14, 2025

Uh oh!

blackillzone commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blackillzone commented Aug 19, 2025

Uh oh!

Uh oh!

blackillzone commented Aug 14, 2025 •

edited

Loading