Skip to content

Conversation

blackillzone
Copy link

Description

Add a second EIP, when we decide to use EIP for the manager. This way, when we update the ASG, a free EIP is available to pick by the aws-ec2-assign-elastic-ip tool.

This fix the rolling update, and allow for a smoother transition (instead of trying to work around the removing of the EIP association), and breaking the availability of the runner.

The caveats is that we require two EIP for this use case, but I guess it's okay, as long the user is aware of it.

Migrations required

No

Verification

Running a deployment with a runner instance using an EIP, then applying a configuration change for this (like changing the type, or an AMI update).

runner_instance = {
  use_eip  = true
}

The second runner is starting correctly, and get an EIP associated.

Alternative

To avoid using two EIP ressources, we may add some more logic in the template/eip.tftpl in order to check the EIP availability, and do a switchover if it is already associated to an existing running instance. The issue is that the runner would still be unavailable in the meantime, on Gitlab side, because it will still running the rest of the cloud init script, and won't be able to keep the communication.
I'm open to the discussion about this, but having two EIP is more reliable I think.

Copy link
Contributor

github-actions bot commented Aug 7, 2025

Hey @blackillzone! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

  • the problem being solved
  • the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

  • /help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

@kayman-mk
Copy link
Collaborator

Zero downtime should be the top priority and this is definitely the easiest solution, but adds $3.60 per IP and machine to the monthly bill.

What about removing the EIP from Terraform and order it during startup and release it during shutdown? But to be safe you always have to check for dangling EIPs (due to failures) and release them too

@blackillzone
Copy link
Author

blackillzone commented Aug 14, 2025

Seems possible, so the workflow would be this one, if I'm correct:

  1. We create an EIP and assign it to the instance, directly in the user_data script (in template/eip.tftpl)
  2. In the monitor script (/opt/monitor_runner.sh), we create a new function, to detach, and release the EIP. And we call it just before send_complete_lifecycle_action, in monitor_process() function ?

This way, each agent manage it's own EIP lifecycle, and we don't have race condition issues, and no more EIP deployed for nothing.

For the dangling EIP, I don't know where we should manage this, to do it properly. Maybe in the lambda function, as it's already used to cleanup what's left from a runner agent shutdown ?

@blackillzone
Copy link
Author

I've updated the PR, and tested all this with docker-autoscaler mode. Currently, the runner agent is able to:

At startup phase:

  • Create a new EIP, if not out of limit (default 10/region/account)
  • Associate the EIP to itself

At terminate phase:

  • Desassociate it's own EIP
  • Release the EIP

Lambda fallback:

  • In case there was any issues in the script at terminate phase (network issue for example), the lambda will cleanup any remaining free EIP, related to the runner agent (based on the tag filtering)

As I was testing in autoscaler mode, I've also include a fix for this: #1316

It's ready for review, let me know if you think we can enhance anything in the current workflow for the EIP.

Note: I've not worked on backward compatibility with the old behavior, but the transition should be easy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants