-
-
Notifications
You must be signed in to change notification settings - Fork 346
Description
Describe the bug
When we stop a runner agent deployed with docker-autoscaler
, by updating the agent configuration, it takes some time to release and stop the instance (always reaching the lifecycle hook timeout).
Looking at the logs, the issue seems to be that the instance role don't have the permissions for autoscaling:DescribeLifecycleHooks
, in the IAM definition.
Aug 11 19:52:40 ip-172-21-252-42 monitor_runner.sh[31313]: An error occurred (AccessDenied) when calling the DescribeLifecycleHooks operation: User: arn:aws:sts::[ACCOUN_ID]:assumed-role/runners-fleet-test-instance/i-00a91317465c7587a is not authorized to perform: autoscaling:DescribeLifecycleHooks because no identity-based policy allows the autoscaling:DescribeLifecycleHooks action
To Reproduce
Deploy a runner agent, with docker-autoscaler mode, and then update the agent configuration, to trigger a deployment on ASG. Check the logs in CloudWatch, or connect to the agent in termination process, and check the logs in the monitor-runner
service.
Expected behavior
The role associated to the instance deployed with the ASG should have the permission to check/update the lifecycle hook status on the ASG.
Additional context
From what I see, in docker+machine mode we have those IAM rights configured here:
"autoscaling:CompleteLifecycleAction", |
But we don't have them here, for autoscaler mode: https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/policies/instance-docker-autoscaler-policy.json
If this seems correct, I can open an MR to fix it, but I would like to test this assumption on my side to validate the fix before.