Nomad server version
We have confirmed this behaviour on both of these Nomad Server versions
Operating system and Environment details
Linux 5.14.0-570.81.1.el9_6.x86_64
Issue
Nomad kiling healthy allocations in the event of a job being scaled down when Vault outage is ocurring
Reproduction steps
- Deploy a job named apigw, with
count = 5 this job will have the following relevant stanzas:
group "apigw" {
count = 5
# (...)
update {
max_parallel = 1
auto_revert = false
auto_promote = false
canary = 1
min_healthy_time = "30s"
healthy_deadline = "2m"
progress_deadline = "10m"
}
# (...)
restart {
attempts = 3
interval = "5m"
delay = "15s"
mode = "delay"
}
# (...)
task "apigw" {
# (...)
template {
data = <<EOH
{{with secret "test" }}{{.Data.data.test}}{{end}}
EOH
destination = "${NOMAD_SECRETS_DIR}/apigw.pem"
change_mode = "noop"
}
# (...)
}
# (...)
}
- A Vault outage occurs, making it inaccessible to Nomad
❯ vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.
URL: GET https://vault.test.server/v1/sys/storage/raft/configuration
Code: 500. Errors:
* local node not active but active cluster node not found
- The apigw job is scaled up (e.g., via Nomad Autoscaler) from 5 to 7, and then to 8 allocations. This behavior can be reproduced manually by sending scaling events to the API
cat > scale.json << EOF
{
"Count": 8,
"Message": "manual up",
"Target": {
"Group": "apigw"
}
}
EOF
curl -X POST -d @scale.json $NOMAD_ADDR/v1/job/apigw/scale
- New allocations fail to start because Vault is inaccesible and the task can't fetch the secret
test
- The job is then scaled down (e.g., via Nomad Autoscaler) from 8 to 6 allocations
cat > scale.json << EOF
{
"Count": 6,
"Message": "manual down",
"Target": {
"Group": "apigw"
}
}
EOF
curl -X POST -d @scale.json $NOMAD_ADDR/v1/job/apigw/scale
Expected Result
In this scenario, we have 5 healthy allocations and 3 failed ones. The expected behavior during a scale-down from 8 to 6 is for Nomad to terminate 2 of the 3 failed allocations, resulting in a deployment with 5 out of 6 allocations running healthy.
Actual Result
Instead, Nomad incorrectly prioritizes the allocations to terminate and kills 2 of the 5 healthy ones. This results in a deployment running with only 3 out of 6 healthy allocations. If this behavior is triggered repeatedly by the autoscaler over a few hours, the job eventually reaches 0 healthy allocations.
More context
Also there is a discrepancy in the nomad job status output. The overall job summary reports 6 running allocations, but the latest deployment status reports a desired count of only 3. This appears to be incorrect and may be related to the scale-down issue.
❯ nomad job status -namespace admin apigw
ID = apigw
Name = apigw
Submit Date = 2026-05-06T11:31:20+02:00
Type = service
Priority = 90
Datacenters = *
Namespace = admin
Node Pool = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
apigw 0 0 6 113 35 1 0
Latest Deployment
ID = 588d3cf1
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline
apigw true 3 3 3 0 2026-05-06T09:36:50Z
Nomad server version
We have confirmed this behaviour on both of these Nomad Server versions
Operating system and Environment details
Linux 5.14.0-570.81.1.el9_6.x86_64
Issue
Nomad kiling healthy allocations in the event of a job being scaled down when Vault outage is ocurring
Reproduction steps
count = 5this job will have the following relevant stanzas:testExpected Result
In this scenario, we have 5 healthy allocations and 3 failed ones. The expected behavior during a scale-down from 8 to 6 is for Nomad to terminate 2 of the 3 failed allocations, resulting in a deployment with 5 out of 6 allocations running healthy.
Actual Result
Instead, Nomad incorrectly prioritizes the allocations to terminate and kills 2 of the 5 healthy ones. This results in a deployment running with only 3 out of 6 healthy allocations. If this behavior is triggered repeatedly by the autoscaler over a few hours, the job eventually reaches 0 healthy allocations.
More context
Also there is a discrepancy in the
nomad job statusoutput. The overall job summary reports 6 running allocations, but the latest deployment status reports a desired count of only 3. This appears to be incorrect and may be related to the scale-down issue.