Nomad kills healthy allocations in the event of a job being scaled down when a Vault outage is ocurring

### Nomad server version
We have confirmed this behaviour on both of these Nomad Server versions
- v1.11.3
- v1.9.7

### Operating system and Environment details

Linux 5.14.0-570.81.1.el9_6.x86_64

### Issue

Nomad kiling healthy allocations in the event of a job being scaled down when Vault outage is ocurring

### Reproduction steps

1. Deploy a job named **apigw**, with `count = 5` this job will have the following relevant stanzas:

```hcl
group "apigw" {
  count = 5

  # (...)

  update {
    max_parallel      = 1
    auto_revert       = false
    auto_promote      = false
    canary            = 1
    min_healthy_time  = "30s"
    healthy_deadline  = "2m"
    progress_deadline = "10m"
  }

  # (...)

  restart {
    attempts = 3
    interval = "5m"
    delay    = "15s"
    mode     = "delay"
  }

  # (...)

  task "apigw" {
    # (...)
    
    template {
      data = <<EOH
        {{with secret "test" }}{{.Data.data.test}}{{end}}
      EOH

      destination = "${NOMAD_SECRETS_DIR}/apigw.pem"
      change_mode = "noop"
    }
    
    # (...)
  }
  
  # (...)
}
```

2. A Vault outage occurs, making it inaccessible to Nomad

```bash
❯ vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET https://vault.test.server/v1/sys/storage/raft/configuration
Code: 500. Errors:

* local node not active but active cluster node not found
```

3. The apigw job is scaled up (e.g., via Nomad Autoscaler) from 5 to 7, and then to 8 allocations. This behavior can be reproduced manually by sending scaling events to the API

```bash
cat > scale.json << EOF
{
  "Count": 8,
  "Message": "manual up",
  "Target": {
    "Group": "apigw"
  }
}
EOF

curl -X POST -d @scale.json $NOMAD_ADDR/v1/job/apigw/scale
```

<img width="1082" height="551" alt="Image" src="https://github.com/user-attachments/assets/9b355628-294e-4c98-8f78-883076ded40f" />

4. New allocations fail to start because Vault is inaccesible and the task can't fetch the secret `test`

<img width="1127" height="937" alt="Image" src="https://github.com/user-attachments/assets/1ab94ca4-fc70-4d79-8cf4-7487532ee243" />

5. The job is then scaled down (e.g., via Nomad Autoscaler) from 8 to 6 allocations

```bash
cat > scale.json << EOF
{
  "Count": 6,
  "Message": "manual down",
  "Target": {
    "Group": "apigw"
  }
}
EOF

curl -X POST -d @scale.json $NOMAD_ADDR/v1/job/apigw/scale
```


<img width="1007" height="354" alt="Image" src="https://github.com/user-attachments/assets/48501cf8-2860-487d-966d-ee4119e17393" />

#### Expected Result

In this scenario, we have 5 healthy allocations and 3 failed ones. The expected behavior during a scale-down from 8 to 6 is for Nomad to terminate 2 of the 3 failed allocations, resulting in a deployment with 5 out of 6 allocations running healthy.

#### Actual Result

Instead, Nomad incorrectly prioritizes the allocations to terminate and kills 2 of the 5 healthy ones. This results in a deployment running with only 3 out of 6 healthy allocations. If this behavior is triggered repeatedly by the autoscaler over a few hours, the job eventually reaches 0 healthy allocations.


<img width="1230" height="1011" alt="Image" src="https://github.com/user-attachments/assets/8686314b-8b07-471b-97e4-4969edef8766" />

### More context
Also there is a discrepancy in the `nomad job status` output. The overall job summary reports 6 running allocations, but the latest deployment status reports a desired count of only 3. This appears to be incorrect and may be related to the scale-down issue.

```bash
❯ nomad job status -namespace admin apigw
ID            = apigw
Name          = apigw
Submit Date   = 2026-05-06T11:31:20+02:00
Type          = service
Priority      = 90
Datacenters   = *
Namespace     = admin
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
apigw       0       0         6        113     35        1     0

Latest Deployment
ID          = 588d3cf1
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
apigw       true         3        3       3        0          2026-05-06T09:36:50Z
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad kills healthy allocations in the event of a job being scaled down when a Vault outage is ocurring #27906

Nomad server version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

More context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nomad kills healthy allocations in the event of a job being scaled down when a Vault outage is ocurring #27906

Description

Nomad server version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

More context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions