Skip to content

[BUG] LoginNodes NLB scheme incorrectly set to internal on public subnet #7173

@almightychang

Description

@almightychang

Required Info

  • AWS ParallelCluster version: 3.14.0
  • Cluster name: pcluster-prod
  • Arn of the cluster CloudFormation main stack: arn:aws:cloudformation:us-east-2:XXXXXXXXXXXX:stack/pcluster-prod/77c91230-cfe2-11f0-920e-0699cccac491

Full cluster configuration (sanitized)

Region: us-east-2
Imds:
  ImdsSupport: v2.0
Image:
  Os: ubuntu2204
HeadNode:
  InstanceType: m5.xlarge
  Networking:
    SubnetId: subnet-XXXXXXXXX  # Public subnet (uses Main route table with IGW)
    AdditionalSecurityGroups:
      - sg-XXXXXXXXX
  Ssh:
    KeyName: my-key

LoginNodes:
  Pools:
    - Name: login
      Count: 2
      InstanceType: m5.xlarge
      GracetimePeriod: 120
      Networking:
        SubnetIds:
          - subnet-XXXXXXXXX  # Same public subnet as HeadNode
        AdditionalSecurityGroups:
          - sg-XXXXXXXXX

Scheduling:
  Scheduler: slurm
  # ... (queues omitted for brevity)

Output of pcluster describe-cluster

{
  "version": "3.14.0",
  "clusterName": "pcluster-prod",
  "clusterStatus": "UPDATE_COMPLETE",
  "region": "us-east-2",
  "loginNodes": [
    {
      "status": "active",
      "poolName": "login",
      "address": "pclust-pclus-XXXXXXXXX.elb.us-east-2.amazonaws.com",
      "scheme": "internal",   // <-- BUG: Should be "internet-facing"
      "healthyNodes": 2,
      "unhealthyNodes": 0
    }
  ]
}

Bug description and how to reproduce

Description

When configuring LoginNodes with a public subnet, the Network Load Balancer (NLB) is incorrectly created with scheme: internal instead of scheme: internet-facing.

According to the documentation:

Login nodes are provisioned with a single connection address to the network load balancer configured for the pool of login nodes. The connectivity settings of the address are based on the type of subnet specified in the Login nodes Pool configuration.

If the subnet is public, the address will be public

Root Cause

The bug is in cli/src/pcluster/aws/ec2.py function is_subnet_public():

def is_subnet_public(self, subnet_id):
    route_tables = self.describe_route_tables(filters=[{"Name": "association.subnet-id", "Values": [subnet_id]}])

    if not route_tables:
        # Falls back to VPC route tables
        vpc_id = subnets[0].get("VpcId")
        route_tables = self.describe_route_tables(filters=[{"Name": "vpc-id", "Values": [vpc_id]}])

    # BUG: Only checks route_tables[0], not the Main route table!
    for route in route_tables[0].get("Routes", []):
        if "GatewayId" in route and route["GatewayId"].startswith("igw-"):
            return True

    return False

The Problem:

When a subnet has no explicit route table association (uses VPC's Main route table):

  1. Code correctly falls back to fetching all VPC route tables
  2. Bug: Uses route_tables[0] which may NOT be the Main route table
  3. If route_tables[0] is a private route table (no IGW), function incorrectly returns False

Steps to Reproduce

  1. Create a VPC with multiple route tables:

    • Main route table with IGW route (0.0.0.0/0 → igw-xxx)
    • Other route tables (some with IGW, some without)
  2. Create a subnet without explicit route table association

    • This subnet will use the Main route table (AWS default behavior)
    • Subnet is therefore public (has IGW route via Main route table)
  3. Configure ParallelCluster LoginNodes with this subnet

  4. Deploy cluster: pcluster update-cluster ...

  5. Check LoginNodes: pcluster describe-cluster ...

  6. Observe: "scheme": "internal" instead of "scheme": "internet-facing"

VPC Configuration in My Case

Route Table Main Has IGW Explicit Subnet Association
rtb-019824176f569d6f8 No subnet-0d2cbac69b733afcb
rtb-04df41e2c13860723 No (NAT Gateway)
rtb-0325b0f7a7505ff6c Yes None (default for unassociated subnets)

The LoginNodes subnet (subnet-07a011e0c2c87c666) has no explicit association, so it uses the Main route table which has IGW → public subnet.

But is_subnet_public() checks route_tables[0] which happens to be a non-Main route table without IGW → incorrectly returns False → NLB created as internal.

Proposed Fix

def is_subnet_public(self, subnet_id):
    route_tables = self.describe_route_tables(filters=[{"Name": "association.subnet-id", "Values": [subnet_id]}])

    if not route_tables:
        subnets = self.describe_subnets([subnet_id])
        if not subnets:
            raise Exception(f"No subnet found with ID {subnet_id}")
        vpc_id = subnets[0].get("VpcId")

        route_tables = self.describe_route_tables(filters=[{"Name": "vpc-id", "Values": [vpc_id]}])
        if not route_tables:
            raise Exception("No route tables found.")

        # FIX: Find the Main route table instead of using route_tables[0]
        main_route_table = next(
            (rt for rt in route_tables
             if any(assoc.get("Main") for assoc in rt.get("Associations", []))),
            route_tables[0]  # fallback
        )
        route_tables = [main_route_table]

    for route in route_tables[0].get("Routes", []):
        if "GatewayId" in route and route["GatewayId"].startswith("igw-"):
            return True

    return False

Workaround

Explicitly associate the subnet with a route table that has an IGW:

aws ec2 associate-route-table \
  --subnet-id subnet-XXXXXXXXX \
  --route-table-id rtb-XXXXXXXXX \  # Route table with IGW
  --region us-east-2

Then redeploy LoginNodes.

Workaround Verified (2025-12-23)

aws ec2 associate-route-table \
  --subnet-id subnet-07a011e0c2c87c666 \
  --route-table-id rtb-0325b0f7a7505ff6c \
  --region us-east-2
# AssociationId: rtbassoc-0b2a1b8c06a6c00ff

Result: "scheme": "internet-facing"

Impact

  • Users cannot SSH to LoginNodes from outside VPC
  • Defeats the purpose of having public-facing LoginNodes
  • Forces workaround of using HeadNode as bastion (ProxyJump)

Additional context

  • HeadNode in the same subnet gets public IP and is accessible from internet
  • Only LoginNodes NLB is affected by this bug
  • The subnet has MapPublicIpOnLaunch: true

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions