-
Notifications
You must be signed in to change notification settings - Fork 316
Description
Required Info
- AWS ParallelCluster version: 3.14.0
- Cluster name: pcluster-prod
- Arn of the cluster CloudFormation main stack:
arn:aws:cloudformation:us-east-2:XXXXXXXXXXXX:stack/pcluster-prod/77c91230-cfe2-11f0-920e-0699cccac491
Full cluster configuration (sanitized)
Region: us-east-2
Imds:
ImdsSupport: v2.0
Image:
Os: ubuntu2204
HeadNode:
InstanceType: m5.xlarge
Networking:
SubnetId: subnet-XXXXXXXXX # Public subnet (uses Main route table with IGW)
AdditionalSecurityGroups:
- sg-XXXXXXXXX
Ssh:
KeyName: my-key
LoginNodes:
Pools:
- Name: login
Count: 2
InstanceType: m5.xlarge
GracetimePeriod: 120
Networking:
SubnetIds:
- subnet-XXXXXXXXX # Same public subnet as HeadNode
AdditionalSecurityGroups:
- sg-XXXXXXXXX
Scheduling:
Scheduler: slurm
# ... (queues omitted for brevity)Output of pcluster describe-cluster
{
"version": "3.14.0",
"clusterName": "pcluster-prod",
"clusterStatus": "UPDATE_COMPLETE",
"region": "us-east-2",
"loginNodes": [
{
"status": "active",
"poolName": "login",
"address": "pclust-pclus-XXXXXXXXX.elb.us-east-2.amazonaws.com",
"scheme": "internal", // <-- BUG: Should be "internet-facing"
"healthyNodes": 2,
"unhealthyNodes": 0
}
]
}Bug description and how to reproduce
Description
When configuring LoginNodes with a public subnet, the Network Load Balancer (NLB) is incorrectly created with scheme: internal instead of scheme: internet-facing.
According to the documentation:
Login nodes are provisioned with a single connection address to the network load balancer configured for the pool of login nodes. The connectivity settings of the address are based on the type of subnet specified in the Login nodes Pool configuration.
If the subnet is public, the address will be public
Root Cause
The bug is in cli/src/pcluster/aws/ec2.py function is_subnet_public():
def is_subnet_public(self, subnet_id):
route_tables = self.describe_route_tables(filters=[{"Name": "association.subnet-id", "Values": [subnet_id]}])
if not route_tables:
# Falls back to VPC route tables
vpc_id = subnets[0].get("VpcId")
route_tables = self.describe_route_tables(filters=[{"Name": "vpc-id", "Values": [vpc_id]}])
# BUG: Only checks route_tables[0], not the Main route table!
for route in route_tables[0].get("Routes", []):
if "GatewayId" in route and route["GatewayId"].startswith("igw-"):
return True
return FalseThe Problem:
When a subnet has no explicit route table association (uses VPC's Main route table):
- Code correctly falls back to fetching all VPC route tables
- Bug: Uses
route_tables[0]which may NOT be the Main route table - If
route_tables[0]is a private route table (no IGW), function incorrectly returnsFalse
Steps to Reproduce
-
Create a VPC with multiple route tables:
- Main route table with IGW route (
0.0.0.0/0 → igw-xxx) - Other route tables (some with IGW, some without)
- Main route table with IGW route (
-
Create a subnet without explicit route table association
- This subnet will use the Main route table (AWS default behavior)
- Subnet is therefore public (has IGW route via Main route table)
-
Configure ParallelCluster LoginNodes with this subnet
-
Deploy cluster:
pcluster update-cluster ... -
Check LoginNodes:
pcluster describe-cluster ... -
Observe:
"scheme": "internal"instead of"scheme": "internet-facing"
VPC Configuration in My Case
| Route Table | Main | Has IGW | Explicit Subnet Association |
|---|---|---|---|
| rtb-019824176f569d6f8 | No | ❌ | subnet-0d2cbac69b733afcb |
| rtb-04df41e2c13860723 | No | ✅ | (NAT Gateway) |
| rtb-0325b0f7a7505ff6c | Yes | ✅ | None (default for unassociated subnets) |
The LoginNodes subnet (subnet-07a011e0c2c87c666) has no explicit association, so it uses the Main route table which has IGW → public subnet.
But is_subnet_public() checks route_tables[0] which happens to be a non-Main route table without IGW → incorrectly returns False → NLB created as internal.
Proposed Fix
def is_subnet_public(self, subnet_id):
route_tables = self.describe_route_tables(filters=[{"Name": "association.subnet-id", "Values": [subnet_id]}])
if not route_tables:
subnets = self.describe_subnets([subnet_id])
if not subnets:
raise Exception(f"No subnet found with ID {subnet_id}")
vpc_id = subnets[0].get("VpcId")
route_tables = self.describe_route_tables(filters=[{"Name": "vpc-id", "Values": [vpc_id]}])
if not route_tables:
raise Exception("No route tables found.")
# FIX: Find the Main route table instead of using route_tables[0]
main_route_table = next(
(rt for rt in route_tables
if any(assoc.get("Main") for assoc in rt.get("Associations", []))),
route_tables[0] # fallback
)
route_tables = [main_route_table]
for route in route_tables[0].get("Routes", []):
if "GatewayId" in route and route["GatewayId"].startswith("igw-"):
return True
return FalseWorkaround
Explicitly associate the subnet with a route table that has an IGW:
aws ec2 associate-route-table \
--subnet-id subnet-XXXXXXXXX \
--route-table-id rtb-XXXXXXXXX \ # Route table with IGW
--region us-east-2Then redeploy LoginNodes.
Workaround Verified (2025-12-23)
aws ec2 associate-route-table \
--subnet-id subnet-07a011e0c2c87c666 \
--route-table-id rtb-0325b0f7a7505ff6c \
--region us-east-2
# AssociationId: rtbassoc-0b2a1b8c06a6c00ffResult: "scheme": "internet-facing" ✅
Impact
- Users cannot SSH to LoginNodes from outside VPC
- Defeats the purpose of having public-facing LoginNodes
- Forces workaround of using HeadNode as bastion (ProxyJump)
Additional context
- HeadNode in the same subnet gets public IP and is accessible from internet
- Only LoginNodes NLB is affected by this bug
- The subnet has
MapPublicIpOnLaunch: true