Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ This file is used to list changes made in each version of the AWS ParallelCluste
3.15.0
------

**CHANGES**
- Replace cfn-hup in compute nodes with systemd timers to support in place updates.
Copy link
Contributor

@gmarciani gmarciani Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must surface the value of this change for the user, which is providing better performance.
Also we should surface that the new mechanism relies on shared storage sync between head node and compute fleet.


3.14.1
------

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Unit]
Description=Check file modification time every minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description must be more specific: it must be more clear that this is a timer configured by pcluster (add this info in the description and in the file name) and the goal of the timer (checks file modifications, which files, why?)


[Timer]
AccuracySec=1s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accuracy has an impact on CPU wake-ups.
The finer the accuracy, the higher the chances to generate load on the CPU.
Since we are using a 60sec activation, what about using a 20s accuracy?

See https://www.freedesktop.org/software/systemd/man/latest/systemd.timer.html

OnActiveSec=60sec
OnUnitActiveSec=60sec
Unit=check-update.service

[Install]
WantedBy=timers.target
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@
# limitations under the License.

include_recipe 'aws-parallelcluster-computefleet::fleet_status'

if ['ComputeFleet'].include?(node['cluster']['node_type'])
include_recipe 'aws-parallelcluster-computefleet::config_check_update_systemd_service'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# frozen_string_literal: true

#
# Cookbook:: aws-parallelcluster-slurm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong cookbook. Curreently, the recipe i sin cookbook aws-parallelcluster-computefleet.

Also, what about moving this recipe to cookbook aws-parallelcluster-platform so that we are already in the direction of applying the same approach to login nodes?

# Recipe:: config_compute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix recipe name

#
# Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other files: fix copyright year with Copyright:: 2025 Amazon.com

#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the
# License. A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.

template '/etc/systemd/system/check-update.service' do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, above and below: the name of the service and the corresponding timer must be more talkative, expressing that they are pcluster services related to the cluster updates

source 'check_update/check-update.service.erb'
owner 'root'
group 'root'
mode '0644'
end

cookbook_file '/etc/systemd/system/check-update.timer' do
source 'check_update/check-update.timer'
owner 'root'
group 'root'
mode '0644'
action :create
end

file node['cluster']['shared_update_path'] do
content ''
owner 'root'
group 'root'
mode '0644'
action :create_if_missing
end

file node['cluster']['update_checkpoint'] do
content ''
owner 'root'
group 'root'
mode '0644'
action :create_if_missing
end

service 'check-update.timer' do
action [:enable, :start]
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
require 'spec_helper'

describe 'aws-parallelcluster-computefleet::config_check_update_systemd_service' do
for_all_oses do |platform, version|
context "on #{platform}#{version}" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
node.override['cluster']['node_type'] = 'ComputeFleet'
end
runner.converge(described_recipe)
end
cached(:node) { chef_run.node }

it 'creates the check-update.service template' do
is_expected.to create_template('/etc/systemd/system/check-update.service')
.with(source: 'check_update/check-update.service.erb')
.with(owner: 'root')
.with(group: 'root')
.with(mode: '0644')
end

it 'creates the check-update.timer file' do
is_expected.to create_cookbook_file('/etc/systemd/system/check-update.timer')
.with(source: 'check_update/check-update.timer')
.with(owner: 'root')
.with(group: 'root')
.with(mode: '0644')
end

it 'creates the shared update path file if missing' do
is_expected.to create_file_if_missing(node['cluster']['shared_update_path'])
.with(content: '')
.with(owner: 'root')
.with(group: 'root')
.with(mode: '0644')
end

it 'creates the local update checkpoint file if missing' do
is_expected.to create_file_if_missing(node['cluster']['update_checkpoint'])
.with(content: '')
.with(owner: 'root')
.with(group: 'root')
.with(mode: '0644')
end

it 'enables and starts the check-update.timer service' do
is_expected.to enable_service('check-update.timer')
is_expected.to start_service('check-update.timer')
end

describe 'check-update.service template content' do
it 'has Type=oneshot to prevent concurrent executions' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content('Type=oneshot')
end

it 'has TimeoutStartSec=30 to handle NFS hangs' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content('TimeoutStartSec=30')
end

it 'exits gracefully if shared file does not exist' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content('[ ! -f "$SHARED_FILE" ] && exit 0')
end

it 'exits gracefully if cat fails' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content('CURRENT_UPDATE=$(cat "$SHARED_FILE") || exit 0')
end

it 'writes checkpoint before running update action' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content('echo "$CURRENT_UPDATE" > "$LOCAL_CHECKPOINT" && ')
end

it 'references the correct shared update path' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content("SHARED_FILE=\"#{node['cluster']['shared_update_path']}\"")
end

it 'references the correct local checkpoint path' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content("LOCAL_CHECKPOINT=\"#{node['cluster']['update_checkpoint']}\"")
end

it 'calls cfn-hup-update-action.sh when update is needed' do
is_expected.to render_file('/etc/systemd/system/check-update.service')
.with_content("#{node['cluster']['scripts_dir']}/cfn-hup-update-action.sh")
end
end
end
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[Unit]
Description=Check for recent file modifications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


[Service]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing logs for the logic executed by this service.
I suggest to set the property StandardOutput to log to a local log file. We must then push such file to cloudwatch.

Also, the logic should log more information:

  1. when it starts
  2. if the file exists
  3. what is the content of the checkpoint file
  4. what is the content of the trigger file
  5. the decision taken
  6. when it completes

All log lines must be timestamped with millis resolution.

Type=oneshot
TimeoutStartSec=30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this timeout is here to prevent the service being stuck in case of file system unresponsiveness.
This is a good strategy as it is important to put a timeout when dealing with remote resources.
However, I think TimeoutStartSec may not be the right parameter to achieve this.

According to the official documentation for TimeoutStartSec:

If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again.

So the timeout covers the whole start logic. The start logic here includes the execution of cfn-hup-update-action.sh, which can legitimately last longer than 30 seconds. So with the current approach we have the risk of interrupting legitimate updates.

If you agree with this risk, I suggest to configure the timeout logic differently:

  1. define a timeout of 20 seconds to read the shared files (20s is enough to capture file system failures with a simple read operation on a single file)
  2. define a timeout for the update action, which must be set through the node_bootstrap_timeout parameter, as it currently is. See https://github.com/gmarciani/aws-parallelcluster-cookbook/blob/wip/mgiacomo/3141/fix-build-ubhu-1218-1/cookbooks/aws-parallelcluster-environment/templates/cfn_hup_configuration/cfn-hook-update.conf.erb#L9-L9

ExecStart=/bin/bash -c '\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve maintainability we must move this logic into a dedicated script and have the unit file invoke it

SHARED_FILE="<%= node['cluster']['shared_update_path'] %>"; \
LOCAL_CHECKPOINT="<%= node['cluster']['update_checkpoint'] %>"; \
\
[ ! -f "$SHARED_FILE" ] && exit 0; \
\
CURRENT_UPDATE=$(cat "$SHARED_FILE") || exit 0; \
LAST_APPLIED=$([ -f "$LOCAL_CHECKPOINT" ] && cat "$LOCAL_CHECKPOINT" || echo ""); \
\
if [ "$CURRENT_UPDATE" != "$LAST_APPLIED" ]; then \
echo "$CURRENT_UPDATE" > "$LOCAL_CHECKPOINT" && <%= node['cluster']['scripts_dir'] %>/cfn-hup-update-action.sh; \
fi'

[Install]
WantedBy=multi-user.target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service is triggered by a timer. What is the point of having WantedBy=multi-user.target?
I think it could be removed as it is actually unused.

Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@ triggers=post.update
<% case node['cluster']['node_type'] -%>
<% when 'HeadNode', 'LoginNode' -%>
path=Resources.<%= @launch_template_resource_id %>.Metadata.AWS::CloudFormation::Init
action=PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin; . /etc/parallelcluster/pcluster_cookbook_environment.sh; $CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-init -v --stack <%= @stack_id %> --resource <%= @launch_template_resource_id %> --configsets update --region <%= @region %> --url <%= @cloudformation_url %> --role <%= @cfn_init_role %>
<% when 'ComputeFleet' -%>
path=Resources.<%= @launch_template_resource_id %>
action=timeout <%= @node_bootstrap_timeout %> <%= @update_hook_script_dir %>/cfn-hup-update-action.sh
action=PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin;. /etc/parallelcluster/pcluster_cookbook_environment.sh; $CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-init -v --stack <%= @stack_id %> --resource <%= @launch_template_resource_id %> --configsets update --region <%= @region %> --url <%= @cloudformation_url %> --role <%= @cfn_init_role %>
<% end %>
runas=root
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@
dcv_port: node['cluster']['dcv_port'],
dcv_auth_certificate: node['cluster']['dcv']['authenticator']['certificate'],
dcv_auth_private_key: node['cluster']['dcv']['authenticator']['private_key'],
dcv_auth_user: node['cluster']['dcv']['authenticator']['user'],
cfnhup_enabled: cfnhup_enabled?
dcv_auth_user: node['cluster']['dcv']['authenticator']['user']
)
end
Original file line number Diff line number Diff line change
Expand Up @@ -57,28 +57,6 @@
end
end

context "when head node and cfn-hup disabled on fleet" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
node.override['cluster']['node_type'] = 'HeadNode'
node.override['cluster']['dcv_enabled'] = 'head_node'
node.override['cluster']['in_place_update_on_fleet_enabled'] = 'false'
allow_any_instance_of(Object).to receive(:dcv_installed?).and_return(true)
end
runner.converge(described_recipe)
end
cached(:node) { chef_run.node }

it 'has the correct content' do
is_expected.to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:cfn-hup]")
.with_content("[program:clustermgtd]")
.with_content("[program:clusterstatusmgtd]")
.with_content("[program:pcluster_dcv_authenticator]")
.with_content("--port 8444")
end
end

context "when compute fleet" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
Expand All @@ -92,32 +70,16 @@

it 'has the correct content' do
is_expected.to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:cfn-hup]")
.with_content("[program:computemgtd]")

is_expected.not_to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:pcluster_dcv_authenticator]")
end
end

context "when compute fleet with cfn-hup disabled on fleet" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
node.override['cluster']['node_type'] = 'ComputeFleet'
node.override['cluster']['in_place_update_on_fleet_enabled'] = 'false'
end
runner.converge(described_recipe)
end
cached(:node) { chef_run.node }

it 'has the correct content' do
is_expected.to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:computemgtd]")

is_expected.not_to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:cfn-hup]")
end
end

context "when login node and dcv configured" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
Expand Down Expand Up @@ -157,25 +119,6 @@
.with_content("[program:pcluster_dcv_authenticator]")
end
end

context "when login node with cfn-hup disabled on fleet" do
cached(:chef_run) do
runner = runner(platform: platform, version: version) do |node|
node.override['cluster']['node_type'] = 'LoginNode'
node.override['cluster']['in_place_update_on_fleet_enabled'] = 'false'
end
runner.converge(described_recipe)
end
cached(:node) { chef_run.node }

it 'has the correct content' do
is_expected.to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:loginmgtd]")

is_expected.not_to render_file('/etc/parallelcluster/parallelcluster_supervisord.conf')
.with_content("[program:cfn-hup]")
end
end
end
end
end
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Generated by Chef for AWS ParallelCluster <%= node['cluster']['node_type'] -%>
# Local modifications could be overwritten.
<%# HeadNode, ComputeFleet, LoginNode -%>
<% if @cfnhup_enabled -%>
<% case node['cluster']['node_type'] -%>
<% when 'HeadNode', 'LoginNode' -%>
[program:cfn-hup]
command = <%= node['cluster']['scripts_dir']%>/cfn-hup-runner.sh
autorestart = true
Expand Down
7 changes: 4 additions & 3 deletions cookbooks/aws-parallelcluster-shared/attributes/cluster.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@
default['cluster']['log_base_dir'] = '/var/log/parallelcluster'
default['cluster']['etc_dir'] = '/etc/parallelcluster'

# Shared file used to manage inplace updates
default['cluster']['shared_update_path'] = "#{node['cluster']['shared_dir']}/check_update"
default['cluster']['update_checkpoint'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"
Comment on lines +13 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use more talkative names, such as

default['cluster']['update']['trigger_file'] = "#{node['cluster']['shared_dir']}/update_trigger"
default['cluster']['update']['checkpoint_file'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"

With such names for the attributes we better convey that:

  1. they are files (when an attribute contains a directory path, it has dir as suffix)
  2. one is used as a trigger


# Slurm_plugin_dir is used by slurm cookbook and custom_actions recipe
default['cluster']['slurm_plugin_dir'] = "#{node['cluster']['etc_dir']}/slurm_plugin"

Expand All @@ -34,6 +38,3 @@

# Default NFS mount options
default['cluster']['nfs']['hard_mount_options'] = 'hard,_netdev,noatime'

# Cluster Updates
default['cluster']['in_place_update_on_fleet_enabled'] = 'true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 changes: 0 additions & 11 deletions cookbooks/aws-parallelcluster-shared/libraries/helpers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -106,14 +106,3 @@ def wait_sync_file(path)
timeout 5
end
end

def cfnhup_enabled?
# cfn-hup is always enabled on the head node, as it is required to perform cluster updates.
# cfn-hup can be disabled on compute nodes and login nodes, limiting the cluster update in the sense that
# live updates on compute and login nodes are not possible.
node['cluster']['node_type'] == 'HeadNode' || node['cluster']['in_place_update_on_fleet_enabled'] == 'true'
end

def cluster_readiness_check_on_update_enabled?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as https://github.com/aws/aws-parallelcluster-cookbook/pull/3070/changes#r2641117741

I think it is useful to keep a way to skip cluster readiness check through chef attributes.
The current chef attribute name must be changed as in_place_update_on_fleet_enabled will not be valid anymore. We could expose something like cluster/update/cluster_readiness_check_enabled

node['cluster']['in_place_update_on_fleet_enabled'] == 'true'
end

This file was deleted.

Loading
Loading