OpenHPC Installation Guide (via recipe.sh)
π§ This OpenHPC installation guide is under active testing. Content may change.
1. Introduction
We are going to install OpenHPC using the recipe.sh script. To make the installation process easier to understand and to check for errors step by step, we will divide this script into 15 individual sections, executing and verifying each one separately.
2. Initial Environment Validation and Setup
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}Please specify the exact path where input.local is located. Make sure there are no spaces between the - and the /
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
FiExplanation:
- If OHPC_INPUT_LOCAL is defined, then inputFile will take its value.
- Otherwise, it will default to /input.local.
# ---------------------------- Begin OpenHPC Recipe ---------------------------------------
# Commands below are extracted from an OpenHPC install guide recipe and are intended for
# execution on the master SMS host.
# -----------------------------------------------------------------------------------------
# Verify OpenHPC repository has been enabled before proceeding
dnf repolist | grep -q OpenHPC
if [ $? -ne 0 ];then
echo "Error: OpenHPC repository must be enabled locally"
exit 1
fiIt checks whether the OpenHPC repository is enabled using dnf repolist.
If itβs not enabled, an error message is displayed and the installation is stopped.
# Disable firewall
systemctl disable --now firewalldIt immediately disables the firewalld service and prevents it from starting automatically.
Running this section of recipe.sh should produce no output.
If any output appears, it indicates an error.
3. Deployment of Core OpenHPC Packages and Initial Time Synchronization Setup
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ------------------------------------------------------------
# Add baseline OpenHPC and provisioning services (Section 3.3)
# ------------------------------------------------------------
dnf -y install ohpc-base warewulf-ohpc hwloc-ohpcPurpose : To install the essential packages for OpenHPC. The packages required for setting up and managing an HPC cluster are installed using dnf.
# Enable NTP services on SMS host
systemctl enable chronyd.serviceEnable the NTP service : The time synchronization service chronyd is enabled to ensure that the SMS server remains synchronized with an NTP server.
echo "local stratum 10" >> /etc/chrony.conf
echo "server ${ntp_server}" >> /etc/chrony.conf
echo "allow all" >> /etc/chrony.confConfigure NTP settings: The chrony.conf configuration file is modified to define a local server, specify an external NTP server, and allow all hosts to synchronize with this server.
systemctl restart chronydRestart the chronyd service : The service is restarted to apply the configuration changes.
Running this script may produce the following error (at least that was the case for me)
Rocky Linux 9 - BaseOS 0.0 B/s | 0 B 00:01
Errors during downloading metadata for repository 'baseos':
- Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.rockylinux.org/mirrorlist?arch=x86_64&repo=BaseOS-9 [SSL certificate problem: certificate is not yet valid]
Error: Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.rockylinux.org/mirrorlist?arch=x86_64&repo=BaseOS-9 [SSL certificate problem: certificate is not yet valid
Solution: An SSL certificate error may occur if the system date and time are incorrect. If the system clock is significantly ahead or behind, SSL certificates may be considered invalid.
[root@master-ohpc /]# date
Wed Feb 5 08:33:12 AM EST 2025
<pre style="font-size: 8px;">
<code>
[root@master-ohpc /]# date -s "2025-02-05 14:37:00"
Wed Feb 5 02:37:00 PM EST 2025
[root@master-ohpc /]# date
Wed Feb 5 02:37:05 PM EST 2025
[root@master-ohpc /]# ./recipe2.sh </pre>Once this configuration is applied, the script generates the following output:
OpenHPC-3 - Base 698 B/s | 1.5 kB 00:02
OpenHPC-3 - Updates 8.2 kB/s | 3.0 kB 00:00
Extra Packages for Enterprise Linux 9 - x86_64 4.9 kB/s | 79 kB 00:16
Rocky Linux 9 - BaseOS 12 kB/s | 4.1 kB 00:00
Rocky Linux 9 - AppStream 15 kB/s | 4.5 kB 00:00
Rocky Linux 9 - Extras 2.0 kB/s | 2.9 kB 00:01
Dependencies resolved.
==============================================================================================================================================================================================================
Package Architecture Version Repository Size
==============================================================================================================================================================================================================
Installing:
hwloc-ohpc x86_64 2.11.1-320.ohpc.1.1 OpenHPC-updates 2.4 M
ohpc-base x86_64 3.2-320.ohpc.1.1 OpenHPC-updates 7.2 k
warewulf-ohpc x86_64 4.5.5-320.ohpc.3.1 OpenHPC-updates 24 M
Upgrading:
4. Installing and Configuring Slurm Resource Manager on Master Node
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
</code>
</pre> . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
Fi
# -------------------------------------------------------------
# Add resource management services on master node (Section 3.4)
# -------------------------------------------------------------
dnf -y install ohpc-slurm-serverInstalls Slurm, a workload manager for HPC jobs, using dnf -y install to automate the process without requiring confirmation.
cp /etc/slurm/slurm.conf.ohpc /etc/slurm/slurm.conf
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.confCopies a default Slurm configuration file (slurm.conf.ohpc β slurm.conf) and a sample cgroup configuration file (cgroup.conf.example β cgroup.conf), which is used to limit and isolate CPU and memory resources via cgroups.
perl -pi -e "s/SlurmctldHost=\S+/SlurmctldHost=${sms_name}/" /etc/slurm/slurm.confReplaces the SlurmctldHost=β¦ line in /etc/slurm/slurm.conf with SlurmctldHost=${sms_name}, where ${sms_name} is defined in the input.local configuration file. This variable represents the master node of the cluster.
Once this configuration is applied, the script generates the following output:
Last metadata expiration check: 0:55:27 ago on Fri 07 Feb 2025 04:11:23 AM EST.
Dependencies resolved.
==============================================================================================================================================================================================================
Package Architecture Version Repository Size
==============================================================================================================================================================================================================
Installing:
ohpc-slurm-server x86_64 3.2-320.ohpc.1.1 OpenHPC-updates 7.0 k
Installing dependencies:
Verification : Confirm that the SlurmctldHost=β¦ line in /etc/slurm/slurm.conf has been correctly replaced with SlurmctldHost=${sms_name}.
Cd /etc/slurm/
Nano slurm.conf# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
ClusterName=cluster
SlurmctldHost=master-ohpc
#SlurmctldHost=
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=15. Updating Slurm Node Configuration in slurm.conf
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ----------------------------------------
# Update node configuration for slurm.conf
# ----------------------------------------
if [[ ${update_slurm_nodeconfig} -eq 1 ]];then (if [[ ${update_slurm_nodeconfig} -eq 1 ]]; thenCheck if the variable update_slurm_nodeconfig is set to 1 β this indicates that the node configuration in the input.local file needs to be updated.
perl -pi -e "s/^NodeName=.+$/#/" /etc/slurm/slurm.confReplace all lines starting with NodeName= with a comment (#).
perl -pi -e "s/ Nodes=c\S+ / Nodes=${compute_prefix}[1-${num_computes}] /" /etc/slurm/slurm.conf Modify the Slurm node configuration to match the prefixes defined in compute_prefix.
echo -e ${slurm_node_config} >> /etc/slurm/slurm.conf
FiAdd the value of the slurm_node_config variable to the bottom of the slurm.conf configuration file.
The script should run silently without generating any output, but itβs important to verify that the modifications have been properly applied to the relevant file.
Verifications
Cd /etc/slurm/
Nano slurm.confTo verify this :echo -e ${slurm_node_config} >> /etc/slurm/slurm.conf
# Enable configless option
SlurmctldParameters=enable_configless
# Setup interactive jobs for salloc
LaunchParameters=use_interactive_step
compute[1-2] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2To verify this :perl -pi -e "s/ Nodes=c\S+ / Nodes=${compute_prefix}[1-${num_computes}] /" /etc/slurm/slurm.conf
PartitionName=normal Nodes=compute[1-2] Default=YES MaxTime=24:00:00 State=UP OverSubscribe=EXCLUSIVE
# Enable configless option
SlurmctldParameters=enable_configlessTo verify this :perl -pi -e "s/^NodeName=.+$/#/" /etc/slurm/slurm.conf
# COMPUTE NODES
#NodeName=linux[1-32] CPUs=1 State=UNKNOWN
#PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP6. Enabling InfiniBand and Omni-Path Support Services on the Master Node
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# -----------------------------------------------------------------------
# Optionally add InfiniBand support services on master node (Section 3.5)
# -----------------------------------------------------------------------
if [[ ${enable_ib} -eq 1 ]];then
dnf -y groupinstall "InfiniBand Support"
udevadm trigger --type=devices --action=add
systemctl restart rdma-load-modules@infiniband.service
fi
# Optionally enable opensm subnet manager
if [[ ${enable_opensm} -eq 1 ]];then
dnf -y install opensm
systemctl enable opensm
systemctl start opensm
fi
# Optionally enable IPoIB interface on SMS
if [[ ${enable_ipoib} -eq 1 ]];then
# Enable ib0
cp /opt/ohpc/pub/examples/network/centos/ifcfg-ib0 /etc/sysconfig/network-scripts
perl -pi -e "s/master_ipoib/${sms_ipoib}/" /etc/sysconfig/network-scripts/ifcfg-ib0
perl -pi -e "s/ipoib_netmask/${ipoib_netmask}/" /etc/sysconfig/network-scripts/ifcfg-ib0
echo "[main]" > /etc/NetworkManager/conf.d/90-dns-none.conf
echo "dns=none" >> /etc/NetworkManager/conf.d/90-dns-none.conf
systemctl start NetworkManager
fi
# ----------------------------------------------------------------------
# Optionally add Omni-Path support services on master node (Section 3.6)
# ----------------------------------------------------------------------
if [[ ${enable_opa} -eq 1 ]];then
dnf -y install opa-basic-tools
fi
# Optionally enable OPA fabric manager
if [[ ${enable_opafm} -eq 1 ]];then
dnf -y install opa-fm
systemctl enable opafm
systemctl start opafm
fiThis section conditionally enables support for high-performance networking on the master node, including InfiniBand, IP over InfiniBand (IPoIB), Omni-Path Architecture (OPA), and the associated subnet/fabric managers, based on configuration variables.
[root@master-ohpc /]# Chmod +x ./recipe5.sh
[root@master-ohpc /]# ./recipe5.shThe script recipe5.sh is made executable with chmod +x, then executed without producing any output, indicating that it likely ran successfully and silently.
7. Completing Warewulf Master Node Configuration for Cluster Provisioning
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# -----------------------------------------------------------
# Complete basic Warewulf setup for master node (Section 3.7)
# -----------------------------------------------------------
ip link set dev ${sms_eth_internal} up
ip address add ${sms_ip}/${internal_netmask} broadcast + dev ${sms_eth_internal}
perl -pi -e "s/ipaddr:.*/ipaddr: ${sms_ip}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/netmask:.*/netmask: ${internal_netmask}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/network:.*/network: ${internal_network}/" /etc/warewulf/warewulf.conf
perl -pi -e 's/template:.*/template: static/' /etc/warewulf/warewulf.conf
perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
wwctl profile set -y default --netmask=${internal_netmask}
wwctl profile set -y default --gateway=${ipv4_gateway}
wwctl profile set -y default --netdev=default --nettagadd=DNS=${dns_servers}
perl -pi -e "s/warewulf/${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
perl -pi -e "s/warewulf/${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
echo "next-server ${sms_ip};" >> /srv/warewulf/overlays/host/rootfs/etc/dhcpd.conf.ww
systemctl enable --now warewulfd
wwctl configure --all
bash /etc/profile.d/ssh_setup.sh
# Update /etc/hosts template to have ${hostname}.localdomain as the first host entry
sed -e 's_\({{$node.Id.Get}}{{end}}\)_{{$node.Id.Get}}.localdomain \1_g' -i /srv/warewulf/overlays/host/rootfs/etc/hosts.wwthis script sets up the internal network interface, updates the Warewulf configuration to reflect the cluster topology and networking, and prepares services for provisioning compute nodes.
Step-by-step summary of the script
Configures the internal network interface of the master server (sms_eth_internal), assigning it an IP address, netmask, and bringing it up.
Modifies the warewulf.conf file:
Sets the IP address, netmask, network, and static network template
Configures the IP range for compute nodes (range start / end)
Enables file system mounting
Configures the default Warewulf profile with:
The netmask
The gateway address
The default network device and DNS servers
Customizes the hosts.ww and dhcpd.conf.ww files to reflect the master serverβs name (sms_name) and add the next-server directive in the DHCP config.
Enables and configures the warewulfd service:
Starts and enables the Warewulf daemon
Applies configuration with wwctl configure βall
Runs the default SSH configuration script (ssh_setup.sh)
Updates the /etc/hosts template for the nodes to include hostname.localdomain as the first host entry.
Verifications
To verify that the configuration was applied correctly, open the file /etc/warewulf/warewulf.conf and check that the values match the expected settings:
GNU nano 5.4
WW_INTERNAL: 45
ipaddr: 192.168.70.41
netmask: 255.255.255.0
gateway: 192.168.70.1
nameserv:
port: 9873
secure: false
update_interval: 60
autobuild_overlays: true
host_overlay: true
base: static
datastore: /usr/share
grubboot: false
dhcp:
enabled: true
template: static
range_start: 192.168.70.51
range_end: 192.168.70.52
systemd_name: dhcpd
tftp:
enabled: true
tftproot: /srv/tftpboot
systemd_name: tftp
ipxe:
"00:00": undionly.kpxe
"00:07": ipxe-snponly-x86_64.efi
"00:09": ipxe-snponly-x86_64.efi
"00:0B": arm64-efi/snponly.efi
nfs:
enabled: true
export_paths:
- path: /home
export_options: rw,sync
mount_options: defaults
mount: true
- path: /opt
export_options: ro,sync,no_root_squash
mount_options: defaults
mount: true
systemd_name: nfs-server
ssh:
key_types:
- rsa
- dsa
- ecdsa
- ed25519
container_mounts:
- source: /etc/resolv.conf
- source: /etc/localtime
readonly: true
paths:
bindir: /usr/bin
sysconfdir: /etcTo verify that the DHCP configuration has been applied correctly, open the dhcpd.conf.ww file and check the relevant settings:
# Pure BIOS clients will get iPXE configuration
filename "http://${s.ipaddr}:${s.Warewulf.Port}/ipxe/${mac:hexhyp}";
# EFI clients will get shim and grub instead
filename "warewulf/shim.efi";
} elsif substring (option vendor-class-identifier, 0, 10) = "HTTPClient" {
filename "http://${s.ipaddr}:${s.Warewulf.Port}/efiboot.img";
} else {
# iPXE vendor-class and option 175 = "iPXE" {
filename "http://${s.ipaddr}:${s.Warewulf.Port}/ipxe/${mac:hexhyp}?assetkey=${asset}&u
} else {
{{range $type,$name := $.Tftp.IpxeBinaries }}
if option architecture-type = {{ $type }} {
filename "/warewulf/{{ basename $name }}";
}
{{end}}{{/* range IpxeBinaries */}}
{{end}}{{/* BootMethod */}}
subnet {{$.Network}} netmask {{$.Netmask}} {
max-lease-time 120;
{{- if ne .Dhcp.Template "static" }}
range {{$.Dhcp.RangeStart}} {{$.Dhcp.RangeEnd}};
next-server {{.Ipaddr}};
{{end}}
}
{{- if eq .Dhcp.Template "static" }}
{{- range $nodes := $.AllNodes}}
{{- range $devs := $.netDevs }}
host {{$nodes.Id.Get}}-{{$netname}}
{{- if $netdevs.Hwaddr.Defined}}
hardware ethernet {{$netdevs.Hwaddr.Get}};
{{- end}}
{{- if $netdevs.Ipaddr.Defined}}
fixed-address {{$netdevs.Ipaddr.Get}};
{{- end }}
{{- if $netdevs.Primary.GetB}}
option host-name "{{$nodes.Id.Get}}";
{{- end }}
}
{{end }}{{/* range NetDevs */}}
{{end }}{{/* range AllNodes */}}
{{end }}{{/* if static */}}
}
{{abort}}
}
{{- end}}{{/* dhcp enabled */}}
{{- end}}{{/* primary */}}
next-server 192.168.70.41;Review the file /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww to ensure that hostnames and IP mappings have been properly configured for the compute nodes:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# Warewulf Server
{{$.Ipaddr}} {{$.BuildHost}} master-ohpc
{{- range $node := $.AllNodes}} {{/* for each node */}}
{{- range $netname,$netdevs := $node.NetDevs}} {{/* for each network device on the node */}}
{{- if $netdevs.OnThisNetwork $.Network}} {{/* only if this device has an IP address on this network */}}
{{$netdevs.Ipaddr.Get}} {{$node.Id.Get}}-{{$netname}} # {{$node.Comment.Print}} if this is the primary */}}8. Creating the Compute Node Image for Warewulf
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# -------------------------------------------------
# Create compute image for Warewulf (Section 3.8.1)
# -------------------------------------------------This section explains how to create a compute image for Warewulf, a cluster management tool used to install and manage compute nodes. This image is used to configure the compute nodes within the cluster.
wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 βsyncuserThis command uses wwctl (Warewulfβs command-line tool) to import a preconfigured Docker image of Rocky Linux 9.4 from the ghcr.io/warewulf registry. The βsyncuser flag ensures that users and groups in the Docker image are synchronized with those on the host system.
wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
dnf -y update
EOFThe wwctl container exec command runs a command inside the Rocky Linux 9.4 Docker container that we previously imported.
The commands executed inside the container are:
- Installation of ohpc-release: This installs the ohpc-release package from the OpenHPC repository for Rocky Linux 9.4. It configures the system to enable the installation of OpenHPC-specific software and resources.
- System update: The dnf -y update command updates all packages in the container to their latest available versions.
export CHROOT=/srv/warewulf/chroots/rocky-9.4/rootfsThis line defines an environment variable CHROOT that points to the directory containing the filesystem of the Rocky Linux 9.4 compute image we just created. This directory is essential for integrating the compute image into the Warewulf system, as it represents the environment in which the compute nodes will be deployed.
After running the script, the output is displayed as follows:
[root@master-ohpc /]# ./recipe7.sh
Copying blob 4f4fb700ef54 done
Copying blob 0046cb37027b [==============================>---------] 436.2MiB / 611.4MiB | 20.6 MiB/s
Copying blob cc311bfc628a done
Copying blob 30e5d205dca1 done
Copying blob 3442e16c7069 doneVerifications
Container Import Verification
Ensure that the rocky-9.4 container has been successfully imported. You can verify this by checking whether it appears in the list using the appropriate command. If the rocky-9.4 container is listed, it means the import was successful.
Verify the installation of the ohpc-release package: After running the script, the ohpc-release package should be installed inside the container. To confirm this, access the container and check the package status.
[root@master-ohpc /]# wwctl container list
ERROR : lstat /srv/warewulf/chroots/rocky-9.4/rootfs/proc/3089049: no such file or directory
CONTAINER NAME NODES KERNEL VERSION CREATION TIME MODIFICATION TIME SIZE
rocky-9.4 0 5.14.0-503.19.1.el9_5.x86_64 10 Feb 25 10:27 EST 10 Feb 25 08:17 EST 1.7 GiB
[root@master-ohpc /]# nano recipe7.sh
[root@master-ohpc /]# wwctl container exec rocky-9.4 /bin/bash
[rocky-9.4] warewulf# rpm -qa | grep ohpc-release
ohpc-release-3.1-1.el9.x86_64
[rocky-9.4] warewulf#- Once inside, verify that the package is properly installed. (If the package is present, it confirms that the initial dnf command was executed successfully.)
[rocky-9.4] Warewulf> rpm -qa | grep ohpc-release
ohpc-release-3-1.el9.x86_64- Check for system updates: After running dnf -y update, you need to ensure that the system has been properly updated. To do this, check for any remaining available updates:
[rocky-9.4] Warewulf> dnf check-update
OpenHPC-3 - Base 321 kB/s | 3.6 MB 00:11
OpenHPC-3 - Updates 860 kB/s | 5.0 MB 00:05
Extra Packages for Enterprise Linux 9 - x86_64 0.6 kB/s | 2.3 kB 00:04
Extra Packages for Enterprise Linux 9 openh264 (From Ci 2.0 kB/s | 2.5 kB 00:01
Rocky Linux 9 - BaseOS 5.2 MB/s | 2.0 MB 00:00
Rocky Linux 9 - AppStream 5.0 MB/s | 8.7 MB 00:01
Rocky Linux 9 - Extras 30 kB/s | 18 kB 00:00
[rocky-9.4] Warewulf>- Verifying the chroot directory : The script does not directly modify the chroot directory, but you should check that the path to the chroot exists and contains the necessary files. Make sure the directory exists and has a structure similar to a typical Linux filesystem. If the directory is empty or incomplete, it may indicate that the container was not initialized correctly.
[root@master-ohpc /]# ls -l /srv/warewulf/chroots/rocky-9.4/rootfs/
total 16
lrwxrwxrwx. 1 root root 7 Nov 2 21:29 bin -> usr/bin
dr-xr-xr-x. 2 root root 4096 Feb 10 08:17 boot
drwxr-xr-x. 2 root root 18 Feb 10 01:01 dev
drwxrwxrwx. 63 root root 4096 Feb 10 08:17 etc
drwxr-xr-x. 2 root root 6 Nov 2 21:29 home
lrwxrwxrwx. 1 root root 7 Nov 2 21:29 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Nov 2 21:29 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 6 Nov 2 21:29 media
drwxr-xr-x. 2 root root 6 Nov 2 21:29 mnt
drwxr-xr-x. 2 root root 6 Nov 2 21:29 opt
drwxr-xr-x. 2 root root 6 Jan 8 14:47 proc
dr-xr-x---. 3 root root 124 Feb 10 10:05 root
drwxr-xr-x. 14 root root 188 Feb 10 08:17 run
lrwxrwxrwx. 1 root root 8 Nov 2 21:29 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Nov 2 21:29 srv
drwxr-xr-x. 2 root root 6 Nov 18 14:46 sys
drwxrwxrwt. 2 root root 144 Nov 18 14:47 tmp
drwxr-xr-x. 12 root root 4096 Nov 18 14:47 usr
drwxr-xr-x. 18 root root 238 Feb 11 03:29 var- Verifying the overall integrity of the container : You can also enter the container directly to ensure everything is working properly (for example, by running a simple command like uname -a to check the system state). If you see the kernel output and system information, it confirms that the container is functioning normally.
[root@master-ohpc /]# wwctl container exec rocky-9.4 /bin/bash
[rocky-9.4] Warewulf> uname -a
Linux rocky-9.4 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UT
C 2024 x86_64 x86_64 x86_64 GNU/Linux9. Configuring the Compute Image with OpenHPC Base, Slurm Client, and Essential Services
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
Fi
# ------------------------------------------------------------
# Add OpenHPC base components to compute image (Section 3.8.2)
# ------------------------------------------------------------
wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install ohpc-base-compute
EOF
# Add SLURM and other components to compute instance
wwctl container exec rocky-9.4 /bin/bash <<- EOF
# Add Slurm client support meta-package and enable munge and slurmd
dnf -y install ohpc-slurm-client
systemctl enable munge
systemctl enable slurmd
# Add Network Time Protocol (NTP) support
dnf -y install chrony
# Include modules user environment
dnf -y install lmod-ohpc
EOF
if [[ ${enable_intel_packages} -eq 1 ]];then
mkdir /opt/intel
echo "/opt/intel *(ro,no_subtree_check,fsid=12)" >> /etc/exports
echo "${sms_ip}:/opt/intel /opt/intel nfs nfsvers=4,nodev 0 0" >> $CHROOT/etc/fstab
fi
# Update basic slurm configuration if additional computes defined
if [ ${num_computes} -gt 4 ];then
perl -pi -e "s/^NodeName=(\S+)/NodeName=${compute_prefix}[1-${num_computes}]/" /etc/slurm/slurm.conf
perl -pi -e "s/^PartitionName=normal Nodes=(\S+)/PartitionName=normal Nodes=${compute_prefix}[1-${num_computes}]/" /etc/slurm/slurm.conf
fi Analysis of the script output
Objective of the script:
The main goal of this script is to automate the installation and configuration of the necessary components for an HPC cluster, including OpenHPC tools, Slurm, and other required dependencies.
Installation Details:
- Repository Setup:
The script begins by updating and importing the following repositories:
- OpenHPC-3 β Base
- OpenHPC-3 β Updates
- EPEL (Extra Packages for Enterprise Linux 9)
- Rocky Linux 9 β BaseOS, AppStream, and Extras
These repositories are essential for retrieving the latest versions of required packages.
- Installation of OpenHPC Components:
The following packages are installed by the script:
- ohpc-base-compute: Core OpenHPC components
- ohpc-slurm-client: Slurm client for job management
- chrony: Time synchronization tool
- lmod-ohpc: Environment module system for user environments
- Service Configuration
After the installation, the script enables and configures several key services:
- munge (systemctl enable munge): Authentication service used by Slurm
- slurmd (systemctl enable slurmd): Slurm compute node daemon
- chronyd (systemctl enable chronyd): NTP time synchronization service
- Installation of Additional Dependencies
The script also installs various additional libraries and tools, including:
- Graphics libraries: cairo, harfbuzz, freetype
- Development tools: gcc, perl, libxml2
- Compression and file system utilities: brotli, squashfs, LZO
- Additional HPC modules: libibverbs, librdmacm (for Infiniband support)
Summary of Installed Packages
In total, the script installed 154 packages, including essential libraries and cluster management tools. Key installed packages include:
- ohpc-base-compute
- ohpc-slurm-client
- chrony
- lmod-ohpc
- singularity-ce
- perl-libs, perl-IO, perl-Net-SSLeay
- python3.11
- libX11, libXext, libXrender
- libselinux-devel, libsepol-devel
Conclusion
The execution of recipe8.sh completed successfully. The OpenHPC environment is now properly configured with all required components to run HPC workloads.
The output
[root@master-ohpc /]# ./recipe8.sh
OpenHPC-3 - Base 1.1 MB/s | 3.6 MB 00:03
OpenHPC-3 - Updates 2.0 MB/s | 5.0 MB 00:02
Extra Packages for Enterprise Linux 9 - x86_64 10 MB/s | 23 MB 00:02
Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64 153 B/s | 2.5 kB 00:16
Rocky Linux 9 - BaseOS 2.6 MB/s | 2.3 MB 00:00
Rocky Linux 9 - AppStream 16 MB/s | 8.7 MB 00:00
Rocky Linux 9 - Extras 47 kB/s | 16 kB 00:00
Dependencies resolved.
===========================================================================================================================================================================================================================================
Package Architecture Version Repository Size
===========================================================================================================================================================================================================================================
Installing:
ohpc-base-compute x86_64 3.2-320.ohpc.1.1 OpenHPC-updates 7.2 k
Installing dependencies:
10. Enhancing the Compute Image with Networking Drivers, Resource Limits, and Optional Filesystem Clients
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# -------------------------------------------------------
# Additional customizations (Section 3.8.4)
# -------------------------------------------------------
# Add IB drivers to compute image
if [[ ${enable_ib} -eq 1 ]];then
dnf -y --installroot=$CHROOT groupinstall "InfiniBand Support"
fi
# Add Omni-Path drivers to compute image
if [[ ${enable_opa} -eq 1 ]];then
dnf -y --installroot=$CHROOT install opa-basic-tools
dnf -y --installroot=$CHROOT install libpsm2
fi
# Update memlock settings
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf
# Enable slurm pam module
echo "account required pam_slurm.so" >> $CHROOT/etc/pam.d/sshd
if [[ ${enable_beegfs_client} -eq 1 ]];then
wget -P /etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
dnf -y install kernel-devel gcc elfutils-libelf-devel
dnf -y install beegfs-client beegfs-helperd beegfs-utils
perl -pi -e "s/^buildArgs=-j8/buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1/" /etc/beegfs/beegfs-client-autobuild.conf
/opt/beegfs/sbin/beegfs-setup-client -m ${sysmgmtd_host}
systemctl start beegfs-helperd
systemctl start beegfs-client
wget -P $CHROOT/etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
dnf -y --installroot=$CHROOT install beegfs-client beegfs-helperd beegfs-utils
perl -pi -e "s/^buildEnabled=true/buildEnabled=false/" $CHROOT/etc/beegfs/beegfs-client-autobuild.conf
rm -f $CHROOT/var/lib/beegfs/client/force-auto-build
chroot $CHROOT systemctl enable beegfs-helperd beegfs-client
cp /etc/beegfs/beegfs-client.conf $CHROOT/etc/beegfs/beegfs-client.conf
echo "drivers += beegfs" >> /etc/warewulf/bootstrap.conf
fi
# Enable Optional packages
if [[ ${enable_lustre_client} -eq 1 ]];then
# Install Lustre client on master
dnf -y install lustre-client-ohpc
# Enable lustre in WW compute image
dnf -y --installroot=$CHROOT install lustre-client-ohpc
mkdir $CHROOT/mnt/lustre
echo "${mgs_fs_name} /mnt/lustre lustre defaults,localflock,noauto,x-systemd.automount 0 0" >> $CHROOT/etc/fstab
# Enable o2ib for Lustre
echo "options lnet networks=o2ib(ib0)" >> /etc/modprobe.d/lustre.conf
echo "options lnet networks=o2ib(ib0)" >> $CHROOT/etc/modprobe.d/lustre.conf
# mount Lustre client on master
mkdir /mnt/lustre
mount -t lustre -o localflock ${mgs_fs_name} /mnt/lustre
fiClarifications
Installation of InfiniBand and Omni-Path drivers
- Adding InfiniBand (IB) drivers
if [[ ${enable_ib} -eq 1 ]]; then
dnf -y --installroot=$CHROOT groupinstall "InfiniBand Support"
fiIf enable_ib is set to 1, the script installs InfiniBand drivers, which are essential for high-performance interconnects in HPC environments.
- Adding Omni-Path (OPA) drivers
if [[ ${enable_opa} -eq 1 ]]; then
dnf -y --installroot=$CHROOT install opa-basic-tools
dnf -y --installroot=$CHROOT install libpsm2
fiIf enable_opa is set to 1, the script installs Omni-Path tools, an alternative to InfiniBand that enables low-latency communication in HPC clusters.
Configuring memory limits (memlock)
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' $CHROOT/etc/security/limits.confPurpose: Increase memory lock limits to prevent HPC processes from being constrained in memory allocation.
- The perl -pi -e commands update the limits.conf files by appending memlock unlimited rules.
- Changes are applied both on the host system and inside the Warewulf chroot environment.
Enabling the PAM module for Slurm
echo "account required pam_slurm.so" >> $CHROOT/etc/pam.d/sshdThis activates pam_slurm.so, a PAM module that restricts SSH access to compute nodes to users with active Slurm jobs only.
Installing and configuring BeeGFS (parallel file system)
if [[ ${enable_beegfs_client} -eq 1 ]]; then
wget -P /etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
dnf -y install kernel-devel gcc elfutils-libelf-devel
dnf -y install beegfs-client beegfs-helperd beegfs-utils
perl -pi -e "s/^buildArgs=-j8/buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1/" /etc/beegfs/beegfs-client-autobuild.conf
/opt/beegfs/sbin/beegfs-setup-client -m ${sysmgmtd_host}
systemctl start beegfs-helperd
systemctl start beegfs-client
wget -P $CHROOT/etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
dnf -y --installroot=$CHROOT install beegfs-client beegfs-helperd beegfs-utils
perl -pi -e "s/^buildEnabled=true/buildEnabled=false/" $CHROOT/etc/beegfs/beegfs-client-autobuild.conf
rm -f $CHROOT/var/lib/beegfs/client/force-auto-build
chroot $CHROOT systemctl enable beegfs-helperd beegfs-client
cp /etc/beegfs/beegfs-client.conf $CHROOT/etc/beegfs/beegfs-client.conf
echo "drivers += beegfs" >> /etc/warewulf/bootstrap.conf
fiIf enable_beegfs_client=1, the script installs and configures BeeGFS, a high-performance parallel file system for HPC. Main actions include:
- Add the BeeGFS repository to yum.repos.d
- Install BeeGFS client, helper daemon, and utilities
- Enable InfiniBand support (BEEGFS_OPENTK_IBVERBS=1)
- Set up the client to connect to the management server (sysmgmtd_host)
- Start beegfs-helperd and beegfs-client services
- Configure the Warewulf chroot environment with BeeGFS
Installing and configuring Lustre (HPC parallel file system)
if [[ ${enable_lustre_client} -eq 1 ]]; then
dnf -y install lustre-client-ohpc
dnf -y --installroot=$CHROOT install lustre-client-ohpc
mkdir $CHROOT/mnt/lustre
echo "${mgs_fs_name} /mnt/lustre lustre defaults,localflock,noauto,x-systemd.automount 0 0" >> $CHROOT/etc/fstab
echo "options lnet networks=o2ib(ib0)" >> /etc/modprobe.d/lustre.conf
echo "options lnet networks=o2ib(ib0)" >> $CHROOT/etc/modprobe.d/lustre.conf
mkdir /mnt/lustre
mount -t lustre -o localflock ${mgs_fs_name} /mnt/lustre
fiIf enable_lustre_client=1, the script installs and configures Lustre, another widely-used high-performance file system in HPC.
Key steps include:
- Install the Lustre client on the management server
- Install Lustre in the Warewulf chroot environment
- Add a mount entry to /etc/fstab for /mnt/lustre
- Enable o2ib (over InfiniBand) network support for Lustre
- Mount the Lustre file system on the management server
Variable Explanation
| Variable | Description | Set to 1 if⦠| Set to 0 if⦠|
|---|---|---|---|
enable_ib |
Enables the installation of InfiniBand drivers | Your cluster uses InfiniBand interconnect | Your cluster uses Ethernet |
enable_opa |
Enables Omni-Path drivers | Your nodes are connected using Omni-Path (OPA) | You do not use Omni-Path |
enable_beegfs_client |
Installs the BeeGFS client (parallel file system) | Your cluster uses BeeGFS for storage | You do not use BeeGFS |
enable_lustre_client |
Installs the Lustre client (another HPC file system) | You use Lustre for parallel storage | You do not use Lustre |
11. Section 10 of the recipe.sh script
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# -------------------------------------------------------
# Configure rsyslog on SMS and computes (Section 3.8.4.7)
# -------------------------------------------------------
echo 'module(load="imudp")' >> /etc/rsyslog.d/ohpc.conf
echo 'input(type="imudp" port="514")' >> /etc/rsyslog.d/ohpc.conf
systemctl restart rsyslog
echo "*.* action(type=\"omfwd\" Target=\"${sms_ip}\" Port=\"514\" " "Protocol=\"udp\")">> $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^\*\.info/\\#\*\.info/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^authpriv/\\#authpriv/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^mail/\\#mail/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^cron/\\#cron/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^uucp/\\#uucp/" $CHROOT/etc/rsyslog.conf
if [[ ${enable_clustershell} -eq 1 ]];then
# Install clustershell
dnf -y install clustershell
cd /etc/clustershell/groups.d
mv local.cfg local.cfg.orig
echo "adm: ${sms_name}" > local.cfg
echo "compute: ${compute_prefix}[1-${num_computes}]" >> local.cfg
echo "all: @adm,@compute" >> local.cfg
fi
if [[ ${enable_genders} -eq 1 ]];then
# Install genders
dnf -y install genders-ohpc
echo -e "${sms_name}\tsms" > /etc/genders
for ((i=0; i<$num_computes; i++)) ; do
echo -e "${c_name[$i]}\tcompute,bmc=${c_bmc[$i]}"
done >> /etc/genders
fi
if [[ ${enable_magpie} -eq 1 ]];then
# Install magpie
dnf -y install magpie-ohpc
fi
# Optionally, enable conman and configure
if [[ ${enable_ipmisol} -eq 1 ]];then
dnf -y install conman-ohpc
for ((i=0; i<$num_computes; i++)) ; do
echo -n 'CONSOLE name="'${c_name[$i]}'" dev="ipmi:'${c_bmc[$i]}'" '
echo 'ipmiopts="'U:${bmc_username},P:${IPMI_PASSWORD:-undefined},W:solpayloadsize'"'
done >> /etc/conman.conf
systemctl enable conman
systemctl start conman
fi
# Optionally, enable nhc and configure
dnf -y install nhc-ohpc
dnf -y --installroot=$CHROOT install nhc-ohpc
echo "HealthCheckProgram=/usr/sbin/nhc" >> /etc/slurm/slurm.conf
echo "HealthCheckInterval=300" >> /etc/slurm/slurm.conf # execute every five minutes
# Optionally, update compute image to support geopm
if [[ ${enable_geopm} -eq 1 ]];then
export kargs="${kargs} intel_pstate=disable"
fi
if [[ ${enable_geopm} -eq 1 ]];then
dnf -y --installroot=$CHROOT install kmod-msr-safe-ohpc
dnf -y --installroot=$CHROOT install msr-safe-ohpc
dnf -y --installroot=$CHROOT install msr-safe-slurm-ohpc
fiClarifications
1. Configuration of rsyslog for log management
echo 'module(load="imudp")' >> /etc/rsyslog.d/ohpc.conf
echo 'input(type="imudp" port="514")' >> /etc/rsyslog.d/ohpc.conf
systemctl restart rsyslogThis enables rsyslog to listen on UDP port 514, thereby facilitating centralized log collection on the management server (SMS).
echo "*.* action(type=\"omfwd\" Target=\"${sms_ip}\" Port=\"514\" " "Protocol=\"udp\")">> $CHROOT/etc/rsyslog.confThis allows the compute nodes to be configured to send their logs to the management server (sms_ip).
perl -pi -e "s/^\*\.info/\\#\*\.info/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^authpriv/\\#authpriv/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^mail/\\#mail/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^cron/\\#cron/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^uucp/\\#uucp/" $CHROOT/etc/rsyslog.confThis allows certain categories of logs to be disabled in order to avoid unnecessary overhead.
2. Installation of Clustershell (centralized management of commands on nodes)
if [[ ${enable_clustershell} -eq 1 ]]; then
dnf -y install clustershell
cd /etc/clustershell/groups.d
mv local.cfg local.cfg.orig
echo "adm: ${sms_name}" > local.cfg
echo "compute: ${compute_prefix}[1-${num_computes}]" >> local.cfg
echo "all: @adm,@compute" >> local.cfg
fi- If enable_clustershell=1, install Clustershell, a tool that allows commands to be executed simultaneously across multiple nodes.
- Configure the local.cfg file to define node groups: adm for the SMS server and compute for the compute nodes.
3. Installation of Genders, a tool for managing node groups within a cluster.
if [[ ${enable_genders} -eq 1 ]]; then
dnf -y install genders-ohpc
echo -e "${sms_name}\tsms" > /etc/genders
for ((i=0; i<$num_computes; i++)) ; do
echo -e "${c_name[$i]}\tcompute,bmc=${c_bmc[$i]}"
done >> /etc/genders
fi- If enable_genders=1, install Genders, a tool used to classify nodes based on their roles.
- Populate the /etc/genders file by associating each compute node with its BMC address, which is used for out-of-band management via IPMI.
4. Installation of Magpie, a monitoring and performance profiling tool for HPC environments.
if [[ ${enable_magpie} -eq 1 ]]; then
dnf -y install magpie-ohpc
fiIf enable_magpie=1, install Magpie, a performance profiling tool for HPC environments.
5. Configuration of IPMI (power management and serial console control of nodes)
if [[ ${enable_ipmisol} -eq 1 ]]; then
dnf -y install conman-ohpc
for ((i=0; i<$num_computes; i++)) ; do
echo -n 'CONSOLE name="'${c_name[$i]}'" dev="ipmi:'${c_bmc[$i]}'" '
echo 'ipmiopts="'U:${bmc_username},P:${IPMI_PASSWORD:-undefined},W:solpayloadsize'"'
done >> /etc/conman.conf
systemctl enable conman
systemctl start conmanIf enable_ipmisol=1, install and configure Conman, a tool that provides access to node serial consoles via IPMI.
6. Installation and configuration of NHC (Node Health Check)
dnf -y install nhc-ohpc
dnf -y --installroot=$CHROOT install nhc-ohpc
echo "HealthCheckProgram=/usr/sbin/nhc" >> /etc/slurm/slurm.conf
echo "HealthCheckInterval=300" >> /etc/slurm/slurm.conf # execute every five minutes- Install nhc-ohpc, a tool dedicated to monitoring the health status of compute nodes.
- Configure Slurm to perform node health checks every 5 minutes.
7. Activation of GEOPM, an energy management tool for HPC environments.
if [[ ${enable_geopm} -eq 1 ]]; then
export kargs="${kargs} intel_pstate=disable"
fiIf enable_geopm=1, disable intel_pstate, a CPU power management feature, to allow GEOPM to take control.
if [[ ${enable_geopm} -eq 1 ]]; then
dnf -y --installroot=$CHROOT install kmod-msr-safe-ohpc
dnf -y --installroot=$CHROOT install msr-safe-ohpc
dnf -y --installroot=$CHROOT install msr-safe-slurm-ohpc
fiInstall the required modules for GEOPM (msr-safe-ohpc, msr-safe-slurm-ohpc).
This script is essential to finalize the configuration of an OpenHPC cluster by automating log management, node access, and monitoring.
It is not mandatory to enable all these variables. Their activation depends on your needs and the architecture of your cluster.
Guide to Decide Which Variables to Enable (1) or Disable (0)
| Variable | Description | Enable (set to 1) if⦠| Disable (set to 0) if⦠|
|---|---|---|---|
enable_clustershell |
Installs Clustershell (execute commands on multiple nodes simultaneously) | You want to run commands simultaneously on all nodes | You prefer to manage nodes individually |
enable_genders |
Installs Genders (node classification) | You want to organize nodes by role (e.g., compute, storage) | You donβt need advanced node management |
enable_magpie |
Installs Magpie (performance profiling) | You want to analyze workload and application performance | You donβt need detailed optimization |
enable_ipmisol |
Installs Conman (serial console management via IPMI) | Your nodes support IPMI and you want remote management | Your nodes donβt support IPMI |
enable_geopm |
Installs GEOPM (energy management for HPC) | You want to optimize Intel CPU power consumption | You donβt manage advanced power consumption |
Recommendations Based on Your Usage
If your cluster is small and you want a minimal installation:
enable_clustershell=1 # Useful for managing multiple nodes
enable_genders=0 # Not necessarily needed if few nodes
enable_magpie=0 # Only if performance analysis is needed
enable_ipmisol=0 # Only if your nodes have IPMI
enable_geopm=0 # Only for energy optimizationIf your cluster is large and complex:
enable_clustershell=1
enable_genders=1
enable_magpie=1
enable_ipmisol=1
enable_geopm=1Conclusion: Enable only what is necessary for your environment. If in doubt, start with 0 and activate options progressively as needed.
Script Execution
Last metadata expiration check: 2:41:57 ago on Wed Feb 10 01:17:22 2025.
Dependencies resolved.
=================================================================================
Package Architecture Version Repository Size
=================================================================================
Installing:
python3 noarch 1.9.3-1.el9 epel 159 k
Installing dependencies:
noarch noarch 1.9.7-1.el9 epel 268 k
Transaction Summary
=================================================================================
Install 2 Packages
Total download size: 427 k
Installed size: 1.9 M
[1/2]: python3-clustershell-1.9.2-1.el9.noarch.rpm 2.1 MB/s | 206 kB 00:00
[2/2]: clustershell-1.9.1-1.el9.noarch.rpm 2.1 MB/s | 158 kB 00:00
--------------------------------------------------------------------------------
Total 389 kB/s | 364 kB 00:01
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Preparing : 1/1
Installing : python3-clustershell-1.9.2-1.el9.noarch 1/2
Running scriptlet: clustershell-1.9.2-1.el9.noarch 1/2
Installing : clustershell-1.9.2-1.el9.noarch 2/2
Running scriptlet: python3-clustershell-1.9.2-1.el9.noarch 2/2
Verifying : python3-clustershell-1.9.2-1.el9.noarch 1/2
Verifying : clustershell-1.9.2-1.el9.noarch 2/2
Installed:
clustershell-1.9.2-1.el9.noarch python3-clustershell-1.9.2-1.el9.noarch
Last metadata expiration check: 2:44:06 ago on Wed Feb 10 01:17:22 2025.
Dependencies resolved.
=================================================================================
Package Architecture Version Repository Size
=================================================================================
Installing:
noarch noarch 1.4.3-300.ohpc.1.2 OpenHPC 64 k
Transaction Summary
=================================================================================
Install 1 Package
Total download size: 64 k
Installed size: 179 k
Downloading Packages:
noarch-1.4.3-300.ohpc.1.2.noarch.rpm 58 kB/s | 64 kB 00:01
--------------------------------------------------------------------------------
Total 58 kB/s | 64 kB 00:01
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Preparing : 1/1
Installing : ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch 1/1
Running scriptlet: ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch 1/1
Verifying : ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch 1/1
Installed:
ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch
Complete!
config error: error parsing '': given path '' is not absolute.Analysis of this output:
1. Installation of ClusterShell
- Installed packages:
- clustershell-1.9.3-1.el9.noarch
- python3-clustershell-1.9.3-1.el9.noarch
- Download details:
- Total download size: 389 KB
- Installation completed successfully without errors
2. Installation of NHC (Node Health Check)
- Installed package:
- nhc-ohpc-1.4.3-300.ohpc.3.2.noarch
- Download details:
- Total download size: 66 KB
- Installation completed successfully without errors
3. Problem detected at the end:
Config error: Error parsing ββ: given pathββ is not absolute.
- Interpretation:
- The error message indicates a configuration problem related to an empty path (ββ).
- This may be an issue with NHC (nhc-ohpc), as it uses a configuration file to define the verification script paths.
12. Section 11 of the recipe.sh script
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ----------------------------
# Import files (Section 3.8.5)
# ----------------------------
wwctl overlay import generic /etc/subuid
wwctl overlay import generic /etc/subgid
echo "server ${sms_ip} iburst" | wwctl overlay import generic <(cat) /etc/chrony.conf
wwctl overlay mkdir generic /etc/sysconfig/
wwctl overlay import generic <(echo SLURMD_OPTIONS="--conf-server ${sms_ip}") /etc/sysconfig/slurmd
wwctl overlay mkdir generic --mode 0700 /etc/munge
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
if [[ ${enable_ipoib} -eq 1 ]];then
wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
wwctl overlay import generic /opt/ohpc/pub/examples/network/centos/ifcfg-ib0.ww /etc/sysconfig/network-scripts/ifcfg-ib0.ww
fiThis Bash script is part of an automated OpenHPC installation, a software stack designed for managing High Performance Computing (HPC) clusters.
It is intended to serve as an installation template, based on the official OpenHPC guide.
This script handles the configuration of several critical components of the cluster:
- Importing system files into the overlay (UID management, Chrony configuration)
- Setting up Slurm and the Munge authentication service
- (Optional) Network setup for InfiniBand if enabled
1. Importing files into overlays (Section 3.8.5)
wwctl overlay import generic /etc/subuid
wwctl overlay import generic /etc/subgid
echo "server ${sms_ip} iburst" | wwctl overlay import generic <(cat) /etc/chrony.confwwctl overlay import copies specific files into an overlay (a file layer used by cluster nodes).
/etc/subuid and /etc/subgid: Define UID/GID ranges for users, commonly used with containers.
/etc/chrony.conf: Configures Chrony for time synchronization. ${sms_ip} is the master serverβs IP.
Purpose: Provide compute nodes with necessary system files for consistent configuration.
Verification
[root@master-ohpc /]# wwctl overlay cat generic /etc/chrony.cong
server 192.168.70.41 iburst 2. Creating and configuring other system files
wwctl overlay mkdir generic /etc/sysconfig/
wwctl overlay import generic <(echo SLURMD_OPTIONS="--conf-server ${sms_ip}") /etc/sysconfig/slurmd- wwctl overlay mkdir creates the /etc/sysconfig/ directory in the overlay.
- wwctl overlay import adds a file with Slurm options: SLURMD_OPTIONS=ββconf-server ${sms_ip}β
Purpose: Allow compute nodes to get their Slurm configuration from the master server, and prepare the Slurm configuration, the workload and job scheduling system used in OpenHPC.
Verification
[root@master-ohpc /]# wwctl overlay cat generic /etc/sysconfig/slurmd
SLURMD_OPTIONS=--conf-server 192.168.70.413. Munge Configuration (Slurm authentication)
wwctl overlay mkdir generic --mode 0700 /etc/munge
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)Munge is the authentication service used by Slurm.
These commands:
Create /etc/munge with restricted permissions (0700)
Import the shared key munge.key
Set the file and directory ownership to the munge user for security
Purpose:
Secure authentication for communication between server and nodes.
4. Network setup using InfiniBand (optional)
if [[ ${enable_ipoib} -eq 1 ]]; then
wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
wwctl overlay import generic /opt/ohpc/pub/examples/network/centos/ifcfg-ib0.ww /etc/sysconfig/network-scripts/ifcfg-ib0.ww
fiInfiniBand is a high-performance network protocol used in HPC.
If enable_ipoib = 1, the script:
Creates the configuration directory
Imports the file ifcfg-ib0.ww to enable InfiniBand on interface ib0
Purpose:
Allow compute nodes to communicate via InfiniBand with minimal latency and maximum throughput.
13. Section 12 of the recipe.sh script
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# --------------------------------------
# Assemble bootstrap image (Section 3.9)
# --------------------------------------
wwctl container build rocky-9.4
wwctl overlay build
# Add hosts to cluster
for ((i=0; i<$num_computes; i++)) ; do
wwctl node add --container=rocky-9.4 \
--ipaddr=${c_ip[$i]} --hwaddr=${c_mac[$i]} ${c_name[i]}
done
wwctl overlay build
wwctl configure --all
# Enable and start munge and slurmctld (Cont.)
systemctl enable --now munge
systemctl enable --now slurmctld
# Optionally, add arguments to bootstrap kernel
if [[ ${enable_kargs} -eq 1 ]]; then
wwctl node set --yes --kernelargs="${kargs}" "${compute_regex}"
fiThis script is designed to automate the installation and configuration of an OpenHPC cluster.
It begins by preparing the environment using a local configuration file, then builds the container image, adds compute nodes to the cluster, applies the required configuration, and starts essential services. Finally, it offers the option to add kernel arguments if needed.
1. Creating the Boot Image and Overlays
- Line 14: Uses the command
wwctl container build rocky-9.4to create a container based on Rocky Linux 9.4, a Linux distribution commonly used for clusters.
- Line 15: Creates an overlay (an additional configuration file layer) using
wwctl overlay build.
2. Adding Hosts to the Cluster
- Lines 17β20: This loop adds compute nodes to the cluster.
For each compute node (defined by the variablenum_computes), it assigns:- an IP address
${c_ip[$i]}, - a MAC address
${c_mac[$i]}, and - a node name
${c_name[$i]}.
wwctl node addcommand is used to register each node in the cluster with the appropriate configuration. - an IP address
3. Rebuilding the Overlay and Applying Configuration
- Line 21: Rebuilds the overlay using
wwctl overlay buildafter all nodes have been added.
- Line 22: Applies the configuration cluster-wide using
wwctl configure --all.
4. Starting and Enabling Services
- Lines 25β26: The
mungeandslurmctldservices are enabled and started:mungeis an authentication service used to secure communication within HPC clusters.
slurmctldis the central controller for SLURM, the resource and job scheduling system.
5. (Optional) Adding Kernel Arguments
- Lines 29β31: If the variable
enable_kargsis set to 1, the script allows for adding custom kernel arguments to compute nodes usingwwctl node set.
This can be used to pass advanced parameters to the Linux kernel.
Note:
This script does not reinstall the Rocky Linux OS on your nodes.
Instead, it adds specific cluster-related configurations (such asmunge,slurm, and networking).
If your nodes are already properly set up with IP addresses and base services, this script will integrate them into the cluster without altering their existing installation.
14. Section 13 of the recipe.sh script
#!/usr/bin/bash
# ----------------------------------------------------------------------------------->
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# ----------------------------------------------------------------------------------->
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ---------------------------------
# Boot compute nodes (Section 3.10)
# ---------------------------------
for ((i=0; i<${num_computes}; i++)) ; do
ipmitool -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
doneThis script is part of an installation process for an HPC cluster using OpenHPC. It focuses on rebooting the compute nodes via IPMI.
The error encountered when executing this script is as follows:
[root@master-ohpc /]# ./recipe13.sh
Unable to read password from environment
Chassis Power Control: Reset
Unable to read password from environmentSolution:
In your script, you are using:
ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power cycle
The -E option tells ipmitool to retrieve the password from the IPMITOOL_PASSWORD environment variable.
But youβre also using -P ${bmc_password}, which is supposed to pass the password directly.
If your script already loads the bmc_password variable from /input.local, modify the line as follows:
ipmitool -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power cycle
This will force the use of the password defined in bmc_password.
Result
[root@master-ohpc /]# ./recipe13.sh
Chassis Power Control: Reset
Chassis Power Control: Reset[root@master-ohpc /]# nano recipe13.sh
[root@master-ohpc /]# ipmitool -I lanplus -H 192.168.201.51 -U root -P calvin chassis power status
Chassis Power is on
[root@master-ohpc /]# ipmitool -I lanplus -H 192.168.201.52 -U root -P calvin chassis power status
Chassis Power is on
[root@master-ohpc /]# ping 192.168.201.51
PING 192.168.201.51 (192.168.201.51) 56(84) bytes of data.
64 bytes from 192.168.201.51: icmp_seq=1 ttl=63 time=0.284 ms
64 bytes from 192.168.201.51: icmp_seq=2 ttl=63 time=0.302 ms
--- 192.168.201.51 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1037ms
rtt min/avg/max/mdev = 0.284/0.293/0.302/0.009 ms
[root@master-ohpc /]# ping 192.168.70.51
PING 192.168.70.51 (192.168.70.51) 56(84) bytes of data.
64 bytes from 192.168.70.51: icmp_seq=1 ttl=64 time=0.451 ms
64 bytes from 192.168.70.51: icmp_seq=2 ttl=64 time=0.390 ms
64 bytes from 192.168.70.51: icmp_seq=3 ttl=64 time=0.399 ms
64 bytes from 192.168.70.51: icmp_seq=4 ttl=64 time=0.392 ms
64 bytes from 192.168.70.51: icmp_seq=5 ttl=64 time=0.339 ms
--- 192.168.70.51 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 5118ms
rtt min/avg/max/mdev = 0.339/0.394/0.451/0.032 ms
[root@master-ohpc /]# ping 192.168.70.52
PING 192.168.70.52 (192.168.70.52) 56(84) bytes of data.
64 bytes from 192.168.70.52: icmp_seq=1 ttl=64 time=0.337 ms
64 bytes from 192.168.70.52: icmp_seq=2 ttl=64 time=0.398 ms
64 bytes from 192.168.70.52: icmp_seq=3 ttl=64 time=0.398 ms
64 bytes from 192.168.70.52: icmp_seq=4 ttl=64 time=0.381 ms15. Section 14 of the recipe.sh script
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
# ---------------------------------------
# Install Development Tools (Section 4.1)
# ---------------------------------------
dnf -y install ohpc-autotools
dnf -y install EasyBuild-ohpc
dnf -y install hwloc-ohpc
dnf -y install spack-ohpc
dnf -y install valgrind-ohpc
# -------------------------------
# Install Compilers (Section 4.2)
# -------------------------------
dnf -y install gnu14-compilers-ohpc
# --------------------------------
# Install MPI Stacks (Section 4.3)
# --------------------------------
if [[ ${enable_mpi_defaults} -eq 1 ]];then
dnf -y install openmpi5-pmix-gnu14-ohpc mpich-ofi-gnu14-ohpc
fi
if [[ ${enable_ib} -eq 1 ]];then
dnf -y install mvapich2-gnu14-ohpc
fi
if [[ ${enable_opa} -eq 1 ]];then
dnf -y install mvapich2-psm2-gnu14-ohpc
fi
# ---------------------------------------
# Install Performance Tools (Section 4.4)
# ---------------------------------------
dnf -y install ohpc-gnu14-perf-tools
if [[ ${enable_geopm} -eq 1 ]];then
dnf -y install ohpc-gnu14-geopm
fi
dnf -y install lmod-defaults-gnu14-openmpi5-ohpc
# ---------------------------------------------------
# Install 3rd Party Libraries and Tools (Section 4.6)
# ---------------------------------------------------
dnf -y install ohpc-gnu14-serial-libs
dnf -y install ohpc-gnu14-io-libs
dnf -y install ohpc-gnu14-python-libs
dnf -y install ohpc-gnu14-runtimes
if [[ ${enable_mpi_defaults} -eq 1 ]];then
dnf -y install ohpc-gnu14-mpich-parallel-libs
dnf -y install ohpc-gnu14-openmpi5-parallel-libs
fi
if [[ ${enable_ib} -eq 1 ]];then
dnf -y install ohpc-gnu14-mvapich2-parallel-libs
fi
if [[ ${enable_opa} -eq 1 ]];then
dnf -y install ohpc-gnu14-mvapich2-parallel-libs
fi
# ----------------------------------------
# Install Intel oneAPI tools (Section 4.7)
# ----------------------------------------
if [[ ${enable_intel_packages} -eq 1 ]];then
dnf -y install intel-oneapi-toolkit-release-ohpc
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
dnf -y install intel-compilers-devel-ohpc
dnf -y install intel-mpi-devel-ohpc
if [[ ${enable_opa} -eq 1 ]];then
dnf -y install mvapich2-psm2-intel-ohpc
fi
dnf -y install openmpi5-pmix-intel-ohpc
dnf -y install ohpc-intel-serial-libs
dnf -y install ohpc-intel-geopm
dnf -y install ohpc-intel-io-libs
dnf -y install ohpc-intel-perf-tools
dnf -y install ohpc-intel-python3-libs
dnf -y install ohpc-intel-mpich-parallel-libs
dnf -y install ohpc-intel-mvapich2-parallel-libs
dnf -y install ohpc-intel-openmpi5-parallel-libs
dnf -y install ohpc-intel-impi-parallel-libs
fi
# -------------------------------------------------------------
# Allow for optional sleep to wait for provisioning to complete
# -------------------------------------------------------------
sleep ${provision_wait}This script runs successfully without generating any errors. The resulting output is:
Dependencies resolved.
==============================================================================================================================================================================================================
Package Architecture Version Repository Size
==============================================================================================================================================================================================================
Installing:
ohpc-autotools x86_64 3.2-320.ohpc.1.1 OpenHPC-updates 6.9 k
Installing dependencies:
autoconf-ohpc x86_64 2.71-300.ohpc.2.6 OpenHPC 953 k
automake-ohpc x86_64 1.16.5-300.ohpc.2.5 OpenHPC 806 k
libtool-ohpc x86_64 2.4.6-300.ohpc.1.5 OpenHPC 680 k
m4 x86_64 1.4.19-1.el9 appstream 294 k
perl-Thread-Queue noarch 3.14-460.el9 appstream 21 k
perl-threads x86_64 1:2.25-460.el9 appstream 57 k
perl-threads-shared x86_64 1.61-460.el9.0.1 appstream 44 k
Transaction Summary
==============================================================================================================================================================================================================
Install 8 Packages
Total download size: 2.8 M
Installed size: 12 M
Downloading Packages:
16. Section 15 of the recipe.sh script
#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
# Example Installation Script Template
# This convenience script encapsulates command-line instructions highlighted in
# an OpenHPC Install Guide that can be used as a starting point to perform a local
# cluster install beginning with bare-metal. Necessary inputs that describe local
# hardware characteristics, desired network settings, and other customizations
# are controlled via a companion input file that is used to initialize variables
# within this script.
# Please see the OpenHPC Install Guide(s) for more information regarding the
# procedure. Note that the section numbering included in this script refers to
# corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------
inputFile=${OHPC_INPUT_LOCAL:-/input.local}
if [ ! -e ${inputFile} ];then
echo "Error: Unable to access local input file -> ${inputFile}"
exit 1
else
. ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ------------------------------------
# Resource Manager Startup (Section 5)
# ------------------------------------
systemctl enable munge
systemctl enable slurmctld
systemctl start munge
systemctl start slurmctld
pdsh -w ${compute_prefix}[1-${num_computes}] systemctl start munge
pdsh -w ${compute_prefix}[1-${num_computes}] systemctl start slurmd
# Optionally, generate nhc config
pdsh -w c1 "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c
useradd -m test
wwctl overlay build
sleep 90The script execution output shows errors, as you can see below :
compute1: Warning: Permanently added 'compute1' (ED25519) to the list of known hosts.
compute2: Warning: Permanently added 'compute2' (ED25519) to the list of known hosts.
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
compute2: root@compute2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute2: ssh exited with exit code 255
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
compute2: Permission denied, please try again.
compute2: root@compute2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute2: ssh exited with exit code 255
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g
We will proceed to resolve the errors identified :
[root@master-ohpc /]# ls -l /root/.ssh/id_rsa.pub
-rw-r--r-- 1 root root 554 Feb 10 05:38 /root/.ssh/id_rsa.pub
[root@master-ohpc /]# ssh-copy-id root@compute1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already in
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the key(s)
root@compute1's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@compute1'"
and check to make sure that only the key(s) you wanted were added.
[root@master-ohpc /]# ssh-copy-id root@compute2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already in
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the key(s)
root@compute2's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@compute2'"
and check to make sure that only the key(s) you wanted were added.The user is copying their SSH public key (id_rsa.pub) from the master node (master-ohpc) to two remote compute nodes (compute1 and compute2) using the ssh-copy-id command. This sets up passwordless SSH login from the master to the compute nodes for the root user. After this setup, the user can log in via SSH without needing to enter the password.
[root@master-ohpc /]# ssh 192.168.70.51
root@192.168.70.51's password:
Last failed login: Tue Feb 25 16:37:14 CET 2025 from 192.168.70.41 on ssh:notty
There were 4 failed login attempts since the last successful login.
Last login: Tue Feb 25 16:02:39 2025 from 192.168.70.41
[root@compute1 ~]# cat /root/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4BqobqbyAfYhDZ2avy1ALtyCxt9xURo3mh2hZj/FeUgfasyb8FERML1KMRveu9FxKz/w4Pkw
PRc9WbN6t4uB4b+4dDd3bY+GsA3tL6d8ysEkb+y8HsH4hzAe+2cpE1fxEmkgOvJo0t5zCDAbqEmsJ1Nsit3U1k9CK2ZZM3t9Gac/PRkwu
kPskAl0W2Po+C1kdoA98FrbAbh3byr9QsVaMEvLR2djHgZu0ukBeAv3t4K9Qoys1tLFSL0c0h7r4dd30sJv8NGdwqK+c2b0bf4LvkB3J
[root@compute1 ~]# exit
logout
Connection to 192.168.70.51 closed.
[root@master-ohpc /]# ssh 192.168.70.52
root@192.168.70.52's password:
Last failed login: Tue Feb 25 15:37:14 +00 2025 from 192.168.70.41 on ssh:notty
There were 4 failed login attempts since the last successful login.
Last login: Tue Feb 4 15:17:38 2025 from 192.168.70.41
[root@compute2 ~]# cat /root/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4BqobqbyAfYhDZ2avy1ALtyCxt9xURo3mh2hZj/FeUgfasyb8FERML1KMRveu9FxKz/w4Pkw
PRc9WbN6t4uB4b+4dDd3bY+GsA3tL6d8ysEkb+y8HsH4hzAe+2cpE1fxEmkgOvJo0t5zCDAbqEmsJ1Nsit3U1k9CK2ZZM3t9Gac/PRkwu
kPskAl0W2Po+C1kdoA98FrbAbh3byr9QsVaMEvLR2djHgZu0ukBeAv3t4K9Qoys1tLFSL0c0h7r4dd30sJv8NGdwqK+c2b0bf4LvkB3J
[root@compute2 ~]# exit
logout
Connection to 192.168.70.52 closed.In this step, the user verifies that the SSH public key has been successfully copied to the authorized_keys file on both compute nodes (192.168.70.51 and 192.168.70.52).
They connect to each node using SSH as root, check the contents of /root/.ssh/authorized_keys, and confirm that the correct public key has been added.
This ensures that passwordless SSH access is now functional from the master node to the compute nodes.
We then make changes to the /etc/ssh/sshd_config file to adjust the SSH server settings:
[root@compute2 ~]# vi /etc/ssh/sshd_config
[root@compute2 ~]# systemctl restart sshdThe following changes were made to the /etc/ssh/sshd_config file:
#LoginGraceTime 2m
PermitRootLogin yes
#StrictModes yes
#MaxAuthTries 6
# To disable tunneled clear text passwords, change to no here!
PasswordAuthentification yes
#PermitEmptyPasswords no
PubkeyAuthentication yes
# but this is overridden so installations will only chexk .ssh/authorized_keys
AuthorizedKeysFile .ssh/authorized_keysThe warning message was resolved immediately after setting PubkeyAuthentication to yes :
[root@master-ohpc /]# pdsh -w compute1 uptime
compute1: Permission denied, please try again.
compute1: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
[root@master-ohpc /]#COMPUTE1: The content of ~/.ssh/authorized_keys is: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDb0AYq/Hv9ZDavylAIU7cXt9xuRa03m2hW2j/FeUgFasv9bEFERmL1KMReuyFxJKz/w48wSI76QlM2RPRMA0yYHgYnRpGDzSqt3wTCG6ouvqK0SlGvqZk9BowlJBltOa4nwAoty7I2hnUDTwjmMLtegTbnDvqAmh+G/Wi3RHRr0cWUO1BtbQNlx0R3oXYbAI3Q3xrl6dg3byelcky+B+sHdWaZ1e+2CpfElkm6qUw9OHlDPTZ3CCbAq0emU51nSti3UU2KC2zzMi9acFPkuwJSqeajRMTaJKYQZIIowLiMk/RyED2HinJfcjyECMIH/mIiP+1ekWVr6BRfqLL4cE+G7OTi4yQTcG/0BM/4p0KfpJl4IdbpWuYYkPSla3WO2oP+CL4dOA94FEbWAd3Ybg+qYsaVaMEGWlR2djHgAzUOUkbeAv3t4JKQ0ys1tLFISLo0h7r4dd3O5u/8NGdWqK+2cDb8f4lxKb3JI0PjHvtJg0AUAiQNk9az3rMt22PkS00=
Master (SMS): The output of cat /root/.ssh/cluster.pub is: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDLvl1s4VJ0A0SQHyGkymiARFuvSNi+tUqEJHVUGMMu1utjNmT803Q89RwzPM5B+//eLjH97Rz62tQo9PtgOmdVdk/kCgCj13AtBVaDK+jkFnXDzRW7fQXjHNPp3/CpNhuWPGMSwiVdGa6+g2NJ+HpJdsnPP/FSrRrFjyHUudUFQ9H8LgxMGQSEId4s5hLPZLFNoU3cI7uTa+yPmSLRtYJB0+W50r2n/4JIQpobX4mX+ubCsUPzlePAOVhXcZ9jXpK7Vz37zlQ7aT3nhXqbBIQexVf5SLiXeBdzxhtcM9gPSGJ+1Dxt+ppmwuvVS4Wyr5skQOIXqb2ea6Ff6SXzzZAz+zOIuz1fL280250Qy/pD7FLub3ZOq6HmpKLDycVS3if6XOHwCP/emgPBdUm2os8pSpUOtXI5xd/GP+EjGBqf4YPx59lrfArlKXcxuCQiCBmxr58zQQehQ+Y1rsPnfozi1trEq4YeSJ8/FyYhsAHevxaECOsAsmXPGte0nv1COFk= Warewulf Cluster key
Note: This is the Warewulf cluster public key.
Problem
The two public keys are not identical! This means that the compute node does not have the correct public key from the master node, which can cause authentication failures.
Solution
On the master node, run the following command to correctly copy the right key to compute1:
ssh-copy-id -i /root/.ssh/cluster.pub root@compute1This will install the proper cluster public key on compute1.
Donβt forget to remove the old key from ~/.ssh/authorized_keys on compute1 to avoid conflicts or unauthorized access.
As a result of these modifications, the output now appears as follows:
[root@master-ohpc /]# ./recipe15.sh
compute2: Failed to start munge.service: Unit munge.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
compute1: Failed to start munge.service: Unit munge.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute1: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute2: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gzProblem
Connection issue from the compute nodes: Network unreachable
Solution
[root@compute2 /]# ip route
169.254.1.0/24 dev idrac proto kernel scope link src 169.254.1.2 metric 101
192.168.70.0/24 dev eth2 proto kernel scope link src 192.168.70.52 metric 100
[root@compute2 /]# nmtui
[root@compute2 /]# ip route add default via 192.168.70.1
[root@compute2 /]# ip route
default via 192.168.70.1 dev eth2
169.254.1.0/24 dev idrac proto kernel scope link src 169.254.1.2 metric 101
192.168.70.0/24 dev eth2 proto kernel scope link src 192.168.70.52 metric 100
[root@compute2 /]# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=17.2 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=17.0 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=16.4 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=116 time=16.4 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=116 time=18.9 msMUNGE Installation and Keyfile Error Resolution on Compute Nodes
On each node (compute1, compute2), run the following commands:
dnf install -y munge munge-libs munge-devel
systemctl enable --now munge Then, verify that MUNGE is working properly:
[root@compute2 ~]# systemctl status munge.service
Γ munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Wed 2025-02-26 14:02:06 +00; 1min 47s ago
Docs: man:munged(8)
Process: 29177 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
CPU: 4ms
Feb 26 14:02:05 compute2 systemd[1]: Starting MUNGE authentication service...
Feb 26 14:02:06 compute2 munged[29177]: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No s>
Feb 26 14:02:06 compute2 systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Feb 26 14:02:06 compute2 systemd[1]: munge.service: Failed with result 'exit-code'.
Feb 26 14:02:06 compute2 systemd[1]: Failed to start MUNGE authentication service.Solution
[root@compute2 ~]# cd /etc/munge/
[root@compute2 munge]# ls
[root@compute2 munge]# systemctl stop munge
[root@compute2 munge]# create-munge-key
Generating a pseudo-random key using /dev/urandom completed.
[root@compute2 munge]# systemctl start munge
[root@compute2 munge]# chown munge:munge /etc/munge/munge.key
[root@compute2 munge]# chmod 0600 /etc/munge/munge.key
[root@compute2 munge]# systemctl restart munge
[root@compute2 munge]# systemctl status munge
β munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
Active: active (running) since Wed 2025-02-26 14:12:30 +00; 7s ago
Docs: man:munged(8)
Process: 29320 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
Main PID: 29322 (munged)
Tasks: 4 (limit: 202700)
Memory: 1.6M
CPU: 5msOnce MUNGE is successfully activated on both compute nodes, the resulting output is:
[root@master-ohpc /]# ./recipe15.sh
compute1: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute2: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gzInstalling and Enabling slurmd on Each Compute Node
To resolve the SLURM error, run the following commands on each compute node:
dnf install -y slurm slurm-slurmd
systemctl enable --now slurmd
systemctl status slurmdThis installs and starts the SLURM daemon (slurmd), which is required for each compute node to communicate with the SLURM controller and accept jobs.
Once slurmd is enabled and running on the compute nodes, the resulting output is:
[root@master-ohpc /]# ./recipe15.sh
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gzTroubleshooting SSH Connection and Hostname Resolution Issues
- To resolve the hostname resolution error, make sure the following line is present in the /etc/hosts file on the master node:
# /etc/hosts
192.168.70.41 master-ohpc master-ohpc.cluster
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# Do not edit after this line
# This block is autogenerated by master-ohpc
# Hosts: master-ohpc.cluster
# Time: 02-25-2025 07:33:30 EST
# Source:
# Warewulf Server
192.168.70.41 master-ohpc.cluster master-ohpc
# Entry for compute1
192.168.70.51 compute1.localdomain compute1.localdomain compute1 c1 compute1-default compute1-default
# Entry for compute2
192.168.70.52 compute2.localdomain compute2.localdomain compute2 compute2-default compute2-default- installing the OpenHPC repository and verifying enabled repos:
[root@compute1 srv]# dnf install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
[root@compute1 ~]# dnf repolist
repo id repo name
OpenHPC OpenHPC-3 - Base
OpenHPC-updates OpenHPC-3 - Updates
appstream Rocky Linux 9 - AppStream
baseos Rocky Linux 9 - BaseOS
epel Extra Packages for Enterprise Linux 9 - x86_64
epel-cisco-openh264 Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64
extras Rocky Linux 9 - Extras- Verify that the following directory contains nhc-genconf and other components. Then, proceed to build and install Node Health Check (NHC) from source:
[root@compute1 src]# cd /usr/local/src/nhc
[root@compute1 nhc]# ls
COPYING Makefile.am autogen.sh contrib nhc nhc-wrapper scripts
ChangeLog README.md bench helpers nhc-genconf nhc.conf test
LICENSE RELEASE_NOTES.txt configure.ac lbnl-nhc.spec.in nhc-test.conf nhc.logrotate
[root@compute1 nhc]# ./autogen.sh
[root@compute1 nhc]# ./configure
[root@compute1 nhc]# make
[root@compute1 nhc]# make install- Fix the path to nhc-genconf in recipe15.sh
Original line (incorrect path):
pdsh -w c1 "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c
Replace it with the corrected line:
pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -" | dshbak -c
This ensures pdsh correctly calls the version of nhc-genconf located in the source directory, not the default system path.
After correcting the path to nhc-genconf in the script (/usr/local/src/nhc/nhc-genconf), running ./recipe15.sh now produces the following output:
c1: /usr/local/src/nhc/nhc-genconf: line 342: nhc_common_unparse_size: command not found
c1: /usr/local/src/nhc/nhc-genconf: line 346: nhc_common_unparse_size: command not found
----------------
c1
----------------
# NHC Configuration File
#
# Lines are in the form "<hostmask>||<check>"
# Hostmask is a glob, /regexp/, or {noderange}
# Comments begin with '#'
#
# This file was automatically generated by nhc-genconf
# Fri Feb 28 13:13:50 CET 2025
#
#######################################################################
###
### NHC Configuration Variables
###
# * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
#######################################################################
###
### Hardware checks
###
* || check_hw_cpuinfo
* || check_hw_physmem 3%
* || check_hw_swap 3%
#######################################################################
###
### nVidia GPU checks
###
* || check_nv_healthmon
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gzResolving the error encountered during the execution of recipe15.sh:
c1: /usr/local/src/nhc/nhc-genconf: line 342: nhc_common_unparse_size: command not found c1: /usr/local/src/nhc/nhc-genconf: line 346: nhc_common_unparse_size: command not found
[root@compute1 nhc]# grep -r "nhc_common_unparse_size" /usr/local/src/nhc
/usr/local/src/nhc/scripts/common.nhc:function nhc_common_unparse_size() {
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_SIZE FS_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $MIN_SIZE MIN_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_SIZE FS_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $MAX_SIZE MAX_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_FREE FS_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $MIN_FREE MIN_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_FREE FS_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_USED FS_USED
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $MAX_USED MAX_USED
/usr/local/src/nhc/scripts/lbnl_fs.nhc: nhc_common_unparse_size $FS_USED FS_USED
/usr/local/src/nhc/scripts/lbnl_ps.nhc: nhc_common_unparse_size ${PS_RSS[$THIS_PID]} NUM
/usr/local/src/nhc/scripts/lbnl_ps.nhc: nhc_common_unparse_size $THRESHOLD LIM
/usr/local/src/nhc/scripts/lbnl_ps.nhc: nhc_common_unparse_size ${PS_VSZ[$THIS_PID]} NUM
/usr/local/src/nhc/scripts/lbnl_ps.nhc: nhc_common_unparse_size $THRESHOLD LIM
/usr/local/src/nhc/test/test_common.nhc: is "`type -t nhc_common_unparse_size 2>&1`" 'function' 'nhc_common_unparse_size() loaded properly'
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1024EB" "nhc_common_unparse_size(): $OSIZE -> 1024EB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1EB" "nhc_common_unparse_size(): $OSIZE -> 1EB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1023PB" "nhc_common_unparse_size(): $OSIZE -> 1023PB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "64TB" "nhc_common_unparse_size(): $OSIZE -> 64TB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "4GB" "nhc_common_unparse_size(): $OSIZE -> 4GB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1023MB" "nhc_common_unparse_size(): $OSIZE -> 1023MB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1MB" "nhc_common_unparse_size(): $OSIZE -> 1MB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1000kB" "nhc_common_unparse_size(): $OSIZE -> 1000kB"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1GB" "nhc_common_unparse_size(): $OSIZE -> 1GB with 51MB error (size)"
/usr/local/src/nhc/test/test_common.nhc: is "$ERR" "51" "nhc_common_unparse_size(): $OSIZE -> 1GB with 51MB error (error)"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1177GB" "nhc_common_unparse_size(): $OSIZE -> 1177GB (1.15TB) with 0GB error (size)"
/usr/local/src/nhc/test/test_common.nhc: is "$ERR" "0" "nhc_common_unparse_size(): $OSIZE -> 1177GB (1.15TB) with 0GB error (error)"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1536GB" "nhc_common_unparse_size(): $OSIZE -> 1536GB (1.5TB) with 0GB error (size)"
/usr/local/src/nhc/test/test_common.nhc: is "$ERR" "0" "nhc_common_unparse_size(): $OSIZE -> 1536GB (1.5TB) with 0GB error (error)"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "1792kB" "nhc_common_unparse_size(): $OSIZE -> 1792kB (1.75MB) with 0kB error (size)"
/usr/local/src/nhc/test/test_common.nhc: is "$ERR" "0" "nhc_common_unparse_size(): $OSIZE -> 1792kB (1.75MB) with 0kB error (error)"
/usr/local/src/nhc/test/test_common.nhc: nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc: is "$NSIZE" "2PB" "nhc_common_unparse_size(): $OSIZE -> 2PB (1.99PB) with 11TB error (size)"
/usr/local/src/nhc/test/test_common.nhc: is "$ERR" "11" "nhc_common_unparse_size(): $OSIZE -> 2PB (1.99PB) with 11TB error (error)"
/usr/local/src/nhc/nhc-genconf: nhc_common_unparse_size $HW_RAM_TOTAL HW_RAM_TOTAL 1024 ERR
/usr/local/src/nhc/nhc-genconf: nhc_common_unparse_size $HW_SWAP_TOTAL HW_SWAP_TOTAL 1024 ERR
[root@compute1 nhc]# source /usr/local/src/nhc/scripts/common.nhc
[root@compute1 nhc]# cd /usr/local/src/nhc/scripts/
[root@compute1 scripts]# ls
common.nhc lbnl_cmd.nhc lbnl_file.nhc lbnl_hw.nhc lbnl_moab.nhc lbnl_nv.nhc
csc_nvidia_smi.nhc lbnl_dmi.nhc lbnl_fs.nhc lbnl_job.nhc lbnl_net.nhc lbnl_ps.nhc[root@compute1 scripts]# nhc_common_unparse_size 1024 RAM_SIZEThe output after executing recipe15.sh is now:
pdsh@master-ohpc: /rc/recipets.sh
c1: /usr/local/src/nhc/scripts/common.nhc: line 528: die: command not found
pdsh@master-ohpc: c1: ssh exited with exit code 1
c1
################################################################################
# NHC Configuration File
#
# Lines are in the form "<hostmask>|<check>"
# Hostmask is a glob, /regex/, or !nodenameg
# Comments begin with "#"
#
# This file was automatically generated by nhc-genconf
# Fri Feb 28 13:36:30 CET 2025
################################################################################
## NHC Configuration Variables
## || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
################################################################################
# Hardware checks
################################################################################
* || check_hw_cpuinfo
useradd: user 'test' already exists
Building system overlays for compute1: [wninit]
Created image for overlay compute1/[wninit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wninit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g
Building system overlays for compute2: [wninit]
Created image for overlay compute2/[wninit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wninit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g
################################################################################ Resolving the error encountered during the execution of recipe15.sh:
The error shown in the output c1: /usr/local/src/nhc/scripts/common.nhc: line 528: die: command not found means that the die function or command used in the common.nhc script is not defined or accessible at runtime.
The die function is commonly used in scripts to stop execution and display an error message. If itβs not defined, this kind of error may occur.
The die function should be defined somewhere in the script itself or in another sourced file. You should search for its definition in common.nhc or other related files.
If the function is missing (which is our case), you can add it at the end of the common.nhc file as follows, to terminate execution with an error message:
die() {
echo β$1β
exit 1
}
[root@compute1 ~]# cd /usr/local/src/nhc/scripts/
[root@compute1 scripts]# vi common.nhc# Find system definition for UID range
function nhc_common_get_max_sys_uid() {
local LINE UID_MIN SYS_UID_MAX
MAX_SYS_UID=${MAX_SYS_UID:-99}
if [[ -e "$LOGIN_DEFS_SRC" ]]; then
while read LINE ; do
if [[ "${LINE#UID_MIN}" != "$LINE" ]]; then
UID_MIN=${LINE//[!0-9]/}
elif [[ "${LINE#UID_MAX}" != "$LINE" ]]; then
SYS_UID_MAX=${LINE//[!0-9]/}
break
fi
done < "$LOGIN_DEFS_SRC"
if [[ -n "$SYS_UID_MAX" ]]; then
MAX_SYS_UID=$((SYS_UID_MAX+0))
fi
if [[ -n "$UID_MIN" ]]; then
UID_MIN=$((UID_MIN-1))
fi
if (( MAX_SYS_UID <= 0 )); then
MAX_SYS_UID=99
fi
return 0
else
return 1
fi
}
die() {
echo "$1"
exit 1
}The output after executing recipe15.sh is now:
pdsh@master-ohpc: c1: ssh exited with exit code 1
----------------
c1
----------------
# NHC Configuration File
#
# Lines are in the form "<hostmask>||<check>"
# Hostmask is a glob, /regexp/, or {noderange}
# Comments begin with '#'
#
# This file was automatically generated by nhc-genconf
# Fri Feb 28 13:40:37 CET 2025
#
#######################################################################
###
### NHC Configuration Variables
###
# * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
#######################################################################
###
### Hardware checks
###
* || check_hw_cpuinfo
1
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g z
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g zChecking that the script runs correctly
Enabling and Starting Services (munge, slurmctld, and slurmd)
The script uses systemctl to enable and start the necessary services (munge and slurmctld) on the master node and on the compute nodes via pdsh.
Verification:
Ensure that the services are successfully started on the master node and compute nodes.
Use systemctl status to verify that the services are running:
systemctl status munge
[root@master-ohpc ~]# systemctl status munge
β munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
Active: active (running) since Wed 2025-02-12 05:13:09 EST; 2 weeks 4 days ago
Docs: man:munged(8)
Main PID: 3279213 (munged)
Tasks: 1 (limit: 48899)
Memory: 4.1M
CPU: 2.229s
CGroup: /system.slice/munge.service
ββ3279213 /usr/sbin/munged
Feb 28 07:36:29 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 07:36:29 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...systemctl status slurmctld
[root@master-ohpc ~]# systemctl status slurmctld
β slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
Active: active (running) since Wed 2025-02-12 05:29:09 EST; 2 weeks 4 days ago
Main PID: 2305503 (slurmctld)
Tasks: 8 (limit: 48899)
Memory: 23.1M
CPU: 2min 21.800s
CGroup: /system.slice/slurmctld.service
ββ2305503 /usr/sbin/slurmctld -D
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Unable to resolve "compute2"
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Address family not supported
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: get_node_addrs: Address to compute2
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Unable to resolve "compute3"
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Address family not supported
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: get_node_addrs: Address to compute3
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: Recovered state of 0 reservations
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: slurmctld: backfill scheduling
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: select/cons_tres: prof_cnt = 0
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: Running as primary controllerFixing the reported problems
- Node Address Configuration Missing in slurm.conf
The configuration in slurm.conf defines the nodes compute1 and compute2, but it lacks explicit IP address declarations (NodeAddr). Since slurmctld is reporting a resolution issue for compute2, it is likely that it cannot resolve its address.
Manually add the IP address of each node in /etc/slurm/slurm.conf on master-ohpc:
NodeName=compute1 NodeAddr=192.168.70.51 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
NodeName=compute2 NodeAddr=192.168.70.52 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWNThen reload the configuration and restart Slurm:
systemctl restart slurmctld
systemctl restart slurmd- Create the Log File Manually
If the log file /var/log/slurmctld.log is missing:
touch /var/log/slurmctld.log
chown slurm:slurm /var/log/slurmctld.log
chmod 644 /var/log/slurmctld.logCheck if the /var/log/slurm/ directory exists:
ls -ld /var/log/slurmIf it doesnβt exist, create it:
mkdir -p /var/log/slurm
chown slurm:slurm /var/log/slurm
chmod 755 /var/log/slurmRestart slurmctld to test:
systemctl restart slurmctld
systemctl status slurmctld- Reset Node States via scontrol
Use the following commands to reset the states of compute nodes:
scontrol update NodeName=compute1 State=DRAIN Reason="Manual reset"
scontrol update NodeName=compute2 State=DRAIN Reason="Manual reset"
scontrol update NodeName=compute1 State=DOWN Reason="Reset"
scontrol update NodeName=compute2 State=DOWN Reason="Reset"
scontrol update NodeName=compute1 State=RESUME
scontrol update NodeName=compute2 State=RESUME- Add MemoryEnforce=YES in slurm.conf
To enforce memory limits, add the following line to /etc/slurm/slurm.conf on master-ohpc:
MemoryEnforce=YESand now the output of systemctl status slurmctld as follow :
[root@master-ohpc ~]# systemctl status slurmctld
β slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
Active: active (running) since Mon 2025-03-03 04:50:39 EST; 7min ago
Main PID: 3871952 (slurmctld)
Tasks: 5
Memory: 5.4M
CPU: 95ms
CGroup: /system.slice/slurmctld.service
ββ3871952 /usr/sbin/slurmctld -D
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 reason set to *
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DRAINED*
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 reason set to *
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DRAINED*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DOWN*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DOWN*
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to IDLE
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to IDLE
-- More -- (ctrl-C pour quitter)
[root@master-ohpc ~]# systemctl status slurmctld
β slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
Active: active (running) since Mon 2025-03-03 04:50:39 EST; 7min ago
Main PID: 3871952 (slurmctld)
Tasks: 5
Memory: 5.4M
CPU: 95ms
CGroup: /system.slice/slurmctld.service
ββ3871952 /usr/sbin/slurmctld -D
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 reason set to: Manual reset
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DRAINED*
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 reason set to: Manual reset
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DRAINED*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DOWN*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DOWN*
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to IDLE
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to IDLEVerification of the pdsh command
Letβs check this with the following commands:
[root@master-ohpc /]# pdsh -w 192.168.70.51 systemctl status munge
192.168.70.51: β munge.service - MUNGE authentication service
192.168.70.51: Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
192.168.70.51: Active: active (running) since Tue 2025-03-04 12:02:13 CET; 7min ago
192.168.70.51: Docs: man:munged(8)
192.168.70.51: Process: 140959 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
192.168.70.51: Main PID: 140961 (munged)
192.168.70.51: Tasks: 4 (limit: 202700)
192.168.70.51: Memory: 1.4M
192.168.70.51: CPU: 76ms
192.168.70.51: CGroup: /system.slice/munge.service
192.168.70.51: ββ140961 /usr/sbin/munged
192.168.70.51: Mar 04 12:02:13 compute1 systemd[1]: Starting MUNGE authentication service...
192.168.70.51: Mar 04 12:02:13 compute1 systemd[1]: Started MUNGE authentication service.
[root@master-ohpc /]# pdsh -w c1 systemctl status slurmd
c1: β slurmd.service - Slurm node daemon
c1: Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
c1: Active: active (running) since Tue 2025-03-04 12:03:08 CET; 7min ago
c1: Main PID: 141050 (slurmd)
c1: Tasks: 2
c1: Memory: 4.2M
c1: CPU: 78ms
c1: CGroup: /system.slice/slurmd.service
c1: ββ141050 /usr/sbin/slurmd -D -s
c1: Mar 04 12:03:08 compute1 systemd[1]: Started Slurm node daemon.
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: slurmd version 22.05.0 started
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: Started on Tue, 04 Mar 2025 12:03:08
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3172s TmpDisk=17616 Uptime=504933 CPUSpecList=CPU Procs=1
c1: yes=(null)
[root@master-ohpc /]#However, this command fails when executed from the nodes :
[root@compute1 ~]# pdsh -w c2 systemctl status munge
pdsh@compute1: c2: connect: Connection refused
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
pdsh@compute1: 192.168.70.52: connect: Connection refusedSolution :
[root@compute1 ~]# echo $PDSH_RCMD_TYPE
[root@compute1 ~]# export PDSH_RCMD_TYPE=ssh
[root@compute1 ~]# echo $PDSH_RCMD_TYPE
ssh
[root@compute1 ~]# echo "export PDSH_RCMD_TYPE=ssh" >> ~/.bashrc
source ~/.bashrc
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
No such rcmd module "ssh"
[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@master-ohpc /]# ssh c2
Last login: Tue Mar 4 11:03:20 2025 from 192.168.70.41
[root@compute2 ~]# export PDSH_RCMD_TYPE=ssh
[root@compute2 ~]# echo "export PDSH_RCMD_TYPE=ssh" >> ~/.bashrc
source ~/.bashrc
[root@compute2 ~]# pdsh -w c1 systemctl status munge
No such rcmd module "ssh"
[root@compute2 ~]# exit
logout
Connection to c2 closed.
[root@master-ohpc /]# ssh c1
Last login: Tue Mar 4 12:14:40 2025 from 192.168.70.41
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
No such rcmd module "ssh"
[root@compute1 ~]# pdsh -L
2 modules loaded:
Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr: arbitrary command rcmd connect method
Active: yes
Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr: BSD rcmd connect method
Active: yes
[root@compute1 ~]# yum install pdsh-rcmd-ssh -y
Last metadata expiration check: 2:45:17 ago on Tue Mar 4 09:35:01 2025.
Dependencies resolved.
===============================================================================================
Package Architecture Version Repository Size
===============================================================================================
Installing:
pdsh-rcmd-ssh x86_64 2.34-7.el9 epel 13 k
Transaction Summary
===============================================================================================
Install 1 Package
Total download size: 13 k
Installed size: 15 k
Downloading Packages:
pdsh-rcmd-ssh-2.34-7.el9.x86_64.rpm 140 kB/s | 13 kB 00:00
-----------------------------------------------------------------------------------------------
Total 6.6 kB/s | 13 kB 00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Running scriptlet: pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Verifying : pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Installed:
pdsh-rcmd-ssh-2.34-7.el9.x86_64
Complete!
[root@compute1 ~]# pdsh -L
3 modules loaded:
Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr: arbitrary command rcmd connect method
Active: yes
Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr: BSD rcmd connect method
Active: yes
Module: rcmd/ssh
Author: Jim Garlick <garlick@llnl.gov>
Descr: ssh based rcmd connect method
Active: yes
[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@master-ohpc /]# ssh c2
Last login: Tue Mar 4 11:17:47 2025 from 192.168.70.41
[root@compute2 ~]# yum install pdsh-rcmd-ssh -y
Last metadata expiration check: 0:13:40 ago on Tue Mar 4 11:07:02 2025.
Dependencies resolved.
===============================================================================================
Package Architecture Version Repository Size
===============================================================================================
Installing:
pdsh-rcmd-ssh x86_64 2.34-7.el9 epel 13 k
Transaction Summary
===============================================================================================
Install 1 Package
Total download size: 13 k
Installed size: 15 k
Downloading Packages:
pdsh-rcmd-ssh-2.34-7.el9.x86_64.rpm 1.3 MB/s | 13 kB 00:00
-----------------------------------------------------------------------------------------------
Total 1.1 kB/s | 13 kB 00:12
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Running scriptlet: pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Verifying : pdsh-rcmd-ssh-2.34-7.el9.x86_64 1/1
Installed:
pdsh-rcmd-ssh-2.34-7.el9.x86_64
Complete!
[root@compute2 ~]# pdsh -L
3 modules loaded:
Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr: arbitrary command rcmd connect method
Active: yes
Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr: BSD rcmd connect method
Active: yes
Module: rcmd/ssh
Author: Jim Garlick <garlick@llnl.gov>
Descr: ssh based rcmd connect method
Active: yesafter that, we got this :
[root@compute2 ~]# pdsh -w c1 systemctl status munge
c1: Host key verification failed.
pdsh@compute2: c1: ssh exited with exit code 255Solution :
If the private key is missing, you can generate a new SSH key pair on compute1 (do the same on compute2).
[root@compute1 ~]# ssh-keygen -t rsa -b 2048 -f /root/.ssh/id_rsa This command will generate a new private key (id_rsa) and a public key (id_rsa.pub) in the /root/.ssh/ directory.
Enter passphrase (empty for no passphrase): When youβre prompted to enter a passphrase, you can choose to leave it empty if you donβt want a password for the private key.
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:0oTKoULuoyFHKGqBGY5zkrdfu7q8h4bNcZHGOHJrOHQ root@compute1
The key's randomart image is:
+---[RSA 2048]----+
| |
| . |
|.. .o... |
|*=ooEo=o |
|@===o+..S |
|+B+.+ .. |
|++o* +. |
|+ooo=... |
|. .*=o. |
+----[SHA256]-----+
[root@compute1 ~]# cat /root/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHsEf13pu9VL1pPi7RVyL3R2PujKG76Fr2wdy5B92aw9gfh0FYknnoyCr58U3wkUmcCatT+PRdIj02q2UELpjnnwTJLFXNZG2FSmg14cgW8wC3CI0Hrb/EAuTSJYk/vkAiYFzVNS6UHaclA30o1NaJ/8D9iSECdtbEOuRJP+dSnZ3VJG0Now7S+NBtsCRMW491Sj3qxsyUFl8tZNxNMrdlFdwkPK9gPUynwq+a5fpm0ZUYRdjioRbcTvyVoQLn2j37NZfUafbMn5uv/IHAmoTVph+WwZ3GsYVyYzoYV1RXmUPjnSved4NU7RW7lAltk5F4S1Y4UiN3WLLR/eocr+Lh root@compute1After generating the new key pair, you will need to copy the public key (/root/.ssh/id_rsa.pub) to ~/.ssh/authorized_keys on compute1 under the root user.
[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@compute2 ~]# vi ~/.ssh/authorized_keys Copy this key, then connect to compute2 and add the public key to the ~/.ssh/authorized_keys file.
[root@compute2 ~]# exit
logout
Connection to c2 closed.
[root@master-ohpc /]# ssh c1
Last login: Tue Mar 4 13:46:51 2025 from 192.168.70.52
[root@compute1 ~]# ssh c2
Last login: Tue Mar 4 12:46:10 2025 from 192.168.70.41
[root@compute2 ~]# ssh c1
Last login: Tue Mar 4 13:49:31 2025 from 192.168.70.41
[root@compute1 ~]# pdsh -w c2 systemctl status munge
c2: β munge.service - MUNGE authentication service
c2: Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
c2: Active: active (running) since Tue 2025-03-04 11:02:24 +00; 1h 47min ago
c2: Docs: man:munged(8)
c2: Process: 124785 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
c2: Main PID: 124787 (munged)
c2: Tasks: 4 (limit: 202700)
c2: Memory: 1.7M
c2: CPU: 28ms
c2: CGroup: /system.slice/munge.service
c2: ββ124787 /usr/sbin/munged
c2:
c2: Mar 04 11:02:24 compute2 systemd[1]: Starting MUNGE authentication service...
c2: Mar 04 11:02:24 compute2 systemd[1]: Started MUNGE authentication service.[root@master-ohpc /]# pdsh -w c1 systemctl status munge
c1: β munge.service - MUNGE authentication service
c1: Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
c1: Active: active (running) since Tue 2025-03-04 12:02:13 CET; 7min ago
c1: Docs: man:munged(8)
c1: Process: 140959 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
c1: Main PID: 140961 (munged)
c1: Tasks: 4 (limit: 202700)
c1: Memory: 1.4M
c1: CPU: 76ms
c1: CGroup: /system.slice/munge.service
c1: ββ140961 /usr/sbin/munged
c1: Mar 04 12:02:13 compute1 systemd[1]: Starting MUNGE authentication service...
c1: Mar 04 12:02:13 compute1 systemd[1]: Started MUNGE authentication service.
[root@master-ohpc /]# pdsh -w c1 systemctl status slurmd
c1: β slurmd.service - Slurm node daemon
c1: Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
c1: Active: active (running) since Tue 2025-03-04 12:03:08 CET; 7min ago
c1: Main PID: 141050 (slurmd)
c1: Tasks: 2
c1: Memory: 4.2M
c1: CPU: 78ms
c1: CGroup: /system.slice/slurmd.service
c1: ββ141050 /usr/sbin/slurmd -D -s
c1: Mar 04 12:03:08 compute1 systemd[1]: Started Slurm node daemon.
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: slurmd version 22.05.0 started
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: Started on Tue, 04 Mar 2025 12:03:08
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3172s TmpDisk=17616 Uptime=504933 CPUSpecList=CPU Procs=1
c1: yes=(null)
[root@master-ohpc /]#**Verification: pdsh -w c1 β/usr/local/src/nhc/nhc-genconf -H β*β -c -ββ**
Make sure the nhc-genconf command runs without errors on c1.
[root@master-ohpc ~]# pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -"
c1: # NHC Configuration File
c1: #
c1: # Lines are read in order; the first matching line is used.
c1: # Comments are ignored. Use '#' for comments.
c1: # Hostmask is a glob, /regexp/, or (noderange)
c1: # Comments begin with '#'
c1: #
c1: # This file was automatically generated by nhc-genconf
c1: # Wed Mar 05 13:44:42 CET 2025
c1: #
c1:
c1: ################################################################################
c1: ###
c1: ### NHC Configuration Variables
c1: ###
c1: #* || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
c1:
c1: ################################################################################
c1: ###
c1: ### Hardware checks
c1: ###
c1: * || check_hw_cpuinfo
c1:
pdsh@master-ohpc: c1: ssh exited with exit code 1It appears that the file check_hw_cpuinfo is indeed missing on compute1, which may explain the previous errors. This file is essential for the proper functioning of NHC hardware checks. To download the nhc-genconf file from GitHub: https://github.com/mej/nhc/blob/master/nhc-genconf. Download the file from the command line:
Use the wget or curl command to download the file directly from the command line. wget https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf -O /usr/local/src/nhc/nhc-genconf
[root@compute1 src]# wget https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf -O /usr/local/src/nhc/nhc-genconf
--2025-03-05 15:19:22-- https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16594 (16K) [text/plain]
Saving to: '/usr/local/src/nhc/nhc-genconf'
/usr/local/src/nhc/nhc- 100%[==============================>] 16.21K --.-KB/s in 0.004s
2025-03-05 15:19:22 (3.64 MB/s) - '/usr/local/src/nhc/nhc-genconf' saved [16594/16594][root@compute1 src]# cd /usr/local/src/nhc/
[root@compute1 nhc]# ls
COPYING Makefile.in automate.cache configure.ac lbnl-nhc.spec.in nhc-wrapper
README helpers branch contrib missing nhc.conf
README.md LICENSE.txt config.log install.sh nhc nhc.cron
aclocal.m4 automate.sh configure scripts nhc-genconf nhc.logrotate
Makefile autogen.sh configure.deps install.sh nhc-wrapper.conf test.nhcfor both nodes :
[root@master-ohpc ~]# pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -"
c1: # NHC Configuration File
c1: #
c1: # Lines are in the form "<hostmask>||<check>"
c1: # Hostmask is a glob, /regexp/, or {noderange}
c1: # Comments begin with '#'
c1: #
c1: # This file was automatically generated by nhc-genconf
c1: # Wed Mar 5 15:21:09 CET 2025
c1: #
c1:
c1: #######################################################################
c1: ###
c1: ### NHC Configuration Variables
c1: ###
c1: # * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
c1:
c1:
c1: #######################################################################
c1: ###
c1: ### DMI Checks
c1: ###
c1: # * || check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Version: 1.5.1"
c1: # * || check_dmi_data_match -h 0x0100 -t 1 "System Information: Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x0200 -t 2 "Base Board Information: Version: A02"
c1: # * || check_dmi_data_match -h 0x0300 -t 3 "Chassis Information: Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Version: Intel(R) Xeon(R) E-2356G CPU @ 3.20GHz"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Max Speed: 4000 MHz"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Current Speed: 3200 MHz"
c1: # * || check_dmi_data_match -h 0x0700 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x0701 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x0702 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Configured Memory Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Firmware Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Configured Memory Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Firmware Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x2600 -t 38 "IPMI Device Information: Specification Version: 2.0"
c1: # * || check_dmi_data_match -h 0x0001 -t 43 "TPM Device: Specification Version: 2.0"
c1: # * || check_dmi_data_match -h 0x0001 -t 43 "TPM Device: Description: TPM 2.0, ManufacturerID: NTC , Firmware Version: 0x00070002.0x0"
c1:
c1:
c1: #######################################################################
c1: ###
c1: ### Filesystem checks
c1: ###
c1: * || check_fs_mount_rw -t "proc" -s "proc" -f "/proc"
c1: * || check_fs_mount_rw -t "sysfs" -s "sysfs" -f "/sys"
c1: * || check_fs_mount_rw -t "devtmpfs" -s "devtmpfs" -f "/dev"
c1: * || check_fs_mount_rw -t "securityfs" -s "securityfs" -f "/sys/kernel/security"
c1: * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm"
c1: * || check_fs_mount_rw -t "devpts" -s "devpts" -f "/dev/pts"
c1: * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run"
c1: * || check_fs_mount_rw -t "pstore" -s "pstore" -f "/sys/fs/pstore"
c1: * || check_fs_mount_rw -t "efivarfs" -s "efivarfs" -f "/sys/firmware/efi/efivars"
c1: * || check_fs_mount_rw -t "bpf" -s "bpf" -f "/sys/fs/bpf"
c1: * || check_fs_mount_rw -t "xfs" -s "/dev/mapper/rl-root" -f "/"
c1: * || check_fs_mount_rw -t "selinuxfs" -s "selinuxfs" -f "/sys/fs/selinux"
c1: * || check_fs_mount_rw -t "hugetlbfs" -s "hugetlbfs" -f "/dev/hugepages"
c1: * || check_fs_mount_rw -t "mqueue" -s "mqueue" -f "/dev/mqueue"
c1: * || check_fs_mount_rw -t "debugfs" -s "debugfs" -f "/sys/kernel/debug"
c1: * || check_fs_mount_rw -t "tracefs" -s "tracefs" -f "/sys/kernel/tracing"
c1: * || check_fs_mount_rw -t "fusectl" -s "fusectl" -f "/sys/fs/fuse/connections"
c1: * || check_fs_mount_rw -t "configfs" -s "configfs" -f "/sys/kernel/config"
c1: * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-sysctl.service"
c1: * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-tmpfiles-setup-dev.service"
c1: * || check_fs_mount_rw -t "xfs" -s "/dev/sda2" -f "/boot"
c1: * || check_fs_mount_rw -t "vfat" -s "/dev/sda1" -f "/boot/efi"
c1: * || check_fs_mount_rw -t "xfs" -s "/dev/mapper/rl-home" -f "/home"
c1: * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-tmpfiles-setup.service"
c1: * || check_fs_mount_rw -t "tracefs" -s "tracefs" -f "/sys/kernel/debug/tracing"
c1: * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run/user/0"
c1: * || check_fs_used /dev 90%
c1: * || check_fs_used /sys/firmware/efi/efivars 90%
c1: * || check_fs_used / 90%
c1: * || check_fs_free /boot 40MB
c1: * || check_fs_used /boot/efi 90%
c1: * || check_fs_used /home 90%
c1: * || check_fs_iused /dev 100%
c1: * || check_fs_iused / 100%
c1: * || check_fs_iused /boot 100%
c1: * || check_fs_iused /home 98%
c1:
c1:
c1: #######################################################################Creation of TEST55 on compute1 from the master node.
[root@master-ohpc /]# pdsh -w c1 "useradd -m test55"
[root@master-ohpc /]# ssh c1
Last login: Wed Mar 5 14:51:13 2025 from 192.168.70.41
[root@compute1 ~]# id test55
uid=1002(test55) gid=1002(test55) groups=1002(test55)
[root@compute1 ~]#