OpenHPC Installation Guide (via recipe.sh)

Author

Abdelhadi Belkziz

Published

July 22, 2025

🚧 This OpenHPC installation guide is under active testing. Content may change.




1. Introduction

We are going to install OpenHPC using the recipe.sh script. To make the installation process easier to understand and to check for errors step by step, we will divide this script into 15 individual sections, executing and verifying each one separately.

2. Initial Environment Validation and Setup

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

Please specify the exact path where input.local is located. Make sure there are no spaces between the - and the /

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
Fi

Explanation:

  • If OHPC_INPUT_LOCAL is defined, then inputFile will take its value.
  • Otherwise, it will default to /input.local.
# ---------------------------- Begin OpenHPC Recipe ---------------------------------------
# Commands below are extracted from an OpenHPC install guide recipe and are intended for
# execution on the master SMS host.
# -----------------------------------------------------------------------------------------

# Verify OpenHPC repository has been enabled before proceeding

dnf repolist | grep -q OpenHPC
if [ $? -ne 0 ];then
   echo "Error: OpenHPC repository must be enabled locally"
   exit 1
fi

It checks whether the OpenHPC repository is enabled using dnf repolist.
If it’s not enabled, an error message is displayed and the installation is stopped.

# Disable firewall
systemctl disable --now firewalld

It immediately disables the firewalld service and prevents it from starting automatically.

Running this section of recipe.sh should produce no output.
If any output appears, it indicates an error.


3. Deployment of Core OpenHPC Packages and Initial Time Synchronization Setup

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi

# ------------------------------------------------------------
# Add baseline OpenHPC and provisioning services (Section 3.3)
# ------------------------------------------------------------
dnf -y install ohpc-base warewulf-ohpc hwloc-ohpc

Purpose : To install the essential packages for OpenHPC. The packages required for setting up and managing an HPC cluster are installed using dnf.

# Enable NTP services on SMS host
systemctl enable chronyd.service

Enable the NTP service : The time synchronization service chronyd is enabled to ensure that the SMS server remains synchronized with an NTP server.

echo "local stratum 10" >> /etc/chrony.conf
echo "server ${ntp_server}" >> /etc/chrony.conf
echo "allow all" >> /etc/chrony.conf

Configure NTP settings: The chrony.conf configuration file is modified to define a local server, specify an external NTP server, and allow all hosts to synchronize with this server.

systemctl restart chronyd

Restart the chronyd service : The service is restarted to apply the configuration changes.

Running this script may produce the following error (at least that was the case for me)

 Rocky Linux 9 - BaseOS                                                                                                                                                        0.0  B/s |   0  B     00:01
Errors during downloading metadata for repository 'baseos':
  - Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.rockylinux.org/mirrorlist?arch=x86_64&repo=BaseOS-9 [SSL certificate problem: certificate is not yet valid]
Error: Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: Curl error (60): SSL peer certificate or SSH remote key was not OK for https://mirrors.rockylinux.org/mirrorlist?arch=x86_64&repo=BaseOS-9 [SSL certificate problem: certificate is not yet valid

Solution: An SSL certificate error may occur if the system date and time are incorrect. If the system clock is significantly ahead or behind, SSL certificates may be considered invalid.

 [root@master-ohpc /]# date
 Wed Feb 5 08:33:12 AM EST 2025

<pre style="font-size: 8px;">
<code>
 [root@master-ohpc /]# date -s "2025-02-05 14:37:00"
 Wed Feb 5 02:37:00 PM EST 2025

 [root@master-ohpc /]# date
 Wed Feb 5 02:37:05 PM EST 2025

 [root@master-ohpc /]# ./recipe2.sh </pre>

Once this configuration is applied, the script generates the following output:


OpenHPC-3 - Base                                                                                                                                                              698  B/s | 1.5 kB     00:02
OpenHPC-3 - Updates                                                                                                                                                           8.2 kB/s | 3.0 kB     00:00
Extra Packages for Enterprise Linux 9 - x86_64                                                                                                                                4.9 kB/s |  79 kB     00:16
Rocky Linux 9 - BaseOS                                                                                                                                                         12 kB/s | 4.1 kB     00:00
Rocky Linux 9 - AppStream                                                                                                                                                      15 kB/s | 4.5 kB     00:00
Rocky Linux 9 - Extras                                                                                                                                                        2.0 kB/s | 2.9 kB     00:01
Dependencies resolved.
==============================================================================================================================================================================================================
 Package                                               Architecture                          Version                                                     Repository                                      Size
==============================================================================================================================================================================================================
Installing:
 hwloc-ohpc                                            x86_64                                2.11.1-320.ohpc.1.1                                         OpenHPC-updates                                2.4 M
 ohpc-base                                             x86_64                                3.2-320.ohpc.1.1                                            OpenHPC-updates                                7.2 k
 warewulf-ohpc                                         x86_64                                4.5.5-320.ohpc.3.1                                          OpenHPC-updates                                 24 M
Upgrading:

View full output

4. Installing and Configuring Slurm Resource Manager on Master Node

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
</code>
</pre>   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
Fi

# -------------------------------------------------------------
# Add resource management services on master node (Section 3.4)
# -------------------------------------------------------------
dnf -y install ohpc-slurm-server

Installs Slurm, a workload manager for HPC jobs, using dnf -y install to automate the process without requiring confirmation.

cp /etc/slurm/slurm.conf.ohpc /etc/slurm/slurm.conf
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf

Copies a default Slurm configuration file (slurm.conf.ohpc β†’ slurm.conf) and a sample cgroup configuration file (cgroup.conf.example β†’ cgroup.conf), which is used to limit and isolate CPU and memory resources via cgroups.

perl -pi -e "s/SlurmctldHost=\S+/SlurmctldHost=${sms_name}/" /etc/slurm/slurm.conf

Replaces the SlurmctldHost=… line in /etc/slurm/slurm.conf with SlurmctldHost=${sms_name}, where ${sms_name} is defined in the input.local configuration file. This variable represents the master node of the cluster.

Once this configuration is applied, the script generates the following output:


Last metadata expiration check: 0:55:27 ago on Fri 07 Feb 2025 04:11:23 AM EST.
Dependencies resolved.
==============================================================================================================================================================================================================
 Package                                                   Architecture                          Version                                                 Repository                                      Size
==============================================================================================================================================================================================================
Installing:
 ohpc-slurm-server                                         x86_64                                3.2-320.ohpc.1.1                                        OpenHPC-updates                                7.0 k
Installing dependencies:

View full output

Verification : Confirm that the SlurmctldHost=… line in /etc/slurm/slurm.conf has been correctly replaced with SlurmctldHost=${sms_name}.

Cd /etc/slurm/
Nano slurm.conf
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.

ClusterName=cluster
SlurmctldHost=master-ohpc
#SlurmctldHost=

#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1

5. Updating Slurm Node Configuration in slurm.conf

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ----------------------------------------
# Update node configuration for slurm.conf
# ----------------------------------------
if [[ ${update_slurm_nodeconfig} -eq 1 ]];then (if [[ ${update_slurm_nodeconfig} -eq 1 ]]; then

Check if the variable update_slurm_nodeconfig is set to 1 β†’ this indicates that the node configuration in the input.local file needs to be updated.

perl -pi -e "s/^NodeName=.+$/#/" /etc/slurm/slurm.conf

Replace all lines starting with NodeName= with a comment (#).

perl -pi -e "s/ Nodes=c\S+ / Nodes=${compute_prefix}[1-${num_computes}] /" /etc/slurm/slurm.conf      

Modify the Slurm node configuration to match the prefixes defined in compute_prefix.

echo -e ${slurm_node_config} >> /etc/slurm/slurm.conf
Fi

Add the value of the slurm_node_config variable to the bottom of the slurm.conf configuration file.

The script should run silently without generating any output, but it’s important to verify that the modifications have been properly applied to the relevant file.

Verifications

Cd /etc/slurm/
Nano slurm.conf

To verify this :echo -e ${slurm_node_config} >> /etc/slurm/slurm.conf

# Enable configless option
SlurmctldParameters=enable_configless

# Setup interactive jobs for salloc
LaunchParameters=use_interactive_step

compute[1-2] Sockets=2 CoresPerSocket=12 ThreadsPerCore=2

To verify this :perl -pi -e "s/ Nodes=c\S+ / Nodes=${compute_prefix}[1-${num_computes}] /" /etc/slurm/slurm.conf

PartitionName=normal Nodes=compute[1-2] Default=YES MaxTime=24:00:00 State=UP OverSubscribe=EXCLUSIVE

# Enable configless option
SlurmctldParameters=enable_configless

To verify this :perl -pi -e "s/^NodeName=.+$/#/" /etc/slurm/slurm.conf

# COMPUTE NODES
#NodeName=linux[1-32] CPUs=1 State=UNKNOWN
#PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

6. Enabling InfiniBand and Omni-Path Support Services on the Master Node

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi

# -----------------------------------------------------------------------
# Optionally add InfiniBand support services on master node (Section 3.5)
# -----------------------------------------------------------------------
if [[ ${enable_ib} -eq 1 ]];then
     dnf -y groupinstall "InfiniBand Support"
     udevadm trigger --type=devices --action=add
     systemctl restart rdma-load-modules@infiniband.service
fi

# Optionally enable opensm subnet manager
if [[ ${enable_opensm} -eq 1 ]];then
     dnf -y install opensm
     systemctl enable opensm
     systemctl start opensm
fi

# Optionally enable IPoIB interface on SMS
if [[ ${enable_ipoib} -eq 1 ]];then
     # Enable ib0
     cp /opt/ohpc/pub/examples/network/centos/ifcfg-ib0 /etc/sysconfig/network-scripts
     perl -pi -e "s/master_ipoib/${sms_ipoib}/" /etc/sysconfig/network-scripts/ifcfg-ib0
     perl -pi -e "s/ipoib_netmask/${ipoib_netmask}/" /etc/sysconfig/network-scripts/ifcfg-ib0
     echo "[main]"   >  /etc/NetworkManager/conf.d/90-dns-none.conf
     echo "dns=none" >> /etc/NetworkManager/conf.d/90-dns-none.conf
     systemctl start NetworkManager
fi

# ----------------------------------------------------------------------
# Optionally add Omni-Path support services on master node (Section 3.6)
# ----------------------------------------------------------------------
if [[ ${enable_opa} -eq 1 ]];then
     dnf -y install opa-basic-tools
fi

# Optionally enable OPA fabric manager
if [[ ${enable_opafm} -eq 1 ]];then
     dnf -y install opa-fm
     systemctl enable opafm
     systemctl start opafm
fi

This section conditionally enables support for high-performance networking on the master node, including InfiniBand, IP over InfiniBand (IPoIB), Omni-Path Architecture (OPA), and the associated subnet/fabric managers, based on configuration variables.

[root@master-ohpc /]# Chmod +x ./recipe5.sh
[root@master-ohpc /]# ./recipe5.sh

The script recipe5.sh is made executable with chmod +x, then executed without producing any output, indicating that it likely ran successfully and silently.

7. Completing Warewulf Master Node Configuration for Cluster Provisioning

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi

# -----------------------------------------------------------
# Complete basic Warewulf setup for master node (Section 3.7)
# -----------------------------------------------------------
ip link set dev ${sms_eth_internal} up
ip address add ${sms_ip}/${internal_netmask} broadcast + dev ${sms_eth_internal}
perl -pi -e "s/ipaddr:.*/ipaddr: ${sms_ip}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/netmask:.*/netmask: ${internal_netmask}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/network:.*/network: ${internal_network}/" /etc/warewulf/warewulf.conf
perl -pi -e 's/template:.*/template: static/' /etc/warewulf/warewulf.conf
perl -pi -e "s/range start:.*/range start: ${c_ip[0]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/range end:.*/range end: ${c_ip[$((num_computes-1))]}/" /etc/warewulf/warewulf.conf
perl -pi -e "s/mount: false/mount: true/" /etc/warewulf/warewulf.conf
wwctl profile set -y default --netmask=${internal_netmask}
wwctl profile set -y default --gateway=${ipv4_gateway}
wwctl profile set -y default --netdev=default --nettagadd=DNS=${dns_servers}
perl -pi -e "s/warewulf/${sms_name}/" /srv/warewulf/overlays/host/rootfs/etc/hosts.ww
perl -pi -e "s/warewulf/${sms_name}/" /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww
echo "next-server ${sms_ip};" >> /srv/warewulf/overlays/host/rootfs/etc/dhcpd.conf.ww
systemctl enable --now warewulfd
wwctl configure --all
bash /etc/profile.d/ssh_setup.sh

# Update /etc/hosts template to have ${hostname}.localdomain as the first host entry
sed -e 's_\({{$node.Id.Get}}{{end}}\)_{{$node.Id.Get}}.localdomain \1_g' -i /srv/warewulf/overlays/host/rootfs/etc/hosts.ww

this script sets up the internal network interface, updates the Warewulf configuration to reflect the cluster topology and networking, and prepares services for provisioning compute nodes.

Step-by-step summary of the script
  • Configures the internal network interface of the master server (sms_eth_internal), assigning it an IP address, netmask, and bringing it up.

  • Modifies the warewulf.conf file:

    • Sets the IP address, netmask, network, and static network template

    • Configures the IP range for compute nodes (range start / end)

    • Enables file system mounting

  • Configures the default Warewulf profile with:

    • The netmask

    • The gateway address

    • The default network device and DNS servers

  • Customizes the hosts.ww and dhcpd.conf.ww files to reflect the master server’s name (sms_name) and add the next-server directive in the DHCP config.

  • Enables and configures the warewulfd service:

    • Starts and enables the Warewulf daemon

    • Applies configuration with wwctl configure –all

  • Runs the default SSH configuration script (ssh_setup.sh)

  • Updates the /etc/hosts template for the nodes to include hostname.localdomain as the first host entry.

Verifications

To verify that the configuration was applied correctly, open the file /etc/warewulf/warewulf.conf and check that the values match the expected settings:

GNU nano 5.4
WW_INTERNAL: 45
ipaddr: 192.168.70.41
netmask: 255.255.255.0
gateway: 192.168.70.1
nameserv:
  port: 9873
  secure: false
  update_interval: 60
  autobuild_overlays: true
  host_overlay: true
  base: static
  datastore: /usr/share
  grubboot: false
dhcp:
  enabled: true
  template: static
  range_start: 192.168.70.51
  range_end: 192.168.70.52
  systemd_name: dhcpd
tftp:
  enabled: true
  tftproot: /srv/tftpboot
  systemd_name: tftp
  ipxe:
    "00:00": undionly.kpxe
    "00:07": ipxe-snponly-x86_64.efi
    "00:09": ipxe-snponly-x86_64.efi
    "00:0B": arm64-efi/snponly.efi
nfs:
  enabled: true
  export_paths:
    - path: /home
      export_options: rw,sync
      mount_options: defaults
      mount: true
    - path: /opt
      export_options: ro,sync,no_root_squash
      mount_options: defaults
      mount: true
      systemd_name: nfs-server
ssh:
  key_types:
    - rsa
    - dsa
    - ecdsa
    - ed25519
  container_mounts:
    - source: /etc/resolv.conf
    - source: /etc/localtime
      readonly: true
paths:
  bindir: /usr/bin
  sysconfdir: /etc

To verify that the DHCP configuration has been applied correctly, open the dhcpd.conf.ww file and check the relevant settings:

# Pure BIOS clients will get iPXE configuration
filename "http://${s.ipaddr}:${s.Warewulf.Port}/ipxe/${mac:hexhyp}";

# EFI clients will get shim and grub instead
filename "warewulf/shim.efi";

} elsif substring (option vendor-class-identifier, 0, 10) = "HTTPClient" {
filename "http://${s.ipaddr}:${s.Warewulf.Port}/efiboot.img";
} else {
# iPXE vendor-class and option 175 = "iPXE" {
filename "http://${s.ipaddr}:${s.Warewulf.Port}/ipxe/${mac:hexhyp}?assetkey=${asset}&u
} else {
{{range $type,$name := $.Tftp.IpxeBinaries }}
if option architecture-type = {{ $type }} {
filename "/warewulf/{{ basename $name }}";
}
{{end}}{{/* range IpxeBinaries */}}

{{end}}{{/* BootMethod */}}

subnet {{$.Network}} netmask {{$.Netmask}} {
max-lease-time 120;
{{- if ne .Dhcp.Template "static" }}
range {{$.Dhcp.RangeStart}} {{$.Dhcp.RangeEnd}};
next-server {{.Ipaddr}};
{{end}}
}

{{- if eq .Dhcp.Template "static" }}
{{- range $nodes := $.AllNodes}}
{{- range $devs := $.netDevs }}
host {{$nodes.Id.Get}}-{{$netname}}
{{- if $netdevs.Hwaddr.Defined}}
hardware ethernet {{$netdevs.Hwaddr.Get}};
{{- end}}
{{- if $netdevs.Ipaddr.Defined}}
fixed-address {{$netdevs.Ipaddr.Get}};
{{- end }}
{{- if $netdevs.Primary.GetB}}
option host-name "{{$nodes.Id.Get}}";
{{- end }}
}

{{end }}{{/* range NetDevs */}}
{{end }}{{/* range AllNodes */}}
{{end }}{{/* if static */}}
}
{{abort}}
}
{{- end}}{{/* dhcp enabled */}}
{{- end}}{{/* primary */}}
next-server 192.168.70.41;

Review the file /srv/warewulf/overlays/generic/rootfs/etc/hosts.ww to ensure that hostnames and IP mappings have been properly configured for the compute nodes:

127.0.0.1    localhost localhost.localdomain localhost4 localhost4.localdomain4
::1          localhost localhost.localdomain localhost6 localhost6.localdomain6

# Warewulf Server
{{$.Ipaddr}} {{$.BuildHost}} master-ohpc

{{- range $node := $.AllNodes}}                    {{/* for each node */}}
{{- range $netname,$netdevs := $node.NetDevs}} {{/* for each network device on the node */}}
{{- if $netdevs.OnThisNetwork $.Network}}       {{/* only if this device has an IP address on this network */}}
{{$netdevs.Ipaddr.Get}} {{$node.Id.Get}}-{{$netname}} # {{$node.Comment.Print}} if this is the primary */}}

8. Creating the Compute Node Image for Warewulf

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi

# -------------------------------------------------
# Create compute image for Warewulf (Section 3.8.1)
# -------------------------------------------------

This section explains how to create a compute image for Warewulf, a cluster management tool used to install and manage compute nodes. This image is used to configure the compute nodes within the cluster.

wwctl container import docker://ghcr.io/warewulf/warewulf-rockylinux:9 rocky-9.4 –syncuser

This command uses wwctl (Warewulf’s command-line tool) to import a preconfigured Docker image of Rocky Linux 9.4 from the ghcr.io/warewulf registry. The –syncuser flag ensures that users and groups in the Docker image are synchronized with those on the host system.

wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
dnf -y update
EOF
  • The wwctl container exec command runs a command inside the Rocky Linux 9.4 Docker container that we previously imported.

  • The commands executed inside the container are:

    • Installation of ohpc-release: This installs the ohpc-release package from the OpenHPC repository for Rocky Linux 9.4. It configures the system to enable the installation of OpenHPC-specific software and resources.
    • System update: The dnf -y update command updates all packages in the container to their latest available versions.
export CHROOT=/srv/warewulf/chroots/rocky-9.4/rootfs

This line defines an environment variable CHROOT that points to the directory containing the filesystem of the Rocky Linux 9.4 compute image we just created. This directory is essential for integrating the compute image into the Warewulf system, as it represents the environment in which the compute nodes will be deployed.

After running the script, the output is displayed as follows:

[root@master-ohpc /]# ./recipe7.sh
Copying blob 4f4fb700ef54 done
Copying blob 0046cb37027b [==============================>---------] 436.2MiB / 611.4MiB | 20.6 MiB/s
Copying blob cc311bfc628a done
Copying blob 30e5d205dca1 done
Copying blob 3442e16c7069 done
Verifications

Container Import Verification

  • Ensure that the rocky-9.4 container has been successfully imported. You can verify this by checking whether it appears in the list using the appropriate command. If the rocky-9.4 container is listed, it means the import was successful.

  • Verify the installation of the ohpc-release package: After running the script, the ohpc-release package should be installed inside the container. To confirm this, access the container and check the package status.

[root@master-ohpc /]# wwctl container list
ERROR : lstat /srv/warewulf/chroots/rocky-9.4/rootfs/proc/3089049: no such file or directory

CONTAINER NAME   NODES   KERNEL VERSION                      CREATION TIME           MODIFICATION TIME   SIZE
rocky-9.4        0       5.14.0-503.19.1.el9_5.x86_64         10 Feb 25 10:27 EST     10 Feb 25 08:17 EST  1.7 GiB

[root@master-ohpc /]# nano recipe7.sh

[root@master-ohpc /]# wwctl container exec rocky-9.4 /bin/bash
[rocky-9.4] warewulf# rpm -qa | grep ohpc-release
ohpc-release-3.1-1.el9.x86_64
[rocky-9.4] warewulf#
  • Once inside, verify that the package is properly installed. (If the package is present, it confirms that the initial dnf command was executed successfully.)
[rocky-9.4] Warewulf> rpm -qa | grep ohpc-release
ohpc-release-3-1.el9.x86_64
  • Check for system updates: After running dnf -y update, you need to ensure that the system has been properly updated. To do this, check for any remaining available updates:
[rocky-9.4] Warewulf> dnf check-update
OpenHPC-3 - Base                                321 kB/s | 3.6 MB     00:11
OpenHPC-3 - Updates                             860 kB/s | 5.0 MB     00:05
Extra Packages for Enterprise Linux 9 - x86_64   0.6 kB/s | 2.3 kB     00:04
Extra Packages for Enterprise Linux 9 openh264 (From Ci  2.0 kB/s | 2.5 kB     00:01
Rocky Linux 9 - BaseOS                          5.2 MB/s | 2.0 MB     00:00
Rocky Linux 9 - AppStream                       5.0 MB/s | 8.7 MB     00:01
Rocky Linux 9 - Extras                            30 kB/s |  18 kB     00:00
[rocky-9.4] Warewulf>
  • Verifying the chroot directory : The script does not directly modify the chroot directory, but you should check that the path to the chroot exists and contains the necessary files. Make sure the directory exists and has a structure similar to a typical Linux filesystem. If the directory is empty or incomplete, it may indicate that the container was not initialized correctly.
[root@master-ohpc /]# ls -l /srv/warewulf/chroots/rocky-9.4/rootfs/
total 16
lrwxrwxrwx.   1 root root     7 Nov  2 21:29 bin -> usr/bin
dr-xr-xr-x.   2 root root  4096 Feb 10 08:17 boot
drwxr-xr-x.   2 root root    18 Feb 10 01:01 dev
drwxrwxrwx.  63 root root  4096 Feb 10 08:17 etc
drwxr-xr-x.   2 root root     6 Nov  2 21:29 home
lrwxrwxrwx.   1 root root     7 Nov  2 21:29 lib -> usr/lib
lrwxrwxrwx.   1 root root     9 Nov  2 21:29 lib64 -> usr/lib64
drwxr-xr-x.   2 root root     6 Nov  2 21:29 media
drwxr-xr-x.   2 root root     6 Nov  2 21:29 mnt
drwxr-xr-x.   2 root root     6 Nov  2 21:29 opt
drwxr-xr-x.   2 root root     6 Jan  8 14:47 proc
dr-xr-x---.   3 root root   124 Feb 10 10:05 root
drwxr-xr-x.  14 root root   188 Feb 10 08:17 run
lrwxrwxrwx.   1 root root     8 Nov  2 21:29 sbin -> usr/sbin
drwxr-xr-x.   2 root root     6 Nov  2 21:29 srv
drwxr-xr-x.   2 root root     6 Nov 18 14:46 sys
drwxrwxrwt.   2 root root   144 Nov 18 14:47 tmp
drwxr-xr-x.  12 root root  4096 Nov 18 14:47 usr
drwxr-xr-x.  18 root root   238 Feb 11 03:29 var
  • Verifying the overall integrity of the container : You can also enter the container directly to ensure everything is working properly (for example, by running a simple command like uname -a to check the system state). If you see the kernel output and system information, it confirms that the container is functioning normally.
[root@master-ohpc /]# wwctl container exec rocky-9.4 /bin/bash
[rocky-9.4] Warewulf> uname -a
Linux rocky-9.4 5.14.0-503.14.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Nov 15 12:04:32 UT
C 2024 x86_64 x86_64 x86_64 GNU/Linux

9. Configuring the Compute Image with OpenHPC Base, Slurm Client, and Essential Services

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
Fi

# ------------------------------------------------------------
# Add OpenHPC base components to compute image (Section 3.8.2)
# ------------------------------------------------------------
wwctl container exec rocky-9.4 /bin/bash <<- EOF
dnf -y install ohpc-base-compute
EOF
# Add SLURM and other components to compute instance
wwctl container exec rocky-9.4 /bin/bash <<- EOF
# Add Slurm client support meta-package and enable munge and slurmd
dnf -y install ohpc-slurm-client
systemctl enable munge
systemctl enable slurmd

# Add Network Time Protocol (NTP) support
dnf -y install chrony

# Include modules user environment
dnf -y install lmod-ohpc
EOF
if [[ ${enable_intel_packages} -eq 1 ]];then
     mkdir /opt/intel
     echo "/opt/intel *(ro,no_subtree_check,fsid=12)" >> /etc/exports
     echo "${sms_ip}:/opt/intel /opt/intel nfs nfsvers=4,nodev 0 0" >> $CHROOT/etc/fstab
fi

# Update basic slurm configuration if additional computes defined
if [ ${num_computes} -gt 4 ];then
   perl -pi -e "s/^NodeName=(\S+)/NodeName=${compute_prefix}[1-${num_computes}]/" /etc/slurm/slurm.conf
   perl -pi -e "s/^PartitionName=normal Nodes=(\S+)/PartitionName=normal Nodes=${compute_prefix}[1-${num_computes}]/" /etc/slurm/slurm.conf
fi 
Analysis of the script output

Objective of the script:

The main goal of this script is to automate the installation and configuration of the necessary components for an HPC cluster, including OpenHPC tools, Slurm, and other required dependencies.

Installation Details:

  • Repository Setup:

The script begins by updating and importing the following repositories:

 - OpenHPC-3 – Base
 - OpenHPC-3 – Updates
 - EPEL (Extra Packages for Enterprise Linux 9)
 - Rocky Linux 9 – BaseOS, AppStream, and Extras

These repositories are essential for retrieving the latest versions of required packages.

  • Installation of OpenHPC Components:

The following packages are installed by the script:

 - ohpc-base-compute: Core OpenHPC components
 - ohpc-slurm-client: Slurm client for job management
 - chrony: Time synchronization tool
 - lmod-ohpc: Environment module system for user environments
  • Service Configuration

After the installation, the script enables and configures several key services:

 - munge (systemctl enable munge): Authentication service used by Slurm
 - slurmd (systemctl enable slurmd): Slurm compute node daemon
 - chronyd (systemctl enable chronyd): NTP time synchronization service
  • Installation of Additional Dependencies

The script also installs various additional libraries and tools, including:

 - Graphics libraries: cairo, harfbuzz, freetype
 - Development tools: gcc, perl, libxml2
 - Compression and file system utilities: brotli, squashfs, LZO
 - Additional HPC modules: libibverbs, librdmacm (for Infiniband support)

Summary of Installed Packages

In total, the script installed 154 packages, including essential libraries and cluster management tools. Key installed packages include:

 - ohpc-base-compute
 - ohpc-slurm-client
 - chrony
 - lmod-ohpc
 - singularity-ce
 - perl-libs, perl-IO, perl-Net-SSLeay
 - python3.11
 - libX11, libXext, libXrender
 - libselinux-devel, libsepol-devel

Conclusion

The execution of recipe8.sh completed successfully. The OpenHPC environment is now properly configured with all required components to run HPC workloads.

The output

[root@master-ohpc /]# ./recipe8.sh
OpenHPC-3 - Base                                                                                                                                                                                           1.1 MB/s | 3.6 MB     00:03
OpenHPC-3 - Updates                                                                                                                                                                                        2.0 MB/s | 5.0 MB     00:02
Extra Packages for Enterprise Linux 9 - x86_64                                                                                                                                                              10 MB/s |  23 MB     00:02
Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64                                                                                                                                       153  B/s | 2.5 kB     00:16
Rocky Linux 9 - BaseOS                                                                                                                                                                                     2.6 MB/s | 2.3 MB     00:00
Rocky Linux 9 - AppStream                                                                                                                                                                                   16 MB/s | 8.7 MB     00:00
Rocky Linux 9 - Extras                                                                                                                                                                                      47 kB/s |  16 kB     00:00
Dependencies resolved.
===========================================================================================================================================================================================================================================
 Package                                                            Architecture                                  Version                                                     Repository                                              Size
===========================================================================================================================================================================================================================================
Installing:
 ohpc-base-compute                                                  x86_64                                        3.2-320.ohpc.1.1                                            OpenHPC-updates                                        7.2 k
Installing dependencies:

View full output

10. Enhancing the Compute Image with Networking Drivers, Resource Limits, and Optional Filesystem Clients

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi


# -------------------------------------------------------
# Additional customizations (Section 3.8.4)
# -------------------------------------------------------

# Add IB drivers to compute image
if [[ ${enable_ib} -eq 1 ]];then
     dnf -y --installroot=$CHROOT groupinstall "InfiniBand Support"
fi
# Add Omni-Path drivers to compute image
if [[ ${enable_opa} -eq 1 ]];then
     dnf -y --installroot=$CHROOT install opa-basic-tools
     dnf -y --installroot=$CHROOT install libpsm2
fi

# Update memlock settings
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf

# Enable slurm pam module
echo "account    required     pam_slurm.so" >> $CHROOT/etc/pam.d/sshd

if [[ ${enable_beegfs_client} -eq 1 ]];then
     wget -P /etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
     dnf -y install kernel-devel gcc elfutils-libelf-devel
     dnf -y install beegfs-client beegfs-helperd beegfs-utils
     perl -pi -e "s/^buildArgs=-j8/buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1/"  /etc/beegfs/beegfs-client-autobuild.conf
     /opt/beegfs/sbin/beegfs-setup-client -m ${sysmgmtd_host}
     systemctl start beegfs-helperd
     systemctl start beegfs-client
     wget -P $CHROOT/etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
     dnf -y --installroot=$CHROOT install beegfs-client beegfs-helperd beegfs-utils
     perl -pi -e "s/^buildEnabled=true/buildEnabled=false/" $CHROOT/etc/beegfs/beegfs-client-autobuild.conf
     rm -f $CHROOT/var/lib/beegfs/client/force-auto-build
     chroot $CHROOT systemctl enable beegfs-helperd beegfs-client
     cp /etc/beegfs/beegfs-client.conf $CHROOT/etc/beegfs/beegfs-client.conf
     echo "drivers += beegfs" >> /etc/warewulf/bootstrap.conf
fi

# Enable Optional packages

if [[ ${enable_lustre_client} -eq 1 ]];then
     # Install Lustre client on master
     dnf -y install lustre-client-ohpc
     # Enable lustre in WW compute image
     dnf -y --installroot=$CHROOT install lustre-client-ohpc
     mkdir $CHROOT/mnt/lustre
     echo "${mgs_fs_name} /mnt/lustre lustre defaults,localflock,noauto,x-systemd.automount 0 0" >> $CHROOT/etc/fstab
     # Enable o2ib for Lustre
     echo "options lnet networks=o2ib(ib0)" >> /etc/modprobe.d/lustre.conf
     echo "options lnet networks=o2ib(ib0)" >> $CHROOT/etc/modprobe.d/lustre.conf
     # mount Lustre client on master
     mkdir /mnt/lustre
     mount -t lustre -o localflock ${mgs_fs_name} /mnt/lustre
fi
Clarifications

Installation of InfiniBand and Omni-Path drivers

  • Adding InfiniBand (IB) drivers
if [[ ${enable_ib} -eq 1 ]]; then
     dnf -y --installroot=$CHROOT groupinstall "InfiniBand Support"
fi

If enable_ib is set to 1, the script installs InfiniBand drivers, which are essential for high-performance interconnects in HPC environments.

  • Adding Omni-Path (OPA) drivers
if [[ ${enable_opa} -eq 1 ]]; then
     dnf -y --installroot=$CHROOT install opa-basic-tools
     dnf -y --installroot=$CHROOT install libpsm2
fi

If enable_opa is set to 1, the script installs Omni-Path tools, an alternative to InfiniBand that enables low-latency communication in HPC clusters.

Configuring memory limits (memlock)

perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' /etc/security/limits.conf
perl -pi -e 's/# End of file/\* soft memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf
perl -pi -e 's/# End of file/\* hard memlock unlimited\n$&/s' $CHROOT/etc/security/limits.conf

Purpose: Increase memory lock limits to prevent HPC processes from being constrained in memory allocation.

  • The perl -pi -e commands update the limits.conf files by appending memlock unlimited rules.
  • Changes are applied both on the host system and inside the Warewulf chroot environment.

Enabling the PAM module for Slurm

echo "account    required     pam_slurm.so" >> $CHROOT/etc/pam.d/sshd

This activates pam_slurm.so, a PAM module that restricts SSH access to compute nodes to users with active Slurm jobs only.

Installing and configuring BeeGFS (parallel file system)

if [[ ${enable_beegfs_client} -eq 1 ]]; then
     wget -P /etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
     dnf -y install kernel-devel gcc elfutils-libelf-devel
     dnf -y install beegfs-client beegfs-helperd beegfs-utils
     perl -pi -e "s/^buildArgs=-j8/buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1/"  /etc/beegfs/beegfs-client-autobuild.conf
     /opt/beegfs/sbin/beegfs-setup-client -m ${sysmgmtd_host}
     systemctl start beegfs-helperd
     systemctl start beegfs-client
     wget -P $CHROOT/etc/yum.repos.d https://www.beegfs.io/release/beegfs_7.4.5/dists/beegfs-rhel9.repo
     dnf -y --installroot=$CHROOT install beegfs-client beegfs-helperd beegfs-utils
     perl -pi -e "s/^buildEnabled=true/buildEnabled=false/" $CHROOT/etc/beegfs/beegfs-client-autobuild.conf
     rm -f $CHROOT/var/lib/beegfs/client/force-auto-build
     chroot $CHROOT systemctl enable beegfs-helperd beegfs-client
     cp /etc/beegfs/beegfs-client.conf $CHROOT/etc/beegfs/beegfs-client.conf
     echo "drivers += beegfs" >> /etc/warewulf/bootstrap.conf
fi

If enable_beegfs_client=1, the script installs and configures BeeGFS, a high-performance parallel file system for HPC. Main actions include:

  • Add the BeeGFS repository to yum.repos.d
  • Install BeeGFS client, helper daemon, and utilities
  • Enable InfiniBand support (BEEGFS_OPENTK_IBVERBS=1)
  • Set up the client to connect to the management server (sysmgmtd_host)
  • Start beegfs-helperd and beegfs-client services
  • Configure the Warewulf chroot environment with BeeGFS

Installing and configuring Lustre (HPC parallel file system)

if [[ ${enable_lustre_client} -eq 1 ]]; then
     dnf -y install lustre-client-ohpc
     dnf -y --installroot=$CHROOT install lustre-client-ohpc
     mkdir $CHROOT/mnt/lustre
     echo "${mgs_fs_name} /mnt/lustre lustre defaults,localflock,noauto,x-systemd.automount 0 0" >> $CHROOT/etc/fstab
     echo "options lnet networks=o2ib(ib0)" >> /etc/modprobe.d/lustre.conf
     echo "options lnet networks=o2ib(ib0)" >> $CHROOT/etc/modprobe.d/lustre.conf
     mkdir /mnt/lustre
     mount -t lustre -o localflock ${mgs_fs_name} /mnt/lustre
fi

If enable_lustre_client=1, the script installs and configures Lustre, another widely-used high-performance file system in HPC.

Key steps include:

  • Install the Lustre client on the management server
  • Install Lustre in the Warewulf chroot environment
  • Add a mount entry to /etc/fstab for /mnt/lustre
  • Enable o2ib (over InfiniBand) network support for Lustre
  • Mount the Lustre file system on the management server
Variable Explanation
Variable Description Set to 1 if… Set to 0 if…
enable_ib Enables the installation of InfiniBand drivers Your cluster uses InfiniBand interconnect Your cluster uses Ethernet
enable_opa Enables Omni-Path drivers Your nodes are connected using Omni-Path (OPA) You do not use Omni-Path
enable_beegfs_client Installs the BeeGFS client (parallel file system) Your cluster uses BeeGFS for storage You do not use BeeGFS
enable_lustre_client Installs the Lustre client (another HPC file system) You use Lustre for parallel storage You do not use Lustre

11. Section 10 of the recipe.sh script

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi



# -------------------------------------------------------
# Configure rsyslog on SMS and computes (Section 3.8.4.7)
# -------------------------------------------------------
echo 'module(load="imudp")' >> /etc/rsyslog.d/ohpc.conf
echo 'input(type="imudp" port="514")' >> /etc/rsyslog.d/ohpc.conf
systemctl restart rsyslog
echo "*.* action(type=\"omfwd\" Target=\"${sms_ip}\" Port=\"514\" " "Protocol=\"udp\")">> $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^\*\.info/\\#\*\.info/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^authpriv/\\#authpriv/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^mail/\\#mail/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^cron/\\#cron/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^uucp/\\#uucp/" $CHROOT/etc/rsyslog.conf

if [[ ${enable_clustershell} -eq 1 ]];then
     # Install clustershell
     dnf -y install clustershell
     cd /etc/clustershell/groups.d
     mv local.cfg local.cfg.orig
     echo "adm: ${sms_name}" > local.cfg
     echo "compute: ${compute_prefix}[1-${num_computes}]" >> local.cfg
     echo "all: @adm,@compute" >> local.cfg
fi

if [[ ${enable_genders} -eq 1 ]];then
     # Install genders
     dnf -y install genders-ohpc
     echo -e "${sms_name}\tsms" > /etc/genders
     for ((i=0; i<$num_computes; i++)) ; do
        echo -e "${c_name[$i]}\tcompute,bmc=${c_bmc[$i]}"
     done >> /etc/genders
fi

if [[ ${enable_magpie} -eq 1 ]];then
     # Install magpie
     dnf -y install magpie-ohpc
fi

# Optionally, enable conman and configure
if [[ ${enable_ipmisol} -eq 1 ]];then
     dnf -y install conman-ohpc
     for ((i=0; i<$num_computes; i++)) ; do
        echo -n 'CONSOLE name="'${c_name[$i]}'" dev="ipmi:'${c_bmc[$i]}'" '
        echo 'ipmiopts="'U:${bmc_username},P:${IPMI_PASSWORD:-undefined},W:solpayloadsize'"'
     done >> /etc/conman.conf
     systemctl enable conman
     systemctl start conman
fi

# Optionally, enable nhc and configure
dnf -y install nhc-ohpc
dnf -y --installroot=$CHROOT install nhc-ohpc

echo "HealthCheckProgram=/usr/sbin/nhc" >> /etc/slurm/slurm.conf
echo "HealthCheckInterval=300" >> /etc/slurm/slurm.conf  # execute every five minutes

# Optionally, update compute image to support geopm
if [[ ${enable_geopm} -eq 1 ]];then
     export kargs="${kargs} intel_pstate=disable"
fi

if [[ ${enable_geopm} -eq 1 ]];then
     dnf -y --installroot=$CHROOT install kmod-msr-safe-ohpc
     dnf -y --installroot=$CHROOT install msr-safe-ohpc
     dnf -y --installroot=$CHROOT install msr-safe-slurm-ohpc
fi
Clarifications

1. Configuration of rsyslog for log management

echo 'module(load="imudp")' >> /etc/rsyslog.d/ohpc.conf
echo 'input(type="imudp" port="514")' >> /etc/rsyslog.d/ohpc.conf
systemctl restart rsyslog

This enables rsyslog to listen on UDP port 514, thereby facilitating centralized log collection on the management server (SMS).

echo "*.* action(type=\"omfwd\" Target=\"${sms_ip}\" Port=\"514\" " "Protocol=\"udp\")">> $CHROOT/etc/rsyslog.conf

This allows the compute nodes to be configured to send their logs to the management server (sms_ip).

perl -pi -e "s/^\*\.info/\\#\*\.info/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^authpriv/\\#authpriv/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^mail/\\#mail/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^cron/\\#cron/" $CHROOT/etc/rsyslog.conf
perl -pi -e "s/^uucp/\\#uucp/" $CHROOT/etc/rsyslog.conf

This allows certain categories of logs to be disabled in order to avoid unnecessary overhead.

2. Installation of Clustershell (centralized management of commands on nodes)

if [[ ${enable_clustershell} -eq 1 ]]; then
     dnf -y install clustershell
     cd /etc/clustershell/groups.d
     mv local.cfg local.cfg.orig
     echo "adm: ${sms_name}" > local.cfg
     echo "compute: ${compute_prefix}[1-${num_computes}]" >> local.cfg
     echo "all: @adm,@compute" >> local.cfg
fi
  • If enable_clustershell=1, install Clustershell, a tool that allows commands to be executed simultaneously across multiple nodes.
  • Configure the local.cfg file to define node groups: adm for the SMS server and compute for the compute nodes.

3. Installation of Genders, a tool for managing node groups within a cluster.

if [[ ${enable_genders} -eq 1 ]]; then
     dnf -y install genders-ohpc
     echo -e "${sms_name}\tsms" > /etc/genders
     for ((i=0; i<$num_computes; i++)) ; do
        echo -e "${c_name[$i]}\tcompute,bmc=${c_bmc[$i]}"
     done >> /etc/genders
fi
  • If enable_genders=1, install Genders, a tool used to classify nodes based on their roles.
  • Populate the /etc/genders file by associating each compute node with its BMC address, which is used for out-of-band management via IPMI.

4. Installation of Magpie, a monitoring and performance profiling tool for HPC environments.

if [[ ${enable_magpie} -eq 1 ]]; then
     dnf -y install magpie-ohpc
fi

If enable_magpie=1, install Magpie, a performance profiling tool for HPC environments.

5. Configuration of IPMI (power management and serial console control of nodes)

if [[ ${enable_ipmisol} -eq 1 ]]; then
     dnf -y install conman-ohpc
     for ((i=0; i<$num_computes; i++)) ; do
        echo -n 'CONSOLE name="'${c_name[$i]}'" dev="ipmi:'${c_bmc[$i]}'" '
        echo 'ipmiopts="'U:${bmc_username},P:${IPMI_PASSWORD:-undefined},W:solpayloadsize'"'
     done >> /etc/conman.conf
     systemctl enable conman
     systemctl start conman

If enable_ipmisol=1, install and configure Conman, a tool that provides access to node serial consoles via IPMI.

6. Installation and configuration of NHC (Node Health Check)

dnf -y install nhc-ohpc
dnf -y --installroot=$CHROOT install nhc-ohpc
echo "HealthCheckProgram=/usr/sbin/nhc" >> /etc/slurm/slurm.conf
echo "HealthCheckInterval=300" >> /etc/slurm/slurm.conf  # execute every five minutes
  • Install nhc-ohpc, a tool dedicated to monitoring the health status of compute nodes.
  • Configure Slurm to perform node health checks every 5 minutes.

7. Activation of GEOPM, an energy management tool for HPC environments.

if [[ ${enable_geopm} -eq 1 ]]; then
     export kargs="${kargs} intel_pstate=disable"
fi

If enable_geopm=1, disable intel_pstate, a CPU power management feature, to allow GEOPM to take control.

if [[ ${enable_geopm} -eq 1 ]]; then
     dnf -y --installroot=$CHROOT install kmod-msr-safe-ohpc
     dnf -y --installroot=$CHROOT install msr-safe-ohpc
     dnf -y --installroot=$CHROOT install msr-safe-slurm-ohpc
fi

Install the required modules for GEOPM (msr-safe-ohpc, msr-safe-slurm-ohpc).

This script is essential to finalize the configuration of an OpenHPC cluster by automating log management, node access, and monitoring.

It is not mandatory to enable all these variables. Their activation depends on your needs and the architecture of your cluster.

Guide to Decide Which Variables to Enable (1) or Disable (0)
Variable Description Enable (set to 1) if… Disable (set to 0) if…
enable_clustershell Installs Clustershell (execute commands on multiple nodes simultaneously) You want to run commands simultaneously on all nodes You prefer to manage nodes individually
enable_genders Installs Genders (node classification) You want to organize nodes by role (e.g., compute, storage) You don’t need advanced node management
enable_magpie Installs Magpie (performance profiling) You want to analyze workload and application performance You don’t need detailed optimization
enable_ipmisol Installs Conman (serial console management via IPMI) Your nodes support IPMI and you want remote management Your nodes don’t support IPMI
enable_geopm Installs GEOPM (energy management for HPC) You want to optimize Intel CPU power consumption You don’t manage advanced power consumption

Recommendations Based on Your Usage

If your cluster is small and you want a minimal installation:

enable_clustershell=1   # Useful for managing multiple nodes  
enable_genders=0        # Not necessarily needed if few nodes  
enable_magpie=0         # Only if performance analysis is needed  
enable_ipmisol=0        # Only if your nodes have IPMI  
enable_geopm=0          # Only for energy optimization

If your cluster is large and complex:

enable_clustershell=1  
enable_genders=1  
enable_magpie=1  
enable_ipmisol=1  
enable_geopm=1

Conclusion: Enable only what is necessary for your environment. If in doubt, start with 0 and activate options progressively as needed.

Script Execution
Last metadata expiration check: 2:41:57 ago on Wed Feb 10 01:17:22 2025.
Dependencies resolved.
=================================================================================
 Package                    Architecture      Version           Repository    Size
=================================================================================
Installing:
 python3                    noarch           1.9.3-1.el9       epel          159 k
Installing dependencies:
 noarch                     noarch           1.9.7-1.el9       epel          268 k

Transaction Summary
=================================================================================
Install  2 Packages

Total download size: 427 k
Installed size: 1.9 M
[1/2]: python3-clustershell-1.9.2-1.el9.noarch.rpm       2.1 MB/s | 206 kB     00:00
[2/2]: clustershell-1.9.1-1.el9.noarch.rpm               2.1 MB/s | 158 kB     00:00
--------------------------------------------------------------------------------
Total                                                     389 kB/s | 364 kB     00:01

Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Preparing        :                                                          1/1
  Installing       : python3-clustershell-1.9.2-1.el9.noarch                1/2
  Running scriptlet: clustershell-1.9.2-1.el9.noarch                        1/2
  Installing       : clustershell-1.9.2-1.el9.noarch                        2/2
  Running scriptlet: python3-clustershell-1.9.2-1.el9.noarch                2/2
  Verifying        : python3-clustershell-1.9.2-1.el9.noarch                1/2
  Verifying        : clustershell-1.9.2-1.el9.noarch                        2/2

Installed:
  clustershell-1.9.2-1.el9.noarch        python3-clustershell-1.9.2-1.el9.noarch

Last metadata expiration check: 2:44:06 ago on Wed Feb 10 01:17:22 2025.
Dependencies resolved.
=================================================================================
 Package                    Architecture      Version           Repository    Size
=================================================================================
Installing:
 noarch                     noarch           1.4.3-300.ohpc.1.2    OpenHPC       64 k

Transaction Summary
=================================================================================
Install  1 Package

Total download size: 64 k
Installed size: 179 k
Downloading Packages:
noarch-1.4.3-300.ohpc.1.2.noarch.rpm              58 kB/s |  64 kB     00:01
--------------------------------------------------------------------------------
Total                                               58 kB/s |  64 kB     00:01

Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Preparing        :                                                          1/1
  Installing       : ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch                   1/1
  Running scriptlet: ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch                   1/1
  Verifying        : ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch                   1/1

Installed:
  ohpc-ohpc-1.4.3-300.ohpc.1.2.noarch

Complete!
config error: error parsing '': given path '' is not absolute.
Analysis of this output:

1. Installation of ClusterShell

  • Installed packages:
    • clustershell-1.9.3-1.el9.noarch
    • python3-clustershell-1.9.3-1.el9.noarch
  • Download details:
    • Total download size: 389 KB
    • Installation completed successfully without errors

2. Installation of NHC (Node Health Check)

  • Installed package:
    • nhc-ohpc-1.4.3-300.ohpc.3.2.noarch
  • Download details:
    • Total download size: 66 KB
    • Installation completed successfully without errors

3. Problem detected at the end:

Config error: Error parsing β€™β€˜: given path’’ is not absolute.

  • Interpretation:
    • The error message indicates a configuration problem related to an empty path (’’).
    • This may be an issue with NHC (nhc-ohpc), as it uses a configuration file to define the verification script paths.

12. Section 11 of the recipe.sh script

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi




# ----------------------------
# Import files (Section 3.8.5)
# ----------------------------
wwctl overlay import generic /etc/subuid
wwctl overlay import generic /etc/subgid
echo "server ${sms_ip} iburst" | wwctl overlay import generic <(cat) /etc/chrony.conf
wwctl overlay mkdir generic /etc/sysconfig/
wwctl overlay import generic <(echo SLURMD_OPTIONS="--conf-server ${sms_ip}") /etc/sysconfig/slurmd
wwctl overlay mkdir generic --mode 0700 /etc/munge
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)

if [[ ${enable_ipoib} -eq 1 ]];then
     wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
     wwctl overlay import generic /opt/ohpc/pub/examples/network/centos/ifcfg-ib0.ww /etc/sysconfig/network-scripts/ifcfg-ib0.ww
fi

This Bash script is part of an automated OpenHPC installation, a software stack designed for managing High Performance Computing (HPC) clusters.
It is intended to serve as an installation template, based on the official OpenHPC guide.

This script handles the configuration of several critical components of the cluster:

  • Importing system files into the overlay (UID management, Chrony configuration)
  • Setting up Slurm and the Munge authentication service
  • (Optional) Network setup for InfiniBand if enabled

1. Importing files into overlays (Section 3.8.5)
wwctl overlay import generic /etc/subuid
wwctl overlay import generic /etc/subgid
echo "server ${sms_ip} iburst" | wwctl overlay import generic <(cat) /etc/chrony.conf
  • wwctl overlay import copies specific files into an overlay (a file layer used by cluster nodes).

  • /etc/subuid and /etc/subgid: Define UID/GID ranges for users, commonly used with containers.

  • /etc/chrony.conf: Configures Chrony for time synchronization. ${sms_ip} is the master server’s IP.

Purpose: Provide compute nodes with necessary system files for consistent configuration.

Verification

[root@master-ohpc /]# wwctl overlay cat generic /etc/chrony.cong
server 192.168.70.41 iburst 
2. Creating and configuring other system files
wwctl overlay mkdir generic /etc/sysconfig/
wwctl overlay import generic <(echo SLURMD_OPTIONS="--conf-server ${sms_ip}") /etc/sysconfig/slurmd
  • wwctl overlay mkdir creates the /etc/sysconfig/ directory in the overlay.
  • wwctl overlay import adds a file with Slurm options: SLURMD_OPTIONS=β€œβ€“conf-server ${sms_ip}”

Purpose: Allow compute nodes to get their Slurm configuration from the master server, and prepare the Slurm configuration, the workload and job scheduling system used in OpenHPC.

Verification

[root@master-ohpc /]# wwctl overlay cat generic /etc/sysconfig/slurmd
SLURMD_OPTIONS=--conf-server  192.168.70.41
3. Munge Configuration (Slurm authentication)
wwctl overlay mkdir generic --mode 0700 /etc/munge
wwctl overlay import generic /etc/munge/munge.key
wwctl overlay chown generic /etc/munge/munge.key $(id -u munge) $(id -g munge)
wwctl overlay chown generic /etc/munge $(id -u munge) $(id -g munge)
  • Munge is the authentication service used by Slurm.

  • These commands:

    • Create /etc/munge with restricted permissions (0700)

    • Import the shared key munge.key

    • Set the file and directory ownership to the munge user for security

Purpose:

Secure authentication for communication between server and nodes.

4. Network setup using InfiniBand (optional)
if [[ ${enable_ipoib} -eq 1 ]]; then
    wwctl overlay mkdir generic /etc/sysconfig/network-scripts/
    wwctl overlay import generic /opt/ohpc/pub/examples/network/centos/ifcfg-ib0.ww /etc/sysconfig/network-scripts/ifcfg-ib0.ww
fi
  • InfiniBand is a high-performance network protocol used in HPC.

  • If enable_ipoib = 1, the script:

    • Creates the configuration directory

    • Imports the file ifcfg-ib0.ww to enable InfiniBand on interface ib0

Purpose:

Allow compute nodes to communicate via InfiniBand with minimal latency and maximum throughput.

13. Section 12 of the recipe.sh script

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi





# --------------------------------------
# Assemble bootstrap image (Section 3.9)
# --------------------------------------
wwctl container build rocky-9.4
wwctl overlay build
# Add hosts to cluster
for ((i=0; i<$num_computes; i++)) ; do
   wwctl node add --container=rocky-9.4 \
   --ipaddr=${c_ip[$i]} --hwaddr=${c_mac[$i]} ${c_name[i]}
done
wwctl overlay build
wwctl configure --all
# Enable and start munge and slurmctld (Cont.)
systemctl enable --now munge
systemctl enable --now slurmctld

# Optionally, add arguments to bootstrap kernel
if [[ ${enable_kargs} -eq 1 ]]; then
wwctl node set --yes --kernelargs="${kargs}" "${compute_regex}"
fi

This script is designed to automate the installation and configuration of an OpenHPC cluster.
It begins by preparing the environment using a local configuration file, then builds the container image, adds compute nodes to the cluster, applies the required configuration, and starts essential services. Finally, it offers the option to add kernel arguments if needed.


1. Creating the Boot Image and Overlays
  • Line 14: Uses the command wwctl container build rocky-9.4 to create a container based on Rocky Linux 9.4, a Linux distribution commonly used for clusters.
  • Line 15: Creates an overlay (an additional configuration file layer) using wwctl overlay build.

2. Adding Hosts to the Cluster
  • Lines 17–20: This loop adds compute nodes to the cluster.
    For each compute node (defined by the variable num_computes), it assigns:
    • an IP address ${c_ip[$i]},
    • a MAC address ${c_mac[$i]}, and
    • a node name ${c_name[$i]}.
    The wwctl node add command is used to register each node in the cluster with the appropriate configuration.

3. Rebuilding the Overlay and Applying Configuration
  • Line 21: Rebuilds the overlay using wwctl overlay build after all nodes have been added.
  • Line 22: Applies the configuration cluster-wide using wwctl configure --all.

4. Starting and Enabling Services
  • Lines 25–26: The munge and slurmctld services are enabled and started:
    • munge is an authentication service used to secure communication within HPC clusters.
    • slurmctld is the central controller for SLURM, the resource and job scheduling system.

5. (Optional) Adding Kernel Arguments
  • Lines 29–31: If the variable enable_kargs is set to 1, the script allows for adding custom kernel arguments to compute nodes using wwctl node set.
    This can be used to pass advanced parameters to the Linux kernel.

Note:
This script does not reinstall the Rocky Linux OS on your nodes.
Instead, it adds specific cluster-related configurations (such as munge, slurm, and networking).
If your nodes are already properly set up with IP addresses and base services, this script will integrate them into the cluster without altering their existing installation.

14. Section 13 of the recipe.sh script

#!/usr/bin/bash
# ----------------------------------------------------------------------------------->
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# ----------------------------------------------------------------------------------->

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi
# ---------------------------------
# Boot compute nodes (Section 3.10)
# ---------------------------------
for ((i=0; i<${num_computes}; i++)) ; do
   ipmitool  -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power reset
done

This script is part of an installation process for an HPC cluster using OpenHPC. It focuses on rebooting the compute nodes via IPMI.

The error encountered when executing this script is as follows:

[root@master-ohpc /]# ./recipe13.sh
Unable to read password from environment
Chassis Power Control: Reset
Unable to read password from environment

Solution:

In your script, you are using:

ipmitool -E -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power cycle

  • The -E option tells ipmitool to retrieve the password from the IPMITOOL_PASSWORD environment variable.

  • But you’re also using -P ${bmc_password}, which is supposed to pass the password directly.

If your script already loads the bmc_password variable from /input.local, modify the line as follows:

ipmitool -I lanplus -H ${c_bmc[$i]} -U ${bmc_username} -P ${bmc_password} chassis power cycle

This will force the use of the password defined in bmc_password.

Result

[root@master-ohpc /]# ./recipe13.sh
Chassis Power Control: Reset
Chassis Power Control: Reset

[root@master-ohpc /]# nano recipe13.sh
[root@master-ohpc /]# ipmitool -I lanplus -H 192.168.201.51 -U root -P calvin chassis power status
Chassis Power is on
[root@master-ohpc /]# ipmitool -I lanplus -H 192.168.201.52 -U root -P calvin chassis power status
Chassis Power is on
[root@master-ohpc /]# ping 192.168.201.51
PING 192.168.201.51 (192.168.201.51) 56(84) bytes of data.
64 bytes from 192.168.201.51: icmp_seq=1 ttl=63 time=0.284 ms
64 bytes from 192.168.201.51: icmp_seq=2 ttl=63 time=0.302 ms

--- 192.168.201.51 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1037ms
rtt min/avg/max/mdev = 0.284/0.293/0.302/0.009 ms
[root@master-ohpc /]# ping 192.168.70.51
PING 192.168.70.51 (192.168.70.51) 56(84) bytes of data.
64 bytes from 192.168.70.51: icmp_seq=1 ttl=64 time=0.451 ms
64 bytes from 192.168.70.51: icmp_seq=2 ttl=64 time=0.390 ms
64 bytes from 192.168.70.51: icmp_seq=3 ttl=64 time=0.399 ms
64 bytes from 192.168.70.51: icmp_seq=4 ttl=64 time=0.392 ms
64 bytes from 192.168.70.51: icmp_seq=5 ttl=64 time=0.339 ms

--- 192.168.70.51 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 5118ms
rtt min/avg/max/mdev = 0.339/0.394/0.451/0.032 ms
[root@master-ohpc /]# ping 192.168.70.52
PING 192.168.70.52 (192.168.70.52) 56(84) bytes of data.
64 bytes from 192.168.70.52: icmp_seq=1 ttl=64 time=0.337 ms
64 bytes from 192.168.70.52: icmp_seq=2 ttl=64 time=0.398 ms
64 bytes from 192.168.70.52: icmp_seq=3 ttl=64 time=0.398 ms
64 bytes from 192.168.70.52: icmp_seq=4 ttl=64 time=0.381 ms

15. Section 14 of the recipe.sh script

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi


#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------



# ---------------------------------------
# Install Development Tools (Section 4.1)
# ---------------------------------------
dnf -y install ohpc-autotools
dnf -y install EasyBuild-ohpc
dnf -y install hwloc-ohpc
dnf -y install spack-ohpc
dnf -y install valgrind-ohpc

# -------------------------------
# Install Compilers (Section 4.2)
# -------------------------------
dnf -y install gnu14-compilers-ohpc

# --------------------------------
# Install MPI Stacks (Section 4.3)
# --------------------------------
if [[ ${enable_mpi_defaults} -eq 1 ]];then
     dnf -y install openmpi5-pmix-gnu14-ohpc mpich-ofi-gnu14-ohpc
fi

if [[ ${enable_ib} -eq 1 ]];then
     dnf -y install mvapich2-gnu14-ohpc
fi
if [[ ${enable_opa} -eq 1 ]];then
     dnf -y install mvapich2-psm2-gnu14-ohpc
fi

# ---------------------------------------
# Install Performance Tools (Section 4.4)
# ---------------------------------------
dnf -y install ohpc-gnu14-perf-tools

if [[ ${enable_geopm} -eq 1 ]];then
     dnf -y install ohpc-gnu14-geopm
fi
dnf -y install lmod-defaults-gnu14-openmpi5-ohpc

# ---------------------------------------------------
# Install 3rd Party Libraries and Tools (Section 4.6)
# ---------------------------------------------------
dnf -y install ohpc-gnu14-serial-libs
dnf -y install ohpc-gnu14-io-libs
dnf -y install ohpc-gnu14-python-libs
dnf -y install ohpc-gnu14-runtimes
if [[ ${enable_mpi_defaults} -eq 1 ]];then
     dnf -y install ohpc-gnu14-mpich-parallel-libs
     dnf -y install ohpc-gnu14-openmpi5-parallel-libs
fi
if [[ ${enable_ib} -eq 1 ]];then
     dnf -y install ohpc-gnu14-mvapich2-parallel-libs
fi
if [[ ${enable_opa} -eq 1 ]];then
     dnf -y install ohpc-gnu14-mvapich2-parallel-libs
fi

# ----------------------------------------
# Install Intel oneAPI tools (Section 4.7)
# ----------------------------------------
if [[ ${enable_intel_packages} -eq 1 ]];then
     dnf -y install intel-oneapi-toolkit-release-ohpc
     rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
     dnf -y install intel-compilers-devel-ohpc
     dnf -y install intel-mpi-devel-ohpc
     if [[ ${enable_opa} -eq 1 ]];then
          dnf -y install mvapich2-psm2-intel-ohpc
     fi
     dnf -y install openmpi5-pmix-intel-ohpc
     dnf -y install ohpc-intel-serial-libs
     dnf -y install ohpc-intel-geopm
     dnf -y install ohpc-intel-io-libs
     dnf -y install ohpc-intel-perf-tools
     dnf -y install ohpc-intel-python3-libs
     dnf -y install ohpc-intel-mpich-parallel-libs
     dnf -y install ohpc-intel-mvapich2-parallel-libs
     dnf -y install ohpc-intel-openmpi5-parallel-libs
     dnf -y install ohpc-intel-impi-parallel-libs
fi

# -------------------------------------------------------------
# Allow for optional sleep to wait for provisioning to complete
# -------------------------------------------------------------
sleep ${provision_wait}

This script runs successfully without generating any errors. The resulting output is:



Dependencies resolved.
==============================================================================================================================================================================================================
 Package                                               Architecture                             Version                                               Repository                                         Size
==============================================================================================================================================================================================================
Installing:
 ohpc-autotools                                        x86_64                                   3.2-320.ohpc.1.1                                      OpenHPC-updates                                   6.9 k
Installing dependencies:
 autoconf-ohpc                                         x86_64                                   2.71-300.ohpc.2.6                                     OpenHPC                                           953 k
 automake-ohpc                                         x86_64                                   1.16.5-300.ohpc.2.5                                   OpenHPC                                           806 k
 libtool-ohpc                                          x86_64                                   2.4.6-300.ohpc.1.5                                    OpenHPC                                           680 k
 m4                                                    x86_64                                   1.4.19-1.el9                                          appstream                                         294 k
 perl-Thread-Queue                                     noarch                                   3.14-460.el9                                          appstream                                          21 k
 perl-threads                                          x86_64                                   1:2.25-460.el9                                        appstream                                          57 k
 perl-threads-shared                                   x86_64                                   1.61-460.el9.0.1                                      appstream                                          44 k

Transaction Summary
==============================================================================================================================================================================================================
Install  8 Packages

Total download size: 2.8 M
Installed size: 12 M
Downloading Packages:

View full output

16. Section 15 of the recipe.sh script

#!/usr/bin/bash
# -----------------------------------------------------------------------------------------
#  Example Installation Script Template
#  This convenience script encapsulates command-line instructions highlighted in
#  an OpenHPC Install Guide that can be used as a starting point to perform a local
#  cluster install beginning with bare-metal. Necessary inputs that describe local
#  hardware characteristics, desired network settings, and other customizations
#  are controlled via a companion input file that is used to initialize variables
#  within this script.
#  Please see the OpenHPC Install Guide(s) for more information regarding the
#  procedure. Note that the section numbering included in this script refers to
#  corresponding sections from the companion install guide.
# -----------------------------------------------------------------------------------------

inputFile=${OHPC_INPUT_LOCAL:-/input.local}

if [ ! -e ${inputFile} ];then
   echo "Error: Unable to access local input file -> ${inputFile}"
   exit 1
else
   . ${inputFile} || { echo "Error sourcing ${inputFile}"; exit 1; }
fi


# ------------------------------------
# Resource Manager Startup (Section 5)
# ------------------------------------
systemctl enable munge
systemctl enable slurmctld
systemctl start munge
systemctl start slurmctld
pdsh -w ${compute_prefix}[1-${num_computes}] systemctl start munge
pdsh -w ${compute_prefix}[1-${num_computes}] systemctl start slurmd

# Optionally, generate nhc config
pdsh -w c1 "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c
useradd -m test
wwctl overlay build
sleep 90

The script execution output shows errors, as you can see below :



compute1: Warning: Permanently added 'compute1' (ED25519) to the list of known hosts.
compute2: Warning: Permanently added 'compute2' (ED25519) to the list of known hosts.
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
compute2: root@compute2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute2: ssh exited with exit code 255
compute1: Permission denied, please try again.
compute2: Permission denied, please try again.
compute1: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
compute2: Permission denied, please try again.
compute2: root@compute2: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute2: ssh exited with exit code 255
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g

We will proceed to resolve the errors identified :

[root@master-ohpc /]# ls -l /root/.ssh/id_rsa.pub
-rw-r--r-- 1 root root 554 Feb 10 05:38 /root/.ssh/id_rsa.pub

[root@master-ohpc /]# ssh-copy-id root@compute1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already in
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the key(s)
root@compute1's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@compute1'"
and check to make sure that only the key(s) you wanted were added.

[root@master-ohpc /]# ssh-copy-id root@compute2
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already in
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the key(s)
root@compute2's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@compute2'"
and check to make sure that only the key(s) you wanted were added.

The user is copying their SSH public key (id_rsa.pub) from the master node (master-ohpc) to two remote compute nodes (compute1 and compute2) using the ssh-copy-id command. This sets up passwordless SSH login from the master to the compute nodes for the root user. After this setup, the user can log in via SSH without needing to enter the password.

[root@master-ohpc /]# ssh 192.168.70.51
root@192.168.70.51's password: 
Last failed login: Tue Feb 25 16:37:14 CET 2025 from 192.168.70.41 on ssh:notty
There were 4 failed login attempts since the last successful login.
Last login: Tue Feb 25 16:02:39 2025 from 192.168.70.41
[root@compute1 ~]# cat /root/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4BqobqbyAfYhDZ2avy1ALtyCxt9xURo3mh2hZj/FeUgfasyb8FERML1KMRveu9FxKz/w4Pkw
PRc9WbN6t4uB4b+4dDd3bY+GsA3tL6d8ysEkb+y8HsH4hzAe+2cpE1fxEmkgOvJo0t5zCDAbqEmsJ1Nsit3U1k9CK2ZZM3t9Gac/PRkwu
kPskAl0W2Po+C1kdoA98FrbAbh3byr9QsVaMEvLR2djHgZu0ukBeAv3t4K9Qoys1tLFSL0c0h7r4dd30sJv8NGdwqK+c2b0bf4LvkB3J
[root@compute1 ~]# exit
logout
Connection to 192.168.70.51 closed.

[root@master-ohpc /]# ssh 192.168.70.52
root@192.168.70.52's password: 
Last failed login: Tue Feb 25 15:37:14 +00 2025 from 192.168.70.41 on ssh:notty
There were 4 failed login attempts since the last successful login.
Last login: Tue Feb 4 15:17:38 2025 from 192.168.70.41
[root@compute2 ~]# cat /root/.ssh/authorized_keys
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4BqobqbyAfYhDZ2avy1ALtyCxt9xURo3mh2hZj/FeUgfasyb8FERML1KMRveu9FxKz/w4Pkw
PRc9WbN6t4uB4b+4dDd3bY+GsA3tL6d8ysEkb+y8HsH4hzAe+2cpE1fxEmkgOvJo0t5zCDAbqEmsJ1Nsit3U1k9CK2ZZM3t9Gac/PRkwu
kPskAl0W2Po+C1kdoA98FrbAbh3byr9QsVaMEvLR2djHgZu0ukBeAv3t4K9Qoys1tLFSL0c0h7r4dd30sJv8NGdwqK+c2b0bf4LvkB3J
[root@compute2 ~]# exit
logout
Connection to 192.168.70.52 closed.

In this step, the user verifies that the SSH public key has been successfully copied to the authorized_keys file on both compute nodes (192.168.70.51 and 192.168.70.52).

They connect to each node using SSH as root, check the contents of /root/.ssh/authorized_keys, and confirm that the correct public key has been added.

This ensures that passwordless SSH access is now functional from the master node to the compute nodes.

We then make changes to the /etc/ssh/sshd_config file to adjust the SSH server settings:

[root@compute2 ~]# vi /etc/ssh/sshd_config
[root@compute2 ~]#  systemctl restart sshd

The following changes were made to the /etc/ssh/sshd_config file:

#LoginGraceTime 2m
PermitRootLogin yes
#StrictModes yes 
#MaxAuthTries 6
# To disable tunneled clear text passwords, change to no here!
PasswordAuthentification yes
#PermitEmptyPasswords no
PubkeyAuthentication yes
# but this is overridden so installations will only chexk .ssh/authorized_keys
AuthorizedKeysFile     .ssh/authorized_keys

The warning message was resolved immediately after setting PubkeyAuthentication to yes :

[root@master-ohpc /]# pdsh -w compute1 uptime
compute1: Permission denied, please try again.
compute1: Permission denied, please try again.
compute1: root@compute1: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
pdsh@master-ohpc: compute1: ssh exited with exit code 255
[root@master-ohpc /]#

COMPUTE1: The content of ~/.ssh/authorized_keys is: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDb0AYq/Hv9ZDavylAIU7cXt9xuRa03m2hW2j/FeUgFasv9bEFERmL1KMReuyFxJKz/w48wSI76QlM2RPRMA0yYHgYnRpGDzSqt3wTCG6ouvqK0SlGvqZk9BowlJBltOa4nwAoty7I2hnUDTwjmMLtegTbnDvqAmh+G/Wi3RHRr0cWUO1BtbQNlx0R3oXYbAI3Q3xrl6dg3byelcky+B+sHdWaZ1e+2CpfElkm6qUw9OHlDPTZ3CCbAq0emU51nSti3UU2KC2zzMi9acFPkuwJSqeajRMTaJKYQZIIowLiMk/RyED2HinJfcjyECMIH/mIiP+1ekWVr6BRfqLL4cE+G7OTi4yQTcG/0BM/4p0KfpJl4IdbpWuYYkPSla3WO2oP+CL4dOA94FEbWAd3Ybg+qYsaVaMEGWlR2djHgAzUOUkbeAv3t4JKQ0ys1tLFISLo0h7r4dd3O5u/8NGdWqK+2cDb8f4lxKb3JI0PjHvtJg0AUAiQNk9az3rMt22PkS00=

Master (SMS): The output of cat /root/.ssh/cluster.pub is: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDLvl1s4VJ0A0SQHyGkymiARFuvSNi+tUqEJHVUGMMu1utjNmT803Q89RwzPM5B+//eLjH97Rz62tQo9PtgOmdVdk/kCgCj13AtBVaDK+jkFnXDzRW7fQXjHNPp3/CpNhuWPGMSwiVdGa6+g2NJ+HpJdsnPP/FSrRrFjyHUudUFQ9H8LgxMGQSEId4s5hLPZLFNoU3cI7uTa+yPmSLRtYJB0+W50r2n/4JIQpobX4mX+ubCsUPzlePAOVhXcZ9jXpK7Vz37zlQ7aT3nhXqbBIQexVf5SLiXeBdzxhtcM9gPSGJ+1Dxt+ppmwuvVS4Wyr5skQOIXqb2ea6Ff6SXzzZAz+zOIuz1fL280250Qy/pD7FLub3ZOq6HmpKLDycVS3if6XOHwCP/emgPBdUm2os8pSpUOtXI5xd/GP+EjGBqf4YPx59lrfArlKXcxuCQiCBmxr58zQQehQ+Y1rsPnfozi1trEq4YeSJ8/FyYhsAHevxaECOsAsmXPGte0nv1COFk= Warewulf Cluster key

Note: This is the Warewulf cluster public key.

Problem

The two public keys are not identical! This means that the compute node does not have the correct public key from the master node, which can cause authentication failures.

Solution

On the master node, run the following command to correctly copy the right key to compute1:

ssh-copy-id -i /root/.ssh/cluster.pub root@compute1

This will install the proper cluster public key on compute1.

Don’t forget to remove the old key from ~/.ssh/authorized_keys on compute1 to avoid conflicts or unauthorized access.

As a result of these modifications, the output now appears as follows:

[root@master-ohpc /]# ./recipe15.sh
compute2: Failed to start munge.service: Unit munge.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
compute1: Failed to start munge.service: Unit munge.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute1: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute2: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gz

Problem

Connection issue from the compute nodes: Network unreachable

Solution

[root@compute2 /]# ip route
169.254.1.0/24 dev idrac proto kernel scope link src 169.254.1.2 metric 101
192.168.70.0/24 dev eth2 proto kernel scope link src 192.168.70.52 metric 100
[root@compute2 /]# nmtui
[root@compute2 /]# ip route add default via 192.168.70.1
[root@compute2 /]# ip route
default via 192.168.70.1 dev eth2
169.254.1.0/24 dev idrac proto kernel scope link src 169.254.1.2 metric 101
192.168.70.0/24 dev eth2 proto kernel scope link src 192.168.70.52 metric 100
[root@compute2 /]# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=17.2 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=17.0 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=16.4 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=116 time=16.4 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=116 time=18.9 ms
MUNGE Installation and Keyfile Error Resolution on Compute Nodes

On each node (compute1, compute2), run the following commands:

dnf install -y munge munge-libs munge-devel
systemctl enable --now munge 

Then, verify that MUNGE is working properly:

[root@compute2 ~]# systemctl status munge.service
Γ— munge.service - MUNGE authentication service
     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Wed 2025-02-26 14:02:06 +00; 1min 47s ago
       Docs: man:munged(8)
    Process: 29177 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)
        CPU: 4ms

Feb 26 14:02:05 compute2 systemd[1]: Starting MUNGE authentication service...
Feb 26 14:02:06 compute2 munged[29177]: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No s>
Feb 26 14:02:06 compute2 systemd[1]: munge.service: Control process exited, code=exited, status=1/FAILURE
Feb 26 14:02:06 compute2 systemd[1]: munge.service: Failed with result 'exit-code'.
Feb 26 14:02:06 compute2 systemd[1]: Failed to start MUNGE authentication service.

Solution

[root@compute2 ~]# cd /etc/munge/
[root@compute2 munge]# ls
 [root@compute2 munge]# systemctl stop munge
[root@compute2 munge]# create-munge-key
Generating a pseudo-random key using /dev/urandom completed.
[root@compute2 munge]# systemctl start munge
[root@compute2 munge]# chown munge:munge /etc/munge/munge.key
[root@compute2 munge]# chmod 0600 /etc/munge/munge.key
[root@compute2 munge]# systemctl restart munge
[root@compute2 munge]# systemctl status munge
● munge.service - MUNGE authentication service
     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
     Active: active (running) since Wed 2025-02-26 14:12:30 +00; 7s ago
       Docs: man:munged(8)
    Process: 29320 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
   Main PID: 29322 (munged)
      Tasks: 4 (limit: 202700)
     Memory: 1.6M
        CPU: 5ms

Once MUNGE is successfully activated on both compute nodes, the resulting output is:

[root@master-ohpc /]# ./recipe15.sh
compute1: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute1: ssh exited with exit code 5
compute2: Failed to start slurmd.service: Unit slurmd.service not found.
pdsh@master-ohpc: compute2: ssh exited with exit code 5
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gz
Installing and Enabling slurmd on Each Compute Node

To resolve the SLURM error, run the following commands on each compute node:

dnf install -y slurm slurm-slurmd
systemctl enable --now slurmd
systemctl status slurmd

This installs and starts the SLURM daemon (slurmd), which is required for each compute node to communicate with the SLURM controller and accept jobs.

Once slurmd is enabled and running on the compute nodes, the resulting output is:

[root@master-ohpc /]# ./recipe15.sh
c1: ssh: Could not resolve hostname c1: Name or service not known
pdsh@master-ohpc: c1: ssh exited with exit code 255
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz

Building system overlays for compute2: [wwinit]
Created image for overlay compute2[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gz
Troubleshooting SSH Connection and Hostname Resolution Issues
  • To resolve the hostname resolution error, make sure the following line is present in the /etc/hosts file on the master node:
# /etc/hosts

192.168.70.41 master-ohpc master-ohpc.cluster
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Do not edit after this line
# This block is autogenerated by master-ohpc
# Hosts: master-ohpc.cluster
# Time: 02-25-2025 07:33:30 EST
# Source:

# Warewulf Server
192.168.70.41 master-ohpc.cluster master-ohpc

# Entry for compute1
192.168.70.51 compute1.localdomain compute1.localdomain compute1 c1 compute1-default compute1-default

# Entry for compute2
192.168.70.52 compute2.localdomain compute2.localdomain compute2 compute2-default compute2-default
  • installing the OpenHPC repository and verifying enabled repos:
[root@compute1 srv]# dnf install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm
[root@compute1 ~]# dnf repolist
repo id                       repo name
OpenHPC                       OpenHPC-3 - Base
OpenHPC-updates               OpenHPC-3 - Updates
appstream                     Rocky Linux 9 - AppStream
baseos                        Rocky Linux 9 - BaseOS
epel                          Extra Packages for Enterprise Linux 9 - x86_64
epel-cisco-openh264           Extra Packages for Enterprise Linux 9 openh264 (From Cisco) - x86_64
extras                        Rocky Linux 9 - Extras
  • Verify that the following directory contains nhc-genconf and other components. Then, proceed to build and install Node Health Check (NHC) from source:

[root@compute1 src]# cd /usr/local/src/nhc
[root@compute1 nhc]# ls
COPYING    Makefile.am        autogen.sh    contrib           nhc            nhc-wrapper    scripts
ChangeLog  README.md          bench         helpers           nhc-genconf    nhc.conf       test
LICENSE    RELEASE_NOTES.txt  configure.ac  lbnl-nhc.spec.in  nhc-test.conf  nhc.logrotate

[root@compute1 nhc]# ./autogen.sh
[root@compute1 nhc]#  ./configure
[root@compute1 nhc]#  make
[root@compute1 nhc]# make install
  • Fix the path to nhc-genconf in recipe15.sh

Original line (incorrect path):

pdsh -w c1 "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c

Replace it with the corrected line:

pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -" | dshbak -c

This ensures pdsh correctly calls the version of nhc-genconf located in the source directory, not the default system path.

After correcting the path to nhc-genconf in the script (/usr/local/src/nhc/nhc-genconf), running ./recipe15.sh now produces the following output:

c1: /usr/local/src/nhc/nhc-genconf: line 342: nhc_common_unparse_size: command not found
c1: /usr/local/src/nhc/nhc-genconf: line 346: nhc_common_unparse_size: command not found
----------------
c1
----------------
# NHC Configuration File
#
# Lines are in the form "<hostmask>||<check>"
# Hostmask is a glob, /regexp/, or {noderange}
# Comments begin with '#'
#
# This file was automatically generated by nhc-genconf
# Fri Feb 28 13:13:50 CET 2025
#

#######################################################################
###
### NHC Configuration Variables
###
# * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0


#######################################################################
###
### Hardware checks
###
 * || check_hw_cpuinfo
 * || check_hw_physmem   3%
 * || check_hw_swap   3%


#######################################################################
###
### nVidia GPU checks
###
 * || check_nv_healthmon
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.gz
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.gz

Resolving the error encountered during the execution of recipe15.sh:

c1: /usr/local/src/nhc/nhc-genconf: line 342: nhc_common_unparse_size: command not found c1: /usr/local/src/nhc/nhc-genconf: line 346: nhc_common_unparse_size: command not found

[root@compute1 nhc]# grep -r "nhc_common_unparse_size" /usr/local/src/nhc

/usr/local/src/nhc/scripts/common.nhc:function nhc_common_unparse_size() {
/usr/local/src/nhc/scripts/lbnl_fs.nhc:            nhc_common_unparse_size $FS_SIZE FS_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:            nhc_common_unparse_size $MIN_SIZE MIN_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:            nhc_common_unparse_size $FS_SIZE FS_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:            nhc_common_unparse_size $MAX_SIZE MAX_SIZE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $FS_FREE FS_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $MIN_FREE MIN_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $FS_FREE FS_FREE
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $FS_USED FS_USED
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $MAX_USED MAX_USED
/usr/local/src/nhc/scripts/lbnl_fs.nhc:                    nhc_common_unparse_size $FS_USED FS_USED
/usr/local/src/nhc/scripts/lbnl_ps.nhc:                nhc_common_unparse_size ${PS_RSS[$THIS_PID]} NUM
/usr/local/src/nhc/scripts/lbnl_ps.nhc:                nhc_common_unparse_size $THRESHOLD LIM
/usr/local/src/nhc/scripts/lbnl_ps.nhc:                nhc_common_unparse_size ${PS_VSZ[$THIS_PID]} NUM
/usr/local/src/nhc/scripts/lbnl_ps.nhc:                nhc_common_unparse_size $THRESHOLD LIM
/usr/local/src/nhc/test/test_common.nhc:    is "`type -t nhc_common_unparse_size 2>&1`" 'function' 'nhc_common_unparse_size() loaded properly'
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1024EB" "nhc_common_unparse_size():  $OSIZE -> 1024EB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1EB" "nhc_common_unparse_size():  $OSIZE -> 1EB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1023PB" "nhc_common_unparse_size():  $OSIZE -> 1023PB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "64TB" "nhc_common_unparse_size():  $OSIZE -> 64TB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "4GB" "nhc_common_unparse_size():  $OSIZE -> 4GB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1023MB" "nhc_common_unparse_size():  $OSIZE -> 1023MB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1MB" "nhc_common_unparse_size():  $OSIZE -> 1MB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1000kB" "nhc_common_unparse_size():  $OSIZE -> 1000kB"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1GB" "nhc_common_unparse_size():  $OSIZE -> 1GB with 51MB error (size)"
/usr/local/src/nhc/test/test_common.nhc:    is "$ERR" "51" "nhc_common_unparse_size():  $OSIZE -> 1GB with 51MB error (error)"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1177GB" "nhc_common_unparse_size():  $OSIZE -> 1177GB (1.15TB) with 0GB error (size)"
/usr/local/src/nhc/test/test_common.nhc:    is "$ERR" "0" "nhc_common_unparse_size():  $OSIZE -> 1177GB (1.15TB) with 0GB error (error)"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1536GB" "nhc_common_unparse_size():  $OSIZE -> 1536GB (1.5TB) with 0GB error (size)"
/usr/local/src/nhc/test/test_common.nhc:    is "$ERR" "0" "nhc_common_unparse_size():  $OSIZE -> 1536GB (1.5TB) with 0GB error (error)"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "1792kB" "nhc_common_unparse_size():  $OSIZE -> 1792kB (1.75MB) with 0kB error (size)"
/usr/local/src/nhc/test/test_common.nhc:    is "$ERR" "0" "nhc_common_unparse_size():  $OSIZE -> 1792kB (1.75MB) with 0kB error (error)"
/usr/local/src/nhc/test/test_common.nhc:    nhc_common_unparse_size $OSIZE NSIZE 1024 ERR
/usr/local/src/nhc/test/test_common.nhc:    is "$NSIZE" "2PB" "nhc_common_unparse_size():  $OSIZE -> 2PB (1.99PB) with 11TB error (size)"
/usr/local/src/nhc/test/test_common.nhc:    is "$ERR" "11" "nhc_common_unparse_size():  $OSIZE -> 2PB (1.99PB) with 11TB error (error)"
/usr/local/src/nhc/nhc-genconf:    nhc_common_unparse_size $HW_RAM_TOTAL HW_RAM_TOTAL 1024 ERR
/usr/local/src/nhc/nhc-genconf:    nhc_common_unparse_size $HW_SWAP_TOTAL HW_SWAP_TOTAL 1024 ERR

[root@compute1 nhc]# source /usr/local/src/nhc/scripts/common.nhc
[root@compute1 nhc]# cd /usr/local/src/nhc/scripts/
[root@compute1 scripts]# ls
common.nhc          lbnl_cmd.nhc  lbnl_file.nhc  lbnl_hw.nhc   lbnl_moab.nhc  lbnl_nv.nhc
csc_nvidia_smi.nhc  lbnl_dmi.nhc  lbnl_fs.nhc    lbnl_job.nhc  lbnl_net.nhc   lbnl_ps.nhc
[root@compute1 scripts]# nhc_common_unparse_size 1024 RAM_SIZE

The output after executing recipe15.sh is now:

pdsh@master-ohpc: /rc/recipets.sh  
c1: /usr/local/src/nhc/scripts/common.nhc: line 528: die: command not found  
pdsh@master-ohpc: c1: ssh exited with exit code 1  

c1  
################################################################################  
# NHC Configuration File  
#  
# Lines are in the form "<hostmask>|<check>"  
# Hostmask is a glob, /regex/, or !nodenameg  
# Comments begin with "#"  
#  
# This file was automatically generated by nhc-genconf  
# Fri Feb 28 13:36:30 CET 2025  
################################################################################  

## NHC Configuration Variables  
## || export MARK_OFFLINE=1 NHC_CHECK_ALL=0  

################################################################################  
# Hardware checks  
################################################################################  

* || check_hw_cpuinfo  
useradd: user 'test' already exists  
Building system overlays for compute1: [wninit]  
Created image for overlay compute1/[wninit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img  
Compressed image for overlay compute1/[wninit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz  
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img  
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g  

Building system overlays for compute2: [wninit]  
Created image for overlay compute2/[wninit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img  
Compressed image for overlay compute2/[wninit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz  
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img  
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g  

################################################################################  

Resolving the error encountered during the execution of recipe15.sh:

The error shown in the output c1: /usr/local/src/nhc/scripts/common.nhc: line 528: die: command not found means that the die function or command used in the common.nhc script is not defined or accessible at runtime.

The die function is commonly used in scripts to stop execution and display an error message. If it’s not defined, this kind of error may occur.

The die function should be defined somewhere in the script itself or in another sourced file. You should search for its definition in common.nhc or other related files.

If the function is missing (which is our case), you can add it at the end of the common.nhc file as follows, to terminate execution with an error message:

die() {

echo β€œ$1”

exit 1

}

[root@compute1 ~]# cd /usr/local/src/nhc/scripts/
[root@compute1 scripts]# vi common.nhc
# Find system definition for UID range
function nhc_common_get_max_sys_uid() {
    local LINE UID_MIN SYS_UID_MAX

    MAX_SYS_UID=${MAX_SYS_UID:-99}
    if [[ -e "$LOGIN_DEFS_SRC" ]]; then
        while read LINE ; do
            if [[ "${LINE#UID_MIN}" != "$LINE" ]]; then
                UID_MIN=${LINE//[!0-9]/}
            elif [[ "${LINE#UID_MAX}" != "$LINE" ]]; then
                SYS_UID_MAX=${LINE//[!0-9]/}
                break
            fi
        done < "$LOGIN_DEFS_SRC"
        if [[ -n "$SYS_UID_MAX" ]]; then
            MAX_SYS_UID=$((SYS_UID_MAX+0))
        fi
        if [[ -n "$UID_MIN" ]]; then
            UID_MIN=$((UID_MIN-1))
        fi
        if (( MAX_SYS_UID <= 0 )); then
            MAX_SYS_UID=99
        fi
        return 0
    else
        return 1
    fi
}

die() {
    echo "$1"
    exit 1
}

The output after executing recipe15.sh is now:

pdsh@master-ohpc: c1: ssh exited with exit code 1
----------------
c1
----------------
# NHC Configuration File
#
# Lines are in the form "<hostmask>||<check>"
# Hostmask is a glob, /regexp/, or {noderange}
# Comments begin with '#'
#
# This file was automatically generated by nhc-genconf
# Fri Feb 28 13:40:37 CET 2025
#

#######################################################################
###
### NHC Configuration Variables
###
# * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0


#######################################################################
###
### Hardware checks
###
 * || check_hw_cpuinfo
1
useradd: user 'test' already exists
Building system overlays for compute1: [wwinit]
Created image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img
Compressed image for overlay compute1/[wwinit]: /srv/warewulf/provision/overlays/compute1/__SYSTEM__.img.gz
Building runtime overlays for compute1: [generic]
Created image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img
Compressed image for overlay compute1/[generic]: /srv/warewulf/provision/overlays/compute1/__RUNTIME__.img.g               z
Building system overlays for compute2: [wwinit]
Created image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img
Compressed image for overlay compute2/[wwinit]: /srv/warewulf/provision/overlays/compute2/__SYSTEM__.img.gz
Building runtime overlays for compute2: [generic]
Created image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img
Compressed image for overlay compute2/[generic]: /srv/warewulf/provision/overlays/compute2/__RUNTIME__.img.g               z

Checking that the script runs correctly

Enabling and Starting Services (munge, slurmctld, and slurmd)

The script uses systemctl to enable and start the necessary services (munge and slurmctld) on the master node and on the compute nodes via pdsh.

Verification:

  • Ensure that the services are successfully started on the master node and compute nodes.

  • Use systemctl status to verify that the services are running:

systemctl status munge

[root@master-ohpc ~]# systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
   Active: active (running) since Wed 2025-02-12 05:13:09 EST; 2 weeks 4 days ago
     Docs: man:munged(8)
 Main PID: 3279213 (munged)
   Tasks: 1 (limit: 48899)
   Memory: 4.1M
   CPU: 2.229s
   CGroup: /system.slice/munge.service
           └─3279213 /usr/sbin/munged

Feb 28 07:36:29 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 07:36:29 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...
Feb 28 08:09:22 master-ohpc.cluster systemd[1]: /usr/lib/systemd/system/munge.service:10: PIDFile= reference...

systemctl status slurmctld

[root@master-ohpc ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
   Active: active (running) since Wed 2025-02-12 05:29:09 EST; 2 weeks 4 days ago
 Main PID: 2305503 (slurmctld)
   Tasks: 8 (limit: 48899)
   Memory: 23.1M
   CPU: 2min 21.800s
   CGroup: /system.slice/slurmctld.service
           └─2305503 /usr/sbin/slurmctld -D

Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Unable to resolve "compute2"
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Address family not supported
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: get_node_addrs: Address to compute2
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Unable to resolve "compute3"
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: slurm_set_addr: Address family not supported
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: error: get_node_addrs: Address to compute3
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: Recovered state of 0 reservations
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: slurmctld: backfill scheduling
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: select/cons_tres: prof_cnt = 0
Feb 12 05:29:09 master-ohpc.cluster slurmctld[2305503]: slurmctld: Running as primary controller

Fixing the reported problems

  1. Node Address Configuration Missing in slurm.conf

The configuration in slurm.conf defines the nodes compute1 and compute2, but it lacks explicit IP address declarations (NodeAddr). Since slurmctld is reporting a resolution issue for compute2, it is likely that it cannot resolve its address.

Manually add the IP address of each node in /etc/slurm/slurm.conf on master-ohpc:

NodeName=compute1 NodeAddr=192.168.70.51 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN  
NodeName=compute2 NodeAddr=192.168.70.52 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN

Then reload the configuration and restart Slurm:

systemctl restart slurmctld  
systemctl restart slurmd
  1. Create the Log File Manually

If the log file /var/log/slurmctld.log is missing:

touch /var/log/slurmctld.log  
chown slurm:slurm /var/log/slurmctld.log  
chmod 644 /var/log/slurmctld.log

Check if the /var/log/slurm/ directory exists:

ls -ld /var/log/slurm

If it doesn’t exist, create it:

mkdir -p /var/log/slurm  
chown slurm:slurm /var/log/slurm  
chmod 755 /var/log/slurm

Restart slurmctld to test:

systemctl restart slurmctld  
systemctl status slurmctld
  1. Reset Node States via scontrol

Use the following commands to reset the states of compute nodes:


scontrol update NodeName=compute1 State=DRAIN Reason="Manual reset"  
scontrol update NodeName=compute2 State=DRAIN Reason="Manual reset"

scontrol update NodeName=compute1 State=DOWN Reason="Reset"  
scontrol update NodeName=compute2 State=DOWN Reason="Reset"

scontrol update NodeName=compute1 State=RESUME  
scontrol update NodeName=compute2 State=RESUME
  1. Add MemoryEnforce=YES in slurm.conf

To enforce memory limits, add the following line to /etc/slurm/slurm.conf on master-ohpc:

MemoryEnforce=YES

and now the output of systemctl status slurmctld as follow :

[root@master-ohpc ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
   Active: active (running) since Mon 2025-03-03 04:50:39 EST; 7min ago
 Main PID: 3871952 (slurmctld)
   Tasks: 5
   Memory: 5.4M
   CPU: 95ms
   CGroup: /system.slice/slurmctld.service
           └─3871952 /usr/sbin/slurmctld -D

Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 reason set to *
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DRAINED*
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 reason set to *
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DRAINED*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DOWN*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DOWN*
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to IDLE
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to IDLE

-- More -- (ctrl-C pour quitter)

[root@master-ohpc ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: disabled)
   Active: active (running) since Mon 2025-03-03 04:50:39 EST; 7min ago
 Main PID: 3871952 (slurmctld)
   Tasks: 5
   Memory: 5.4M
   CPU: 95ms
   CGroup: /system.slice/slurmctld.service
           └─3871952 /usr/sbin/slurmctld -D

Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 reason set to: Manual reset
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DRAINED*
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 reason set to: Manual reset
Mar 03 04:57:15 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DRAINED*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to DOWN*
Mar 03 04:57:18 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to DOWN*
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute1 state set to IDLE
Mar 03 04:57:21 master-ohpc.cluster slurmctld[3871952]: slurmctld: update_node: node compute2 state set to IDLE

Verification of the pdsh command

Let’s check this with the following commands:

[root@master-ohpc /]# pdsh -w 192.168.70.51 systemctl status munge
192.168.70.51: ● munge.service - MUNGE authentication service
192.168.70.51:     Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
192.168.70.51:     Active: active (running) since Tue 2025-03-04 12:02:13 CET; 7min ago
192.168.70.51:       Docs: man:munged(8)
192.168.70.51:    Process: 140959 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
192.168.70.51:   Main PID: 140961 (munged)
192.168.70.51:      Tasks: 4 (limit: 202700)
192.168.70.51:     Memory: 1.4M
192.168.70.51:      CPU: 76ms
192.168.70.51:     CGroup: /system.slice/munge.service
192.168.70.51:             └─140961 /usr/sbin/munged

192.168.70.51: Mar 04 12:02:13 compute1 systemd[1]: Starting MUNGE authentication service...
192.168.70.51: Mar 04 12:02:13 compute1 systemd[1]: Started MUNGE authentication service.
[root@master-ohpc /]# pdsh -w c1 systemctl status slurmd
c1:              ● slurmd.service - Slurm node daemon
c1:                 Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
c1:                 Active: active (running) since Tue 2025-03-04 12:03:08 CET; 7min ago
c1:               Main PID: 141050 (slurmd)
c1:                  Tasks: 2
c1:                 Memory: 4.2M
c1:                  CPU: 78ms
c1:                 CGroup: /system.slice/slurmd.service
c1:                         └─141050 /usr/sbin/slurmd -D -s

c1: Mar 04 12:03:08 compute1 systemd[1]: Started Slurm node daemon.
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: slurmd version 22.05.0 started
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: Started on Tue, 04 Mar 2025 12:03:08
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3172s TmpDisk=17616 Uptime=504933 CPUSpecList=CPU Procs=1 
c1: yes=(null)
[root@master-ohpc /]#

However, this command fails when executed from the nodes :

[root@compute1 ~]# pdsh -w c2 systemctl status munge
pdsh@compute1: c2: connect: Connection refused
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
pdsh@compute1: 192.168.70.52: connect: Connection refused

Solution :

[root@compute1 ~]# echo $PDSH_RCMD_TYPE

[root@compute1 ~]# export PDSH_RCMD_TYPE=ssh
[root@compute1 ~]# echo $PDSH_RCMD_TYPE
ssh
[root@compute1 ~]# echo "export PDSH_RCMD_TYPE=ssh" >> ~/.bashrc
source ~/.bashrc
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
No such rcmd module "ssh"
[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@master-ohpc /]# ssh c2
Last login: Tue Mar  4 11:03:20 2025 from 192.168.70.41
[root@compute2 ~]# export PDSH_RCMD_TYPE=ssh
[root@compute2 ~]# echo "export PDSH_RCMD_TYPE=ssh" >> ~/.bashrc
source ~/.bashrc
[root@compute2 ~]# pdsh -w c1 systemctl status munge
No such rcmd module "ssh"
[root@compute2 ~]# exit
logout
Connection to c2 closed.
[root@master-ohpc /]# ssh c1
Last login: Tue Mar  4 12:14:40 2025 from 192.168.70.41
[root@compute1 ~]# pdsh -w 192.168.70.52 systemctl status munge
No such rcmd module "ssh"
[root@compute1 ~]# pdsh -L
2 modules loaded:

Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr:  arbitrary command rcmd connect method
Active: yes

Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr:  BSD rcmd connect method
Active: yes

[root@compute1 ~]# yum install pdsh-rcmd-ssh -y
Last metadata expiration check: 2:45:17 ago on Tue Mar  4 09:35:01 2025.
Dependencies resolved.
===============================================================================================
 Package                   Architecture       Version                   Repository        Size
===============================================================================================
Installing:
 pdsh-rcmd-ssh             x86_64             2.34-7.el9                epel              13 k

Transaction Summary
===============================================================================================
Install  1 Package

Total download size: 13 k
Installed size: 15 k
Downloading Packages:
pdsh-rcmd-ssh-2.34-7.el9.x86_64.rpm                            140 kB/s |  13 kB     00:00
-----------------------------------------------------------------------------------------------
Total                                                          6.6 kB/s |  13 kB     00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                       1/1
  Installing       : pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1
  Running scriptlet: pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1
  Verifying        : pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1

Installed:
  pdsh-rcmd-ssh-2.34-7.el9.x86_64

Complete!
[root@compute1 ~]# pdsh -L
3 modules loaded:

Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr:  arbitrary command rcmd connect method
Active: yes

Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr:  BSD rcmd connect method
Active: yes

Module: rcmd/ssh
Author: Jim Garlick <garlick@llnl.gov>
Descr:  ssh based rcmd connect method
Active: yes

[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@master-ohpc /]# ssh c2
Last login: Tue Mar  4 11:17:47 2025 from 192.168.70.41
[root@compute2 ~]# yum install pdsh-rcmd-ssh -y
Last metadata expiration check: 0:13:40 ago on Tue Mar  4 11:07:02 2025.
Dependencies resolved.
===============================================================================================
 Package                   Architecture       Version                   Repository        Size
===============================================================================================
Installing:
 pdsh-rcmd-ssh             x86_64             2.34-7.el9                epel              13 k

Transaction Summary
===============================================================================================
Install  1 Package

Total download size: 13 k
Installed size: 15 k
Downloading Packages:
pdsh-rcmd-ssh-2.34-7.el9.x86_64.rpm                            1.3 MB/s |  13 kB     00:00
-----------------------------------------------------------------------------------------------
Total                                                          1.1 kB/s |  13 kB     00:12
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                       1/1
  Installing       : pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1
  Running scriptlet: pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1
  Verifying        : pdsh-rcmd-ssh-2.34-7.el9.x86_64                                       1/1

Installed:
  pdsh-rcmd-ssh-2.34-7.el9.x86_64

Complete!
[root@compute2 ~]# pdsh -L
3 modules loaded:

Module: rcmd/exec
Author: Mark Grondona <mgrondona@llnl.gov>
Descr:  arbitrary command rcmd connect method
Active: yes

Module: rcmd/rsh
Author: Jim Garlick <garlick@llnl.gov>
Descr:  BSD rcmd connect method
Active: yes

Module: rcmd/ssh
Author: Jim Garlick <garlick@llnl.gov>
Descr:  ssh based rcmd connect method
Active: yes

after that, we got this :

[root@compute2 ~]# pdsh -w c1 systemctl status munge
c1: Host key verification failed.
pdsh@compute2: c1: ssh exited with exit code 255

Solution :

If the private key is missing, you can generate a new SSH key pair on compute1 (do the same on compute2).

[root@compute1 ~]# ssh-keygen -t rsa -b 2048 -f /root/.ssh/id_rsa   

This command will generate a new private key (id_rsa) and a public key (id_rsa.pub) in the /root/.ssh/ directory.

Enter passphrase (empty for no passphrase): 

When you’re prompted to enter a passphrase, you can choose to leave it empty if you don’t want a password for the private key.

Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:0oTKoULuoyFHKGqBGY5zkrdfu7q8h4bNcZHGOHJrOHQ root@compute1
The key's randomart image is:
+---[RSA 2048]----+
|                 |
|       .         |
|..  .o...        |
|*=ooEo=o         |
|@===o+..S        |
|+B+.+ ..         |
|++o* +.          |
|+ooo=...         |
|.  .*=o.         |
+----[SHA256]-----+
 [root@compute1 ~]# cat /root/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDHsEf13pu9VL1pPi7RVyL3R2PujKG76Fr2wdy5B92aw9gfh0FYknnoyCr58U3wkUmcCatT+PRdIj02q2UELpjnnwTJLFXNZG2FSmg14cgW8wC3CI0Hrb/EAuTSJYk/vkAiYFzVNS6UHaclA30o1NaJ/8D9iSECdtbEOuRJP+dSnZ3VJG0Now7S+NBtsCRMW491Sj3qxsyUFl8tZNxNMrdlFdwkPK9gPUynwq+a5fpm0ZUYRdjioRbcTvyVoQLn2j37NZfUafbMn5uv/IHAmoTVph+WwZ3GsYVyYzoYV1RXmUPjnSved4NU7RW7lAltk5F4S1Y4UiN3WLLR/eocr+Lh root@compute1

After generating the new key pair, you will need to copy the public key (/root/.ssh/id_rsa.pub) to ~/.ssh/authorized_keys on compute1 under the root user.

[root@compute1 ~]# exit
logout
Connection to c1 closed.
[root@compute2 ~]# vi  ~/.ssh/authorized_keys  

Copy this key, then connect to compute2 and add the public key to the ~/.ssh/authorized_keys file.

[root@compute2 ~]# exit
logout
Connection to c2 closed.
[root@master-ohpc /]# ssh c1
Last login: Tue Mar  4 13:46:51 2025 from 192.168.70.52
[root@compute1 ~]# ssh c2
Last login: Tue Mar  4 12:46:10 2025 from 192.168.70.41
[root@compute2 ~]# ssh c1
Last login: Tue Mar  4 13:49:31 2025 from 192.168.70.41
[root@compute1 ~]# pdsh -w c2 systemctl status munge
c2: ● munge.service - MUNGE authentication service
c2:      Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
c2:      Active: active (running) since Tue 2025-03-04 11:02:24 +00; 1h 47min ago
c2:        Docs: man:munged(8)
c2:     Process: 124785 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
c2:    Main PID: 124787 (munged)
c2:       Tasks: 4 (limit: 202700)
c2:      Memory: 1.7M
c2:         CPU: 28ms
c2:      CGroup: /system.slice/munge.service
c2:              └─124787 /usr/sbin/munged
c2:
c2: Mar 04 11:02:24 compute2 systemd[1]: Starting MUNGE authentication service...
c2: Mar 04 11:02:24 compute2 systemd[1]: Started MUNGE authentication service.
[root@master-ohpc /]# pdsh -w c1 systemctl status munge
c1: ● munge.service - MUNGE authentication service
c1:    Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; preset: disabled)
c1:    Active: active (running) since Tue 2025-03-04 12:02:13 CET; 7min ago
c1:      Docs: man:munged(8)
c1:   Process: 140959 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
c1:  Main PID: 140961 (munged)
c1:     Tasks: 4 (limit: 202700)
c1:    Memory: 1.4M
c1:     CPU: 76ms
c1:    CGroup: /system.slice/munge.service
c1:            └─140961 /usr/sbin/munged

c1: Mar 04 12:02:13 compute1 systemd[1]: Starting MUNGE authentication service...
c1: Mar 04 12:02:13 compute1 systemd[1]: Started MUNGE authentication service.
[root@master-ohpc /]# pdsh -w c1 systemctl status slurmd
c1: ● slurmd.service - Slurm node daemon
c1:    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: disabled)
c1:    Active: active (running) since Tue 2025-03-04 12:03:08 CET; 7min ago
c1:  Main PID: 141050 (slurmd)
c1:     Tasks: 2
c1:    Memory: 4.2M
c1:     CPU: 78ms
c1:    CGroup: /system.slice/slurmd.service
c1:            └─141050 /usr/sbin/slurmd -D -s

c1: Mar 04 12:03:08 compute1 systemd[1]: Started Slurm node daemon.
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: slurmd version 22.05.0 started
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: Started on Tue, 04 Mar 2025 12:03:08
c1: Mar 04 12:03:08 compute1 slurmd[141050]: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=3172s TmpDisk=17616 Uptime=504933 CPUSpecList=CPU Procs=1
c1: yes=(null)
[root@master-ohpc /]#

**Verification: pdsh -w c1 β€œ/usr/local/src/nhc/nhc-genconf -H ’*’ -c -β€™β€œ**

Make sure the nhc-genconf command runs without errors on c1.

[root@master-ohpc ~]# pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -"
c1: # NHC Configuration File
c1: #
c1: # Lines are read in order; the first matching line is used.
c1: # Comments are ignored. Use '#' for comments.
c1: # Hostmask is a glob, /regexp/, or (noderange)
c1: # Comments begin with '#'
c1: #
c1: # This file was automatically generated by nhc-genconf
c1: # Wed Mar 05 13:44:42 CET 2025
c1: #
c1:
c1: ################################################################################
c1: ###
c1: ### NHC Configuration Variables
c1: ###
c1: #* || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
c1:
c1: ################################################################################
c1: ###
c1: ### Hardware checks
c1: ###
c1: * || check_hw_cpuinfo
c1:
pdsh@master-ohpc: c1: ssh exited with exit code 1

It appears that the file check_hw_cpuinfo is indeed missing on compute1, which may explain the previous errors. This file is essential for the proper functioning of NHC hardware checks. To download the nhc-genconf file from GitHub: https://github.com/mej/nhc/blob/master/nhc-genconf. Download the file from the command line:

Use the wget or curl command to download the file directly from the command line. wget https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf -O /usr/local/src/nhc/nhc-genconf

[root@compute1 src]# wget https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf -O /usr/local/src/nhc/nhc-genconf
--2025-03-05 15:19:22--  https://raw.githubusercontent.com/mej/nhc/master/nhc-genconf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16594 (16K) [text/plain]
Saving to: '/usr/local/src/nhc/nhc-genconf'

/usr/local/src/nhc/nhc- 100%[==============================>] 16.21K  --.-KB/s    in 0.004s

2025-03-05 15:19:22 (3.64 MB/s) - '/usr/local/src/nhc/nhc-genconf' saved [16594/16594]
[root@compute1 src]# cd /usr/local/src/nhc/
[root@compute1 nhc]# ls
COPYING      Makefile.in     automate.cache  configure.ac   lbnl-nhc.spec.in  nhc-wrapper
README       helpers         branch          contrib        missing            nhc.conf
README.md    LICENSE.txt     config.log      install.sh     nhc                nhc.cron
aclocal.m4   automate.sh     configure       scripts        nhc-genconf        nhc.logrotate
Makefile     autogen.sh      configure.deps  install.sh     nhc-wrapper.conf   test.nhc

for both nodes :

[root@master-ohpc ~]# pdsh -w c1 "/usr/local/src/nhc/nhc-genconf -H '*' -c -"
c1: # NHC Configuration File
c1: #
c1: # Lines are in the form "<hostmask>||<check>"
c1: # Hostmask is a glob, /regexp/, or {noderange}
c1: # Comments begin with '#'
c1: #
c1: # This file was automatically generated by nhc-genconf
c1: # Wed Mar 5 15:21:09 CET 2025
c1: #
c1:
c1: #######################################################################
c1: ###
c1: ### NHC Configuration Variables
c1: ###
c1: # * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0
c1:
c1:
c1: #######################################################################
c1: ###
c1: ### DMI Checks
c1: ###
c1: # * || check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Version: 1.5.1"
c1: # * || check_dmi_data_match -h 0x0100 -t 1 "System Information: Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x0200 -t 2 "Base Board Information: Version: A02"
c1: # * || check_dmi_data_match -h 0x0300 -t 3 "Chassis Information: Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Version: Intel(R) Xeon(R) E-2356G CPU @ 3.20GHz"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Max Speed: 4000 MHz"
c1: # * || check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Current Speed: 3200 MHz"
c1: # * || check_dmi_data_match -h 0x0700 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x0701 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x0702 -t 7 "Cache Information: Speed: Unknown"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Configured Memory Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1100 -t 17 "Memory Device: Firmware Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Configured Memory Speed: 3200 MT/s"
c1: # * || check_dmi_data_match -h 0x1101 -t 17 "Memory Device: Firmware Version: Not Specified"
c1: # * || check_dmi_data_match -h 0x2600 -t 38 "IPMI Device Information: Specification Version: 2.0"
c1: # * || check_dmi_data_match -h 0x0001 -t 43 "TPM Device: Specification Version: 2.0"
c1: # * || check_dmi_data_match -h 0x0001 -t 43 "TPM Device: Description: TPM 2.0, ManufacturerID: NTC , Firmware Version: 0x00070002.0x0"
c1:
c1:
c1: #######################################################################
c1: ###
c1: ### Filesystem checks
c1: ###
c1:  * || check_fs_mount_rw -t "proc" -s "proc" -f "/proc"
c1:  * || check_fs_mount_rw -t "sysfs" -s "sysfs" -f "/sys"
c1:  * || check_fs_mount_rw -t "devtmpfs" -s "devtmpfs" -f "/dev"
c1:  * || check_fs_mount_rw -t "securityfs" -s "securityfs" -f "/sys/kernel/security"
c1:  * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm"
c1:  * || check_fs_mount_rw -t "devpts" -s "devpts" -f "/dev/pts"
c1:  * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run"
c1:  * || check_fs_mount_rw -t "pstore" -s "pstore" -f "/sys/fs/pstore"
c1:  * || check_fs_mount_rw -t "efivarfs" -s "efivarfs" -f "/sys/firmware/efi/efivars"
c1:  * || check_fs_mount_rw -t "bpf" -s "bpf" -f "/sys/fs/bpf"
c1:  * || check_fs_mount_rw -t "xfs" -s "/dev/mapper/rl-root" -f "/"
c1:  * || check_fs_mount_rw -t "selinuxfs" -s "selinuxfs" -f "/sys/fs/selinux"
c1:  * || check_fs_mount_rw -t "hugetlbfs" -s "hugetlbfs" -f "/dev/hugepages"
c1:  * || check_fs_mount_rw -t "mqueue" -s "mqueue" -f "/dev/mqueue"
c1:  * || check_fs_mount_rw -t "debugfs" -s "debugfs" -f "/sys/kernel/debug"
c1:  * || check_fs_mount_rw -t "tracefs" -s "tracefs" -f "/sys/kernel/tracing"
c1:  * || check_fs_mount_rw -t "fusectl" -s "fusectl" -f "/sys/fs/fuse/connections"
c1:  * || check_fs_mount_rw -t "configfs" -s "configfs" -f "/sys/kernel/config"
c1:  * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-sysctl.service"
c1:  * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-tmpfiles-setup-dev.service"
c1:  * || check_fs_mount_rw -t "xfs" -s "/dev/sda2" -f "/boot"
c1:  * || check_fs_mount_rw -t "vfat" -s "/dev/sda1" -f "/boot/efi"
c1:  * || check_fs_mount_rw -t "xfs" -s "/dev/mapper/rl-home" -f "/home"
c1:  * || check_fs_mount_ro -t "ramfs" -s "none" -f "/run/credentials/systemd-tmpfiles-setup.service"
c1:  * || check_fs_mount_rw -t "tracefs" -s "tracefs" -f "/sys/kernel/debug/tracing"
c1:  * || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run/user/0"
c1:  * || check_fs_used /dev 90%
c1:  * || check_fs_used /sys/firmware/efi/efivars 90%
c1:  * || check_fs_used / 90%
c1:  * || check_fs_free /boot 40MB
c1:  * || check_fs_used /boot/efi 90%
c1:  * || check_fs_used /home 90%
c1:  * || check_fs_iused /dev 100%
c1:  * || check_fs_iused / 100%
c1:  * || check_fs_iused /boot 100%
c1:  * || check_fs_iused /home 98%
c1:
c1:
c1: #######################################################################

Creation of TEST55 on compute1 from the master node.

[root@master-ohpc /]# pdsh -w c1 "useradd -m test55"
[root@master-ohpc /]# ssh c1
Last login: Wed Mar  5 14:51:13 2025 from 192.168.70.41
[root@compute1 ~]# id test55
uid=1002(test55) gid=1002(test55) groups=1002(test55)
[root@compute1 ~]#