Cluster Platform Knowledgebase¶
This site documents the considerations and guidelines for designing and developing a HPC platform for cluster computing. The documentation describes general practices and considerations when designing a HPC platform as well as recommendations and guides used by Alces Software to configure HPC platforms.
Acknowledgements¶
We recognise the respect the trademarks of all third-party providers referenced in this documentation. Please see the respective EULAs for software packages used in configuring your own environment based on this knowledgebase.
License¶
This documentation is released under the Creative-Commons: Attribution-ShareAlike 4.0 International license.
HPC Cluster Platform¶
This knowledgebase document is broken down into the following sections:
Introduction¶
The purpose of this documentation is to provide a list of considerations and guidelines for the development of a High Performance Computing (HPC) environment. This documentation should be followed through in order to properly understand the structure of the environment and that certain considerations are not missed out along the way.
To generalise the entire process, it goes as follows:
Hardware Architecture Design -> Hardware Build -> Software Build -> Platform Delivery
Ensuring that a suitable hardware and network architecture is designed before the build process begins will allow you to create a stable base for your HPC platform.
Performing the hardware build before doing any software configuration guarantees that the network and hardware is properly setup. A partially built network during software setup can lead to unforeseen issues with communication and configuration.
Once the infrastructure has been physically built the software build can proceed. Usually the central servers will be configured first before client and compute nodes are configured.
Finally, platform delivery includes a range of connectivity, performance and quality tests which ensure that the completed environment is stable, manageable and consistent.
Note
It is recommended to read through all of the documentation before starting to design the HPC platform to understand the scope and considerations.
Overviews¶
A successful HPC cluster environment is composed of a number of different packages. When designing your system, a range of different components can be chosen to enable different functionality, features and capabilities to be implemented. Your choice of packages will depend on your specific requirements - the diagram below shows an example breakdown of the different sections reviewed in this document which can be used to build a complete HPC system:


Considerations for Network and Hardware Design¶
In general, the things to consider when designing the hardware and network solution for a HPC platform are:
- The hardware environment
- The types of nodes required in the network
- The different networks to be used by the network
- The level of resilience desired
- The hostname and domain naming convention
These topics are covered in more detail below.
Hardware Environment¶
The hardware environment will generally be one of two setups, metal or cloud.
- Metal - Metal environments are those which are composed of on-site systems in a datacenter which are usually running 24/7.
- Cloud - Cloud environments are systems hosted in a third-party datacenter and are usually ephemeral systems that are being created and destroyed on demand.
- Metal/Cloud Hybrid - A hybrid environment usually consists of a core metal configuration that uses cloud as an overflow for additional capacity at times of high utilisation.
A hardware environment is mainly focussed on the location, capacity and permanence of the HPC platform and does not directly determine the hardware that will be used in the various systems.
Node Types¶
A complete HPC platform will be comprised of systems that serve different purposes within the network. Ideas of node types along with the services and purpose of those nodes can be seen below.
- Login Node - A login node will usually provide access to the HPC platform and will be the central system that users access to run applications. How users will access the system should be considered, usually this will be SSH and some graphical login service, such as, VNC.
- Master Node - A master node will usually run services for the HPC platform. Such as, the master process for a job scheduler, monitoring software and user management services.
- Compute Node - Compute nodes are usually used for running HPC applications that are queued through a job scheduler. Additionally, these can be used for VM deployments (via software like OpenStack) or other computational uses. Compute nodes usually have large amounts of cores and memory as well as high bandwidth interconnect (like Infiniband).
- Special-purpose Node - Some compute nodes may feature a particular specification to be used for a particular job, or stage in your workflow. Examples may include nodes with more memory, larger amounts of local scratch storage, or GPU/FPGA devices installed.
- Storage Node - The storage node will serve network storage solutions to systems on the network. It would have some sort of storage array connected to it which would provide large and resilient storage.
The above types are not strict. Services can be mixed, matched and moved around to create the desired balance and distribution of services and functions for the platform.
Different Networks¶
The network in the system will most likely be broken up (physically or virtually with VLANs) into separate networks to serve different usages and isolate traffic. Potential networks that may be in the HPC platform are:
- Primary Network - The main network that all systems are connected to.
- Out-of-Band Network - A separate network for management traffic. This could contain on-board BMCs, switch management ports and disk array management ports. Typically this network would only be accessible by system administrators from within the HPC network.
- High Performance Network - Usually built on an Infiniband fabric, the high performance network would be used by the compute nodes for running large parallel jobs over MPI. This network can also be used for storage servers to provide performance improvements to data access.
- External Networks - The network outside of the HPC environment that nodes may need to access. For example, the Master Node could be connected to an Active Directory server on the external network and behave as a slave to relay user information to the rest of the HPC environment.
- Build Network - This network can host a DHCP server for deploying operating systems via PXE boot kickstart installations. It allows for systems that require a new build or rebuild to be flipped over and provisioned without disturbing the rest of the network.
- DMZ - A demilitarised zone would contain any externally-facing services, this could be setup in conjunction with the external networks access depending on the services and traffic passing through.
The above networks could be physically or virtually separated from one another. In a physical separation scenario there will be a separate network switch for each one, preventing any sort of cross-communication. In a virtually separated network there will be multiple bridged switches that separate traffic by dedicating ports (or tagging traffic) to different VLANs. The benefit of the VLAN solution is that the bridged switches (along with bonded network interfaces) provides additional network redundancy.
Note
If a cloud environment is being used then it is most likely that all systems will reside on the primary network and no others. This is due to the network configuration from the cloud providers.
Resilience¶
How well a system can cope with failures is crucial when delivering a HPC platform. Adequate resilience can allow for maximum system availability with a minimal chance of failures disrupting the user. System resilience can be improved with many hardware and software solutions, such as:
- RAID Arrays - A RAID array is a collection of disks configured in such a way that they become a single storage device. There are different RAID levels which improve data redundancy or storage performance (and maybe even both). Depending on the RAID level used, a disk in the array can fail without disrupting the access to data and can be hot swapped to rebuild the array back to full functionality. [1]
- Service Redundancy - Many software services have the option to configure a slave/failover server that can take over the service management should the master process be unreachable. Having a secondary server that mirrors critical network services would provide suitable resilience to master node failure.
- Failover Hardware - For many types of hardware there is the possibility of setting up failover devices. For example, in the event of a power failure (either on the circuit or in a power supply itself) a redundant power supply will continue to provide power to the server without any downtime occurring.
There are many more options than the examples above for improving the resilience of the HPC platform, it is worth exploring and considering available solutions during design.
Note
Cloud providers are most likely to implement all of the above resilience procedures and more to ensure that their service is available 99.99% of the time.
Hostname and Domain Names¶
Using proper domain naming conventions during design of the HPC platform is best practice for ensuring a clear, logical and manageable network. Take the below fully qualified domain name:
node01.pri.cluster1.compute.estate
Which can be broken down as follows:
node01
- The hostname of the systempri
- The network that the interface of the system is sat on (in this case, pri = primary)cluster1
- The cluster thatnode01
is a part ofcompute
- The subdomain of the greater network thatcluster1
is a part ofestate
- The top level domain
Security¶
Network security is key for both the internal and external connections of the HPC environment. Without proper security control the system configuration and data is at risk to attack or destruction from user error. Some tips for improving network security are below:
- Restrict external access points where possible. This will reduce the quantity of points of entry, minimising the attack surface from external sources.
- Limit areas that users have access to. In general, there are certain systems that users would never (and should never) have access to so preventing them from reaching these places will circumvent any potential user error risks.
- Implement firewalls to limit the types of traffic allowed in/out of systems.
It is also worth considering the performance and usability impacts of security measures.
Much like with resilience, a Cloud provider will most likely implement the above security features - it is worth knowing what security features and limitations are in place when selecting a cloud environment.
Note
Non-Ethernet networks usually cannot usually be secured to the same level as Ethernet so be aware of what the security drawbacks are for the chosen network technology.
Additional Considerations and Questions¶
The below questions should be considered when designing the network and hardware solution for the HPC platform.
- How much power will the systems draw?
- Think about the power draw of the selected hardware, it may be drawing a large amount of amps so sufficient power sources must be available.
- How many users are going to be accessing the system?
- A complex, distributed service network would most likely be overkill and a centralised login/master node would be more appropriate.
- What network interconnect will be used?
- It’s most likely that different network technologies will be used for Different Networks. For example, the high performance network could benefit from using Infiniband as the interconnect.
- How could the hardware be optimised?
- BIOS settings could be tweaked on the motherboard to give additional performance and stability improvements.
- Network switch configurations could be optimised for different types of traffic
- What types of nodes will be in the system?
- What applications are going to be run on the system?
- Are they memory intensive?
- Is interconnect heavily relied upon for computations?
[1] | For more information on RAID arrays see https://en.wikipedia.org/wiki/RAID |
Recommendations for Network and Hardware Design¶
At Alces software, the recommended network design differs slightly depending on the number of users and quantity of systems within the HPC platform.
Hardware Recommendations¶
Recommendations for blades, network switches and storage technologies can be found here - https://github.com/alces-software/knowledgebase/wiki
Cluster Architectures¶
With the Network and Hardware Design Considerations in mind, diagrams of different architectures are below. They increase in complexity and redundancy as the list goes on.
Example 1 - stand-alone¶

The above architecture consists of master, login and compute nodes. The services provided by the master & login nodes can be seen to the right of each node type. This architecture only separates the services for users and admins.
Example 2 - high-availability¶

This architecture provides additional redundancy to the services running on the master node. For example, the disk array is connected to both master nodes which use multipath to ensure the higher availability of the storage device.
Example 3 - HA VMs¶

This architecture puts services inside of VMs to improve the ability to migrate and modify services with little impact to the other services and systems on the architecture. Virtual machines can be moved between VM hosts live without service disruption allowing for hardware replacements to take place on servers.
Network Designs¶
The above architectures can be implemented with any of the below network designs.
Example 1- simple¶

The above design contains the minimum recommended internal networks. A primary network (for general logins and navigating the system), a management network (for BMC management of nodes and switches) and a high performance Infiniband network (connected to the nodes). The master and login nodes have access to the external network for user and admin access to the HPC network.
Note
The master node could additionally be connected to the high performance network so that compute nodes have a faster network connection to storage.
Example 2 - VLANs¶

The above network design has a few additions to the first example. The main change is the inclusion of VLANs for the primary, management and build networks (with the build network being a new addition to this design). The build network allows for systems to be toggled over to a DHCP system that uses PXE booting to kickstart an OS installation.
Other Recommendations¶
BIOS Settings¶
It’s recommended to ensure that the BIOS settings are reset to default and the latest BIOS version is installed before optimising the settings. This can ensure that any issues that may be present in the configuration before proceeding have been removed.
When it comes to optimising the BIOS settings on a system in the network, the following changes are recommended:
- Setting the power management to maximum performance
- Disabling CPU CStates
- Disabling Hyperthreading
- Enabling turbo mode
- Disabling quiet boot
- Setting BMC to use the dedicated port for BMC traffic
- Setting the node to stay off when power is restored after AC power loss
Note
The wordings for settings above may differ depending on the hardware that is being used. Look for similar settings that can be configured to achieve the same result.
For hardware-specific BIOS configuration settings see https://github.com/alces-software/knowledgebase/wiki#servers
Considerations for Infrastructure Design¶
Infrastructure design largely relates to the considerations made for the Cluster Architectures. Depending on the design being used, some of the infrastructure decisions may have already been made.
Infrastructure Service Availability¶
There are typically 3 possible service availability options to choose from, these are:
- All-in-one
- VM Platform
- High Availability VM Platform
Note
If using a Cloud Platform then the service availability will be handled by the cloud provider. The only additional considerations are how services will be provided in terms of native running services or containerised.
These are covered in more detail below.
All-in-one¶
This is the most common solution, an all-in-one approach loads services onto a single machine which serves the network. It is the simplest solution as a single OS install is required and no additional configuration of virtual machine services is needed.
This solution, while quick and relatively easy to implement, is not a recommended approach. Due to the lack of redundancy options and the lack of service isolation there is a higher risk of an issue effecting one service (or the machine) to have an effect on other services.
VM Platform¶
A VM platform provides an additional layer of isolation between services. This can allow for services to be configured, migrated and modified without potentially effecting other services.
There are a number of solutions for hosting virtual machines, including:
- Open-source solutions (e.g. VirtualBox, KVM, Xen)
- Commercial solutions (e.g. VMware)
The above software solutions provide similar functionality and can all be used as a valid virtualisation platform. Further investigation into the ease of use, flexibility and features of the software is recommended to identify the ideal solution for your HPC platform.
High Availability VM Platform¶
For further redundancy, the virtualisation platform can utilise a resource pool. The service will be spread across multiple machines which allows for VMs to migrate between the hosts whilst still active. This live migration can allow for one of the hosts to be taken off of the network for maintenance without impacting the availability of the service VMs.
Node Network Configuration¶
In addition to the availability of services, the network configuration on the node can provide better performance and redundancy. Some of the network configuration options that can improve the infrastructure are:
- Channel Bonding - Bonding interfaces allows for traffic to be shared between 2 network interfaces. If the bonded interfaces are connected to separate network switches then this solution
- Interface Bridging - Network bridges are used by interfaces on virtual machines to connect to the rest of the network. A bridge can sit on top of a channel bond such that the VM service network connection is constantly available.
- VLAN Interface Tagging - VLAN management can be performed both on a managed switch and on the node. The node is able to create subdivisions of network interfaces to add VLAN tags to packets. This will create separate interfaces that can be seen by the operating system (e.g. eth0.1 and eth0.2) which can individually have IP addresses set.
Additional Considerations and Questions¶
Could containerised solutions be used instead of VMs?
- Docker or Singularity containerised services can provide similar levels of isolation between services as VMs without the additional performance overhead.
Recommendations for Infrastructure Design¶
The example configurations here combine elements of the Recommendations for Network and Hardware Design as well as the different infrastructure solutions from Considerations for Infrastructure Design. These focus on the internal configuration of the master node but these examples can be extrapolated for configuring login, storage, compute or any other nodes that are part of the HPC environment.
Simple Infrastructure¶

The simplest infrastructure configuration uses the all-in-one approach where services are configured on the master node’s operating system.
Virtual Machine Infrastructure¶

This solution separates the services into VMs running on the master node. In order for these VMs to be able to connect to the primary network a network bridge is created that allows the VM interfaces to send traffic over the eth0 interface.
Channel Bonded Infrastructure¶

This example adds a layer of redundancy over the VM Infrastructure design by bonding the eth0 and eth3 interfaces. These interfaces are connected to separate network switches (the switches will be bridged together as well) which provides redundancy should a switch or network interface fail. Bonding of the two interfaces creates a new bond interface that the bridge for the virtual machines connects to.
VLAN Infrastructure¶

The above solution implements the channel bonded infrastructure in a network with VLANs. The VLANs have bond and bridge interfaces created for them. This allows some additional flexibility for VM bridging as virtual interfaces can be bridged onto specific VLANs whilst maintaining the redundancy provided by the bond. This adds additional security to the network as the master node can be left without an IP on certain VLAN bond interfaces which prevents that network from accessing the master node whilst VMs on the master node are able to reach that VLAN.
Base System Overview¶
About¶
The base system is comprised of the integral services required for a deployment environment.
It is recommended that periodic updates are run in the future with the source tree for the minor OS version. The systems would require careful testing after any updates have been applied to ensure system functionality has persisted.
Note
If a local repository has been setup then the local repo mirrors will need to be resynced before deploying updates.
The controller node also provides IP Masquerading on its external interface. All slave nodes are configured to default route out via the controller node’s external interface.
A tftp service, dhcpd service and webserver are installed on the controller node, these enable slave systems to be booted and pickup a series of automated deployment scripts that will result in a blank system being deployed and joining the environment.
Components¶
This package will set up and configure:
Up-to-date OS installation (CentOS 7.3 with full upstream ‘update’ tree applied)
Firewall rules for network interfaces
Metalware cluster management software, providing:
Custom yum repository for providing additional packages to nodes
- The directory
/opt/alces/repo/custom/Packages
can be used to store RPMs that are then be served to client nodes, allowing for custom, additional or non-supported packages to be installed.DHCP and TFTP server configuration for network booting
- DHCP provides host identity management, such as, serving IPs and hostnames to client systems based on the hardware MAC address of the client. This information is used during installation to configure the node uniquely.
- TFTP provides the boot configuration of the system in order to provide the build or boot environment of client systems.
NTP for keeping the cluster clocks in sync
Name resolution services either:
DNSmasq using
/etc/hosts
- Alongside the server providing lookup responses, the client systems will also have a fully populated
/etc/hosts
files for local queries.or
Named from bind packages
- Named creates forward and reverse search zones on the controller node that can be queried by all clients. Unlike DNSmasq, the client systems have an empty
/etc/hosts
as named is serving all of the additional host information.Management tools built around ipmitool, pdsh and libgenders
- These management tools allow for running commands on multiple systems defined in groups, improving the ease and flexibility of environment management.
Combined logging for clients with rsyslog server
- All clients built from the master write out build progress and syslog messages to
/var/log/slave/<NODENAME>/messages.log
Key Files¶
/etc/hosts
/etc/dhcp/dhcpd.*
/etc/dnsmasq.conf
or/etc/named/metalware.conf
/opt/metalware/*
/var/lib/metalware/*
/var/lib/tftpboot/*
/etc/ntp.conf
/etc/rsyslog.d/metalware.conf
Licensing¶
The CentOS Linux distribution is released under a number of Open-source software licenses, including GPL. A copy of the relevant license terms is included with the CentOS software installed on your cluster. The CentOS Linux distribution does not require any per-server license registration.
Additionally, the applications and services installed have similar open-source licensing which can be viewed either online or through the manual pages for the specific package.
Considerations for Software and Application Deployment¶
Before considering how the OS and applications will be deployed it is worth making a few decisions regarding the OS that will be used:
- Will the same OS be used on all systems? (it’s strong recommended to do so)
- What software will be used? (and therefore will need to be supported by the OS)
- How stable is the OS? (bleeding edge OSes may have bugs and instabilities that could negatively impact the HPC environment)
- If you are using bare-metal hardware, is the OS supported by your hardware vendor? (running an unsupported OS can lead to issues when attempting to obtain hardware support)
Deployment¶
The deployment of the HPC platform can be summarised in two main sections, these being:
- Operating System Deployment
- Software Package Repository Management
Operating System Deployment¶
When it comes to performing many operating installations across nodes in the network it can be tricky to find a flexible, manageable, automated solution. Performing manual installations of operating systems may be the ideal solution if there are only a few compute nodes, however, there are many other ways of improving the speed of OS deployment:
- Disk Cloning - A somewhat inelegant solution, disk cloning involves building the operating system once and creating a compressed copy on a hard-drive that can be restored to blank hard-drives.
- Kickstart - A kickstart file is a template for automating OS installations, the configuration file can be served over the network such that clients can PXE boot the installation. This can allow for easy, distributed deployment over a local network.
- Image Deployment - Cloud service providers usually deploy systems from template images that set hostnames and other unique system information at boot time. Customised templates can also be created for streamlining the deployment and customisation procedure.
It is worth considering manual, cloning and kickstart solutions for your OS deployment, any one of them could be the ideal solution depending on the number of machines that are being deployed.
Repository Management¶
It is worth considering how packages, libraries and applications will be installed onto individual nodes and the network as a whole. Operating systems usually have their own package management system installed that uses public repositories for pulling down packages for installation. It is likely that all of the packages required for a system are not in the public repositories so it’s worth considering where additional packages will come from (e.g. a 3rd party repository, downloaded directly from the package maintainer or manually compiled).
Further to managing packages on the local system, the entire network may require applications to be installed; there are a couple of options for achieving this:
- Server Management Tools - Management tools such as puppet, chef or pdsh can execute commands across multiple systems in parallel. This saves time instead of having to individually login and run commands on each node in the system.
- Network Package Managers - Software such as Alces Gridware can install an application in a centralised storage location, allowing users simply to modify their
PATH
andLD_LIBRARY_PATH
on a node in order to start using the application.
For more information regarding network package managers and application deployment, see Application Deployment
Additional Considerations and Questions¶
- How will applications outside of the repositories be installed?
- Will the application need to be useable by all nodes in the HPC network? (e.g. an NFS export for apps may solve the issue of multiple installations)
- How will new package versions be installed when the HPC environment being maintained in the future?
- How will you create and maintain a consistent software environment on all nodes over time?
Master Node Setup¶
Manual Setup¶
Run a minimal CentOS installation on the system; this can be performed manually or via an automated install service if you have one already setup
It is recommended to update the packages on the system for any bug fixes and patches that may have been introduced to core packages:
yum -y update
Note
If using kickstart OS installation on many nodes it is worth considering a local mirror repository for the OS image and packages so that all nodes receive an equivalent software installation, and aren’t trying to connect to the internet at the same time.
Disable and stop NetworkManager:
systemctl disable NetworkManager systemctl stop NetworkManager
Set the hostname:
echo ‘master.cluster.local’ > /etc/hostname
Configure bridge interface for primary network (
/etc/sysconfig/network-scripts/ifcfg-pri
):DEVICE=pri ONBOOT=yes TYPE=Bridge stp=no BOOTPROTO=static IPADDR=10.10.0.11 NETMASK=255.255.0.0 ZONE=trusted PEERDNS=no
Note
Replace DEVICE
, IPADDR
and NETMASK
with the appropriate values for your system
- Create bridge interfaces for all other networks (e.g. management [
mgt
] and external [ext
])
Note
The external interface may require getting it’s network settings over DHCP; if so then set BOOTPROTO
to dhcp
instead of static and remove the IPADDR
lines.
Setup config file for network interfaces (do this for all interfaces sitting on the bridges configured above):
TYPE=Ethernet BOOTPROTO=none NAME=p1p1 DEVICE=p1p1 ONBOOT=“yes” BRIDGE=pri
Note
In the above example, the interface p1p1
is connected to the primary network but instead of giving that an IP it is set to use the pri
bridge
Enable and start firewalld to allow masquerading client machines via the external interface and to improve general network security:
systemctl enable firewalld systemctl start firewalld
Add
ext
bridge to external zone; the external zone is a zone configured as part of firewalld:firewall-cmd --add-interface ext --zone external --permanent
Add all the other network interfaces to the trusted zone; replace
pri
with the other network names, excludingext
:firewall-cmd --add-interface pri --zone trusted --permanent
Reboot the system
Install components for VM service:
yum groupinstall -y virtualization-platform virtualization-tools yum install -y virt-viewer virt-install
Enable and start the virtualisation service:
systemctl start libvirtd systemctl enable libvirtd
Create VM pool:
mkdir /opt/vm virsh pool-define-as local dir - - - - "/opt/vm/" virsh pool-build local virsh pool-start local virsh pool-autostart local
Automated Setup¶
If using metalware, a controller can be used to deploy it’s own master. By setting up the controller on a separate machine to the master, the master can then be defined and hunted (see Deployment Example for hunting instructions). The following will add the build config and scripts to configure a functional master (much like the above).
Note
In the following guide the group is called masters
and the master node is master1
Configure certificate authority for libvirt from the controller as described in VM Deployment from the Controller
Create a deployment file for the master node (
/var/lib/metalware/repo/config/master1.yaml
) containing the following (the network setup configures network bridges and the external interface):files: setup: - /opt/alces/install/scripts/10-vm_master.sh core: - /opt/alces/ca_setup/master1-key.pem - /opt/alces/ca_setup/master1-cert.pem networks: pri: interface: pri type: Bridge slave_interfaces: em1 mgt: interface: mgt type: Bridge slave_interfaces: em2 ext: defined: true domain: ext ip: 10.101.100.99 network: 10.101.0.0 netmask: 255.255.0.0 gateway: 10.101.0.1 short_hostname: <%= node.name %>.<%= config.networks.ext.domain %> interface: ext firewallpolicy: external slave_interfaces: p1p4
Note
If additional scripts are defined in the domain level setup
and core
lists then be sure to include them in the master1 file
Additionally, download the VM master script to
/opt/alces/install/scripts/10-vm_master.sh
:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 10-vm_master.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/libvirt/vm_master.sh
Controller VM Setup¶
On Master Node¶
Create
/opt/vm/controller.xml
for provisioning a VM called controller (Available here
)- This template creates 3 interfaces on the VM (on the primary, management and external networks)
Create base qcow2 image
controller.qcow2
:qemu-img create -f qcow2 controller.qcow2 80G
Create the VM:
virsh define controller.xml
Start the VM:
virsh start controller
Connect a VNC-like window to the VM to watch it booting and interact with the terminal:
virt-viewer controller
Note
Much like the host system, a minimal installation of CentOS 7 is recommended (as is ensuring that the system is up-to-date with yum -y update
)
On Controller VM¶
OS Setup¶
Set the hostname of the system (the fully-qualified domain name for this system has additionally added the cluster name):
echo 'controller.testcluster.cluster.local' > /etc/hostname
Setup the network interfaces (if setting a static IP then ensure to set
IPADDR
,NETMASK
andNETWORK
for the interface)- Eth0 is bridged onto the primary network - set a static IP for that network in
/etc/sysconfig/network-scripts/ifcfg-eth0
- Eth1 is bridged onto the management network - set a static IP for that network in
/etc/sysconfig/network-scripts/ifcfg-eth1
- Eth2 is bridged onto the external network - this will most likely use DHCP to obtain an IP address
/etc/sysconfig/network-scripts/ifcfg-eth2
Note
Add
ZONE=trusted
to eth0 & eth1,ZONE=external
to eth2 to ensure the correct firewall zones are used by the interfaces.- Eth0 is bridged onto the primary network - set a static IP for that network in
Enable and start firewalld:
systemctl enable firewalld systemctl start firewalld
Add the interfaces to the relevant firewall zones:
firewall-cmd --add-interface eth0 --zone trusted --permanent firewall-cmd --add-interface eth1 --zone trusted --permanent firewall-cmd --add-interface eth2 --zone external --permanent
Disable network manager:
systemctl disable NetworkManager
Reboot the VM
Once the VM is back up it should be able to ping both the primary and management interfaces on the master node. If the ping returns properly then metalware can be configured to enable deployment capabilities on the VM.
Metalware Install¶
Install metalware (to install a different branch, append
alces_SOURCE_BRANCH=develop
before/bin/bash
in order to install thedevelop
branch):curl -sL http://git.io/metalware-installer | sudo alces_OS=el7 /bin/bash
Set metalware to use default repository:
metal repo use https://github.com/alces-software/metalware-default.git
Domain Configuration¶
Configure the domain settings (this will prompt for various details regarding the domain setup, such as, root password, SSH RSA key [which can be created with
ssh-keygen
] and default network parameters):metal configure domain
Uncomment the
PASSWORD=
line in/opt/metalware/etc/ipmi.conf
and replacepassword
with the IPMI password from the domain configurationSync the configuration changes:
metal sync
Controller Build¶
Configure the controller node:
metal configure local
Sync the configuration changes:
metal sync
Template the scripts and sync them into place:
metal template local metal sync
Build the controller:
metal build local
Note
If you wish to install an OS other than CentOS 7 then see the Configure Alternative Kickstart Profile instructions.
Platform Scripts¶
Deploying on different hardware and platforms may require additional stages to be run on systems when being deployed. This is handled by an additional scripts key platform:
in /var/lib/metalware/repo/config/domain.yaml
.
There is currently a script for configuring the AWS EL7 platform available on github which can be downloaded to the scripts area:
mkdir -p /opt/alces/install/scripts/
cd /opt/alces/install/scripts/
wget https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/platform/aws.sh
Deployment Example¶
Client Deployment Example¶
Configure a node group (this example creates a
nodes
group for compute nodes):metal configure group nodes
Start the controller VM listening for PXE requests:
metal hunter
Boot up the client node
The controller VM will print a line when the node has connected, when this happens enter the hostname for the system (this should be a hostname that exists in the group configured earlier)
Once the hostname has been added the previous metal command can be cancelled (with ctrl-c)
Generate DHCP entry for the node:
metal dhcp
Start the controller VM serving installation files to the node (replace slave01 with the hostname of the client node):
metal build slave01
Note
If building multiple systems the genders group can be specified instead of the node hostname. For example, all compute nodes can be built with metal build -g nodes
.
- The client node can be rebooted and it will begin an automatic installation of CentOS 7
- The
metal build
will automatically exit when the client installation has completed - Passwordless SSH should now work to the client node
Configuring Alternative Kickstart Profile¶
In this example, a CentOS 6 kickstart profile is configured. This method should be transferrable to other operating systems with little modification to the general practice.
Download the boot files to the
PXE_BOOT
directory:PXE_BOOT=/var/lib/tftpboot/boot/ curl http://mirror.ox.ac.uk/sites/mirror.centos.org/6/os/x86_64/images/pxeboot/initrd.img > “$PXE_BOOT/centos6-initrd.img” curl http://mirror.ox.ac.uk/sites/mirror.centos.org/6/os/x86_64/images/pxeboot/vmlinuz > “$PXE_BOOT/centos6-kernel”
Create
/var/lib/metalware/repo/pxelinux/centos6
template PXE boot file for the OS:DEFAULT menu PROMPT 0 MENU TITLE PXE Menu TIMEOUT 5 TOTALTIMEOUT 5 <%= alces.firstboot ? "ONTIMEOUT INSTALL" : "ONTIMEOUT local"%> LABEL INSTALL KERNEL boot/centos6-kernel APPEND initrd=boot/centos6-initrd.img ksdevice=<%= config.networks.pri.interface %> ks=<%= node.kickstart_url %> network ks.sendmac _ALCES_BASE_HOSTNAME=<%= node.name %> <%= config.kernelappendoptions %> IPAPPEND 2 LABEL local MENU LABEL (local) MENU DEFAULT LOCALBOOT 0
Create
/var/lib/metalware/repo/kickstart/centos6
template file for kickstart installations of the OS:#!/bin/bash ##(c)2017 Alces Software Ltd. HPC Consulting Build Suite ## vim: set filetype=kickstart : network --onboot yes --device <%= config.networks.pri.interface %> --bootproto dhcp --noipv6 #MISC text reboot skipx install #SECURITY firewall --enabled firstboot --disable selinux --disabled #AUTH auth --useshadow --enablemd5 rootpw --iscrypted <%= config.encrypted_root_password %> #LOCALIZATION keyboard uk lang en_GB timezone Europe/London #REPOS url --url=<%= config.yumrepo.build_url %> #DISK %include /tmp/disk.part #PRESCRIPT %pre set -x -v exec 1>/tmp/ks-pre.log 2>&1 DISKFILE=/tmp/disk.part bootloaderappend="console=tty0 console=ttyS1,115200n8" cat > $DISKFILE << EOF <%= config.disksetup %> EOF #PACKAGES %packages --ignoremissing vim emacs xauth xhost xdpyinfo xterm xclock tigervnc-server ntpdate vconfig bridge-utils patch tcl-devel gettext #POSTSCRIPTS %post --nochroot set -x -v exec 1>/mnt/sysimage/root/ks-post-nochroot.log 2>&1 ntpdate 0.centos.pool.ntp.org %post set -x -v exec 1>/root/ks-post.log 2>&1 # Example of using rendered Metalware file; this file itself also uses other # rendered files. curl <%= node.files.main.first.url %> | /bin/bash | tee /tmp/metalware-default-output curl '<%= node.build_complete_url %>&event=complete&msg=Build%20is%20complete'
When building nodes, use the new template files by specifying them as arguments to the
metal build
command:metal build -k centos6 -p centos6 slave01
Configuring UEFI Boot¶
UEFI network booting is an alternative to PXE booting and is usually the standard on newer hardware, support for building nodes with UEFI booting can be configured as follows.
Create additional TFTP directory and download EFI boot loader:
mkdir -p /var/lib/tftpboot/efi/ cd /var/lib/tftpboot/efi/ wget https://github.com/alces-software/knowledgebase/raw/master/epel/7/grub-efi/grubx64.efi chmod +x grubx64.efi
For UEFI clients, add the following line to the client config file:
build_method: uefi
Additionally, a
/boot/efi
partition will be required for UEFI clients, an example of this partition as part of the disk setup (in the client config) is below:disksetup: | zerombr bootloader --location=mbr --driveorder=sda --append="$bootloaderappend" clearpart --all --initlabel #Disk partitioning information part /boot --fstype ext4 --size=4096 --asprimary --ondisk sda part /boot/efi --fstype=efi --size=200 --asprimary --ondisk sda part pv.01 --size=1 --grow --asprimary --ondisk sda volgroup system pv.01 logvol / --fstype ext4 --vgname=system --size=16384 --name=root logvol /var --fstype ext4 --vgname=system --size=16384 --name=var logvol /tmp --fstype ext4 --vgname=system --size=1 --grow --name=tmp logvol swap --fstype swap --vgname=system --size=8096 --name=swap1
Repository Overview¶
About¶
Upstream repositories for CentOS and EPEL will be mirrored locally to a virtual machine which can provide the packages to the rest of the nodes in the cluster. The local repository will be used for deployment installations and package updates.
Components¶
Upstream distribution primary repos and EPEL will imported to /opt/alces/repo/
with reposync
, any upstream repo groups will also be imported to allow node redeployment without internet access (and a known working disaster recovery configuration).
Key Files¶
/etc/yum.repos.d/*
/opt/alces/repo/*
Repository Mirror Server¶
On Master Node¶
Create
/opt/vm/repo.xml
for deploying the repo VM (Available here
)Create disk image for the repo VM:
qemu-img create -f qcow2 repo.qcow2 150G
Define the VM:
virsh define repo.xml
On Controller VM¶
Create a group for the repo VM (add at least
repo1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group repo
Customise
repo1
node configuration (set the primary IP address to 10.10.0.2):metal configure node repo1
Create a deployment file specifically for
repo1
at/var/lib/metalware/repo/config/repo1.yaml
with the following content:yumrepo: is_server: true
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(build_url
is the URL for client kickstart builds to use,source_repos
should be a comma-separated list of source files that repoman will use to generate client configurations,clientrepofile
will need to be a URL to a repo config file for the client to curl):yumrepo: # Repostiroy URL for kickstart builds build_url: http://mirror.ox.ac.uk/sites/mirror.centos.org/7/os/x86_64/ # If true, this server will host a client config file for the network is_server: false # Repoman source files for repository mirror server to use (comma separate) source_repos: base.upstream # The file for clients to curl containing repository information [OPTIONAL] # clientrepofile: http://myrepo.com/repo/client.repo clientrepofile: false
Note
See the repoman project page for the included repository template files. To add customised repositories, create them in /var/lib/repoman/templates/centos/7/
on the repository server.
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/00-repos.sh
Download the
repos.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 00-repos.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/repo/repos.sh
Note
The script is renamed to 00-repos.sh
to guarantee that it is run before any other setup scripts.
- Follow Client Deployment Example to setup the repo node
- The repo VM will now be up and can be logged in with passwordless SSH from the controller VM and will have a clone of the CentOS upstream repositories locally. Modify
build_url
in/var/lib/metalware/repo/config/domain.yaml
to be the, now built, repo server’s URL so that new client builds will use that repository.
Custom Repository Setup¶
As well as using different sources for the upstream repositories it is beneficial to have a local repository that can be used to serve additional packages which are not part of upstream repos to clients. This will be known as the custom repository, details on setting up the custom repository are below. The purpose of this repository is to provide packages to the network that aren’t available in upstream repositories or require higher installation priority than other available packages (e.g. a newer kernel package).
Manual¶
Install package dependencies:
yum -y install createrepo httpd yum-plugin-priorities yum-utils
Create custom repository directory:
mkdir -p /opt/alces/repo/custom/
Define the repository:
cd /opt/alces/repo/ createrepo custom
Create a repo source file to be served to clients at
/var/lib/repoman/templates/centos/7/custom.local
:[custom] name=custom baseurl=http://myrepo.com/repo/custom/ description=Custom repository local to the cluster enabled=1 skip_if_unavailable=1 gpgcheck=0 priority=1
Add the custom repository to the source repos in
/var/lib/metalware/repo/config/domain.yaml
:yumrepo: # Repostiroy URL for kickstart builds build_url: http://mirror.ox.ac.uk/sites/mirror.centos.org/7/os/x86_64/ # If true, this server will host a client config file for the network is_server: false # Repoman source files for repository mirror server to use (comma separate) source_repos: base.upstream,custom.local # The file for clients to curl containing repository information [OPTIONAL] # clientrepofile: http://myrepo.com/repo/client.repo clientrepofile: false
User Management Overview¶
About¶
This package contains the services required to configure a central user management server for the HPC environment. This relieves the need to manage /etc/passwd
locally on every system within the HPC environment and provides further authentication management of different services.
Components¶
For user management, one of the following software solutions will be implemented:
NIS (Network Information Service)
- The Network Information Service (NIS) is a directory service that enables the sharing of user and host information across a network.
or
IPA (Identity, Policy, Audit)
- FreeIPA provides all the information that NIS does as well as providing application and service information to the network. Additionally, FreeIPA uses directory structure such that information can be logically stored in a tree-like structure. It also comes with a web interface for managing the solution.
Considerations for User Management¶
User Authentication¶
User authentication is usually performed in a server/client setup inside the HPC environment due to the unnecessary overhead of manually maintaining /etc/passwd
on a network of nodes. A few options for network user management are:
- NIS - The Network Information Service (NIS) is a directory service that enables the sharing of user and host information across a network.
- FreeIPA - FreeIPA provides all the information that NIS does as well as providing application and service information to the network. Additionally, FreeIPA uses directory structure such that information can be logically stored in a tree-like structure. It also comes with a web interface for managing the solution.
- Connecting to an externally-managed user-authentication service (e.g. LDAP, active-directory). This option is not recommended, as a large HPC cluster can put considerable load on external services. Using external user-authentication also creates a dependancy for your cluster on another service, complicating troubleshooting and potentially impacting service availability.
Note
If the user accounts need to be consistent with accounts on the external network then the master node should have a slave service to the external networks account management system. This will allow the account information to be forwarded to the HPC network, without creating a hard-dependancy on an external authentication service.
User Access¶
It is also worth considering how users will be accessing the system. A few ways that users can be accessing and interacting with the HPC environment are:
- SSH - This is the most common form of access for both users and admins. SSH will provide terminal-based access and X forwarding capabilities to the user.
- VNC - The VNC service creates a desktop session that can be remotely connected to by a user, allowing them to run graphical applications inside the HPC network.
- VPN - A VPN will provide remote network access to the HPC environment. This can be especially useful when access to the network is required from outside of the external network. Once connected to the VPN service, SSH or VNC can be used as it usually would be.
Note
If running firewall services within the environment (recommended) then be sure to allow access from the ports used by the selected user access protocols.
Additional Considerations and Questions¶
- What information will need to be shared between the systems?
- How will users want to access the system?
Recommendations for User Management¶
Below are guidelines for setting up both a NIS and an IPA server, only one of these should be setup to prevent conflicts and inconsistencies in user management.
NIS Server Setup¶
On Master Node¶
Create
/opt/vm/nis.xml
for deploying the nis VM (Available here
)Create disk image for the nis VM:
qemu-img create -f qcow2 nis.qcow2 80G
Define the VM:
virsh define nis.xml
On Controller VM¶
Create a group for the nis VM (add at least
nis1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group nis
Customise
nis1
node configuration (set the primary IP address to 10.10.0.4):metal configure node nis1
Create a deployment file specifically for
nis1
at/var/lib/metalware/repo/config/nis1.yaml
with the following content:nisconfig: is_server: true
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(the nisserver IP should match the one specified fornis1
):nisconfig: nisserver: 10.10.0.4 nisdomain: nis.<%= config.domain %> is_server: false # specify non-standard user directory [optional] users_dir: /users
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/02-nis.sh
Download the
nis.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 02-nis.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/nis/nis.sh
Follow Client Deployment Example to setup the compute nodes
IPA Server Setup¶
On Master Node¶
Create
/opt/vm/ipa.xml
for deploying the IPA VM (Available here
)Create disk image for the IPA VM:
qemu-img create -f qcow2 IPA.qcow2 80G
Define the VM:
virsh define IPA.xml
On Controller VM¶
Create a group for the IPA VM (add at least
ipa1
as a node in the group, set additional groups ofservices,cluster,domain
allowing for more diverse group management):metal configure group ipa
Customise
ipa1
node configuration (set the primary IP address to 10.10.0.4):metal configure node ipa1
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(the ipaserver IP should match the one specified foripa1
):ipaconfig: serverip: 10.10.0.4 servername: ipa1 insecurepassword: abcdef123 userdir: /users
Additionally, add the following to the
scripts:
namespace list in/var/lib/metalware/repo/config/domain.yaml
(this script runs the client-side configuration of IPA):- /opt/alces/install/scripts/02-ipa.sh
Download the
ipa.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 02-ipa.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/ipa/ipa.sh
Follow Client Deployment Example to setup the IPA node and continue to the next session to configure the IPA server with a script
Setup IPA Server¶
Download the server configuration script to the controller:
cd /opt/alces/install/scripts/ wget http://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/ipa/ipa_server.sh
Render the script for the IPA server:
metal render /opt/alces/install/scripts/ipa_server.sh ipa1 > /tmp/ipa_server.sh
Copy the script to the IPA server:
scp /tmp/ipa_server.sh ipa1:/root/
Note
Before launching the script it is currently necessary to disable named on the controller from serving the primary forward and reverse domains such that the IPA installation will work. This can be re-enabled once the IPA script has finished running.
Launch the script on the IPA server (following any on-screen prompts):
ssh ipa1 "/root/ipa_server.sh"
IPA Replica Server Setup¶
For this example, the servers are as follows:
infra01-dom0
- The IPA serverinfra01-domX
- The server to replicate IPA
Configuring Replica Host (on infra01-domX
)¶
Install IPA tools:
yum -y install ipa-server bind bind-dyndb-ldap ipa-server-dns
Configure DNS to use
infra01-dom0
as nameserver in/etc/resolv.conf
Preparing Server (on infra01-dom0
)¶
Add
infra01-domX
as a host:ipa host-add infra01-domX --password="MyOneTimePassword" --ip-address=10.10.2.51
Add
infra01-domX
toipaservers
group:ipa hostgroup-add-member ipaservers --hosts=infra01-domX
Connecting Replica Host (on infra01-domX
)¶
Run the replica installation:
ipa-replica-install --realm="PRI.CLUSTER.ALCES.NETWORK" --server="infra01-dom0.pri.cluster.alces.network" --domain="pri.cluster.alces.network" --password="MyOneTimePassword"
Storage Overview¶
About¶
This package configures an NFS master server to provide user and data filesystems to all slave nodes.
Components¶
The storage solution is comprised of the following:
NFS server
- The NFS server serves redundant network storage (depending on hardware and network configuration) to the client systems in the environment. This allows for distributed access to project and user data.
Filesystem formatting
- Ext4
or - XFS
Exported filesystems
/export/users
to be mounted at/users
on clients/export/data
to be mounted at/data
on clients
Key Files¶
/etc/sysconfig/nfs
/etc/exports
/etc/fstab
Considerations for Storage Solution¶
Storage Hardware¶
When selecting the storage solution it is worth considering the size, performance and resilience of the desired storage solution. Usually some sort of storage array will be used; e.g. a collection of disks (otherwise known as JBOD - Just a Bunch Of Disks) in the form of an internal or external RAID array.
Type of filesystem¶
For many requirements, a simple NFS solution can provide sufficient performance and resiliency for data. Application, library and source-code files are often small enough to be stored on an appropriately sized NFS solution; such storage systems can even be grown over time using technologies like LVM and XFS as requirements increase.
For data-sets that require high capacity (>100TB) or high-performance (>1GB/sec) access, a parallel filesystem may be suitable to store data. Particularly well suited to larger files (e.g. at least 4MB per storage server), a parallel filesystem can provide additional features such as byte-range locking (the ability for multiple nodes to update sections of a large file simultaneously), and MPI-IO (the ability to control data read/written using MPI calls). Parallel filesystems work by aggregating performance and capacity from multiple servers and allowing clients (your cluster compute and login nodes) to mount the filesystem as a single mount-point. Common examples include:
- Lustre; an open-source kernel-based parallel filesystem for Linux
- GPFS (general purpose file system; a proprietary kernel-based parallel filesystem for Linux
- BeeGFS; an open-source user-space parallel filesystem for Linux
Your choice of filesystem will depend on the features you require, and your general familiarity with the technologies involved. As parallel filesystems do not perform well for smaller files, it is very common to deploy a parallel filesystem alongside a simple NFS-based storage solution.
If your data-set contains a large number of files which need to be kept and searchable for a long time (> 12-months) then an object storage system can also be considered. Accessed using client-agnostic protocols such as HTTPS, an object storage system is ideal for creating data archives which can include extended metadata to assist users to locate and organise their data. Most object storage systems include data redundancy options (e.g. multiple copies, object versioning and tiering), making them an excellent choice for long-term data storage. Examples of object-storage systems include:
- AWS Simple Storage Service (S3); a cloud-hosted service with a range of data persistence options
- Swift-stack; available as both on-premise and cloud-hosted services, the SWIFT protocol is compatible with a wide range of software and services
- Ceph; an on-premise object storage system with a range of interfaces (block, object, S3, POSIX file)
Network Storage Solutions¶
Single server with NFS¶

In this example, a single server is connected to a RAID 6 storage array which it is serving over NFS to the systems on the network. While simple in design and implementation, this design only provides redundancy at the RAID level.
Multiple Servers with NFS¶

In addition to the previous example, this setup features multiple storage servers which balance the load of serving the disk over NFS.
Multiple Servers with Parallel filesystem¶

This setup features multiple RAID sets which are installed externally to the storage servers and are connected to both of them using multipath - this allows for multiple paths to the storage devices to be utilised. Using this storage, a Lustre volume has been configured which consists of a combination of all the external disks. Authorisation of access to the storage volume is managed by the metadata node, which also has dedicated storage.
Additional Considerations and Questions¶
- What data will need to be centrally stored?
- Where will data be coming from?
- Are source files created within the HPC network or do they exist in the external network?
- Will compute nodes be writing out logs/results from running jobs?
- Where else might data be coming from?
- Is scratch space needed?
- What level of redundancy/stability is required for the data?
- How will the data be backed up?
- Will there be off-site backups?
- Should a separate storage medium be used?
- Does all the data need backing up or only certain files?
- For how long will we keep the data, and any backups created?
- What are my disaster recovery and business continuity plans?
- If the storage service fails, how will my users continue working?
- How can I recreate the storage service if it fails?
- How long will it take to restore any data from backups?
Recommendations for Storage Solution¶
Storage Hardware¶
For recommended storage hardware technologies, see https://github.com/alces-software/knowledgebase/wiki#storage
Changing Disk Formatting for a Node/Group¶
In the metalware domain configuration files, the disksetup
namespace configures the kickstart commands for disk formatting. A couple of example configurations are below.
Default disk configuration:
disksetup: |
zerombr
bootloader --location=mbr --driveorder=sda --append="$bootloaderappend"
clearpart --all --initlabel
#Disk partitioning information
part /boot --fstype ext4 --size=4096 --asprimary --ondisk sda
part pv.01 --size=1 --grow --asprimary --ondisk sda
volgroup system pv.01
logvol / --fstype ext4 --vgname=system --size=16384 --name=root
logvol /var --fstype ext4 --vgname=system --size=16384 --name=var
logvol /tmp --fstype ext4 --vgname=system --size=1 --grow --name=tmp
logvol swap --fstype swap --vgname=system --size=8096 --name=swap1
Software RAID configuration:
disksetup: |
zerombr
bootloader --location=mbr --driveorder=sda --append="$bootloaderappend"
clearpart --all --initlabel
#Disk partitioning information
part /boot --fstype ext4 --size=1024 --asprimary --ondisk sda
part /boot2 --fstype ext4 --size=1024 --asprimary --ondisk sdb
part raid.01 --size 60000 --ondisk sda --asprimary
part raid.02 --size 60000 --ondisk sdb --asprimary
raid pv.01 --level=1 --fstype=ext4 --device=md0 raid.01 raid.02
volgroup system pv.01
logvol / --fstype ext4 --vgname=system --size=1 --name=root --grow
logvol /var --fstype ext4 --vgname=system --size=16384 --name=var
logvol swap --fstype swap --vgname=system --size=16384 --name=swap1
part raid.03 --size=1 --ondisk sda --asprimary --grow
part raid.04 --size=1 --ondisk sdb --asprimary --grow
raid /tmp --fstype ext4 --fstype=ext4 --device=md1 --level=0 raid.03 raid.04
To override the default disk configuration, create a config file with the node/group name in /var/lib/metalware/repo/config/
and add the new disksetup:
key to it.
NFS Server Setup¶
On Master Node¶
Create
/opt/vm/storage.xml
for deploying the storage VM (Available here
)Create disk image for the storage VM:
qemu-img create -f qcow2 storage.qcow2 80G
Define the VM:
virsh define storage.xml
On Controller VM¶
Create a group for the storage VM (add at least
storage1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group storage
Customise
storage1
node configuration (set the primary IP address to 10.10.0.3):metal configure node storage1
Create
/var/lib/metalware/repo/config/storage1.yaml
with the ip definition:nfsconfig: is_server: true nfsexports: /export/users: /export/data: # Modify the export options [optional] #options: <%= config.networks.pri.network %>/<%= config.networks.pri.netmask %>(ro,no_root_squash,async)
Note
The options:
namespace is optional, if not specified then the default export options will be used (<%= config.networks.pri.network %>/<%= confignetworks.pri.netmask %>(rw,no_root_squash,sync)
)
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(toggledefined
tofalse
to prevent a client from creating an fstab entry for the mount on a node):nfsconfig: is_server: false nfsmounts: /users: defined: true server: 10.10.0.3 export: /export/users /data: defined: true server: 10.10.0.3 export: /export/data options: intr,sync,rsize=32768,wsize=32768,_netdev
Note
Add any NFS exports to be created as keys underneath nfsmounts:
. The options:
namespace is only needed if wanting to override the default mount options (intr,rsize=32768,wsize=32768,_netdev
)
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/01-nfs.sh
Download the
nfs.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 01-nfs.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/nfs/nfs.sh
Follow Client Deployment Example to setup the compute nodes
Lustre Server Setup¶
On Master Node¶
Create
/opt/vm/lustre-mds.xml
for deploying the lustre metadata server VM (Available here
)Create disk image for the lustre metadata server VM:
qemu-img create -f qcow2 lustre-mds.qcow2 80G
Define the VM:
virsh define lustre-mds.xml
On Controller VM¶
Create a group for the lustre VM (add at least
lustre-mds1
as a node in the group, set additional groups oflustre,services,cluster,domain
allows for more diverse group management):metal configure group lustre-mds
Customise
lustre-mds1
node configuration (set the primary IP address to 10.10.0.10):metal configure node lustre-mds1
Create a deployment file specifically for
lustre-mds1
at/var/lib/metalware/repo/config/lustre-mds1.yaml
with the following content:lustreconfig: type: server networks: tcp0(<%= config.networks.pri.interface %>) mountentry: "10.10.0.10:/lustre /mnt/lustre lustre default,_netdev 0 0"
Note
If the server has an Infiniband interface that can be used for storage access then set networks
to a list of modules which includes Infiniband, e.g. o2ib(<%= config.networks.ib.interface %>),tcp0(<%= config.networks.pri.interface %>)
Add the following to
/var/lib/metalware/repo/config/domain.yaml
:lustreconfig: type: none networks: tcp0(<%= config.networks.pri.interface %>) mountentry: "10.10.0.10:/lustre /mnt/lustre lustre default,_netdev 0 0"
Note
For clients to lustre, replicate the above entry into the group or node config file and change type: none
to type: client
, also ensuring that networks
reflects the available modules and interfaces on the system
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/08-lustre.sh
Download the
lustre.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 08-lustre.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/lustre/lustre.sh
Follow Client Deployment Example to setup the lustre node
Once this has completed the lustre-mds node will have the necessary configuration to host a lustre metadata target or storage configuration. To configure the metadata disk or storage configuration see the below section.
Lustre Storage Setup¶
A lustre storage configuration usually consists of a metadata server (that is used to authorise mount, read and write requests to the lustre storage volume) and multiple storage servers (with disk arrays attached to them). The above configuration shows how a metadata server can be configured as part of the network but with some naming tweaks the lustre storage servers can also be added to the environment.
Metadata Storage Target
To format a metadata storage disk from the metadata server run a command similar to the following (replacing
lustre
with the desired name of the lustre filesystem and/dev/sda
with the path to the disk for storing metadata):mkfs.lustre --index=0 --mgs --mdt --fsname=lustre --servicenode=10.10.0.10 --reformat /dev/sda
To activate the storage, mount it somewhere on the metadata server:
mount -t lustre /dev/sda /mnt/lustre/mdt
Lustre Storage Target
These commands should be performed from different systems connected to the same storage backends across the storage configuration (depending on the network configuration) to ensure that the device management is distributed.
A storage target for the lustre filesystem can be formatted as follows (replacing
lustre
with the name of the filesystem from mdt configuration, repeat--servicenode=IP-OF-OSSX
for each storage system that is also connected to the same storage backend and replace/dev/mapper/ostX
with the path to the storage device):mkfs.lustre --ost --index=0 --fsname=lustre --mgsnode=IP-OF-MDS-NODE --mkfsoptions="-E stride=32,stripe_width=256" --servicenode=IP-OF-OSSX /dev/mapper/ostX
The device can then be mounted:
mount -t lustre /dev/mapper/ostX /mnt/lustre/ostX
Client Mount
The following command will mount the example lustre volume created from the above steps:
mount -t lustre 10.10.0.10:/lustre /mnt/lustre
Monitoring Overview¶
About¶
This package will configure a monitoring master system with metric collection services and a web front-end. Slave nodes will have the client monitoring service setup to send metrics to the master system.
Components¶
The monitoring system will provide the following applications:
Ganglia, a passive monitoring system with metric graphs
- Clients send metrics to the Ganglia host system which are plotted on graphs for viewing data trends for the environment. The data is available through both a command-line utility and a web interface.
Nagios, an active monitoring system with notifications
- Nagios clients are configured on the server and are not required to run client software unless additional metrics are needed. The system actively monitors metrics and if the values go over a predefined, customisable threshold.
Key Files¶
/etc/ganglia/gmetad.conf
/etc/ganglia/gmond.conf
/etc/httpd/conf.d/ganglia
/var/lib/ganglia/*
/etc/nagios/*
/etc/httpd/conf.d/nagios.conf
/usr/share/nagios/*
Considerations for Monitoring the HPC Platform¶
Types of Monitoring¶
There are 2 types of monitoring that can be implemented into a network; these are:
- Passive - Passive monitoring tools collect data and store from systems. Usually this data will be displayed in graphs and is accessible either through command-line or web interfaces. This sort of monitoring is useful for historical metrics and live monitoring of systems.
- Active - Active monitoring collects and checks metrics; it will then send out notifications if certain thresholds or conditions are met. This form of monitoring is beneficial for ensuring the health of systems; for example, email notifications can be sent out when systems start overheating or if a system is no longer responsive.
Both forms of monitoring are usually necessary in order to ensure that your HPC cluster is running properly, and in full working order.
Metrics¶
It is worth considering what metrics for the system will be monitored; a few common ones are listed here:
- CPU
- Load average
- Idle percentage
- Temperature
- Memory
- Used
- Free
- Cached
- Disk
- Free space
- Used space
- Swap (free/used/total)
- Quotas (if configured)
- Network
- Packets in
- Packets out
Note
Cloud service providers usually have both passive and active monitoring services available through their cloud management front-end.
Additional Considerations and Questions¶
- What metrics should be monitored?
- How frequently should metrics be checked?
- What level of notification is required?
- Escalation upon repeated errors?
- Acknowledgement of long-running outages?
- How do we avoid over-saturation of notifications during major outages?
- What tests will we run to ensure that notifications are working properly?
Recommendations for Monitoring the HPC Platform¶
Setting Up Monitor Server (Ganglia & Nagios)¶
On Master Node¶
Create
/opt/vm/monitor.xml
for deploying the storage VM (Available here
)Create disk image for the monitor VM:
qemu-img create -f qcow2 monitor.qcow2 80G
Define the VM:
virsh define monitor.xml
On Controller VM¶
Create a group for the monitor VM (add at least
monitor1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group monitor
Customise
monitor1
node configuration (set the primary IP address to 10.10.0.5):metal configure node monitor1
Create
/var/lib/metalware/repo/config/monitor1.yaml
with the following network and server definition:ganglia: is_server: true nagios: is_server: true
Add the following to
/var/lib/metalware/repo/config/domain.yaml
:ganglia: server: 10.10.0.5 is_server: false nagios: is_server: false
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/03-ganglia.sh - /opt/alces/install/scripts/04-nagios.sh
Download the
ganglia.sh
andnagios.sh
scripts to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 03-ganglia.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/ganglia/ganglia.sh wget -O 04-nagios.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/nagios/nagios.sh
Follow Client Deployment Example to setup the compute nodes
This will setup minimal installations of both Ganglia and Nagios. All nodes within the domain will be built to connect to these services such that they can be monitored. It is possible to expand upon the metrics monitored and notification preferences.
Once deployed, both Ganglia and Nagios services can be further configured and customised to deliver the metrics and notifications that you require. Review the documentation for each of the projects for assistance in settings up these tools.
Hardware Drivers Overview¶
About¶
Some clusters may require custom hardware drivers for your chosen operating system. This can be the case for both bare-metal hardware and cloud-based resources for devices such as Infiniband or other accelerated network technologies, GPU and FPGA cards or RAID controllers. This package will create an environment that allows nodes to run installers during first-boot to build the driver against the up-to-date operating system packages.
Components¶
The components in this package are:
First boot environment for automatically executing installers at boot time
- This provides a system service that can be used to run installation scripts when a system is turned on. For example, Nvidia graphics driver can be set to compile after the initial build of the system such that it is compiled against the correct kernel version.
Infiniband driver first-boot script
- A script that can install the Mellanox Infiniband driver either from upstream repositories or through local compilation from the source files
Nvidia graphics driver first-boot script
Key Files¶
/etc/systemd/system/firstrun.service
/var/lib/firstrun/*
/var/log/firstrun/*
/opt/alces/installers/*
Note
In order to use first-boot, the system must be compatible with the base operating system. Review instructions for your chosen operating system if you need to use special drivers in order to allow your nodes to install the base OS.
First Boot Script Environment¶
The first boot environment is a service that allows for scripts to be executed on a node at startup, occurring only on the first boot after system build.
Creating a First Boot Script¶
A first boot script is made up of two main components:
- Setup script
- First run script
Setup Script¶
This script will run as part of the node build procedure and will be used to put the first run script into the correct location to be executed at boot time.
Create a script like the following example (replace
myfile.bash
with the name of the program and betweencat
andEOF
with the installation commands):cat << EOF > /var/lib/firstrun/scripts/myfile.bash curl http://www.system-driver.com/downloads/installer.sh > /tmp/installer.sh sh /tmp/installer.sh --quiet EOF
The above script can then be saved somewhere under
/opt/alces/install/scripts/
on the deployment VMIn
/var/lib/metalware/repo/config/domain.yaml
(or a group/node specific config file) add the path to the script in thesetup:
namespace
First Run Script¶
In the example setup script above it creates a file called /var/lib/firstrun/scripts/myfile.bash
which is the first run script. Any files ending with .bash
in /var/lib/firstrun/scripts
will be executed on the first boot of the node.
Compute Node Setup¶
Infiniband Setup¶
Create a configuration file specifically for the nodes group
/var/lib/metalware/repo/config/nodes.yaml
with the ib network setup:networks: ib: defined: true ib_use_installer: false mellanoxinstaller: http://route/to/MLNX_OFED_LINUX-x86_64.tgz ip:
Note
If you want to install the Mellanox driver (and not use the IB drivers from the CentOS repositories), set ib_use_installer
to true
and set mellanoxinstaller
to the location of the mellanox OFED installer.
Download the
infiniband.sh
script from the knowledgebase:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 06-infiniband.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/infiniband/infiniband.sh
Add the script to the
scripts:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/06-infiniband.sh
Follow Client Deployment Example to setup the compute nodes
Nvidia Driver Setup¶
Download the Nvidia installer to
/opt/alces/installers/
on the controller VM asnvidia.run
Download the
nvidia.sh
script from the knowledgebase:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 07-nvidia.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/nvidia/nvidia.sh
Add the script to the
scripts:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/07-nvidia.sh
To run the installer on all nodes in a group (for example,
gpunodes
) add the following line to the group’s config file (in this example,/var/lib/metalware/config/gpunodes.yaml
):nvidia: true
HPC Environment Overview¶
About¶
This package provides tools for queuing jobs and running applications across the cluster.
Components¶
The HPC environment will be comprised of:
Queuing system for optimised resource utilisation
SLURM
- The SLURM job scheduler is a centrally managed job scheduler that cant constrain resources based on grid utilisation, user/group assignment and job resource requirements.
or
Open Grid Scheduler
- Much like SLURM, OGS provides a centrally managed job scheduler with similar resource management possibilities.
Application deployment solution
Environment modules
- Modules allows for applications to be loaded dynamically in shell sessions. With the paths being updated on-the-fly, applications can be installed to a network storage location - minimising installation time and improving the ease of application use (both in interactive and non-interactive sessions).
and/or
Alces Gridware
- Gridware contains an implementation of environment modules and also provides a useful CLI tool for installing and managing applications from a large repository of HPC applications.
Key Files¶
/etc/slurm/*
or/opt/gridscheduler/*
/opt/apps/*
or/opt/gridware/
Considerations for HPC Environment Design¶
Job Scheduling¶
In a HPC environment there are large, distributed, multiple processor jobs that the users wish to run. While these jobs can be run manually by simply executing job scripts along with a hostfile containing the compute nodes to execute on, you will soon run into problems with multiple users, job queuing and priorities. These features are provided by a job scheduler, delivering a centralised server that manages the distribution, prioritisation and execution of job scripts from multiple users in the HPC network.
Popular job schedulers for HPC clusters include:
- Open Grid Scheduler (SGE)
- PBS / Torque-Maui / Torque-Moab / PBSPro
- SLURM
- Load-sharing facility (LSF) / OpenLava
All job schedulers provide a similar level of functionality and customisations so it is worth investigating the features of the available solutions to find the one best suited for your environment.
Application Deployment¶
General management of applications and libraries is mentioned Repository Management however this section focuses on installing applications into the entire HPC environment instead of individually to each node system.
A few things to consider when designing/implementing an application deployment system are:
- How will applications be stored? (central network storage location?)
- What parts of the application need to be seen by the nodes? (application data? program files? libraries?)
- How will multiple versions of the same application be installed, and how will users choose between them?
- How will dependencies be managed? (more on this below)
An application deployment system can be created yourself or Alces Gridware provides tools and an index of HPC applications for HPC platform installations.
Dependencies¶
When it comes to managing dependencies for applications it can either be done with local installations of libraries/packages or by storing these in a centralised location (as suggested with the applications themselves). Dependency control is one of the main reasons that using the same OS for all systems is recommended as it eliminates the risk of applications only working on some systems within the HPC environment.
Dependancies must be managed across all nodes of the cluster, and over time as the system is managed. For example, an application that requires a particular C++ library that is available from your Linux distribution may not work properly after you install distribution updates on your compute nodes. Dependancies for applications that utilise dynamic libraries (i.e. loaded at runtime, rather than compile-time) must be particularly carefully managed over time.
Reproducibility¶
It is important that your users receive a consistent, long-term service from your HPC cluster to allow them to rely on results from applications run at different points in your clusters’ lifecycle. Consider the following questions when designing your application management system:
- How can I install new applications quickly and easily for users?
- What test plans have I created to ensure that applications run in the same way across all cluster nodes?
- How can I ensure that applications run normally as nodes are re-installed, or new nodes are added to the cluster?
- How can I test that applications are working properly after an operating system upgrade or update?
- How will I prepare for moving to a new HPC cluster created on fresh hardware, or using cloud resources?
- What are the disaster recovery plans for my software applications?
Recommendations for HPC Environment Design¶
SLURM Setup (From Controller VM)¶
The instructions below provide guidelines for installing the SLURM job-scheduler on your HPC cluster, and may be modified as required for alternative job-schedulers such as SGE, LSF, PBS or Torque/Maui.
Create a group for the slurm VM (add at least
slurm1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group slurm
Customise
slurm1
node configuration (set the primary IP address to 10.10.0.6):metal configure node slurm1
Create
/var/lib/metalware/repo/config/slurm1.yaml
with the following network and server definition:slurm: is_server: true
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(setserver
to the hostname of the SLURM VM):slurm: server: slurm1 is_server: false mungekey: ff9a5f673699ba8928bbe009fb3fe3dead3c860c
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/06-slurm.sh
Download the
slurm.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 06-slurm.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/slurm/slurm.sh
Build SLURM RPMs in custom repo (
/opt/alces/repo/custom/Packages
), a guide to building the SLURM RPMs can be found in the SLURM documentation. Once the packages have been moved to the previously mentioned custom repo directory, rebuild the repo withcreaterepo custom
Follow Client Deployment Example to setup the SLURM node
Note
All systems that are built will have SLURM installed and the SLURM daemon running which will allow that node to submit and run jobs. Should this not be desired then the service can be permanently stopped and disabled with systemctl disable slurmd && systemctl stop slurmd
on the node which is no longer to run SLURM.
Once your job-scheduler is installed and running, follow the documentation specific to your chosen package for instructions on how to configure the software for your particular requirements.
Modules Setup (From Deployment VM)¶
The environment modules software allows users to control their environment variables for a particular login session. This enables flexibility to use a library of installed application software whilst maintaining the correct dependencies for each package. Modules can also control environment variables that deliver user assistance (e.g. manual pages, help-files, usage examples), curate software package usage terms and manage license availability for commercial packages.
The instructions below provide a simple modules installation for your HPC environment:
Create a group for the modules VM (add at least
apps1
as a node in the group, set additional groups ofservices,cluster,domain
allows for more diverse group management):metal configure group apps
Customise
apps1
node configuration (set the primary IP address to 10.10.0.7):metal configure node apps1
Create
/var/lib/metalware/repo/config/apps1.yaml
with the following network and server definition:modules: is_server: true
Add the following to
/var/lib/metalware/repo/config/domain.yaml
(setserver
to the IP of the apps VM):modules: server: 10.10.0.7 directory: /opt/apps is_server: false
Additionally, add the following to the
setup:
namespace list in/var/lib/metalware/repo/config/domain.yaml
:- /opt/alces/install/scripts/07-modules.sh
Download the
modules.sh
script to the above location:mkdir -p /opt/alces/install/scripts/ cd /opt/alces/install/scripts/ wget -O 07-modules.sh https://raw.githubusercontent.com/alces-software/knowledgebase/master/epel/7/modules/modules.sh
Follow Client Deployment Example to setup the apps node
Note
The apps directory can be setup on the storage node if one was created, this allows for all NFS exports to come from a centralised server.
Considerations for HPC Platform Verification¶
Before putting the system into a production environment it is worth verifying that the hardware and software is functioning as expected. The 2 key types of verification are:
- Configuration - Verifying that the system is functioning properly as per the server setup.
- Performance - Verifying that the performance of the system is as expected for the hardware, network and applications.
Verifying the configuration¶
Simple configuration tests for the previous stages of the HPC platform creation will need to be performed to verify that it will perform to user expectations. For example, the following could be tested:
- Passwordless SSH between nodes performs as expected
- Running applications on different nodes within the network
- Pinging and logging into systems on separate networks
Best practice would be to test the configuration whilst it is being setup at regular intervals to confirm functionality is still as expected. In combination with written documentation, a well practiced preventative maintenance schedule is essential to ensuring a high-quality, long-term stable platform for your users.
Testing System¶
There are multiple parts of the hardware configuration that can be tested on the systems. The main few areas are CPU, memory and interconnect (but may also include GPUs, disk drives and any other hardware that will be heavily utilised). Many applications are available for testing, including:
- Memtester
- HPL/Linpack/HPC-Challenge (HPCC)
- IOZone
- IMB
- GPUBurn
Additionally, benchmarking can be performed using whichever applications the HPC platform is being designed to run to give more representable results for the use case.
Additional Considerations and Questions¶
- How will you know that a compute node is performing at the expected level?
- Gflops theoretical vs actual performance efficiency
- Network performance (bandwidth, latency, ping interval)
- How can you test nodes regularly to ensure that performance has not changed / degraded?
Recommendations for HPC Platform Verification¶
Checking System Configuration¶
For the system configuration (depending on which previous sections have been configured), the cluster administrator should verify the following settings:
- Clocks - The current date/time is correct on all machines; clients are syncing with the controller
- User Login - Users from the chosen verification method can login to:
- Infrastructure node(s) (should be possible for admin users only)
- Login node(s)
- Compute node(s)
- Storage system mounts - Mounted with correct mount options and permissions
- Job-scheduler - Jobs can be submitted and are run; nodes are all present and available on the queue
- Ganglia - All nodes present, metrics are logging
- Nagios - All nodes present, services are in positive state
Checking Hardware Configuration¶
Benchmarking Software¶
For general notes on running memtester, IOZone, IMB and HPL see - https://github.com/alces-software/knowledgebase/wiki/Burn-in-Testing
Further details can be found at:
Hardware Information¶
Check CPU type:
pdsh -g groupname 'grep -m 1 name /proc/cpuinfo'
Check CPU count:
pdsh -g groupname 'grep processor /proc/cpuinfo |wc -l'
Check RAID active:
pdsh -g groupname 'cat /proc/mdstat | grep md[0-1]'
Check Infiniband up/active:
pdsh -g groupname 'ibstatus |grep phys'
Check free memory:
pdsh -g groupname 'free -m |grep ^Mem'
Check GPU type and count:
pdsh -g groupname 'nvidia-smi'
Grab serial numbers:
pdsh -g groupname 'dmidecode -t baseboard |grep -i serial'
Setting Up AWS Environment for HPC Platform¶
HPC platforms can be deployed in the cloud instead of on local hardware. While there are many cloud providers out there, this guide focusses on setting up login and compute nodes in the cloud on the AWS platform.

General Network Setup¶
AWS has a command line tool that can be used to create and manage resources. These will need to be run from a Linux/Mac machine.
Create a VPC for the network:
aws ec2 create-vpc --cidr-block 10.75.0.0/16
Note
Optionally, a name tag can be created for the VPC (which can make it easier to locate the VPC through the AWS web console) with aws ec2 create-tags --resources my_vpc_id --tags Key=Name,Value=Name-For-My-VPC
Create a security group (replacing
my_vpc_id
with the VpcId from the above command output):aws ec2 create-security-group --description my-sg1 --group-name my-sg1 --vpc-id my_vpc_id
Create a file
sg-permissions.json
in the current directory with the following content:[ { "IpProtocol": "-1", "IpRanges": [ { "CidrIp": "10.75.0.0/16" } ] }, { "IpProtocol": "tcp", "FromPort": 22, "ToPort": 22, "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ] }, { "IpProtocol": "tcp", "FromPort": 443, "ToPort": 443, "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ] }, { "IpProtocol": "tcp", "FromPort": 80, "ToPort": 80, "IpRanges": [ { "CidrIp": "0.0.0.0/0" } ] } ]
Add rules to security group (replacing
my_sg_id
with the GroupId from the above command output):aws ec2 authorize-security-group-ingress --group-id my_sg_id --ip-permissions file://sg-permissions.json
Define subnet for the VPC (replacing
my_vpc_id
with the VpcId from earlier):aws ec2 create-subnet --vpc-id my_vpc_id --cidr-block 10.75.0.0/16
Create an Internet gateway:
aws ec2 create-internet-gateway
Attach the Internet gateway to the VPC (replacing
my_igw_id
with InternetGatewayId from the above command output):aws ec2 attach-internet-gateway --internet-gateway-id my_igw_id --vpc-id my_vpc_id
Locate route table for the VPC:
aws ec2 describe-route-tables --filters Name=vpc-id,Values=my_vpc_id
Create a route within the table (replacing
my_rtb_id
with RouteTableId from the above command output):aws ec2 create-route --route-table-id my_rtb_id --destination-cidr-block 0.0.0.0/0 --gateway-id my_igw_id
Create a file
ec2-role-trust-policy.json
in the current directory with the following content:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole" } ] }
Create autoscaling role:
aws iam create-role --role-name autoscaling --assume-role-policy-document file://ec2-role-trust-policy.json
Create a file
ec2-role-access-policy.json
in the current directory with the following content:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:SetDesiredCapacity", "autoscaling:UpdateAutoScalingGroup", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeTags" ], "Resource": [ "*" ] } ] }
Set role policy for above role:
aws iam put-role-policy --role-name my-autoscaling-role --policy-name My-Autoscaling-Permissions --policy-document file://ec2-role-access-policy.json
Create instance profile for autoscaling:
aws iam create-instance-profile --instance-profile-name autoscaling
Join the role and instance profile:
aws iam add-role-to-instance-profile --instance-profile-name autoscaling --role-name autoscaling
Create a file
mapping.json
in the current directory with the following content:[ { "DeviceName": "/dev/sda1", "Ebs": { "DeleteOnTermination": true, "SnapshotId": "snap-00f18f3f6413c7879", "VolumeSize": 20, "VolumeType": "gp2" } } ]
Autoscaling Group Configuration¶
Setup autoscaling launch configuration (
ami-061b1560
is the ID for the Official CentOS 7 minimal installation):aws autoscaling create-launch-configuration --launch-configuration-name my-compute-group1 --image-id compute_node_template_ami_id --key-name my_key_pair --security-groups my_sg_id --associate-public-ip-address --iam-instance-profile my-autoscaling-profile --instance-type c4.large --spot-price 0.113
Create autoscaling group which can scale from 0 to 8 nodes and initially starts with 1:
aws autoscaling create-auto-scaling-group --auto-scaling-group-name my-compute-group1 --launch-configuration-name my-compute-group1 --vpc-zone-identifier my_subnet_id --min-size 0 --max-size 8 --desired-capacity 1
Node Creation Example¶
Create node (
ami-061b1560
is the ID for the Official CentOS 7 minimal installation, replacemy_key_pair
,my_sg_id
andmy_subnet_id
with the related values from earlier commands):aws ec2 run-instances --image-id ami-061b1560 --key-name my_key_pair --instance-type r4.2xlarge --associate-public-ip-address --security-group-ids my_sg_id --block-device-mappings file://mapping.json --subnet-id my_subnet_id --iam-instance-profile Name=my-autoscaling-profile
Wait for node to launch (
instance_id
being the ID from the above command):aws ec2 describe-instances --instance-id instance_id | jq '.Reservations[0].Instances[0].State.Name'
Identify public IP for the node to login to (
instance_id
being the ID from the above command):aws ec2 describe-instances --instance-id instance_id |jq '.Reservations[0].Instances[0].PublicIpAddress'
Controller Node Setup¶
- Follow Node Creation Example
- Follow Metalware Install
Repository Node Setup¶
- Follow Node Creation Example
- Follow Repo Configuration
Storage Node Setup¶
- Follow Node Creation Example
- Follow Storage Configuration
User Management Setup¶
- Follow Node Creation Example
- Follow User Management Setup
Monitor Node Setup¶
- Follow Node Creation Example
- Follow Monitor System Setup