HPC on Microsoft Azure: A Practical Guide
Deploying critical workloads like genomic sequencing, oil and gas, or electronic design automation requires speed, resilience and high-availability. High Performance Computing (HPC) offerings from Azure enable you to leverage cloud HPC, which can significantly reduce your costs while providing workload distribution and scalability.
Why Run HPC in the Cloud?
You can run HPC workloads in all major public or private clouds. Some cloud providers, like Azure, offer ready-to-use HPC clusters for machine learning, big data analytics, which you can leverage for any industry.
Benefits of HPC in the cloud include:
Availability—the high-availability of the cloud can reduce interruptions in HPC workload operations. High-availability also provides data protection since it is usually achieved through mirrored data centers.
Scalability—cloud controls can help you avoid paying for unused resources by increasing or decreasing storage, memory, and processing power of your HPC workloads as needed.
Cost—cloud infrastructure eliminates the need to purchase, maintain, and host servers. Outsourcing resources instead of buying reduces upfront costs and potentially removes technical debt.
Workload distribution—using containerized microservices in the cloud enables you to orchestrate distributed HPC workloads. Containers can also help you move existing workloads and tools to the cloud.
Running HPC in the cloud requires components like low latency, batch scheduling features, high-throughput storage, bare-metal support, userspace communication functionality, high-speed interconnects, and clustered instances.
How Do HPC Deployments Work in Microsoft Azure?
You can deploy HPC in Azure as an extension of on-premise deployments or entirely in the cloud. You can also run workloads simultaneously on-premise and in the cloud for hybrid deployments. Another option is to use Azure deployments only as burst capacity for larger workloads.
In Azure, you can distribute low-latency networking between your HPC nodes by using Remote Direct Memory Access (RDMA) or InfiniBand. To set up your nodes in Azure, you can use GPU-enabled Virtual Machines (VMs) with NVIDIA GPUs or customized compute-intensive VMs. The type of node you choose will depend on your budget and specific compute needs.
HPC in Azure components include:
HPC head node—a VM that operates as the central managing server for your HPC cluster. You can use it to schedule workloads and jobs to your worker nodes.
VM Scale Sets service—enables you to create multiple virtual machines with features for autoscaling, load balancing, and deployment across availability zones. You can use scale sets to run Hadoop, Cassandra, Cloudera, MongoDB, and Mesos.
Virtual network—a network that links your head node to your storage and compute nodes via IP. You can use this network to control the infrastructure and the traffic between subnets. Virtual Network creates an isolated environment for your deployment with secure connections via ExpressRoute OR IPsec VPN.
Storage—Azure storage options include disk, file, and blob storage. You can also use hybrid storage solutions or data lakes. You can add the Avere vFXT service to improve storage performance. Avere vFXT enables you to deploy file system workloads with low latency through an advanced cache feature.
Azure Resource Manager—a deployment and management service that enables you to use script files or templates to deploy HPC applications. It is the standard resource management tool in Azure.
Managing HPC Deployments in Azure
Azure’s HPC deployment management services are native offerings. You can integrate them with other Azure services and use Azure support when needed.
Azure Batch is a compute management and job scheduling service for HPC applications. To use this service, you need to configure your VM pool and upload your workloads. Azure Batch then provisions VMs, assigns and runs tasks, and monitors progress. You can use Batch to set policies for job scheduling and autoscale your deployment.
Microsoft HPC Pack
HPC Pack is a centralized management and deployment interface that enables you to manage your VM cluster schedule jobs, and monitor processes. You can use HPC Pack to provision and manage network infrastructure and VMs. HPC Pack is helpful for migrating existing, on-premise HPC workloads to the cloud. You can either migrate your workloads completely to the cloud or use HPC Pack to manage a hybrid deployment.
Azure CycleCloud is an enterprise-grade tool for managing and orchestrating HPC clusters on Azure. You can use CycleCloud to provision HPC infrastructure, and run and schedule jobs at any scale. CycleCloud enables you to create different file systems types in Azure and mount them to the compute cluster nodes to support HPC workloads.
Best Practices for Using HPC on Azure
The following guidelines can help you create and use large Azure deployments with your HPC cluster.
Split deployments to multiple services
Split large HPC deployments into several smaller-sized deployments by using multiple Azure services. You should use less than 500 to 700 virtual machine instances in each service. Larger deployments can lead to deployment timeouts, loss of live instances, and issues with virtual IP address swapping.
When dividing deployments, you get the following benefits:
Flexibility—in stopping and starting groups of nodes.
Shut down unused instances—if a job is no longer running.
Find available nodes in Azure clusters—when you use extra large instances.
Multiple Azure data centers—for business continuity and disaster recovery scenarios.
Use several Azure storage accounts for different node deployments
For applications that are constrained by I/O, use several Azure storage accounts to deploy large HPC nodes simultaneously. However, you should use these storage accounts only for node provisioning. For instance, you need to configure separate Azure storage accounts to move job and task data to and from the head node.
Change the number of proxy node instances
Proxy nodes are Azure worker role instances that enable communication between Azure nodes and on-premises head nodes. Proxy nodes are automatically added to each deployment of Azure node from an HPC cluster. The demand for resources on the proxy nodes depends on the number of deployed nodes in Azure and the running jobs on those nodes.
You should increase the number of proxy nodes in a large Azure deployment. You can scale up or down the number of proxy nodes by using the Azure Management Portal. The advisable number of proxy nodes for a single deployment is ten.
Use HPC Client Utilities to connect remotely to the head node
The performance of the head node can be negatively impacted when operating under a heavy load, or when many users are connected with remote desktop connections. Instead of using Remote Desktop Services (RDS) to connect users to the head node, users should install HPC Pack Client Utilities on their workstations. These utilities enable users to access the cluster by using remote tools.
Organizations need the ability to process data in real time for purposes like live sporting event streaming, machine learning applications, or big data analytics. HPC in the cloud provides a fast, and highly reliable IT infrastructure to process, store, and analyze massive amounts of data. To get the most out of your HPC workload in Azure, you need to split deployments to multiple services, use multiple Azure storage accounts for node deployments, and change the number of proxy node instances.a