Best Practices for HPC Monitoring on AWS

Running HPC workloads on AWS requires precise monitoring to handle unique challenges like large-scale auto-scaling, strict security standards, and specialized hardware metrics. Here's what you need to know:

Key Benefits: Proper monitoring can reduce job failures by 40% and cut costs by 25%.
Core Tools: Use AWS CloudWatch, CloudTrail, and Config for infrastructure monitoring, security compliance, and workload-specific metrics.
Specialized Metrics: Track GPU usage, Lustre FS performance, and Elastic Fabric Adapter (EFA) network health.
Advanced Tools: Integrate Amazon Managed Grafana for real-time dashboards and NICE DCV for visualization performance.

This guide covers everything from setting up basic monitoring to advanced tools, helping you optimize performance and ensure compliance for your HPC workloads.

Setting Up Basic Monitoring

To monitor HPC workloads effectively, you'll need to configure AWS's core observability tools to track compute resources, API activities, and network traffic.

AWS CloudWatch Setup for HPC

Set up CloudWatch with a 1-minute monitoring interval for compute nodes using AWS ParallelCluster's built-in agent ^[5].

Key metrics to configure:

Metric Category	Metrics to Track	Collection Interval
Compute	CPU/GPU Utilization, Instance States	1 minute
Storage	EBS Volume Read/Write Operations	1 minute
Network	NetworkIn/NetworkOut, NetworkPacketsOut (EFA packet drops)	1 minute
Custom	MPI Job Completion Rates	1 minute

For GPU metrics, use NVIDIA DCGM and a custom namespace (e.g., ClusterName/GPU) with 5-second granularity during peak usage ^[5].

AWS CloudTrail API Monitoring

Set up CloudTrail to log and track HPC API activities:

Log File Validation: Enable SHA-256 hashing to ensure log integrity ^[2].
Event Selection: Monitor critical events such as RunInstances and BatchSubmitJob ^[4].
Real-time Alerts: Use CloudWatch Logs to get immediate notifications for unauthorized actions ^[4].

Multi-Zone Monitoring Setup

Enable VPC flow logs across all Availability Zones (AZs) to focus on REJECTED security group rules ^[4].

For cross-account monitoring, follow these steps:

Set up the service-linked role AWSServiceRoleForCloudWatchCrossAccount in the monitoring account ^[1].
Define resource-based policies to allow cross-account role assumption ^[1].

Leverage CloudWatch cross-account observability for centralized monitoring of HPC workloads ^[1]. Pay special attention to Elastic Fabric Adapter (EFA) performance by tracking the NetworkPacketsOut metric ^[1].

These configurations lay the groundwork for monitoring HPC performance, which will be expanded upon in the next section.

HPC Performance Monitoring

After setting up the initial CloudWatch configurations, it's time to refine monitoring for specific workloads.

AWS ParallelCluster Metrics

Create CloudWatch dashboards to keep an eye on cluster-wide performance. Use these metrics and thresholds to ensure smooth operations:

Metric	Threshold	Alert Condition
ClusterUtilization	80%	Monitors percentage usage
ActiveJobs	Queue depth > nodes × 1.5	Sustained for over 10 minutes
NetworkThroughput	>500K packets/sec	Important for CFD workloads
EFA NetworkLatency	>50μs	Key for molecular dynamics tasks

Instance-Specific Monitoring

Different instance types come with their own monitoring needs. For example, Graviton instances require close tracking of core usage. Specifically, for C7g instances, keep an eye on ARMv9 core utilization, ensuring it doesn't exceed an 85% sustained load threshold ^[6].

Workload Metrics Configuration

Tailor your monitoring setup to match the demands of specific applications. Expand on the basic EFA monitoring to capture patterns unique to each workload:

CFD Workloads: Set alerts for NetworkPpsRx exceeding 500K packets/sec and FSx Lustre BurstCreditBalance dropping below 20% ^[6].
Molecular Dynamics (GROMACS):
- Monitor MPI wait times (over 5ms) and EFA credits (below 10%).
- Use Grafana to track CollectiveOpLatency ^[3].
Weather Modeling (WRF):
- Trigger alerts for sustained DiskQueueDepth values greater than 5.
- Ensure compliance with NIST SP 800-223 by linking performance alerts to CloudTrail events ^[2].

For visualization tasks using NICE DCV, set up alerts for frame rates dropping below 24 FPS and client latency exceeding 100ms ^[3]. These thresholds help maintain a seamless user experience.

sbb-itb-6210c22

Security and Compliance Monitoring

While performance monitoring focuses on optimizing workloads, security monitoring ensures your systems adhere to compliance standards. Tools like AWS Config, KMS encryption, and compliance alerts are key to building a secure monitoring framework.

AWS Config Compliance Rules

AWS Config continuously evaluates the security of your HPC environment. It automates compliance checks across your infrastructure, addressing both scale and security challenges:

Rule Name	Purpose	Compliance Control
cloudwatch-log-group-encrypted	Ensures monitoring data is encrypted	NIST AU-9
restricted-ssh	Blocks unauthorized cluster access	NIST SC-7
ec2-volume-inuse-check	Verifies storage usage compliance	-

You can also create custom rules using Lambda functions to check ParallelCluster configurations against specific NIST SP 800-223 standards. These rules can monitor changes to instances, network setups, and access patterns.

Log Encryption with AWS KMS

All monitoring data should be encrypted using AWS KMS. Set up a dedicated KMS key specifically for HPC monitoring and define service access policies accordingly. For CloudWatch log groups, enable encryption and configure retention periods based on data type:

Operational metrics: Retain for 90 days
Security logs: Retain for 1 year
Compliance data: Retain for 7 years (use S3 Glacier for long-term storage)

If you're monitoring across multiple accounts, ensure the KMS key policies allow the necessary roles to decrypt data.

Setting Up Compliance Alerts

Set up CloudWatch alarms to monitor for issues like unauthorized API calls (via CloudTrail), Config rule violations, KMS access problems, and unauthorized instance changes.

Use EventBridge rules paired with Lambda functions to automate responses. For example, you can enable encryption on newly created log groups or quarantine non-compliant instances using AWS Systems Manager.

These measures not only strengthen security but also provide detailed audit trails that enhance both security and operational visibility.

Additional Monitoring Tools

These tools are designed to work with AWS services, tackling common challenges in monitoring HPC environments:

Amazon Managed Grafana Setup

Amazon Managed Grafana offers a centralized way to visualize HPC metrics across clusters. To set it up effectively, integrate it with CloudWatch and use the following recommended settings:

Dashboard Panel	Metrics Source	Refresh Rate
Cluster Heatmap	AWS/EC2 CPUUtilization	30 seconds
Node Distribution	ClusterName dimension	1 minute
Resource Allocation	ParallelCluster logs	5 minutes

The Cluster Heatmap is especially useful for managing auto-scaling fleets. To get the most out of your dashboard, separate operational metrics from analytical ones for better performance.

NICE DCV Performance Tracking

For workloads that rely heavily on visualization, monitor these key metrics:

Metric	Threshold	Duration
Frame Rate	≥30 FPS (4K)	Below 25 FPS for 5 minutes
Network Latency	<100ms	Above 150ms for 2 minutes
GPU Utilization	<85%	Above 90% for 10 minutes
Session Bandwidth	<50 Mbps/session	Above 60 Mbps for 15 minutes

Tracking these thresholds ensures smooth performance for visualization-heavy tasks.

Enginframe Dashboard Configuration

Enginframe simplifies HPC job monitoring by integrating Lustre FS and scheduler metrics. Key features include:

Job queue status with color-coded priorities for quick assessment.
User resource analytics powered by CloudWatch for better insights.
Direct scheduler metrics from systems like Slurm or PBS for detailed tracking.

These features make it easier to manage and optimize HPC workloads effectively.

Summary

Monitoring HPC environments on AWS effectively involves combining AWS-native services with specialized tools, focusing on three main goals: visibility across accounts, automated compliance checks, and metrics tailored to workloads.

Monitoring Layer	Key Tools	Focus Areas
Infrastructure	CloudWatch, CloudTrail	Instance health, VPC flow logs
Application	ParallelCluster, Enginframe	Job queue times, cluster utilization
Security	AWS Config, Security Hub	Compliance scores, encryption status

A CloudWatch-based monitoring setup offers broad visibility while keeping operational complexity low. By utilizing AWS's built-in integrations, organizations can monitor key performance indicators across their entire HPC setup.

This layered approach - combining insights into infrastructure, workload-specific metrics, and automated security measures - helps ensure that HPC systems run efficiently while meeting performance and compliance standards.

Best Practices for HPC Monitoring on AWS

Setting Up Basic Monitoring

AWS CloudWatch Setup for HPC

AWS CloudTrail API Monitoring

Multi-Zone Monitoring Setup

HPC Performance Monitoring

AWS ParallelCluster Metrics

Instance-Specific Monitoring

Workload Metrics Configuration

sbb-itb-6210c22

Security and Compliance Monitoring

AWS Config Compliance Rules

Log Encryption with AWS KMS

Setting Up Compliance Alerts

Additional Monitoring Tools

Amazon Managed Grafana Setup

NICE DCV Performance Tracking

Enginframe Dashboard Configuration

Summary

Related Blog Posts

Read more

How Automation Rules Work in AWS Security Hub

AWS Cross-Service Resilience Patterns

AWS DR: Pilot Light vs Warm Standby

Best Practices for HPC Monitoring on AWS

Setting Up Basic Monitoring

AWS CloudWatch Setup for HPC

AWS CloudTrail API Monitoring

Multi-Zone Monitoring Setup

HPC Performance Monitoring

AWS ParallelCluster Metrics

Instance-Specific Monitoring

Workload Metrics Configuration

sbb-itb-6210c22

Security and Compliance Monitoring

AWS Config Compliance Rules

Log Encryption with AWS KMS

Setting Up Compliance Alerts

Additional Monitoring Tools

Amazon Managed Grafana Setup

NICE DCV Performance Tracking

Enginframe Dashboard Configuration

Summary

Related Blog Posts

Read more

How Automation Rules Work in AWS Security Hub

AWS Cross-Service Resilience Patterns

AWS DR: Pilot Light vs Warm Standby

Get in Touch