Running HPC workloads on AWS requires precise monitoring to handle unique challenges like large-scale auto-scaling, strict security standards, and specialized hardware metrics. Here's what you need to know:
- Key Benefits: Proper monitoring can reduce job failures by 40% and cut costs by 25%.
- Core Tools: Use AWS CloudWatch, CloudTrail, and Config for infrastructure monitoring, security compliance, and workload-specific metrics.
- Specialized Metrics: Track GPU usage, Lustre FS performance, and Elastic Fabric Adapter (EFA) network health.
- Advanced Tools: Integrate Amazon Managed Grafana for real-time dashboards and NICE DCV for visualization performance.
This guide covers everything from setting up basic monitoring to advanced tools, helping you optimize performance and ensure compliance for your HPC workloads.
Setting Up Basic Monitoring
To monitor HPC workloads effectively, you'll need to configure AWS's core observability tools to track compute resources, API activities, and network traffic.
AWS CloudWatch Setup for HPC
Set up CloudWatch with a 1-minute monitoring interval for compute nodes using AWS ParallelCluster's built-in agent [5].
Key metrics to configure:
Metric Category | Metrics to Track | Collection Interval |
---|---|---|
Compute | CPU/GPU Utilization, Instance States | 1 minute |
Storage | EBS Volume Read/Write Operations | 1 minute |
Network | NetworkIn/NetworkOut, NetworkPacketsOut (EFA packet drops) | 1 minute |
Custom | MPI Job Completion Rates | 1 minute |
For GPU metrics, use NVIDIA DCGM and a custom namespace (e.g., ClusterName/GPU
) with 5-second granularity during peak usage [5].
AWS CloudTrail API Monitoring
Set up CloudTrail to log and track HPC API activities:
- Log File Validation: Enable SHA-256 hashing to ensure log integrity [2].
- Event Selection: Monitor critical events such as
RunInstances
andBatchSubmitJob
[4]. - Real-time Alerts: Use CloudWatch Logs to get immediate notifications for unauthorized actions [4].
Multi-Zone Monitoring Setup
Enable VPC flow logs across all Availability Zones (AZs) to focus on REJECTED security group rules [4].
For cross-account monitoring, follow these steps:
- Set up the service-linked role
AWSServiceRoleForCloudWatchCrossAccount
in the monitoring account [1]. - Define resource-based policies to allow cross-account role assumption [1].
Leverage CloudWatch cross-account observability for centralized monitoring of HPC workloads [1]. Pay special attention to Elastic Fabric Adapter (EFA) performance by tracking the NetworkPacketsOut
metric [1].
These configurations lay the groundwork for monitoring HPC performance, which will be expanded upon in the next section.
HPC Performance Monitoring
After setting up the initial CloudWatch configurations, it's time to refine monitoring for specific workloads.
AWS ParallelCluster Metrics
Create CloudWatch dashboards to keep an eye on cluster-wide performance. Use these metrics and thresholds to ensure smooth operations:
Metric | Threshold | Alert Condition |
---|---|---|
ClusterUtilization | 80% | Monitors percentage usage |
ActiveJobs | Queue depth > nodes × 1.5 | Sustained for over 10 minutes |
NetworkThroughput | >500K packets/sec | Important for CFD workloads |
EFA NetworkLatency | >50μs | Key for molecular dynamics tasks |
Instance-Specific Monitoring
Different instance types come with their own monitoring needs. For example, Graviton instances require close tracking of core usage. Specifically, for C7g instances, keep an eye on ARMv9 core utilization, ensuring it doesn't exceed an 85% sustained load threshold [6].
Workload Metrics Configuration
Tailor your monitoring setup to match the demands of specific applications. Expand on the basic EFA monitoring to capture patterns unique to each workload:
- CFD Workloads: Set alerts for NetworkPpsRx exceeding 500K packets/sec and FSx Lustre BurstCreditBalance dropping below 20% [6].
-
Molecular Dynamics (GROMACS):
- Monitor MPI wait times (over 5ms) and EFA credits (below 10%).
- Use Grafana to track CollectiveOpLatency [3].
-
Weather Modeling (WRF):
- Trigger alerts for sustained DiskQueueDepth values greater than 5.
- Ensure compliance with NIST SP 800-223 by linking performance alerts to CloudTrail events [2].
For visualization tasks using NICE DCV, set up alerts for frame rates dropping below 24 FPS and client latency exceeding 100ms [3]. These thresholds help maintain a seamless user experience.
sbb-itb-6210c22
Security and Compliance Monitoring
While performance monitoring focuses on optimizing workloads, security monitoring ensures your systems adhere to compliance standards. Tools like AWS Config, KMS encryption, and compliance alerts are key to building a secure monitoring framework.
AWS Config Compliance Rules
AWS Config continuously evaluates the security of your HPC environment. It automates compliance checks across your infrastructure, addressing both scale and security challenges:
Rule Name | Purpose | Compliance Control |
---|---|---|
cloudwatch-log-group-encrypted | Ensures monitoring data is encrypted | NIST AU-9 |
restricted-ssh | Blocks unauthorized cluster access | NIST SC-7 |
ec2-volume-inuse-check | Verifies storage usage compliance | - |
You can also create custom rules using Lambda functions to check ParallelCluster configurations against specific NIST SP 800-223 standards. These rules can monitor changes to instances, network setups, and access patterns.
Log Encryption with AWS KMS
All monitoring data should be encrypted using AWS KMS. Set up a dedicated KMS key specifically for HPC monitoring and define service access policies accordingly. For CloudWatch log groups, enable encryption and configure retention periods based on data type:
- Operational metrics: Retain for 90 days
- Security logs: Retain for 1 year
- Compliance data: Retain for 7 years (use S3 Glacier for long-term storage)
If you're monitoring across multiple accounts, ensure the KMS key policies allow the necessary roles to decrypt data.
Setting Up Compliance Alerts
Set up CloudWatch alarms to monitor for issues like unauthorized API calls (via CloudTrail), Config rule violations, KMS access problems, and unauthorized instance changes.
Use EventBridge rules paired with Lambda functions to automate responses. For example, you can enable encryption on newly created log groups or quarantine non-compliant instances using AWS Systems Manager.
These measures not only strengthen security but also provide detailed audit trails that enhance both security and operational visibility.
Additional Monitoring Tools
These tools are designed to work with AWS services, tackling common challenges in monitoring HPC environments:
Amazon Managed Grafana Setup
Amazon Managed Grafana offers a centralized way to visualize HPC metrics across clusters. To set it up effectively, integrate it with CloudWatch and use the following recommended settings:
Dashboard Panel | Metrics Source | Refresh Rate |
---|---|---|
Cluster Heatmap | AWS/EC2 CPUUtilization | 30 seconds |
Node Distribution | ClusterName dimension | 1 minute |
Resource Allocation | ParallelCluster logs | 5 minutes |
The Cluster Heatmap is especially useful for managing auto-scaling fleets. To get the most out of your dashboard, separate operational metrics from analytical ones for better performance.
NICE DCV Performance Tracking
For workloads that rely heavily on visualization, monitor these key metrics:
Metric | Threshold | Duration |
---|---|---|
Frame Rate | ≥30 FPS (4K) | Below 25 FPS for 5 minutes |
Network Latency | <100ms | Above 150ms for 2 minutes |
GPU Utilization | <85% | Above 90% for 10 minutes |
Session Bandwidth | <50 Mbps/session | Above 60 Mbps for 15 minutes |
Tracking these thresholds ensures smooth performance for visualization-heavy tasks.
Enginframe Dashboard Configuration
Enginframe simplifies HPC job monitoring by integrating Lustre FS and scheduler metrics. Key features include:
- Job queue status with color-coded priorities for quick assessment.
- User resource analytics powered by CloudWatch for better insights.
- Direct scheduler metrics from systems like Slurm or PBS for detailed tracking.
These features make it easier to manage and optimize HPC workloads effectively.
Summary
Monitoring HPC environments on AWS effectively involves combining AWS-native services with specialized tools, focusing on three main goals: visibility across accounts, automated compliance checks, and metrics tailored to workloads.
Monitoring Layer | Key Tools | Focus Areas |
---|---|---|
Infrastructure | CloudWatch, CloudTrail | Instance health, VPC flow logs |
Application | ParallelCluster, Enginframe | Job queue times, cluster utilization |
Security | AWS Config, Security Hub | Compliance scores, encryption status |
A CloudWatch-based monitoring setup offers broad visibility while keeping operational complexity low. By utilizing AWS's built-in integrations, organizations can monitor key performance indicators across their entire HPC setup.
This layered approach - combining insights into infrastructure, workload-specific metrics, and automated security measures - helps ensure that HPC systems run efficiently while meeting performance and compliance standards.