AWS Fault Injection Service: Simulate Outages, Test Resilience

published on 09 July 2024

AWS Fault Injection Service (FIS) helps you test your AWS systems' resilience by simulating controlled disruptions. Here's what you need to know:

  • Purpose: Find and fix weak spots in your applications before real outages occur
  • Key features:
    • Simulate resource disruptions, API issues, network problems, and resource stress
    • Target specific parts of your system
    • Work with other AWS tools like CloudWatch and IAM

To get started:

  1. Set up your AWS account and IAM permissions
  2. Create test resources (VPC, EC2 instances, RDS databases)
  3. Make an experiment template
  4. Run your first test
  5. Analyze results and improve your system
Test Scenario What It Does Why It's Useful
EC2 instance failure Stops or removes an EC2 instance Shows how your app handles sudden server loss
Network issues Slows down or cuts off network connections Checks if your app works with poor internet
CPU overload Makes the CPU very busy Tests if your app can handle high demand
Storage problems Simulates disk errors or data access issues Checks how your app deals with data storage problems

Remember to start small, test regularly, and use the results to make your applications stronger and more reliable.

2. Before you start

2.1 Setting up your AWS account

To use AWS Fault Injection Service (AWS FIS), you need:

  • An active AWS account
  • The right permissions to use AWS FIS

2.2 Required IAM permissions

IAM

To use AWS FIS, you need specific IAM permissions:

Permission Type Description
IAM role Grants AWS FIS permission to run experiments
IAM policy Allows modification of resources specified in your experiment template
Service-linked role Named AWSServiceRoleForFIS, manages monitoring and resource selection

For more details on multi-account experiment permissions, check the AWS documentation.

2.3 Basic AWS knowledge needed

Before using AWS FIS, you should know:

  • Basic AWS services like EC2, ECS, EKS, RDS, and SSM
  • How IAM roles and permissions work

If you're new to AWS, start with the basics before using AWS FIS.

3. AWS Fault Injection Service basics

AWS Fault Injection Service

3.1 Key terms and concepts

AWS Fault Injection Service (AWS FIS) lets you test how your AWS systems handle problems. It's based on chaos engineering, which means creating controlled disruptions to see how your system responds. This helps you find weak spots and fix them.

3.2 Types of fault injections

AWS FIS offers several ways to test your system:

Fault Type Description
Resource disruption Stopping or terminating EC2 instances or RDS databases
API issues Forcing failovers or slowing down API calls
Network problems Adding delays or dropping packets in network traffic
Resource stress Putting pressure on CPU or memory

You can target these tests at specific parts of your system, like certain EC2 instances, RDS databases, or entire Availability Zones.

3.3 Why use AWS FIS?

AWS FIS helps you:

  • Test your system's ability to handle problems
  • Find weak spots before they cause real issues
  • Make your applications more reliable
  • Work with other AWS tools like CloudWatch and IAM for better testing

4. Preparing your AWS environment

4.1 Creating test resources

Before testing with AWS FIS, set up:

Resource Purpose
AWS account Main access point
VPC Network for your resources
EC2 instances or RDS databases Targets for your tests

Follow the AWS guide to create a default VPC and EC2 instances.

4.2 Setting up IAM roles

To use AWS FIS, create an IAM role:

Role Name Trust Policy
AWSServiceRoleForFIS fis.amazonaws.com AmazonFISServiceRolePolicy

This role lets AWS FIS run tests and manage resources for you.

4.3 Creating CloudWatch alarms

CloudWatch

Set up CloudWatch alarms to watch your resources during tests:

Metric to Monitor Why It's Important
CPU use Shows how busy your systems are
Memory use Indicates if your systems have enough memory
Network traffic Helps spot unusual activity

These alarms help you see how your system responds to the tests.

5. Making an experiment template

5.1 Opening the AWS FIS console

To create an experiment template:

  1. Go to the AWS FIS console: https://console.aws.amazon.com/fis/
  2. Click on Experiment templates in the menu

5.2 Setting up experiment actions

Actions are the tests AWS FIS runs on your resources. To add actions:

  1. Click Add action
  2. Name your action
  3. Pick the action type
  4. Set the action details
Action Example Duration Purpose
Network disruption 2 minutes Test system response to connection loss
EC2 instance stop 5 minutes Check recovery from sudden instance failure

5.3 Choosing targets

Targets are the resources you want to test. To set targets:

  1. Click Edit on the auto-created target
  2. Pick the resource type (e.g., EC2, RDS)
  3. Choose how to select the target (e.g., by tag, by ID)
Target Type Selection Method Example
EC2 instance By tag All instances tagged "Test"
RDS database By ID Specific database "prod-db-1"

5.4 Adding stop conditions

Stop conditions end the test if something goes wrong. To add a stop condition:

  1. Click Add stop condition
  2. Pick a CloudWatch alarm you made earlier

5.5 Linking IAM roles

Link an IAM role to let AWS FIS run the test:

  1. Choose Use an existing IAM role
  2. Pick the IAM role you made for AWS FIS

6. Running your first test

6.1 Starting the experiment

To run your first test:

  1. Go to the AWS FIS console
  2. Select your experiment template
  3. Click Start experiment
  4. Enter a unique client token

The client token helps identify the experiment and stops accidental duplicate runs.

6.2 Watching the test progress

During the test:

  • Use the AWS FIS console to track progress
  • See which actions are happening
  • Check which targets are being affected

6.3 Understanding the results

After the test ends:

Step Action
1 Look at the experiment report
2 Check which actions were done
3 See which targets were affected
4 Note any errors that happened

Use CloudWatch metrics and logs to get more details about how your system behaved during the test.

sbb-itb-6210c22

7. Common test scenarios

Here are some basic test scenarios you can use with AWS Fault Injection Service to check how well your applications handle problems:

7.1 EC2 instance failure

EC2

Test what happens when an EC2 instance stops working by turning it off or removing it. This helps you see how your application deals with sudden instance problems.

7.2 Network issues

Check how your application handles network problems like slow connections or no connection at all. This test shows if your application can work when the network isn't perfect.

7.3 CPU overload

See how your application performs when the CPU is very busy. This test helps you understand if your application can handle lots of work or many users at once.

7.4 Storage problems

Test how your application reacts when storage doesn't work right. This could be disk errors or not being able to read or write data.

Test Scenario What It Does Why It's Useful
EC2 instance failure Stops or removes an EC2 instance Shows how your app handles sudden server loss
Network issues Slows down or cuts off network connections Checks if your app works with poor internet
CPU overload Makes the CPU very busy Tests if your app can handle high demand
Storage problems Simulates disk errors or data access issues Checks how your app deals with data storage problems

These tests help you find weak spots in your application before they cause real problems for users.

8. Tips for effective testing

8.1 Creating useful experiments

When making tests with AWS Fault Injection Service:

  1. Copy real-world problems
  2. Set clear goals for your system
  3. Guess how your system will react to issues
Test Example What It Does Why It's Useful
EC2 instance stops Turns off a server Shows how your app handles server loss
Network slows down Makes internet connection poor Checks if your app works with bad internet

8.2 Keeping tests safe

To run safe tests:

  • Use a test environment, not your live system
  • Have a plan to undo changes if needed
  • Start small and grow your tests slowly

8.3 Regular testing and updates

To get the most from AWS Fault Injection Service:

Action Frequency Purpose
Run tests Every new release Find problems early
Update test plans When system changes Keep tests useful
Review results After each test Learn and improve

9. Automating your tests

9.1 Using AWS CLI for experiments

AWS CLI

You can use the AWS Command Line Interface (CLI) to run tests with AWS Fault Injection Service (FIS). This helps you add testing to your development process.

To create a test template with AWS CLI:

  1. Make a JSON file with your test details
  2. Use the aws fis create-experiment-template command
  3. Start the test with aws fis start-experiment

Here's an example of a JSON file for a test template:

{
  "actions": {
    "terminate-instance": {
      "actionId": "terminate-instance",
      "description": "Stop an EC2 instance",
      "actionType": "aws:ec2:stopInstances",
      "targets": {
        "instances": "EC2InstancesTarget"
      },
      "parameters": {}
    }
  },
  "description": "Test EC2 instance failure",
  "roleArn": "arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>",
  "stopConditions": [],
  "targets": {
    "EC2InstancesTarget": {
      "resourceType": "aws:ec2:instance",
      "selectionMode": "COUNT(1)",
      "filters": [
        {
          "path": "tags.your-key",
          "values": [
            "your-value"
          ]
        }
      ]
    }
  }
}

Replace <ACCOUNT_ID>, <ROLE_NAME>, your-key, and your-value with your own details.

9.2 Adding tests to CI/CD pipelines

You can add FIS tests to your CI/CD pipeline. This runs tests every time you update your code.

For example, use AWS CodePipeline to:

  1. Detect code changes
  2. Run your normal tests
  3. Run FIS tests
  4. Deploy if all tests pass

This helps catch problems early.

9.3 Scheduling regular tests

Running tests often helps keep your system strong. You can set up FIS to run tests on a schedule.

Scheduling Option How to Set It Up Benefits
Daily tests Use AWS CloudWatch Events Catch daily issues
Weekly tests Use AWS CloudWatch Events Find less common problems
After major changes Add to your deployment process Test new code right away

Regular testing helps you find and fix issues before they affect users.

10. Improving system reliability

10.1 Reading test results

After running a test with AWS FIS, you'll get a report showing how your system handled the fake outage. To understand these results:

  1. Look at key numbers like error rates and response times
  2. Check how your system used resources during the test
  3. Note any parts that didn't work well or failed

10.2 Finding weak spots

By looking closely at your test results, you can spot areas in your system that need work. Use this table to help identify problems:

What to Look For Why It Matters
Parts that didn't recover These could cause long outages
Overloaded resources May lead to slow performance or crashes
High error rates Could mean poor user experience
Slow response times Might frustrate users or cause timeouts

10.3 Making system improvements

Once you know where the problems are, you can fix them. Here's how to make your system stronger:

Improvement How It Helps
Add backup systems Keeps things running if one part fails
Spread out the workload Stops any one part from getting too busy
Update your design Makes your system better at handling problems
Test often Helps you catch and fix issues early

11. Fixing common problems

11.1 When experiments fail

Sometimes, AWS FIS experiments don't work as planned. Here's how to fix common issues:

  1. Check the AWS FIS console for error messages
  2. Look over your experiment template for mistakes
  3. Make sure you have the right permissions

11.2 Unexpected system reactions

Your system might act strangely during a test. To avoid this:

Action Purpose
Watch system performance Spot problems early
Use safety measures Stop small issues from getting bigger
Test how your system handles failures Find weak spots

11.3 Permission issues

Not having the right permissions is a common problem. To fix this:

  • Give the right permissions to IAM users and roles
  • Let AWS FIS run tests for you
  • Use service-linked roles to make managing permissions easier
Permission Type What It Does
Identity-based policies Control what users and roles can do
AWS FIS permissions Allow AWS FIS to run tests
Service-linked roles Make it easier to manage permissions

12. Wrap-up

12.1 Key points to remember

This guide showed you how to use AWS Fault Injection Service (FIS) to test your applications for outages. Here's what to keep in mind:

Key Point Description
Start small Begin with simple tests and slowly make them harder
Watch closely Look at how your system acts during and after tests
Find weak spots Use test results to see where your system needs work
Test often Set up automatic tests to check your system regularly

12.2 Next steps

Now that you know how to use AWS FIS, it's time to put it to work:

  1. Set up your first test using the steps in this guide
  2. Run the test and look at the results
  3. Fix any problems you find
  4. Make your tests harder over time
  5. Keep testing and fixing to make your applications stronger

Related posts

Read more