Scalability, cost-effectiveness, and resilience are essential for contemporary cloud-native applications. Although up to 90% less expensive than On-Demand instances, AWS Spot Instances pose a risk to workload availability due to their transient nature. This is where resilience testing with AWS Fault Injection Simulator (FIS) and clever automation with EC2 Auto Scaling come into play.
In this article, we’ll look at how to use EC2 Auto Scaling to automate Spot Instance utilization and how to use AWS FIS to test your system’s fault tolerance and replicate real-world failures.
Why Spot Instances?
Spot Instances allow you to take advantage of unused Amazon EC2 capacity at reduced prices. However, if AWS needs the capacity back, it can be stopped with only two minutes’ notice.
Ideal Use Cases for Spot Instances:
- Batch processing jobs
- CI/CD workloads
- Containerized workloads (e.g., Kubernetes, ECS)
- Fault-tolerant microservices
To effectively leverage Spot Instances in production, you need to automate their provisioning and ensure graceful fallback to On-Demand capacity during interruptions.
Automating Spot Instances with EC2 Auto Scaling
Amazon EC2 Auto Scaling automatically adjusts the number of instances in your application’s fleet based on demand, health checks, or schedules. By configuring a mixed instance policy, you can blend Spot and On-Demand capacity to optimize cost and availability.
Step 1: Define a Launch Template
Create an EC2 Launch Template that includes:
- Instance type(s)
- AMI ID
- Key pair
- Security group
- User data (for bootstrapping)
This template forms the blueprint for launching EC2 instances in your Auto Scaling Group (ASG).
Step 2: Create an Auto Scaling Group with Mixed Instances
In your ASG configuration, select a Mixed Instances Policy. This allows you to set preferences like:
- Spot allocation strategy: e.g., capacity-optimised or lowest-price
- On-Demand base capacity: Minimum number of On-Demand instances to always have
- Percentage split: Define how much of your fleet should be Spot vs On-Demand
- Instance pools: Provide flexibility across multiple instance types and availability zones
Example:
“MixedInstancesPolicy”: {
“LaunchTemplate”: {
“LaunchTemplateSpecification”: {
“LaunchTemplateId”: “lt-0abcd1234”,
“Version”: “$Latest”
}
},
“InstancesDistribution”: {
“OnDemandPercentageAboveBaseCapacity”: 30,
“SpotAllocationStrategy”: “capacity-optimised”
}
}
Step 3: Attach Scaling Policies
To enable elasticity, attach:
- Target tracking policies (e.g., CPU utilisation)
- Scheduled actions (scale at specific times)
- Step scaling policies (adjust based on thresholds)
This guarantees that your Spot instances automatically scale in and out in real time in response to demand.
The effectiveness of AWS Fault Injection Simulator Automation in enhancing fault tolerance depends on its stress resilience. Presenting AWS Fault Injection Simulator (FIS), a completely managed solution for conducting controlled tests using chaos engineering on workloads hosted on AWS.
Why Use FIS?
FIS helps answer critical questions:
- What happens when a Spot instance is interrupted?
- Does the Auto Scaling Group replace lost capacity?
- Is there failover to On-Demand instances?
- Are application metrics and alerts triggered correctly?
By simulating failures, FIS ensures your automation logic behaves predictably and recovers quickly. Common Chaos Scenarios for EC2 Spot Instances:
- Terminate EC2 Spot Instances: Simulates AWS reclaiming Spot capacity.
- Simulate Network Latency or Packet Loss: Helps identify how dependent services handle degraded performance.
- Inject CPU or memory stress to validate that scaling policies are triggered as expected.
Step 1: Set Up IAM Roles
FIS requires a role with permissions to perform actions like:
- ec2:TerminateInstances
- autoscaling: UpdateAutoScalingGroup
- cloudwatch: GetMetricData
- Logging to CloudWatch
Attach the FIS role to your experiment templates.
Step 2: Define an Experiment Template
An FIS experiment template contains:
- Targets: e.g., EC2 instances in a specific Auto Scaling Group
- Actions: e.g., terminate a Spot instance
- Stop conditions: CloudWatch alarms that halt the experiment if thresholds are breached
Example:
targets:
spotInstances:
resourceType: aws:ec2:instance
selectionMode: COUNT(1)
filters:
– path: “InstanceLifecycle”
values: [“spot”]
actions:
terminateSpot:
actionId: aws:ec2:terminate-instances
parameters:
instanceIds: “{{spotInstances}}”
Step 3: Run Experiments and Analyse
Execute the experiment and observe:
- Auto Scaling group replaces terminated Spot instance
- Replacement respects instance type preferences
- CloudWatch alarms and logs are triggered
- Application availability is unaffected
This proactive testing hardens your system against real-world issues.
Best Practices for Spot + Auto Scaling + FIS
- Increased diversity lowers the chance of interruptions. Diversify instance types and AZs.
- Stable spot pools are given priority when capacity-optimised allocation is used.
- Maintaining on-demand base capacity at all times guarantees a minimum level of availability.
- Keep an eye on interruption notices: To gently terminate programs, use Spot instance termination notices.
- Automate recovery logic: To handle events brought on by FIS experiments, use System Manager, EventBridge, or AWS Lambda.
- Conduct regular experiments with chaos: Plan to include FIS scenarios in your resilience tests or CI/CD pipeline.
Conclusion
Spot Instances offer unmatched cost advantages, but their unpredictable availability can be risky without proper automation and resilience strategies. By combining EC2 Auto Scaling’s mixed instance policies with AWS Fault Injection Simulator’s controlled chaos, you can build an infrastructure that’s both cost-efficient and highly reliable. Whether you’re running stateless microservices, containerised workloads, or batch jobs, this powerful combo empowers you to embrace Spot Instances confidently, while staying prepared for the unexpected.

