System Design and Documentation for a Production-Grade API Infrastructure Deployed on AWS

DevOps and Cloud Engineer
Focused on optimizing the software development lifecycle through seamless integration of development and operations, specializing in designing, implementing, and managing scalable cloud infrastructure with a strong emphasis on automation and collaboration.
Key Skills:
Terraform: Skilled in Infrastructure as Code (IaC) for automating infrastructure deployment and management. Ansible: Proficient in automation tasks, configuration management, and application deployment. AWS: Extensive experience with AWS services like EC2, S3, RDS, and Lambda, designing scalable and cost-effective solutions. Kubernetes: Expert in container orchestration, deploying, scaling, and managing containerized applications. Docker: Proficient in containerization for consistent development, testing, and deployment. Google Cloud Platform: Familiar with GCP services for compute, storage, and machine learning.
Part 0: CI/CD Pipeline Documentation - Bird API
CI/CD Pipeline Documentation - Bird API
This document describes the complete CI/CD pipeline for the Bird API project. The pipeline automates building Go applications, creating Docker images, and pushing them to Docker Hub for deployment on AWS EKS with Kubernetes.
Project Structure
resend/
├── bird/ # Bird API service
│ ├── main.go # Bird API code
│ ├── Dockerfile # Bird container config
│ ├── Makefile # Build automation
│ └── go.mod # Go dependencies
├── birdImage/ # Bird Image API service
│ ├── main.go # BirdImage API code
│ ├── Dockerfile # BirdImage container config
│ ├── Makefile # Build automation
│ └── go.mod # Go dependencies
├── frontend/ # Bird Frontend service (NEW)
│ ├── main.go # Frontend code
│ └── Dockerfile # Frontend container config
├── bird-api-k8s-manifests/ # Kubernetes manifests
│ ├── bird-api-deployment.yaml # Bird API K8s deployment
│ └── bird-image-deployment.yaml # Bird Image API K8s deployment
├── bird-chart/ # Helm chart for deployments
│ ├── Chart.yaml # Chart metadata
│ ├── values.yaml # Helm values
│ └── templates/ # K8s templates
├── infrastructure/ # Terraform infrastructure
│ ├── eks-cluster.tf # EKS cluster configuration
│ ├── eks-node-group.tf # EKS node groups
│ ├── kubernetes-provider.tf # K8s provider settings
│ ├── variables.tf # Terraform variables
│ └── ... # Other infrastructure files
└── README.md # Project documentation
Architecture
Services
bird - Main API service (port 4201)
Fetches bird data from local database
Calls birdImage service for images
Returns bird information with images
birdImage - Image service (port 4200)
Fetches bird images from Unsplash API
Returns image URLs based on bird name
frontend - Frontend web service (port 3000) (NEW)
Displays bird images fullscreen
Fetches data from Bird API
Responsive web interface
Technology Stack
Language: Go
Containerization: Docker
Container Registry: Docker Hub
Orchestration: Kubernetes (AWS EKS)
Infrastructure as Code: Terraform
CI/CD: GitHub Actions
Package Manager: Helm
Build Process
1. Local Build (Development)
Build the Go Binary
cd bird
make bird
# or
go build -o bird main.go
Output: Executable binary named bird
Using Makefile
The Makefile in each directory provides convenient build targets:
cd bird
make bird # Build bird API
make clean # Clean build artifacts
cd ../birdImage
make birdImage # Build bird image API
cd ../frontend
make frontend # Build frontend
2. Docker Image Build
Build Docker Images Locally
# From the root directory
docker build -t bruno74t/bird-api:v.1.0.5.7 ./bird
docker build -t bruno74t/bird-image-api:v.1.0.5.7 ./birdImage
docker build -t bruno74t/bird-frontend:v.1.0.5.7 ./frontend
What happens:
Reads Dockerfile from each service directory
Compiles the Go code inside the container
Creates a lightweight Alpine Linux image with the binary
Tags the image with the service name and version
Dockerfile Structure (Multi-stage build)
bird/Dockerfile, birdImage/Dockerfile, and frontend/Dockerfile follow the same pattern:
# syntax=docker/dockerfile:1
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod ./
RUN go mod download
COPY . .
RUN go build -o bird main.go
FROM alpine:latest
WORKDIR /root/
COPY --from=builder /app/bird .
CMD ["./bird"]
EXPOSE 4201
Benefits:
Multi-stage build keeps final image small
Builder stage has all dependencies
Final image only contains the executable
Push Process
GitHub Actions Automated Push
Trigger
Any push to main branch
Any pull request to main branch
GitHub Workflow File Location
.github/workflows/docker-ci.yml
Workflow Steps
Step 1: Checkout Code
- name: Checkout Code
uses: actions/checkout@v2
Pulls the latest code from the repository
Step 2: Set up Docker Buildx
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
Enables advanced Docker building features
Step 3: Authenticate with Docker Hub
- name: Log into Docker Hub
run: echo "\({{ secrets.DOCKER_PASSWORD_SYMBOLS_ALLOWED }}" | docker login --username "\){{ secrets.DOCKER_USERNAME }}" --password-stdin
Uses GitHub Secrets to securely log into Docker Hub without exposing credentials
Step 4: Build and Push Bird API
- name: Build and Push Bird API Docker Image
run: |
docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-api:v.1.0.5.7 ./bird
docker push ${{ secrets.DOCKER_USERNAME }}/bird-api:v.1.0.5.7
Builds the bird API image from bird/Dockerfile Pushes to Docker Hub repository
Step 5: Build and Push BirdImage API
- name: Build and Push BirdImage API Docker Image
run: |
docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-image-api:v.1.0.5.7 ./birdImage
docker push ${{ secrets.DOCKER_USERNAME }}/bird-image-api:v.1.0.5.7
Builds the birdImage API image from birdImage/Dockerfile Pushes to Docker Hub repository
Step 6: Build and Push Bird Frontend (NEW)
- name: Build and Push Bird Frontend Docker Image
run: |
docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-frontend:v.1.0.5.7 ./frontend
docker push ${{ secrets.DOCKER_USERNAME }}/bird-frontend:v.1.0.5.7
Builds the frontend image from frontend/Dockerfile Pushes to Docker Hub repository
GitHub Secrets Configuration
Required Secrets
Set these in GitHub: Settings → Secrets and variables → Actions
DOCKER_USERNAME
Value: Your Docker Hub username
Example: bruno74t
DOCKER_PASSWORD_SYMBOLS_ALLOWED
Value: Docker Hub access token or password
Recommended: Use access token instead of password
How to Create a Docker Hub Access Token
Go to Docker Hub (https://hub.docker.com)
Click Account Settings → Security
Click New Access Token
Name: github-actions
Permissions: Read, Write, Delete
Copy token and paste in GitHub Secrets
Container Image Management via Terraform
Instead of hardcoding image URIs in Kubernetes manifests, we use Terraform variables to manage container versions. This allows easy updates without modifying resource definitions.
Terraform variables.tf Configuration
File: infrastructure/variables.tf
# Container Images
variable "bird_api_image" {
description = "Docker image URI for bird-api"
type = string
default = "bruno74t/bird-api:v.1.0.5.7"
}
variable "bird_api_port" {
description = "Port for bird-api container"
type = number
default = 4201
}
variable "bird_image_api_image" {
description = "Docker image URI for bird-image-api"
type = string
default = "bruno74t/bird-image-api:v.1.0.5.7"
}
variable "bird_image_api_port" {
description = "Port for bird-image-api container"
type = number
default = 4200
}
variable "bird_frontend_image" {
description = "Docker image URI for bird-frontend"
type = string
default = "bruno74t/bird-frontend:v.1.0.5.7"
}
variable "bird_frontend_port" {
description = "Port for bird-frontend container"
type = number
default = 3000
}
Usage in kubernetes-provider.tf
Reference the variables in your Kubernetes resource definitions:
resource "kubernetes_deployment" "bird_frontend" {
metadata {
name = "bird-frontend"
}
spec {
replicas = 2
template {
spec {
container {
name = "bird-frontend"
image = var.bird_frontend_image
port {
container_port = var.bird_frontend_port
}
}
}
}
}
}
resource "kubernetes_service" "bird_frontend_service" {
metadata {
name = "bird-frontend-service"
}
spec {
type = "LoadBalancer"
selector = {
app = "bird-frontend"
}
port {
port = 80
target_port = var.bird_frontend_port
}
}
}
How to Update Container Images
Option 1: Update variables.tf (Recommended)
Simply change the default value in infrastructure/variables.tf:
variable "bird_frontend_image" {
description = "Docker image URI for bird-frontend"
type = string
default = "bruno74t/bird-frontend:v.1.0.5.8" # Update version here
}
Option 2: Override at Deploy Time
terraform apply \
-var="bird_api_image=bruno74t/bird-api:v.1.0.5.8" \
-var="bird_image_api_image=bruno74t/bird-image-api:v.1.0.5.8" \
-var="bird_frontend_image=bruno74t/bird-frontend:v.1.0.5.8"
Option 3: Use terraform.tfvars
Create or update infrastructure/terraform.tfvars:
bird_api_image = "bruno74t/bird-api:v.1.0.5.8"
bird_image_api_image = "bruno74t/bird-image-api:v.1.0.5.8"
bird_frontend_image = "bruno74t/bird-frontend:v.1.0.5.8"
bird_api_port = 4201
bird_image_api_port = 4200
bird_frontend_port = 3000
Then deploy:
cd infrastructure
terraform plan
terraform apply
Deploy with Updated Images
cd infrastructure
# Review changes
terraform plan
# Apply changes (pulls latest image and restarts containers)
terraform apply
Complete Workflow
Development → Production Flow
Developer makes changes to Go code └─ Edit bird/main.go, birdImage/main.go, or frontend/main.go
Developer commits and pushes to GitHub └─ git commit -m "Update bird data" └─ git push origin main
GitHub Actions triggered automatically ├─ Checkout code ├─ Build Docker images (all 3 services) ├─ Push to Docker Hub └─ Creates new image tags (v.1.0.5.8, etc.)
Docker images available on Docker Hub └─ bruno74t/bird-api:v.1.0.5.8 └─ bruno74t/bird-image-api:v.1.0.5.8 └─ bruno74t/bird-frontend:v.1.0.5.8
Update Terraform variables.tf with new version └─ Update default image tags
Deploy infrastructure via Terraform ├─ terraform plan ├─ terraform apply └─ EKS pulls new images and restarts pods
Kubernetes manages the deployment ├─ Rolling update strategy ├─ Old pods gradually replaced with new ones └─ Services remain available during update
Version Management
Tagging Convention
Format: v..
Examples: v.1.0.0, v.1.0.1, v.1.0.5.7
Note: Use dot before major version (v.1.0.5.7 not v1.0.5.7)
How to Release a New Version
- Make code changes
cd bird
# Edit main.go
git add main.go
- Update CI/CD workflow (optional, if changing version tags)
# Update .github/workflows/docker-ci.yml if needed
git add .github/workflows/docker-ci.yml
- Update Terraform variables
# Update infrastructure/variables.tf with new version
git add infrastructure/variables.tf
- Commit and push
git commit -m "Release v.1.0.5.8 - Description of changes"
git push origin main
GitHub Actions automatically builds and pushes new version
Deploy with Terraform
cd infrastructure
terraform plan
terraform apply
Troubleshooting
Go Build Issues
# Clean cache
go clean -cache
# Rebuild
go build -o bird main.go
GitHub Actions Not Triggering
Check:
.github/workflows/docker-ci.ymlexistsCheck: Secrets are set (DOCKER_USERNAME, DOCKER_PASSWORD_SYMBOLS_ALLOWED)
Check: Workflow syntax is valid
Docker Push Fails
Verify: Docker Hub credentials in GitHub Secrets
Verify: Repository exists on Docker Hub
Verify: Access token hasn't expired
Image Not Updating
Check Docker Hub tags: bruno74t/bird-api repository
Verify workflow ran successfully in GitHub Actions
Check: Terraform is pulling latest image tag
Verifying Deployed Container Images (Kubernetes)
Check which image version is currently deployed:
kubectl get deployment bird-api -o yaml | grep image:
kubectl get deployment bird-image-api -o yaml | grep image:
kubectl get deployment bird-frontend -o yaml | grep image:
Check pod status and readiness:
kubectl get pods -n default | grep bird
View container logs to verify app is running correctly:
kubectl logs -n default deployment/bird-api
kubectl logs -n default deployment/bird-image-api
kubectl logs -n default deployment/bird-frontend
Watch real-time logs as pods start:
kubectl logs -f -n default deployment/bird-api
kubectl logs -f -n default deployment/bird-frontend
Check pod details and events:
kubectl describe pod <pod-name> -n default
Verify rollout status after deployment:
kubectl rollout status deployment/bird-api -n default
kubectl rollout status deployment/bird-image-api -n default
kubectl rollout status deployment/bird-frontend -n default
Force restart deployment with new image:
kubectl rollout restart deployment/bird-api -n default
kubectl rollout restart deployment/bird-image-api -n default
kubectl rollout restart deployment/bird-frontend -n default
Key Files
resend/
├── .github/
│ └── workflows/
│ └── docker-ci.yml # CI/CD Pipeline definition
├── bird/
│ ├── main.go # Bird API code
│ ├── Dockerfile # Bird container config
│ ├── Makefile # Build automation
│ └── go.mod # Dependencies
├── birdImage/
│ ├── main.go # BirdImage API code
│ ├── Dockerfile # BirdImage container config
│ ├── Makefile # Build automation
│ └── go.mod # Dependencies
├── frontend/
│ ├── main.go # Frontend code
│ └── Dockerfile # Frontend container config
├── bird-api-k8s-manifests/ # Kubernetes manifests
│ ├── bird-api-deployment.yaml
│ └── bird-image-deployment.yaml
├── bird-chart/ # Helm charts
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
├── infrastructure/ # Infrastructure as Code
│ ├── kubernetes-provider.tf # K8s provider settings
│ ├── variables.tf # Container image versions managed here
│ ├── eks-cluster.tf # EKS cluster
│ └── ... # Other infrastructure files
└── README.md
variables.tf Role
The infrastructure/variables.tf file is the single source of truth for container image versions:
Define which Docker image tag to deploy
Update versions without touching resource definitions
Centralized management for all services
Easy rollback by changing image tags in one file
Summary
This CI/CD pipeline:
Automatically builds Go applications using Makefiles
Creates Docker images for each service using multi-stage builds (bird, birdImage, frontend)
Pushes images to Docker Hub with version tags
Enables infrastructure teams to deploy latest versions via Terraform using variables.tf
Centralizes image version management in variables.tf (single source of truth)
Eliminates manual build and push steps
Maintains secure credentials using GitHub Secrets
Allows easy rollback by changing image tags in one file
Uses AWS EKS for Kubernetes orchestration
Supports rolling updates with zero downtime
Part 1: Architecture Overview & Technology Selection
Full Detailed Architecture Design
When deploying APIs in the cloud, every technology choice involves trade-offs. In this comprehensive guide, I'll walk you through the complete system design of a highly available, scalable API infrastructure on AWS. I'll explain not just what I implemented, but why I made each architectural decision and what alternatives we considered.
This is the infrastructure behind the Bird API, a real-world project deployed to AWS EKS with auto-scaling, CloudFront CDN, and comprehensive CloudWatch monitoring.
Executive Summary
Building scalable infrastructure isn't about using the fanciest tools—it's about making careful choices that balance:
Cost - Stay within budget (this costs around $120/month)
Complexity - Keep it simple enough to understand
Capability - Meet all functional requirements
Future-proofing - Allow for scaling without rewrites
This infrastructure shows these principles through real-world decisions, with a trade-off analysis for each major component.
Design Goals Achieved
| Goal | Target | Achieved |
|---|---|---|
| Availability | 99.9% uptime | ✓ Multi-AZ + self-healing |
| Scalability | 2-10 pods, 2-5 nodes | ✓ HPA + Cluster Autoscaler |
| Latency | <100ms response | ✓ CloudFront CDN at edges |
| Cost | <$150/month | ✓ ~$120/month |
| Recovery | <15s pod failure | ✓ Verified in tests |
| Monitoring | Real-time alerts | ✓ CloudWatch + 4 alarms |
The Journey: Why These Technologies?
Every decision in this architecture answers a fundamental question: What's the simplest solution that meets our requirements, but as well can be continously improved?
This philosophy guided me toward AWS managed services (EKS, CloudFront, CloudWatch) over self-hosted alternatives, even when the self-hosted versions offered more power. I choose a system that:
Works reliably with minimal operational overhead
Scales automatically without intervention
Alerts us when something goes wrong
Costs predictably without surprises
Can be deployed in 20-25 minutes
System Architecture Overview
Before diving into individual components, here's the complete picture:
Container Orchestration: Why EKS?
Let me start with the biggest decision: Kubernetes on AWS (EKS) vs alternatives.
The Alternatives We Considered
Option 1: Self-Managed Kubernetes
Setup Time: 2-3 Days
Operational Burden: Time Consuming
Cost: EC2 instances independently
Monthly Cost: 3 EC2 instances for HA = ~$200-250
Why NOT : For a single API, the operational overhead isn't justified. You're maintaining the control plane (patching, updates, security), managing etcd backups, and handling everything that AWS does automatically with EKS.
Option 2: Docker Swarm
Setup Time: 1-2 days
Operational Burden: Low
Scalability: Limited
Community: Declining
Why NOT: Swarm has limited auto-scaling capabilities, no multi-region support, and the community is shrinking. It's great for small-scale deployments but won't grow with you.
Option 3: AWS ECS (Fargate)
Setup Time: 1 day
Operational Burden: Very Low
Kubernetes Features: Not available
Cost: Pay per pod (~$40-50/month for our workload)
Why I almost chose it: ECS Fargate is genuinely simpler than Kubernetes. No nodes to manage. But we chose EKS because:
Kubernetes is industry standard
Terraform support better for K8s
EKS cost is similar (\(73 control plane + \)68 nodes = $141 vs ~$40-50 Fargate)
Actually, Fargate wins on cost—we could optimize here later
Decision: AWS EKS
Here's why EKS won:
| Factor | EKS | Self-Managed K8s | Docker Swarm | ECS Fargate |
|---|---|---|---|---|
| AWS Integration | Native | Manual plugins | Poor | Native |
| Multi-AZ | Built-in | Manual | Manual | Built-in |
| Auto-Scaling | Native | Custom setup | Limited | Native |
| Industry Standard | Yes | Yes | Declining | No |
| Control Plane Mgmt | AWS | You | Built-in | AWS |
| Cost (Control Plane) | $73/month | $0 | $0 | $0 |
| Monthly Total | $141 | $200-250 | $100 | $50-70 |
| Operational Burden | Low | Very High | Low | Very Low |
| Future Flexibility | Excellent | Excellent | Limited | Limited |
The surprising winner for cost: ECS Fargate (\(50-70/month)
The surprising winner for simplicity: ECS Fargate (no nodes to manage)
Our choice: EKS (\)141/month)
Why not the cheapest option?
Three reasons:
Industry Adoption - Kubernetes skills transfer to any cloud. ECS is AWS-only.
The cost difference is small - $70/month difference is acceptable for portability
Future flexibility - Kubernetes runs anywhere. ECS doesn't. If we ever need multi-cloud, K8s is the only option.
The lesson here: Sometimes paying a little more for portability is worth it.
Load Balancing Strategy
Now that we've chosen Kubernetes, we need a load balancer to route traffic to our pods.
Load Balancer Options
Layer 4 (Transport): TCP/UDP
├── Network Load Balancer (NLB)
│ └─ Latency: <100µs, Cost: $0.006/hr
│
Layer 7 (Application): HTTP/HTTPS
├── Application Load Balancer (ALB)
│ └─ Latency: ~400µs, Cost: $0.0225/hr
│
Legacy:
└── Classic Load Balancer
└─ Deprecated (don't use)
Decision: Network Load Balancer (NLB)
Here's the analysis:
| Requirement | NLB | ALB | Why NLB |
|---|---|---|---|
| Latency < 100ms? | ✓ Ultra-low | ✓ Low | NLB is 3x faster |
| Simple API routing? | ✓ Yes | ✓ Yes | Both work |
| Need URL-based routing? | ✗ No | ✓ Yes | N/A for us |
| Cost for 2 services? | $44/month | $120/month | NLB 2.7x cheaper |
| Future consolidation? | Hard | Easy | But we don't need it yet |
Trade-off we made: Each service gets its own load balancer (more expensive) instead of consolidating on one ALB (would need URL-based routing).
Cost of this trade-off: Extra $76/month vs ALB
Benefit: Simpler setup, no routing rules to manage
Break-even point: When you have 10+ microservices, switch to ALB
Our situation: 2 microservices, NLB makes sense
The lesson: Don't optimize for scale you don't have. Use NLB now, migrate to ALB later if needed.
Key Takeaways
From these architecture decisions, here are the principles we followed:
Choose Managed Services Over Self-Hosted
EKS > Self-Managed K8s (let AWS handle control plane)
CloudWatch > Prometheus (no additional infrastructure)
CloudFront > self-hosted CDN (global by default)
Optimize for Your Current Scale, Not Future Scale
2 NLBs > 1 ALB (simpler now, migrate later)
2 nodes > 10 nodes (scale as needed)
EKS > ECS Fargate (slight cost premium for flexibility)
Prioritize Operational Simplicity
"Simple" beats "powerful"
"Boring" beats "cutting-edge"
"Built-in" beats "third-party"
Document Trade-offs Explicitly
Every decision has a cost and benefit
Your future self will thank you
Enables future optimization
What's Next?
In the next part, we'll dive into:
Horizontal Pod Autoscaling (HPA) - Why 70% CPU? Why 10 pods max?
Cluster Autoscaling - When to add/remove nodes
Failure Recovery Design - How the system recovers in <15 seconds
System Design Series
Part 1: Architecture Overview & Technology Selection (this post)
Part 2: Autoscaling & Failure Recovery Design
Part 3: Monitoring & CloudWatch Strategy
Part 4: Cost Optimization for Production Workloads
Part 5: Multi-Region & Disaster Recovery
Part 1.1 Terraform Files Explained
Bird API - Terraform Infrastructure Documentation
Overview
This documentation covers the complete Terraform infrastructure setup for the Bird API, a containerized application deployed on AWS EKS (Elastic Kubernetes Service). The infrastructure is designed for high availability, auto-scaling, and comprehensive monitoring.
Architecture Type: Microservices on Kubernetes
Primary Services: Bird API, Bird Image API, Bird Frontend
Cloud Provider: AWS
Container Orchestration: Kubernetes (EKS)
Infrastructure as Code Tool: Terraform
Table of Contents
Architecture Overview
High-Level Flow
The Bird API infrastructure follows a multi-tier architecture:
Edge Layer: CloudFront CDN distributes content globally
Load Balancing Layer: Network Load Balancers route traffic to services
Kubernetes Layer: EKS cluster manages containerized workloads
Persistence Layer: S3 for logs and state management
Monitoring Layer: CloudWatch for metrics, alarms, and dashboards
Key Services Deployed
File Structure & Organization
Core Infrastructure Files
vpc.tf - Virtual Private Cloud Setup
Purpose: Creates isolated network environment with public/private subnets
Key Resources:
Public Subnets: 2 subnets (10.0.1.0/24, 10.0.2.0/24) across availability zones
Private Subnets: 2 subnets (10.0.101.0/24, 10.0.102.0/24) for worker nodes
Internet Gateway: Provides internet access for public subnets
NAT Gateways: Enable outbound internet access from private subnets (2 for HA)
Route Tables: Separate routing for public and private traffic
Public Subnets (with IGW)
├── bird-public-subnet-1 (10.0.1.0/24)
└── bird-public-subnet-2 (10.0.2.0/24)
Private Subnets (with NAT)
├── bird-private-subnet-1 (10.0.101.0/24)
└── bird-private-subnet-2 (10.0.102.0/24)
eks-cluster.tf - Kubernetes Control Plane
Purpose: Creates and configures EKS cluster
Key Resources:
CloudWatch Logs: Enables cluster logging for api, audit, authenticator, controllerManager, scheduler
OIDC Provider: Enables IAM Roles for Service Accounts (IRSA)
Private and public endpoint access enabled for flexible connectivity
IRSA capability for fine-grained IAM permissions per service account
eks-node-group.tf - Worker Nodes
Purpose: Manages EC2 worker nodes that run containers
Configuration:
Instance Type: t2.medium (configurable via
node_instance_types)IAM Permissions: Worker node, CNI, ECR access, SSM, CloudWatch
33% max unavailable during updates (rolling update strategy)
Lifecycle policy: create_before_destroy for zero-downtime updates
autoscaling.tf - Auto-Scaling Configuration
Purpose: Enables automatic scaling of cluster and pods
Components:
High Traffic → CPU Increases → HPA triggers scale-up → More pods created
Low Traffic → CPU Decreases → HPA triggers scale-down → Pods removed
Node Capacity Full → Cluster Autoscaler → Adds EC2 node
kubernetes-provider.tf - Kubernetes Workloads
Purpose: Defines all application deployments, services, and policies
Deployments Created:
-
Image: bruno74t/bird-frontend:v.1.0.5.9
Port: 3000
Service Type: LoadBalancer (creates AWS Network Load Balancer)
cloudfront.tf - Content Distribution Network
Purpose: Global content caching and routing via CloudFront
Origins:
monitoring.tf - CloudWatch Monitoring & Alerts
Purpose: Real-time visibility and alerting for infrastructure health
CloudWatch Alarms (Low Thresholds for Testing):
Log Insights Queries (Sample):
Error Logs: fields @timestamp, @message | filter @message like /ERROR/
Response Times: fields @timestamp, response_time | stats avg, max, p95
HTTP Status: fields status_code | stats count by status_code
Pod Metrics: stats avg(cpu), avg(memory) by pod_name
Log Retention: 7 days (configurable)
bucket.tf - State Management & Locking
Purpose: Secure remote Terraform state storage
backend "s3" {
bucket = "bird-api-terraform-state-123456789"
key = "bird-api/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "bird-api-terraform-locks"
}
variables.tf - Input Variables
Purpose: Configurable parameters for different environments
node_instance_types: EC2 instance types (default: t2.medium)enable_nat_gateway: Outbound internet access (default: true)enable_metrics_server: Metrics collection (default: true)
terraform.tfvars - Variable Values
Purpose: Provides actual values for variables
aws_region = "us-east-1"
environment = "dev"
project_name = "bird-api"
cluster_name = "bird-api-cluster"
cluster_version = "1.29"
node_group_desired_size = 2
alert_email = "brunogatete77@gmail.com"
enable_cloudwatch_monitoring = true
locals.tf - Local Values
Purpose: Derived values used across multiple resources
data.tf - Data Sources
Purpose: References existing AWS resources
EKS Optimized AMI for worker nodes
output.tf - Outputs
Purpose: Display important values after deployment
backend.tf - Terraform Backend Configuration
Purpose: Remote state storage configuration
Current Setup: Local state (for initial setup)
Production: Should migrate to S3 backend
kubernetes-provider.tf - Terraform Kubernetes Provider
Purpose: Authenticates Terraform with EKS cluster
Configuration & Variables
Environment-Specific Values
environment = "dev"
node_group_desired_size = 2
node_group_max_size = 3
bird_api_replicas = 2
enable_cloudwatch_monitoring = true
alert_email = "devops@example.com"
environment = "prod"
node_group_desired_size = 3
node_group_max_size = 10
bird_api_replicas = 3
enable_cloudwatch_monitoring = true
alert_email = "ops-team@example.com"
log_retention_in_days = 30
Important Constraints
High Availability Requirements:
Scaling Limits:
Node scaling: 2-5 (configurable)
Pod scaling: 2-10 (configurable)
Deployment Instructions
Prerequisites
# Install required tools
- AWS CLI configured with appropriate credentials
- Terraform >= 1.0
- kubectl >= 1.29
- helm (for manual chart installations)
Step-by-Step Deployment
1. Initialize Terraform:
terraform init
2. Create terraform.tfvars with your values:
cat > terraform.tfvars << EOF
aws_region = "us-east-1"
environment = "dev"
alert_email = "your-email@example.com"
EOF
terraform plan
terraform apply
aws eks update-kubeconfig \
--name bird-api-cluster \
--region us-east-1
kubectl get nodes
kubectl get pods -n default
kubectl get svc -n default
7. (Optional) Install Helm charts manually:
# Cluster Autoscaler
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--set autoDiscovery.clusterName=bird-api-cluster \
--set awsRegion=us-east-1
# Metrics Server
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm install metrics-server metrics-server/metrics-server
Post-Deployment Verification
# Check EKS cluster
aws eks describe-cluster --name bird-api-cluster
# Check node status
kubectl get nodes -o wide
# Check deployments
kubectl get deployments
# Check services
kubectl get svc
# Check HPA status
kubectl get hpa
# View CloudWatch dashboard
aws cloudwatch list-dashboards | grep bird-api
# Test API access
curl http://{CLOUDFRONT_DOMAIN}/api/...
Monitoring & Alerting
CloudWatch Alarms
Pod CPU Utilization > 20%
Pod Memory Utilization > 30%
Pod Restarts > 3 in 2 minutes
Node Not Ready
CloudWatch Dashboard
Access via AWS Console: CloudWatch → Dashboards → bird-api-overview
Cluster memory utilization
Pod restart counts
Pod resource utilization
Log Insights Queries
Use CloudWatch Logs Insights to analyze:
# Find errors
fields @timestamp, @message | filter @message like /ERROR/ | stats count()
# Analyze response times
fields @timestamp, response_time | stats avg(response_time), max(response_time)
# HTTP status distribution
fields status_code | stats count() as requests by status_code
# Pod resource usage
fields @timestamp, kubernetes.pod_name, container_cpu_utilization
| stats avg(container_cpu_utilization) by kubernetes.pod_name
Scaling & Performance
Horizontal Pod Autoscaling (HPA)
Scales down when demand decreases (with cooldown period)
Monitor HPA:
kubectl get hpa
kubectl describe hpa bird-api-hpa
kubectl get metrics pod -l app=bird-api
Cluster Autoscaling
Current Configuration:
Max Nodes: 5
Triggers when pods cannot be scheduled
Behavior:
kubectl logs -n kube-system -l app=cluster-autoscaler -f
kubectl describe node <node-name>
Resource Requests & Limits
Cost Considerations
Resource Costs
Cost Optimization Tips
Use Spot Instances: Add t2.medium spot instances to node group
Adjust Max Nodes: Reduce max_size if maximum scale never reached
NAT Gateway: Consider NAT instance for dev (cheaper but less HA)
Failure Simulation & Testing
Included Script: failure-simulation.sh
chmod +x failure-simulation.sh
./failure-simulation.sh
Troubleshooting Common Issues
Pods in Pending State
kubectl describe pod <pod-name>
# Check if node capacity issue
kubectl get nodes
# May need cluster autoscaler or more nodes
Service LoadBalancer No External IP
# Check AWS service annotations
kubectl get svc -o yaml
# Verify cluster has AWS Load Balancer controller installed
helm list -A | grep aws-load-balancer
HPA Not Scaling
# Verify Metrics Server is installed
kubectl get deployment metrics-server -n kube-system
# Check HPA status
kubectl describe hpa bird-api-hpa
# View metrics
kubectl top pods
CloudWatch Alarms Not Triggering
# Verify SNS subscription is confirmed
aws sns list-subscriptions-by-topic --topic-arn <topic-arn>
# Check alarm history
aws cloudwatch describe-alarms --alarm-names bird-api-pod-cpu-high
EKS Cluster Access Issues
# Update kubeconfig
aws eks update-kubeconfig --name bird-api-cluster
# Test connectivity
kubectl cluster-info
kubectl auth can-i get pods
Additional Resources
Maintenance & Updates
Regular Tasks
Upgrade Procedure
# Update cluster version
terraform apply -var="cluster_version=1.30"
# Update node group (blue-green strategy)
# 1. Update Terraform variable
# 2. terraform apply (creates new nodes before removing old)
# 3. Pods automatically migrate
Document Version: 1.0
Last Updated: 2024
Author: DevOps Team
Part 2 Preview (Autoscaling & Resilience)
The Autoscaling Problem
Imagine you run an API and suddenly you get 10x traffic. What happens?
The difference? Two lines of configuration.
But here's the catch: autoscaling is easy to implement, hard to configure correctly.
The question we had to answer: What are the right thresholds?
Horizontal Pod Autoscaling (HPA)
What is HPA?
HPA is a Kubernetes controller that automatically scales the number of pods up or down based on metrics (usually CPU utilization).
Here's how it works:
1. Metrics Server collects CPU/Memory from pods (every 15 seconds)
2. HPA controller reads metrics
3. Calculates: desiredReplicas = currentReplicas × (currentCPU / targetCPU)
4. Updates Deployment with new replica count
5. Kubernetes scheduler creates/destroys pods
6. Repeat every 15 seconds
The Formula (Explained)
desiredReplicas = ceil[currentReplicas × (currentCPU / targetCPU)]
Let me break this down with a real example:
Scenario: Normal operation, traffic doubles
Before:
Current replicas: 2
Current CPU per pod: 50m (50 millicores)
Target CPU: 70m (our threshold)
desiredReplicas = ceil[2 × (50 / 70)] = ceil[1.43] = 2 pods
Action: No change (we're fine)
After traffic doubles:
Current replicas: 2
Current CPU per pod: 140m (doubled!)
Target CPU: 70m (same threshold)
desiredReplicas = ceil[2 × (140 / 70)] = ceil[4] = 4 pods
Action: Scale from 2 → 4 pods
What happens next:
New 2 pods start up (takes ~5-10 seconds)
Load balancer starts sending requests to them
CPU per pod drops from 140m to 70m (load spreads)
System stabilizes
This is the magic. The system rebalances itself.
Why 70% CPU Target? (Not 50%, not 90%)
This was one of our most debated decisions. Let me show you the analysis:
Option 1: 50% CPU Target
Threshold: Scale when any pod reaches 50% CPU
Pros:
✓ Lots of headroom
✓ Never hits limits
✓ Very responsive to traffic spikes
Cons:
✗ Wastes resources (30-50% idle pods)
✗ Higher cost (~\(200/month instead of \)141)
✗ Over-provisioned for typical workloads
✗ Money thrown away
Use case: Mission-critical systems (99.99% SLA)
Suitable for us? No
Option 2: 70% CPU Target (CHOSEN)
Threshold: Scale when any pod reaches 70% CPU
Pros:
✓ Good utilization (70% is industry standard)
✓ Proven for API workloads
✓ Cost-effective (~$141/month)
✓ Still provides safety margin (30% headroom)
✓ Minimal waste
Cons:
⚠ Brief spikes above 100% possible (mitigated by fast scaling)
⚠ Slightly less headroom than 50%
Use case: Production APIs with good SLA (99.9%)
Suitable for us? Yes, perfect fit
Scales pod in next 15-30 seconds (total: 30-45 seconds)
Option 3: 90% CPU Target
Threshold: Scale when any pod reaches 90% CPU
Pros:
✓ Minimal waste
✓ Cheapest ($120/month)
✓ Maximum utilization
Cons:
✗ Frequent scaling (yo-yo effect)
✗ Insufficient headroom
✗ Risk of cascade failures
✗ Requests can timeout during scaling
✗ Poor user experience
Use case: Non-critical batch jobs
Suitable for us? No, too risky
CPU Utilization vs Scaling Quality
50% CPU →→→→→→→→→→ (Too conservative)
60% CPU →→→→→→→→→→ (Better)
70% CPU →→→→→→→→→→ (OPTIMAL) ✓
80% CPU →→→→→→→→→→ (Risky)
90% CPU →→→→→→→→→→ (Too aggressive)
Sweet spot: 70%
- Good utilization
- Safety margin
- Cost-effective
- Proven in production
Min & Max Replicas Decision
Option 1: 1 replica (minimum for cost)
├─ Cost: Lowest
├─ Availability: Zero HA
│ └─ Pod crashes → service down
│ └─ Pod update → downtime
└─ Risk: Unacceptable for "production-grade"
Option 2: 2 replicas (CHOSEN)
├─ Cost: Minimal ($7/month more)
├─ Availability: High (N-1 redundancy)
│ └─ 1 pod can fail, service continues
│ └─ 1 pod can update, other handles traffic
├─ Load distribution: Spreads requests
└─ Industry standard: Most production APIs
Option 3: 3 replicas (overkill for dev)
├─ Cost: $14/month more
├─ Availability: Very high (N-2 redundancy)
├─ Use case: Mission-critical systems
└─ For us: Unnecessary complexity
Capacity Analysis:
- t2.medium node: 1 core = 1000m CPU
- Pod request: 100m CPU
- Pods per node: 1000m ÷ 100m = 10 pods
With 5 nodes max:
- Total capacity: 5 nodes × 10 pods = 50 pods max
- But we limit to 10 per deployment
- Reason: Prevents runaway scaling
Scenario: Traffic spike hits hard
Current replicas: 2
New replicas desired: 10 (max limit)
Cost: 10 pods × $X per pod
What if it keeps going? Limited by max=10
Cost ceiling:
10 pods × 100m CPU × \(0.05/hr ≈ \)0.05/hr = $36/month per deployment
Total for 2 deployments: \(72/month (vs \)68 for 2 nodes baseline)
Acceptable cost increase for 10x capacity
Scaling Behavior: Up Fast, Down Slow
Scale Up (Aggressive):
├─ Trigger: CPU > 70% for 3 minutes
├─ Response: Immediate (adds pod every 15s if needed)
├─ Reason: Traffic is likely sustained
└─ Worst case: Add up to 10 pods in 2 minutes
Scale Down (Conservative):
├─ Trigger: CPU < 70% for 5 minutes
├─ Response: Remove 1 pod every 3 minutes
├─ Reason: Prevent yo-yo (scale up/down/up/down)
└─ Benefit: Saves cost gradually, not abruptly
Why asymmetric scaling (fast up, slow down)?
Real-world traffic pattern:
8:00 AM ▁▁▁▁▂▃▅▇█▇▅▃▂▁▁▁▁ (morning surge)
Scale up: 2→4→6→8→10 (good, users happy)
2:00 PM ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ (afternoon lull)
Scale down slowly: 10→9→8→7→6→5→4→3→2
Why slow down?
- Traffic often returns soon (yo-yo prevention)
- Cost savings: 5 minutes × \(0.05/hr = ~\)0.004 (negligible)
- Stability: More important than saving pennies
Cluster Autoscaling
While HPA scales pods, Cluster Autoscaler scales nodes.
HPA: "We need more pods" → Adds pods
Cluster Autoscaler: "We're out of space for pods" → Adds nodes
How Cluster Autoscaler Works
Detection Loop (every 10 seconds):
1. Check for "Pending" pods (can't be scheduled)
2. Calculate if new node would help
3. Check if adding node violates constraints
4. If safe, launch new EC2 instance
5. Node joins cluster (~2-3 minutes)
6. Pending pods get scheduled
Example scenario:
- 2 nodes running, fully utilized
- 3rd pod tries to schedule
- No space available (Pending state)
- Cluster Autoscaler detects this
- Launches new t2.medium EC2 instance
- Instance joins cluster
- 3rd pod gets scheduled
- Total: 2-3 minutes
Configuration Decisions
Why 2 and not 1?
1 Node Setup:
├─ Cost: Cheapest
├─ HA: Zero (1 node failure = total downtime)
├─ Updates: Kill service during node updates
└─ Risk: Unacceptable
2 Nodes Setup:
├─ Cost: Acceptable
├─ HA: N-1 redundancy (1 node can fail)
├─ Updates: Can update 1 node while other runs service
└─ Industry standard for production
3+ Nodes:
├─ Cost: Higher baseline
├─ HA: N-2+ redundancy
├─ Use case: Mission-critical systems
└─ For us: Overkill
Max Nodes: 5
Capacity ceiling:
- 5 nodes × 10 pods per node = 50 pods maximum
- 50 pods × 100m CPU = 5 full cores
- Cost: 5 × t2.medium = \(35/month (vs \)68 for 2)
- Prevents: Runaway cloud costs
Cost protection:
- If somehow pods try to scale to 100
- Node limit prevents creating 10+ nodes
- Caps cost increase at 5x baseline
- Alert triggers at 80% capacity
Scale Down Strategy
Conservative approach:
├─ Node must be < 50% utilized for 10 minutes
├─ Respects PodDisruptionBudgets (won't violate HA)
├─ Drains pods gracefully (30-second window)
├─ Waits for pods to move to other nodes
└─ Only then terminates instance
Why 10 minutes (not 5)?
- Prevents rapid scale down/up cycles
- Traffic often returns (lunch rush, meetings ending)
- Cost: $0.07 for 10 minutes idle node
- Worth the stability
Why respect PDBs?
- Ensures minimum availability during scale down
- If 2 pods on node and minAvailable=1
- Only 1 pod can be evicted
- Other stays, preventing node termination
- Protects against "accidental downtime"
Failure Recovery Design
Here's where autoscaling meets resilience. What happens when pods or nodes actually fail?
Pod Failure Recovery
Scenario: Container crash (app exception, memory leak, etc)
Timeline:
T+0s - Application crashes
- Container exits with code 1
T+1s - Kubelet detects exit
- Restart policy: Always
- Attempts restart
T+5s - Container restarts, starts up
- Health checks begin
T+15s - Readiness probe succeeds
- Load balancer adds back to rotation
T+30s - Pod fully operational
- Traffic flowing again
Recovery time: ~30 seconds
Service impact: 1 pod down, other pod handles traffic
User impact: None (transparent failover)
Protecting against cascading failures:
The key innovation here is the health check configuration:
livenessProbe:
httpGet: /health
initialDelaySeconds: 30 # Wait 30s for app to start
periodSeconds: 10 # Check every 10s
failureThreshold: 3 # 3 failures = kill pod
readinessProbe:
httpGet: /health
initialDelaySeconds: 10 # Fast feedback on readiness
periodSeconds: 5 # Check every 5s
failureThreshold: 3 # 3 failures = remove from LB
Liveness Probe:
├─ Purpose: Detect hung processes
├─ Action: Kill and restart pod
├─ Latency: Up to 30 seconds (acceptable)
└─ Why slow: Avoids killing healthy pods that are busy
Readiness Probe:
├─ Purpose: Detect when app is ready to serve
├─ Action: Remove from load balancer (don't kill)
├─ Latency: Fast (up to 15 seconds)
└─ Why fast: Users should never hit unready pods
Node Failure Recovery
Scenario: Node hardware fails or gets disconnected
Timeline:
T+0s - Network partition or node crash
- Kubernetes doesn't know yet
T+40s - Health check timeout
- Node heartbeat missing
T+5m - Node marked "NotReady"
- Controller marks pods for eviction
T+5m30s - Graceful termination begins
- Pods get 30-second shutdown window
- New pods scheduled on healthy nodes
T+6m - Pods forcibly terminated
- If graceful termination didn't work
T+6m30s - Cluster Autoscaler detects
- Unschedulable pods
- Launches new EC2 instance
T+8m - New node fully ready
- Joins cluster
T+8m30s - Pods scheduled and ready
- Load balanced to new node
Total recovery time: 8-10 minutes
Service impact: Temporary 50% capacity loss (if 2 nodes)
User impact: Requests go to remaining node (slower)
Data impact: None (stateless design)
Why 8-10 minutes is acceptable:
RTO vs Cost Trade-off:
Faster recovery (< 5 min):
├─ Requires: Reserved capacity (empty node always ready)
├─ Cost: +$35/month (extra node sitting idle)
├─ Benefit: Faster failover
├─ For us: Not worth it (dev environment)
Current approach (8-10 min):
├─ Cost: Optimal (only pay for what we use)
├─ Recovery: Slow but acceptable
├─ For us: Perfect balance
SLA implications:
├─ 99.9% uptime allows: ~8 hours downtime/year
├─ 1 node failure every 10 min: Highly unlikely
├─ Multiple failures/year needed to violate SLA
└─ Current approach sufficient
Availability Zone Failure
Scenario: Entire us-east-1a goes down
Setup:
Node 1 in us-east-1a (2 pods)
Node 2 in us-east-1b (2 pods)
Failure:
T+0s - All us-east-1a infrastructure fails
T+5m - Node 1 marked NotReady
T+6m - Pods on Node 1 evicted
T+6m30s- Cluster Autoscaler launches replacement
T+8m - New node in us-east-1b ready
T+8m30s- Pods scheduled
Impact:
├─ Temporary capacity: 2 pods (down from 4)
├─ Service: Still available (reduced performance)
├─ Traffic: Load concentrated on Node 2
├─ Duration: 8-10 minutes
└─ No data loss (stateless design)
Why we use 2 AZs (not 3):
Cost-benefit analysis:
2 AZs (CHOSEN):
├─ Cost: $141/month
├─ Capacity loss: 50% during AZ failure
├─ Recovery: 8-10 minutes
├─ Suitable for: Development, normal production
└─ Trade-off: Acceptable
3 AZs:
├─ Cost: +$13/month (~11% increase)
├─ Capacity loss: 33% during AZ failure
├─ Recovery: Same 8-10 minutes
├─ Suitable for: Mission-critical, 99.99% SLA
└─ For us: Overkill
Decision: 2 AZs is optimal
Could upgrade to 3 AZs later if needed
Pod Disruption Budgets (PDB)
The final piece of resilience: PDB prevents accidental downtime
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: bird-api-pdb
spec:
minAvailable: 1 # Always keep at least 1 pod
selector:
matchLabels:
app: bird-api
Scenario: Admin accidentally runs kubectl delete pods -l app=bird-api
Without PDB:
├─ All pods deleted immediately
├─ Service completely down
├─ 30-second outage
With PDB (minAvailable: 1):
├─ First pod deleted
├─ Second pod protected by PDB
├─ Kubernetes prevents deletion
├─ Service continues
├─ Admin must explicitly override
Testing Resilience
How do we know this actually works?
We tested it. Here's what we did:
Test 1: Kill a Pod
# Get a pod name
POD_NAME=$(kubectl get pods -l app=bird-api -o jsonpath='{.items[0].metadata.name}')
# Kill it
kubectl delete pod $POD_NAME
# Watch replacement
watch kubectl get pods -n default -l app=bird-api
Test 2: Scale Down and Watch HPA Scale Back Up
# Scale down to 1 pod
kubectl scale deployment bird-api --replicas=1
# Wait 15 seconds
sleep 15
# Check status
kubectl get deployment bird-api
kubectl get hpa -n default
Test 3: Simulate High Load
# Run load generator in pod
kubectl run -i --tty load-gen --image=busybox --restart=Never -- /bin/sh
# Inside pod
while true; do wget -q -O- http://bird-api-service/; done
Test Results Summary
Test Status Recovery Time User Impact
────────────────────────────────────────────────────────────────────
Pod crash ✓ Pass ~10 seconds None
Manual pod deletion ✓ Pass ~10 seconds None
HPA scale up under load ✓ Pass ~30 seconds Minimal
HPA scale down (idle) ✓ Pass ~5 minutes None
Service stays available ✓ Pass N/A Zero downtime
RTO Achievement: < 15 seconds (pod recovery)
RPO Achievement: 0 (stateless design)
Service Availability: 100% during tests
Key Takeaways
1. Autoscaling Requires Right Thresholds
70% CPU is industry standard for APIs
Faster scale-up, slower scale-down
Min/max limits prevent extremes
2. Two-Level Autoscaling Is Essential
3. Resilience Comes from Multiple Layers
4. Recovery Is Automatic
5. Test Your Resilience
6. Cost Follows Traffic
Light traffic: 2 pods, 2 nodes (~$0.19/hr)
Heavy traffic: 10 pods, 5 nodes (~$0.40/hr)
Automatic cost optimization
Configuration Reference
HPA Configuration:
targetCPUUtilizationPercentage: 70
minReplicas: 2
maxReplicas: 10
scaleDownStabilizationWindow: 300s
scaleUpPeriod: 0s
Cluster Autoscaler:
minNodes: 2
maxNodes: 5
scaleDownEnabled: true
scaleDownUtilizationThreshold: 0.5
scaleDownDelayAfterAdd: 10m
Health Checks:
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
Part 3: Monitoring with CloudWatch - Knowing What to Watch
In Parts 1 and 2, we built a self-healing infrastructure that scales automatically. Now comes the critical question: How do we know when something goes wrong?
This is where monitoring separates "hope-driven development" from production systems.
But here's the paradox: More metrics isn't better. Too many alarms lead to alert fatigue.
This part covers our monitoring strategy: what to measure, why, and how to alert without drowning in notifications.
The Monitoring Problem
Imagine this: You wake up at 2 AM. Your phone is vibrating.
Alert: "Pod memory at 82%"
You check. Service is fine. False alarm. Go back to sleep.
2 more alerts come that night.
By morning, you've snoozed 10 alerts. All false.
Now the real problem happens: Pod memory hits 85% → OOMKill → service down → you don't notice because you've stopped paying attention to alerts.
This is alert fatigue, and it kills production systems.
The solution? Alert only on things that matter.
What Metrics Should We Track?
Not all metrics are equal. Some predict problems, others just report them.
Leading Indicators (predict problems)
CPU utilization (scaling will happen)
Memory usage trend (heading toward OOMKill)
Pod restart count (app is crashing)
Lagging Indicators (report existing problems)
Request latency
Error rate
Node status
We focus on leading indicators because we can act before users are affected.
The Five Metrics We Track
1. Pod CPU Utilization
What it measures: Percentage of CPU core being used
Source: Metrics Server (built-in with Kubernetes)
Update frequency: Every 15 seconds
Why it matters: Drives HPA scaling decision
Interpretation:
< 30% → Underutilized, pod is idle
30-70% → Healthy, normal operation
70-80% → Getting hot, HPA may trigger
80-90% → Warning, investigate
> 90% → Critical, possible resource leak
Action if consistently > 80%:
- Increase pod limits? (more CPU per pod)
- Optimize code? (use less CPU)
- Scale more? (lower threshold)
2. Pod Memory Utilization
What it measures: Percentage of RAM being used
Source: Metrics Server
Update frequency: Every 15 seconds
Why it matters: Early warning for OOMKill
Interpretation:
< 30% → Plenty of headroom
30-70% → Normal
70-85% → Getting tight, monitor
> 85% → Warning, action needed
> 100% → OOMKilled (dead)
Why 85% threshold (not 90%)?
Memory is NOT compressible
If you go over limit, pod dies immediately
No graceful degradation like CPU
85% gives 27Mi buffer before 512Mi limit
Action if consistently > 85%:
- Increase memory limit?
- Check for memory leak?
- Profile application?
3. Pod Restart Count
What it measures: How many times pod has restarted
Source: Kubelet
Update frequency: Real-time
Why it matters: Indicates application crashes
Interpretation:
0 restarts → Healthy, never crashed
1-2 times → Occasional issues, investigate
> 5 times → Serious problem, fix immediately
Why > 5 in 5 minutes is critical:
More than 1 restart per minute = crash loop
Application is fundamentally broken
Needs immediate attention
Possible causes:
- OOMKilled (memory leak)
- Application bug (exception)
- Configuration error (bad env var)
- Dependency unavailable (can't reach database)
Investigation steps:
kubectl logs pod-name --previous
kubectl describe pod pod-name
4. Node Status
What it measures: Is node Ready or NotReady?
Source: Kubernetes API
Update frequency: Every 40 seconds
Why it matters: Node failure is worst-case scenario
Status values:
Ready → Node healthy, accepting pods
NotReady → Node unhealthy, evicting pods
Unknown → Kubernetes lost contact
When NotReady occurs:
1. Node network failure
2. Kubelet process crashed
3. Disk full on node
4. Node running out of memory
5. EBS volume detached
Recovery:
Cluster Autoscaler detects NotReady
Launches replacement node
Evicts pods to healthy nodes
Terminates failed node
Timeline: 8-10 minutes total recovery
5. Error Rate (Optional, in logs)
What it measures: Percentage of requests that failed
Source: Application logs (parsed)
Update frequency: Every 5 minutes
Why it matters: Indicates service health
Healthy:
Error rate < 0.1% (1 error per 1000 requests)
Warning:
Error rate 0.1% - 1%
Investigate why errors increasing
Critical:
Error rate > 1%
Service is in trouble
Note: We don't have HTTP-level logging in this setup
Could add it later with X-Ray or APM tools
CloudWatch vs Alternatives
Let me show you why we chose CloudWatch over competitors.
Monitoring Options Comparison
CloudWatch │ Native AWS │ $5/month
Prometheus │ Self-hosted │ Ops effort
Datadog │ SaaS │ $15-30/host
New Relic │ SaaS │ $40+/month
Splunk │ Self-hosted/SaaS │ $$$$
Detailed Comparison: CloudWatch vs Prometheus
Feature CloudWatch Prometheus
─────────────────────────────────────────────────
Setup time 5 minutes 2 hours
Infrastructure Managed Self-hosted
Query language CloudWatch PromQL
Dashboard quality Good Excellent
Alerting Built-in AlertManager
Cost \(5/month \)20-40/month
(baseline) (EC2 + monitoring)
Best for:
CloudWatch ↓
- AWS-native setup
- Minimal ops burden
- Quick deployment
- Good enough queries
Prometheus ↓
- Multi-cloud setup
- Powerful queries (PromQL)
- Advanced dashboards
- Complete control
Why We Chose CloudWatch
Decision Matrix:
1. Already on AWS
→ CloudWatch integrates natively
→ No additional infrastructure
→ Cost: Included baseline
2. Small-scale operation (2 pods)
→ CloudWatch sufficient
→ PromQL power not needed
→ Log Insights queries adequate
3. Cost-conscious
→ CloudWatch: ~$5/month
→ Prometheus + Grafana: ~$25/month
→ Savings: $20/month (14% total cost)
4. Minimal ops overhead
→ CloudWatch: Set it and forget
→ Prometheus: Manage Prometheus server, etcd, etc
→ No operational burden preferred
Decision: CloudWatch for this scale
Future migration path: Easy to add Prometheus if needed
The Four Alarms
We configured exactly 4 CloudWatch alarms. No more, no less.
Why 4? Because each alerts on something that requires action.
Alarm 1: Pod CPU High
Configuration:
Metric: Pod CPU Utilization (average)
Threshold: > 80%
Duration: 2 periods × 5 min = 10 minutes total
Action: Send SNS notification (→ Slack, email)
Trigger scenario:
T+0 min: Pod CPU hits 81%
T+5 min: Still at 85% (first period met)
T+10 min: Still at 82% (second period met)
→ ALARM FIRES
Why 80% (not 70%)?
70% is HPA trigger (automatic scaling)
80% means HPA is in progress but not enough yet
Need to investigate if:
- Resource limits too low?
- Code has efficiency issue?
- Legitimate traffic spike?
Why 2 periods (10 minutes)?
Prevents false alarms from brief spikes
Gives HPA time to scale
Only alert if sustained high CPU
Investigation steps:
1. Check HPA status (kubectl get hpa)
2. Check pod count (kubectl get pods)
3. Check application logs (kubectl logs)
4. Check recent code changes
Alarm 2: Pod Memory High
Configuration:
Metric: Pod Memory Utilization (average)
Threshold: > 85%
Duration: 2 periods × 5 min = 10 minutes
Action: Send SNS notification
Trigger scenario:
Pod allocated: 512Mi
Pod using: 435Mi (85% of limit)
After 10 minutes at this level
→ ALARM FIRES
Why 85% (not 90%)?
Memory can't be compressed
If you hit 100%, pod gets OOMKilled immediately
85% gives 27Mi buffer before disaster
More conservative than CPU threshold
Why NOT auto-scale memory?
HPA only scales on CPU
VPA (Vertical Pod Autoscaler) can resize memory
But VPA requires pod restart (causes downtime)
Better to fix the leak than auto-resize
Investigation steps:
1. Check memory trend (increasing or stable?)
2. Look for memory leak (kubectl logs)
3. Check database connections (each =~1-2Mi)
4. Profile application memory usage
5. If leak found: deploy fixed version
Alarm 3: Node Not Ready
Configuration:
Metric: Node Status
Threshold: status != "Ready"
Duration: 1 period (no delay)
Action: URGENT - Send SNS, may trigger on-call
Trigger scenario:
Node heartbeat stops
Kubernetes waits 5 minutes
Marks node "NotReady"
→ ALARM FIRES IMMEDIATELY
Why no delay?
Node failure is urgent
Every second matters
Need immediate attention
Cluster Autoscaler should handle, but alert anyway
What this means:
Hardware failed
Network disconnected
Kubernetes components crashed
Something serious
Immediate action:
1. Check node status: kubectl describe node <name>
2. Check AWS console (is instance running?)
3. Check Cluster Autoscaler logs
4. Manual intervention if CA doesn't fix it
Expected resolution:
Cluster Autoscaler detects and replaces
Takes 8-10 minutes
Manual check confirms recovery
Alarm 4: Pod Restarts High
Configuration:
Metric: Pod Restart Count
Threshold: > 5 in 5 minutes
Duration: 1 period (immediate)
Action: Send SNS notification
Trigger scenario:
8:00:00 - Pod restart #1
8:01:00 - Pod restart #2
8:02:00 - Pod restart #3
8:03:00 - Pod restart #4
8:04:00 - Pod restart #5
→ ALARM FIRES (> 5 in 5 minutes)
What this indicates:
Application crash loop
Restarts are failing faster than recovering
Something is fundamentally wrong
Not a temporary issue
Root causes (in order of likelihood):
1. Out of memory (OOMKill) - 40%
2. Application bug (unhandled exception) - 30%
3. Dependency unavailable (can't reach DB) - 20%
4. Configuration error (bad env var) - 10%
Investigation:
# Check logs from crashed pod
kubectl logs <pod-name> --previous
# Check events
kubectl describe pod <pod-name>
# Is it OOMKilled?
kubectl get pod <pod-name> -o yaml | grep -i "OOMKilled"
# Check recent deployments
kubectl rollout history deployment/bird-api
Resolution options:
1. Increase memory limit (if OOMKilled)
2. Rollback to previous version (if app bug)
3. Fix configuration (if env var wrong)
4. Restart dependent service (if dependency issue)
Alert Fatigue Prevention
Here's what we intentionally don't alert on:
NOT alerting on these (prevents noise):
❌ CPU < 70% (normal, why alert?)
❌ Memory < 85% (healthy)
❌ Pod restarts = 0 (good thing!)
❌ Node count = 2-3 (normal range)
❌ Request latency < 500ms (acceptable)
❌ Error rate < 0.1% (normal)
❌ Pod not ready for 30 seconds (might be starting)
Why not?
These are all NORMAL states
Alerting on normal causes "noise"
Teams stop paying attention
Real problems get missed
The goal: Alert only when action is needed
Log Insights Queries
CloudWatch Logs are useless without ways to query them. Here are the queries we actually use:
Query 1: Recent Errors
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by @message
| sort error_count desc
What it does: Finds all errors in logs, groups by message, sorts by frequency
Use case: Something goes wrong, want to know what
Query 2: Pod Restart Timeline
fields @timestamp, kubernetes.pod_name, @message
| filter @message like /restarted/
| stats count() as restart_count by kubernetes.pod_name, @timestamp
| sort @timestamp desc
What it does: Shows when pods restarted and how many times
Use case: Debugging the "pod_restarts_high" alarm
Query 3: Performance Analysis
fields response_time
| stats avg(response_time) as avg_rt,
pct(response_time, 95) as p95_rt,
pct(response_time, 99) as p99_rt
What it does: Calculates average, 95th percentile, 99th percentile latency
Use case: Trending performance degradation
Query 4: Traffic Volume
fields @timestamp
| stats count() as request_count by @timestamp
| sort @timestamp desc
What it does: Shows requests per time period
Use case: Correlate with alerts (was there traffic spike?)
Cost of Monitoring
How much does CloudWatch actually cost?
CloudWatch Pricing:
├─ Logs ingested: $0.50 per GB
│ ├─ Our logs: ~100MB/day = 3GB/month
│ ├─ Cost: 3GB × \(0.50 = \)1.50/month
│ └─ With retention: 7 days
│
├─ Metrics: $0.30 per custom metric per month
│ ├─ We use 5 metrics (built-in)
│ ├─ Cost: $0 (built-in metrics free)
│ └─ Only pay for custom metrics we add
│
├─ Dashboards: Free
│ └─ Cost: $0
│
├─ Alarms: $0.10 per alarm per month
│ ├─ 4 alarms configured
│ ├─ Cost: 4 × \(0.10 = \)0.40/month
│ └─ Cheap insurance
│
└─ Total: ~$2-5/month
Compare to alternatives:
Prometheus: $25-40/month (self-hosted EC2)
Datadog: $40-100/month (SaaS)
New Relic: $100+/month
CloudWatch: $5/month (best deal for AWS workloads)
Key Takeaways
1. Measure the Right Things
Focus on leading indicators (CPU, memory, restarts)
Avoid lagging indicators (latency, error rate) until at scale
Track only what you'll act on
2. Alert on Signals, Not Noise
4 alarms, not 40
Only alert when action is needed
Prevent alert fatigue
3. CloudWatch is Good Enough
Native AWS integration
Cheap ($5/month)
Sufficient for development
Scale to Prometheus when needed
4. Logs Are Searchable
Log Insights queries enable debugging
Store logs 7-14 days
Delete old logs to control costs
5. Test Your Alarms
Don't assume they work
Actually trigger alert conditions
Make sure notifications work (Slack/email)
Document what each alarm means
Quick Reference: Our 4 Alarms
Alarms:
pod_cpu_high:
Threshold: > 80%
Duration: 10 minutes
Action: Investigate why HPA didn't prevent
pod_memory_high:
Threshold: > 85%
Duration: 10 minutes
Action: Increase memory or fix leak
node_not_ready:
Threshold: status != Ready
Duration: Immediate
Action: Check if Cluster Autoscaler fixes it
pod_restarts_high:
Threshold: > 5 in 5 minutes
Duration: Immediate
Action: Check logs for OOM or exceptions
Part 4: Cost Optimization - Running Kubernetes on AWS Without Breaking the Bank
You've built a production-grade infrastructure. It scales automatically. It monitors itself. It recovers from failures.
Now the hard question: Is it costing too much?
Our current setup runs ~$120-140 per month. For a development environment, that's reasonable. But for a startup, every dollar matters.
This part covers cost optimization strategies: from quick wins (\(50/month savings) to major architecture changes (\)100/month savings).
Current Cost Breakdown
Let's see where every dollar goes:
Monthly Cost Analysis (30 days)
EKS Control Plane: $73.00/month
EC2 Worker Nodes: $68.62/month (2 × t2.medium)
Network Load Balancers: $8.76/month (2 × NLB)
CloudFront CDN: ~$0.85/month (~10GB/month)
S3 Storage: ~$0.50/month (state + logs)
DynamoDB: ~$0.50/month (state locking)
CloudWatch: ~$1.90/month (logs + alarms)
Data Transfer: ~$0.23/month (NAT Gateway)
─────────────────────────────────────────
TOTAL MONTHLY COST: ~$160-170
─────────────────────────────────────────
Breakdown by component:
EC2 Nodes: 43% ($68.62)
EKS Control: 45% ($73.00)
Load Balancers: 5% ($8.76)
Everything else: 7% (~$10.00)
Where to optimize?
EC2 Nodes (43%): Biggest opportunity
EKS Control (45%): Can't change (managed)
Load Balancers (5%): Can consolidate later
Everything else (7%): Already cheap
Cost Optimization Strategies
Now let's look at concrete ways to reduce costs.
Strategy 1: Switch to ECS Fargate (Saves ~$70/month)
Current Setup:
2 t2.medium nodes: $68.62/month
Each pod: 100m CPU, 128Mi memory
Fargate Alternative:
Pay per pod: $0.05 per CPU per hour
2 pods average: 100m CPU = 0.1 CPU
Cost: 0.1 × \(0.05 × 730 = \)3.65/month
Memory: ~0.5GB × \(0.005 = \)3.65/month
Total for 2 pods: ~$7.30/month
With overhead (extra pods during scaling): ~$40-50/month
Savings: $20-30/month (20% reduction)
Trade-offs:
Fargate Pros:
✓ No node management
✓ No Cluster Autoscaler needed
✓ Cheaper for small workloads
✓ Automatic scaling
✓ Simpler operations
Fargate Cons:
✗ Can't SSH into nodes
✗ Limited customization
✗ Slightly higher per-pod cost at scale
✗ Not all EKS features available
When to switch:
- If you have < 20 pods
- If you need zero node management
- If cost > operational complexity
Decision: Could implement now for 20% savings
Strategy 2: Use Spot Instances (Saves ~$48/month)
Current Setup:
2 t2.medium on-demand: \(0.0470/hour = \)68.62/month
Fully available 24/7
Spot Alternative:
2 t2.medium spot: ~\(0.0140/hour = \)20.44/month
Discount: 70% cheaper
Risk: Can be terminated without warning (2-minute notice)
Savings: $48/month (70% reduction)
Trade-offs:
Spot Pros:
✓ 70% cheaper than on-demand
✓ Designed for fault-tolerant workloads
✓ Perfect for Kubernetes (auto-healing)
✓ Auto Scaling Group handles replacement
✓ Transparent to application
Spot Cons:
✗ 2-minute interruption notice
✗ Brief traffic spike to remaining nodes
✗ Not suitable for stateful workloads
✗ Need to handle rapid restarts
Kubernetes fit:
Spot PERFECT for Kubernetes
- Auto Scaling Group replaces instantly
- Pods rescheduled immediately
- 99.9% uptime achievable
- Users might not notice 30s disruption
Risk analysis:
Spot interruption frequency: ~2-3/month/instance
Each disruption: 30-60 seconds
Impact: Temporary load spike, no data loss
For our 2-node setup:
- Expected interruptions: 2-3/month total
- Each adds 30s latency to requests
- Acceptable for development
- Unacceptable for mission-critical
Decision: Implement Spot + On-Demand mix (1 spot, 1 on-demand) = $45/month savings
Strategy 3: Consolidate Load Balancers (Saves ~$15/month)
Current Setup:
- 2 Network Load Balancers: ~\(8.76/month (but with LCU charges: ~\)50-60/month total)
ALB Alternative:
- 1 Application Load Balancer: ~$35-40/month (with LCU charges)
Savings: ~$15-20/month (but requires redesign)
Trade-offs:
Consolidation Pros:
✓ Single load balancer (simpler)
✓ URL-based routing (future-proof)
✓ Saves $15-20/month
Consolidation Cons:
✗ Requires Ingress controller setup
✗ Requires rewriting service definitions
✗ More complexity for 2 services
✗ Worth it only with 5+ services
Decision: Keep 2 NLBs for now
Re-evaluate when adding 3rd service
Strategy 4: Reduce CloudWatch Retention (Saves ~$2-3/month)
Current Setup:
7-day retention: ~$1.50/month
Covers debugging window
Reduced Retention:
3-day retention: ~$0.70/month
Still covers most issues
Savings: ~$0.80/month
Trade-offs:
Less savings than you'd think:
- 7 days: $1.50/month
- 3 days: $0.70/month
- Savings: Only $0.80/month
Debugging benefit:
- 7 days: Cover entire week of issues
- 3 days: Miss issues found late in week
Decision: Keep 7 days
Cost savings too minimal to matter
Strategy 5: Reserved Instances (Saves ~$40/month, 1-year commitment)
Current Setup:
- 2 t2.medium on-demand: $68.62/month
Reserved Instance Alternative:
1-year commitment: ~$350 upfront per instance
2 instances: ~$700 upfront
Per-hour cost: ~$0.04 (33% discount)
Monthly: ~$29.20
Savings: \(39.42/month (but requires \)700 upfront)
Break-even:
$700 ÷ $39.42/month = 17.8 months
But you save from month 1
Typical payback: 18 months
Trade-offs:
RI Pros:
✓ 33% discount
✓ Lock in price (no increase)
✓ Good for stable workloads
RI Cons:
✗ $700 upfront cost
✗ 1-year commitment
✗ Can't change instance type
✗ May have unused capacity
When to use:
- Production workloads (stable)
- Confident infrastructure won't change
- Budget available for upfront cost
For us:
- Not yet (still optimizing setup)
- Revisit in 6 months when stable
Break-Even Analysis
Which optimizations are worth doing?
Optimization Strategy Savings Effort ROI
────────────────────────────────────────────────────
Spot instances $48/mo Medium High ✓
ECS Fargate $20/mo High Medium
Consolidate LBs $15/mo High Low
Reduce logs $0.80/mo Low Very Low
Reserved instances $40/mo None High* (with commitment)
Scheduled scaling $15/mo High Low
* Requires upfront commitment
My Recommendation:
Implement now (Quick wins):
Switch 1 node to Spot: -$24/month (low effort)
Total savings: $24/month
Implement in 3 months (If stable):
Both nodes Spot: -$48/month total
Total savings: $48/month (30% reduction)
Implement in 6 months (If proven stable):
1-year Reserved Instances: -$40/month
Combination: -$88/month total (55% reduction)
Not worth doing:
Reduce CloudWatch logs (savings too small)
Complex scheduling (effort > benefit)
Consolidate LBs (wait until 5+ services)
When to Optimize
Here's my philosophy: Don't prematurely optimize.
Development phase (current):
Goal: Get it working
Cost: Secondary concern
Strategy: Use on-demand
Monthly: $140-160
Early production (6 months in):
Goal: Prove business model
Cost: Important but not critical
Strategy: Add Spot instances
Monthly: $90-110
Scaling production (12+ months):
Goal: Maximize profitability
Cost: Primary concern
Strategy: Reserved instances, Fargate, optimization
Monthly: $40-60
Hypergrowth (2+ years in):
Goal: Minimize per-customer cost
Cost: Extreme optimization needed
Strategy: Multi-region, custom infra, internal tools
Monthly: $0.05-0.10 per customer
Our current phase: Development
Optimize when reaching next phase
Key Takeaways
1. Know Your Costs
EC2 is 90% of your cost
EKS control plane + LBs = fixed costs
Monitoring/storage = negligible
2. Spot Instances are Gold for Kubernetes
70% cheaper than on-demand
Perfect for auto-healing architecture
Minimal operational complexity
Highest ROI optimization
3. Fargate is Great, But Check the Math
Better for <20 pods
More expensive at scale (above 100 pods)
Better for zero-ops teams
Lower cost per pod when small
4. Reserved Instances = 33% Discount
Requires commitment
Requires stable workload
Worth it after 1 year of operation
Not worth it while optimizing
5. Don't Optimize for Scale You Don't Have
Consolidating 2 LBs saves $15/month
But requires architecture change
Wait until 5+ services to consolidate
6. Monitoring and Storage are Cheap
CloudWatch: $5/month
S3: $0.50/month
Not worth extreme cost optimization
Focus on compute costs (EC2)
Cost Optimization Roadmap
Month 0 (Current):
Cost: ~$140/month
Setup: 2 on-demand nodes
Status: Baseline
Month 3:
Action: Add 1 Spot node (mixed)
Cost: ~$110/month
Savings: $30/month
Month 6:
Action: Make 2nd node Spot
Cost: ~$90/month
Savings: $50/month
Month 12:
Action: Reserved instances for 1 node
Action: Keep 1 Spot for flexibility
Cost: ~$50/month
Savings: $90/month
At this point, re-evaluate:
- Consider Fargate if < 20 pods
- Consider Karpenter for better scaling
- Consider multi-region if global
- Consider custom infrastructure if >>$10k/month
Quick Reference: Optimization Options
Spot Instances:
Cost: $24-48/month savings (20-30% reduction)
Risk: Low (Kubernetes handles it)
Effort: Medium (test first)
Payoff: 1 month
ECS Fargate:
Cost: $20-30/month savings
Risk: Medium (architectural change)
Effort: High (full redesign)
Payoff: 3-6 months
Reserved Instances:
Cost: $40/month savings
Risk: None (committed cost)
Effort: None
Payoff: 17 months (with $700 upfront)
Consolidate LBs:
Cost: $15/month savings
Risk: Low (add Ingress controller)
Effort: Medium (redesign services)
Payoff: 6+ months (not worth it now)
Part 5: Multi-Region & Disaster Recovery - Preparing for the Worst
The infrastructure we've built scales automatically, recovers from failures, and costs ~$140/month.
But what if AWS us-east-1 (the entire region) goes down?
This part covers preparing for the worst-case scenario: a complete region failure.
The Disaster Recovery Problem
Let's be honest: Complete regional outages are rare.
In AWS history:
Major regional outage: 2012 (Virginia)
Minor outages: 2-3 per year per region
Frequency: ~0.05-0.1% of the time
But when they happen, they're catastrophic.
If your entire infrastructure goes down:
Downtime: 2-6 hours (typical regional recovery time)
Data loss: Maybe 0 (depending on strategy)
Revenue loss: Potentially 100%
The question: Is multi-region worth the complexity and cost?
Answer: It depends.
Scenario 1: Startup (<$100k revenue/year)
├─ Multi-region cost: +\(140/month = \)1680/year
├─ Regional outage impact: ~$1000 revenue lost
├─ Break-even: ~2 years of outages (unlikely)
└─ Recommendation: Single region (now), multi-region (later)
Scenario 2: Growth stage ($1M+ revenue/year)
├─ Multi-region cost: +$140/month = 1.7% of revenue
├─ Regional outage impact: ~$5000+ revenue lost
├─ Break-even: ~1 month of outages
└─ Recommendation: Implement multi-region NOW
Scenario 3: Enterprise (>$10M revenue/year)
├─ Multi-region cost: 0.17% of revenue
├─ Regional outage impact: $50k+ lost
├─ Break-even: Days
├─ Competitors: Already multi-region
└─ Recommendation: Multi-region + multi-cloud required
For this project (learning/development):
├─ Single region optimal
├─ Multi-region design patterns important (knowledge transfer)
└─ Worth understanding even if not implementing now
Multi-Region Architecture
What Does Multi-Region Mean?
Single Region:
Everything in us-east-1
├─ EKS cluster in us-east-1
├─ Database in us-east-1
├─ S3 buckets in us-east-1
└─ If us-east-1 down: SERVICE DOWN
Multi-Region:
Clusters in multiple regions
├─ Primary: us-east-1
├─ Secondary: us-west-2
├─ Database: Replicated across regions
├─ S3: Cross-region replication
└─ If us-east-1 down: Failover to us-west-2
Basic Multi-Region Setup
┌──────────────────────────────────┐
│ Route 53 (Global DNS) │
│ Geo-routing + Health checks │
└─────────────┬────────────────────┘
│
┌─────────┴─────────┐
│ │
┌───▼─────────────┐ ┌──▼─────────────┐
│ us-east-1 │ │ us-west-2 │
│ │ │ │
│ • EKS Cluster │ │ • EKS Cluster │
│ • 2 Nodes │ │ • 2 Nodes │
│ • LB 1 & 2 │ │ • LB 1 & 2 │
│ • CloudFront │ │ • CloudFront │
│ (distributed) │ │ (distributed)│
│ • Database │ │ • Database │
│ Primary │ │ Replica │
└─────────────────┘ └────────────────┘
(Primary) (Secondary)
│ │
└───────────────────┘
Replicated data sync
Database Replication Strategy
This is the hard part.
Option 1: Active-Passive (Primary-Replica)
├─ Primary: us-east-1 (read/write)
├─ Replica: us-west-2 (read-only, data sync delay)
├─ Failover: Manual or automated to read-only
├─ RTO: 5-30 minutes
├─ RPO: 0-5 minutes (depends on sync frequency)
Option 2: Active-Active (Multi-Master)
├─ us-east-1: Read/write
├─ us-west-2: Read/write
├─ Replication: Bi-directional
├─ Conflict resolution: Application logic
├─ RTO: 0 minutes (already serving users)
├─ RPO: 0 minutes (all writes replicated)
├─ Complexity: High (conflict handling)
Option 3: Managed (AWS Multi-AZ)
├─ Database service: RDS, DynamoDB, etc
├─ Synchronous replication
├─ Automatic failover
├─ RTO: 30-60 seconds
├─ RPO: 0 (no data loss)
├─ Cost: Usually 1.5-2x single region
For stateless APIs (our case):
├─ Database is only state
├─ If you use DynamoDB: Global Tables = Active-Active
├─ If you use RDS: Read replicas + failover
├─ Best: Managed multi-region database
Active-Active vs Active-Passive
The biggest decision: Should both regions serve traffic simultaneously?
Active-Passive (Simpler, Cheaper)
Normal operation:
us-east-1: Serving 100% of traffic
us-west-2: Idle (replicating data)
Failure (us-east-1 down):
1. Health check fails
2. Route 53 detects failure
3. Failover to us-west-2
4. Users redirected to us-west-2
Timeline:
T+0s - Failure occurs
T+30s - Health check timeout
T+60s - Route 53 fails over
T+120s - Users can connect to us-west-2
Recovery time (RTO): 2 minutes
Data loss (RPO): 0-5 minutes (depending on sync)
Cost: 2x (second region idle most of the time)
Trade-offs:
Active-Passive Pros:
✓ Simple failover logic
✓ No cross-region replication complexity
✓ Cheaper than active-active
✓ Easy to test (just test failover)
Active-Passive Cons:
✗ Idle capacity (wasted money)
✗ Failover delay (2-5 minutes)
✗ User experience during failover
✗ Data loss possible (depends on sync)
Active-Active (More Complex, Better)
Normal operation:
us-east-1: Serving 50% of traffic
us-west-2: Serving 50% of traffic
Failure (us-east-1 down):
1. Health check fails
2. Route 53 redirects traffic
3. us-west-2 absorbs all traffic
Timeline:
T+0s - Failure occurs
T+30s - Health check timeout
T+60s - Route 53 rebalances
T+90s - us-west-2 at 100% capacity
Recovery time (RTO): 1-2 minutes
Data loss (RPO): 0 (bi-directional replication)
Cost: 2x (but both regions utilized)
Trade-offs:
Active-Active Pros:
✓ No idle capacity (both regions used)
✓ Faster failover
✓ Zero data loss (bi-directional sync)
✓ Better load distribution
✓ Natural load balancing
Active-Active Cons:
✗ Complex multi-master replication
✗ Conflict resolution needed
✗ Higher operational complexity
✗ Harder to test
✗ Not all databases support (need DynamoDB, not RDS)
RTO/RPO Metrics
These are the most important DR metrics.
Recovery Time Objective (RTO)
Definition: How long can the system be down before it's unacceptable?
Example: RTO = 1 hour
If system fails at 2:00 PM, it MUST be back by 3:00 PM
If recovery takes until 3:30 PM, SLA is violated
RTO levels:
< 5 min: Critical systems (hospitals, airlines)
5-15 min: Production systems (e-commerce, banking)
15-60 min: Important business (sales platforms)
1-4 hours: Can tolerate some downtime
> 4 hours: Non-critical systems
Our infrastructure:
Single region: RTO = Region recovery time = 2-6 hours
With failover: RTO = 1-2 minutes (active-active)
With manual failover: RTO = 5-30 minutes (active-passive)
Recovery Point Objective (RPO)
Definition: How much data can we afford to lose?
Example: RPO = 5 minutes
If system fails, it's acceptable to lose 5 minutes of data
If RPO is violated, we lost more than 5 minutes of changes
RPO levels:
0 minutes (zero data loss): Critical (banking, medical)
0-5 minutes: Production (most systems)
5-60 minutes: Business systems
> 1 hour: Non-critical
Our infrastructure:
Single region: RPO = 0 (stateless, no data loss)
Active-passive: RPO = 0-5 min (depends on replication frequency)
Active-active: RPO = 0 (bi-directional replication)
Testing Disaster Recovery
You can't trust a DR plan you haven't tested.
DR Testing Strategy
Level 1: Documentation Review
├─ Read and verify DR procedures exist
├─ Effort: 1 hour
├─ Cost: $0
├─ Confidence: 10%
Level 2: Tabletop Exercise
├─ Simulate failure scenario (on whiteboard)
├─ Walk through recovery steps
├─ Effort: 4 hours
├─ Cost: $0
├─ Confidence: 30%
Level 3: Failover Test
├─ Actually failover to secondary region
├─ Verify everything works
├─ Failback to primary
├─ Effort: 8 hours
├─ Cost: $200 (might get billed for both regions)
├─ Confidence: 85%
Level 4: Full DR Drill
├─ Simulate real regional failure
├─ Run for 2-4 hours
├─ Measure actual RTO/RPO
├─ Effort: 16 hours + follow-up
├─ Cost: Potential service issues
├─ Confidence: 99%
Recommendation:
├─ Do Level 1: Before going live
├─ Do Level 2: Every 6 months
├─ Do Level 3: Every 12 months
├─ Do Level 4: Every 2 years (or quarterly for critical)
Example: Level 3 Failover Test
Step 1: Preparation (1 hour)
├─ Document current state (RTO = 0 right now)
├─ Backup database
├─ Notify team
├─ Set 2-hour testing window
Step 2: Initiate Failover (30 minutes)
├─ Failover DNS to us-west-2 (Route 53 change)
├─ Wait for DNS propagation
├─ Verify us-west-2 is serving traffic
└─ Note: T = failover start time
Step 3: Verify (15 minutes)
├─ Test APIs from us-west-2
├─ Check application functionality
├─ Verify data is consistent
├─ Document any issues
Step 4: Measure RTO (calculated)
├─ Actual RTO = (Time traffic fully on us-west-2) - (Failover start)
├─ Goal: < 2 minutes
├─ Actual: Usually 45-90 seconds (DNS propagation)
Step 5: Failback (15 minutes)
├─ Restore DNS to primary (us-east-1)
├─ Verify us-east-1 ready
├─ Failback traffic
└─ Monitor for issues
Step 6: Post-Test (30 minutes)
├─ Document findings
├─ Update procedures if needed
├─ Review what went wrong
├─ Schedule next test
Implementation Timeline
Phase 1: Planning (Month 1)
Cost: $0 (planning only)
├─ Document current infrastructure
├─ Define RTO/RPO requirements
├─ Choose replication strategy
├─ Design failover procedures
└─ Get stakeholder approval
Phase 2: Infrastructure (Month 2-3)
Cost: +$140/month (duplicate infrastructure)
├─ Set up secondary region (us-west-2)
├─ Replicate EKS cluster
├─ Set up database replication
├─ Configure Route 53
├─ Set up monitoring across regions
Phase 3: Testing (Month 4)
Cost: +\(140/month (ongoing), +\)500 (test)
├─ Level 1: Documentation review
├─ Level 2: Tabletop exercise
├─ Level 3: Failover test (live)
├─ Fix issues found during testing
├─ Document procedures
Phase 4: Operations (Month 5+)
Cost: +$140/month (ongoing)
├─ Regular testing (Level 2 every 6 months)
├─ Monitor replication lag
├─ Update procedures as needed
├─ Handle actual failures (if any)
└─ Continuously improve RTO/RPO
Cost-Benefit Analysis
Multi-Region Costs (Annual):
├─ Additional infrastructure: \(140 × 12 = \)1,680/year
├─ Operational overhead: ~$5,000/year (extra work)
└─ Total: ~$6,680/year
Potential Losses (per outage):
├─ Critical system (\(10M revenue): \)50,000-100,000
├─ Production system (\(1M revenue): \)5,000-10,000
├─ Startup (\(100k revenue): \)500-1,000
└─ Development environment: $0 (acceptable downtime)
Break-even analysis:
├─ If 1 outage/year: \(6,680 cost vs \)50,000+ loss = Worth it
├─ If 1 outage/2 years: Cost accumulates, harder to justify
├─ If 1 outage/5 years: Probably not worth it
Decision matrix:
Revenue < $500k/year: Skip multi-region for now
Revenue \(500k-\)5M: Implement multi-region when stable
Revenue > $5M: Implement multi-region immediately
Critical systems: Always implement multi-region
Key Takeaways
1. RTO/RPO Must Be Defined Upfront
Different for different businesses
Drives architecture decisions
Not afterthought, core requirement
2. Multi-Region is Expensive
Doubles infrastructure cost
Adds operational complexity
Only worth it if revenue justifies it
3. Test Your DR Plan
Plans that haven't been tested fail
Failover tests catch hidden issues
Regular testing keeps team sharp
4. Choose Active-Active for Critical Systems
Zero downtime
Zero data loss
But requires multi-master database
5. Choose Active-Passive for Cost Optimization
Idle capacity is wasted money
Acceptable for non-critical systems
Simpler to manage
6. Stateless Design Simplifies DR
No special data handling needed
Replication just needs database
Failover is mostly DNS change
Quick Reference: Multi-Region Options
Active-Passive:
RTO: 2-5 minutes (DNS propagation delay)
RPO: 0-5 minutes (replication lag)
Cost: 2x infrastructure + low ops
Complexity: Low
Best for: Cost-sensitive, acceptable downtime
Active-Active:
RTO: 30 seconds (health check + DNS)
RPO: 0 (bi-directional replication)
Cost: 2x infrastructure + high ops
Complexity: High (conflict resolution)
Best for: Critical systems, zero downtime required
Managed (DynamoDB Global Tables):
RTO: Immediate (both regions active)
RPO: 0 (managed replication)
Cost: 1.5-2x database cost
Complexity: Low (AWS handles it)
Best for: Greenfield projects, DynamoDB users
GitHub Actions Notifications on Slack with CI Pipeline Visibility
Context to add: When building production APIs, monitoring your CI/CD pipeline is critical. Integrate GitHub Actions with Slack to get real-time notifications of build successes, failures, and deployments. This includes visibility into:
Build status for each commit/PR
Deployment stages (test, staging, production)
Failed workflow runs with error logs
Passing test suites
Branch protection rule validations
Implementation details: Use the official GitHub Slack app or slackapi/slack-notify-build action to post workflow results directly to a dedicated Slack channel. Include commit details, authors, and direct links to failing tests.
CloudWatch Notifications via Slack and Email
Context to add: AWS CloudWatch monitors application metrics and logs. Setting up notifications ensures your team is immediately aware of issues:
Performance degradation (high latency, error rates)
Resource utilization alerts (CPU, memory, database connections)
Custom metric thresholds
Log pattern matching (e.g., exceptions, failed requests)
Implementation details: Create CloudWatch alarms that trigger SNS (Simple Notification Service) topics, which then notify both Slack (via a Lambda function or webhook) and email subscriptions. This dual-channel approach ensures critical alerts aren't missed.
CloudWatch Dashboard
Context to add: A dedicated CloudWatch dashboard provides a single pane of glass for monitoring your API's health. Include:
Request count and latency metrics
Error rate trends
Database query performance
API endpoint-specific metrics
System resource utilization graphs
Custom business metrics (if applicable)
Implementation details: Create a customized dashboard with widgets for each metric, set appropriate time ranges (last hour/day/week), and enable auto-refresh. Share this dashboard with your team or embed it in internal monitoring pages.
S3 Bucket for Logs with Retention Policy
Context to add: Centralized log storage is essential for compliance, debugging, and auditing. Store logs from both CloudWatch and GitHub Actions:
CloudWatch Logs exported to S3 for long-term archival
GitHub Actions workflow logs and artifacts
Application error logs and request traces
Audit logs for access and changes
Implementation details: Create an S3 bucket with:
Versioning enabled (optional, for audit trails)
Server-side encryption (SSE-S3 or KMS)
Lifecycle policies for automatic transitions to cheaper storage (e.g., Glacier after 90 days)
Set retention periods (e.g., delete logs after 2 years or per compliance requirements)
Enable access logging to monitor who accesses the bucket
Block public access by default
This satisfies compliance requirements (GDPR, HIPAA, SOC2) while managing storage costs.
Acquiring a Domain
Context to add: A custom domain provides a professional identity for your API and is often required for production deployments:
Use the domain for your API endpoints (e.g.,
api.yourdomain.com)SSL/TLS certificates for HTTPS (using AWS Certificate Manager or Let's Encrypt)
DNS records pointing to your API (Route 53 for AWS-hosted APIs)
Email records (MX records) for transactional emails if needed
Subdomain routing to different services (API, docs, dashboard)
Implementation details: Register a domain via Route 53, GoDaddy, Namecheap, or similar. Set up DNS records to point to your API Gateway/load balancer. Obtain an SSL certificate and enforce HTTPS. Consider using a friendly URL for API documentation (e.g., docs.yourdomain.com).
Conclusion
We've covered the complete system design of a production-grade API infrastructure:
Part 1: Why we chose EKS, NLBs, and CloudWatch
Part 2: How autoscaling and resilience work
Part 3: Monitoring strategy to prevent alert fatigue
Part 4: Cost optimizations to reduce expenses
Part 5: Multi-region DR for business continuity
The key principle throughout: Choose simple solutions that meet your needs, then optimize later as the business grows.
This infrastructure could handle 10x traffic automatically, recover from failures in seconds, and cost only $140/month. That's excellent value.
But it's built on understanding the trade-offs: Why 70% CPU threshold? Why 2 nodes minimum? Why CloudWatch instead of Prometheus? Every decision was deliberate and documented.



