Skip to main content

Command Palette

Search for a command to run...

System Design and Documentation for a Production-Grade API Infrastructure Deployed on AWS

Updated
73 min read
System Design and Documentation for a Production-Grade API Infrastructure Deployed on AWS
B

DevOps and Cloud Engineer

Focused on optimizing the software development lifecycle through seamless integration of development and operations, specializing in designing, implementing, and managing scalable cloud infrastructure with a strong emphasis on automation and collaboration.

Key Skills:

Terraform: Skilled in Infrastructure as Code (IaC) for automating infrastructure deployment and management. Ansible: Proficient in automation tasks, configuration management, and application deployment. AWS: Extensive experience with AWS services like EC2, S3, RDS, and Lambda, designing scalable and cost-effective solutions. Kubernetes: Expert in container orchestration, deploying, scaling, and managing containerized applications. Docker: Proficient in containerization for consistent development, testing, and deployment. Google Cloud Platform: Familiar with GCP services for compute, storage, and machine learning.

Part 0: CI/CD Pipeline Documentation - Bird API

CI/CD Pipeline Documentation - Bird API

This document describes the complete CI/CD pipeline for the Bird API project. The pipeline automates building Go applications, creating Docker images, and pushing them to Docker Hub for deployment on AWS EKS with Kubernetes.

Project Structure

resend/
├── bird/                          # Bird API service
│   ├── main.go                    # Bird API code
│   ├── Dockerfile                 # Bird container config
│   ├── Makefile                   # Build automation
│   └── go.mod                     # Go dependencies
├── birdImage/                     # Bird Image API service
│   ├── main.go                    # BirdImage API code
│   ├── Dockerfile                 # BirdImage container config
│   ├── Makefile                   # Build automation
│   └── go.mod                     # Go dependencies
├── frontend/                      # Bird Frontend service (NEW)
│   ├── main.go                    # Frontend code
│   └── Dockerfile                 # Frontend container config
├── bird-api-k8s-manifests/        # Kubernetes manifests
│   ├── bird-api-deployment.yaml   # Bird API K8s deployment
│   └── bird-image-deployment.yaml # Bird Image API K8s deployment
├── bird-chart/                    # Helm chart for deployments
│   ├── Chart.yaml                 # Chart metadata
│   ├── values.yaml                # Helm values
│   └── templates/                 # K8s templates
├── infrastructure/                # Terraform infrastructure
│   ├── eks-cluster.tf            # EKS cluster configuration
│   ├── eks-node-group.tf         # EKS node groups
│   ├── kubernetes-provider.tf    # K8s provider settings
│   ├── variables.tf              # Terraform variables
│   └── ...                        # Other infrastructure files
└── README.md                      # Project documentation

Architecture

Services

  • bird - Main API service (port 4201)

    • Fetches bird data from local database

    • Calls birdImage service for images

    • Returns bird information with images

  • birdImage - Image service (port 4200)

    • Fetches bird images from Unsplash API

    • Returns image URLs based on bird name

  • frontend - Frontend web service (port 3000) (NEW)

    • Displays bird images fullscreen

    • Fetches data from Bird API

    • Responsive web interface

Technology Stack

  • Language: Go

  • Containerization: Docker

  • Container Registry: Docker Hub

  • Orchestration: Kubernetes (AWS EKS)

  • Infrastructure as Code: Terraform

  • CI/CD: GitHub Actions

  • Package Manager: Helm

Build Process

1. Local Build (Development)

Build the Go Binary
cd bird
make bird
# or
go build -o bird main.go

Output: Executable binary named bird

Using Makefile

The Makefile in each directory provides convenient build targets:

cd bird
make bird     # Build bird API
make clean    # Clean build artifacts

cd ../birdImage
make birdImage # Build bird image API

cd ../frontend
make frontend # Build frontend

2. Docker Image Build

Build Docker Images Locally
# From the root directory
docker build -t bruno74t/bird-api:v.1.0.5.7 ./bird
docker build -t bruno74t/bird-image-api:v.1.0.5.7 ./birdImage
docker build -t bruno74t/bird-frontend:v.1.0.5.7 ./frontend

What happens:

  • Reads Dockerfile from each service directory

  • Compiles the Go code inside the container

  • Creates a lightweight Alpine Linux image with the binary

  • Tags the image with the service name and version

Dockerfile Structure (Multi-stage build)

bird/Dockerfile, birdImage/Dockerfile, and frontend/Dockerfile follow the same pattern:

# syntax=docker/dockerfile:1
FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY go.mod ./
RUN go mod download
COPY . .
RUN go build -o bird main.go

FROM alpine:latest
WORKDIR /root/
COPY --from=builder /app/bird .
CMD ["./bird"]
EXPOSE 4201

Benefits:

  • Multi-stage build keeps final image small

  • Builder stage has all dependencies

  • Final image only contains the executable

Push Process

GitHub Actions Automated Push

Trigger
  • Any push to main branch

  • Any pull request to main branch

GitHub Workflow File Location

.github/workflows/docker-ci.yml

Workflow Steps

Step 1: Checkout Code

- name: Checkout Code
  uses: actions/checkout@v2

Pulls the latest code from the repository

Step 2: Set up Docker Buildx

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2

Enables advanced Docker building features

Step 3: Authenticate with Docker Hub

- name: Log into Docker Hub
  run: echo "\({{ secrets.DOCKER_PASSWORD_SYMBOLS_ALLOWED }}" | docker login --username "\){{ secrets.DOCKER_USERNAME }}" --password-stdin

Uses GitHub Secrets to securely log into Docker Hub without exposing credentials

Step 4: Build and Push Bird API

- name: Build and Push Bird API Docker Image
  run: |
    docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-api:v.1.0.5.7 ./bird
    docker push ${{ secrets.DOCKER_USERNAME }}/bird-api:v.1.0.5.7

Builds the bird API image from bird/Dockerfile Pushes to Docker Hub repository

Step 5: Build and Push BirdImage API

- name: Build and Push BirdImage API Docker Image
  run: |
    docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-image-api:v.1.0.5.7 ./birdImage
    docker push ${{ secrets.DOCKER_USERNAME }}/bird-image-api:v.1.0.5.7

Builds the birdImage API image from birdImage/Dockerfile Pushes to Docker Hub repository

Step 6: Build and Push Bird Frontend (NEW)

- name: Build and Push Bird Frontend Docker Image
  run: |
    docker build -t ${{ secrets.DOCKER_USERNAME }}/bird-frontend:v.1.0.5.7 ./frontend
    docker push ${{ secrets.DOCKER_USERNAME }}/bird-frontend:v.1.0.5.7

Builds the frontend image from frontend/Dockerfile Pushes to Docker Hub repository

GitHub Secrets Configuration

Required Secrets

Set these in GitHub: Settings → Secrets and variables → Actions

  1. DOCKER_USERNAME

    • Value: Your Docker Hub username

    • Example: bruno74t

  2. DOCKER_PASSWORD_SYMBOLS_ALLOWED

    • Value: Docker Hub access token or password

    • Recommended: Use access token instead of password

How to Create a Docker Hub Access Token
  • Go to Docker Hub (https://hub.docker.com)

  • Click Account Settings → Security

  • Click New Access Token

  • Name: github-actions

  • Permissions: Read, Write, Delete

  • Copy token and paste in GitHub Secrets

Container Image Management via Terraform

Instead of hardcoding image URIs in Kubernetes manifests, we use Terraform variables to manage container versions. This allows easy updates without modifying resource definitions.

Terraform variables.tf Configuration

File: infrastructure/variables.tf

# Container Images
variable "bird_api_image" {
  description = "Docker image URI for bird-api"
  type        = string
  default     = "bruno74t/bird-api:v.1.0.5.7"
}

variable "bird_api_port" {
  description = "Port for bird-api container"
  type        = number
  default     = 4201
}

variable "bird_image_api_image" {
  description = "Docker image URI for bird-image-api"
  type        = string
  default     = "bruno74t/bird-image-api:v.1.0.5.7"
}

variable "bird_image_api_port" {
  description = "Port for bird-image-api container"
  type        = number
  default     = 4200
}

variable "bird_frontend_image" {
  description = "Docker image URI for bird-frontend"
  type        = string
  default     = "bruno74t/bird-frontend:v.1.0.5.7"
}

variable "bird_frontend_port" {
  description = "Port for bird-frontend container"
  type        = number
  default     = 3000
}

Usage in kubernetes-provider.tf

Reference the variables in your Kubernetes resource definitions:

resource "kubernetes_deployment" "bird_frontend" {
  metadata {
    name = "bird-frontend"
  }

  spec {
    replicas = 2

    template {
      spec {
        container {
          name  = "bird-frontend"
          image = var.bird_frontend_image

          port {
            container_port = var.bird_frontend_port
          }
        }
      }
    }
  }
}

resource "kubernetes_service" "bird_frontend_service" {
  metadata {
    name = "bird-frontend-service"
  }

  spec {
    type = "LoadBalancer"

    selector = {
      app = "bird-frontend"
    }

    port {
      port        = 80
      target_port = var.bird_frontend_port
    }
  }
}

How to Update Container Images

Simply change the default value in infrastructure/variables.tf:

variable "bird_frontend_image" {
  description = "Docker image URI for bird-frontend"
  type        = string
  default     = "bruno74t/bird-frontend:v.1.0.5.8"  # Update version here
}
Option 2: Override at Deploy Time
terraform apply \
  -var="bird_api_image=bruno74t/bird-api:v.1.0.5.8" \
  -var="bird_image_api_image=bruno74t/bird-image-api:v.1.0.5.8" \
  -var="bird_frontend_image=bruno74t/bird-frontend:v.1.0.5.8"
Option 3: Use terraform.tfvars

Create or update infrastructure/terraform.tfvars:

bird_api_image        = "bruno74t/bird-api:v.1.0.5.8"
bird_image_api_image  = "bruno74t/bird-image-api:v.1.0.5.8"
bird_frontend_image   = "bruno74t/bird-frontend:v.1.0.5.8"
bird_api_port         = 4201
bird_image_api_port   = 4200
bird_frontend_port    = 3000

Then deploy:

cd infrastructure
terraform plan
terraform apply

Deploy with Updated Images

cd infrastructure

# Review changes
terraform plan

# Apply changes (pulls latest image and restarts containers)
terraform apply

Complete Workflow

Development → Production Flow

  1. Developer makes changes to Go code └─ Edit bird/main.go, birdImage/main.go, or frontend/main.go

  2. Developer commits and pushes to GitHub └─ git commit -m "Update bird data" └─ git push origin main

  3. GitHub Actions triggered automatically ├─ Checkout code ├─ Build Docker images (all 3 services) ├─ Push to Docker Hub └─ Creates new image tags (v.1.0.5.8, etc.)

  4. Docker images available on Docker Hub └─ bruno74t/bird-api:v.1.0.5.8 └─ bruno74t/bird-image-api:v.1.0.5.8 └─ bruno74t/bird-frontend:v.1.0.5.8

  5. Update Terraform variables.tf with new version └─ Update default image tags

  6. Deploy infrastructure via Terraform ├─ terraform plan ├─ terraform apply └─ EKS pulls new images and restarts pods

  7. Kubernetes manages the deployment ├─ Rolling update strategy ├─ Old pods gradually replaced with new ones └─ Services remain available during update

Version Management

Tagging Convention

Format: v..

Examples: v.1.0.0, v.1.0.1, v.1.0.5.7

Note: Use dot before major version (v.1.0.5.7 not v1.0.5.7)

How to Release a New Version

  1. Make code changes
   cd bird
   # Edit main.go
   git add main.go
  1. Update CI/CD workflow (optional, if changing version tags)
   # Update .github/workflows/docker-ci.yml if needed
   git add .github/workflows/docker-ci.yml
  1. Update Terraform variables
   # Update infrastructure/variables.tf with new version
   git add infrastructure/variables.tf
  1. Commit and push
   git commit -m "Release v.1.0.5.8 - Description of changes"
   git push origin main
  1. GitHub Actions automatically builds and pushes new version

  2. Deploy with Terraform

   cd infrastructure
   terraform plan
   terraform apply

Troubleshooting

Go Build Issues

# Clean cache
go clean -cache

# Rebuild
go build -o bird main.go

GitHub Actions Not Triggering

  • Check: .github/workflows/docker-ci.yml exists

  • Check: Secrets are set (DOCKER_USERNAME, DOCKER_PASSWORD_SYMBOLS_ALLOWED)

  • Check: Workflow syntax is valid

Docker Push Fails

  • Verify: Docker Hub credentials in GitHub Secrets

  • Verify: Repository exists on Docker Hub

  • Verify: Access token hasn't expired

Image Not Updating

  • Check Docker Hub tags: bruno74t/bird-api repository

  • Verify workflow ran successfully in GitHub Actions

  • Check: Terraform is pulling latest image tag

Verifying Deployed Container Images (Kubernetes)

Check which image version is currently deployed:

kubectl get deployment bird-api -o yaml | grep image:
kubectl get deployment bird-image-api -o yaml | grep image:
kubectl get deployment bird-frontend -o yaml | grep image:

Check pod status and readiness:

kubectl get pods -n default | grep bird

View container logs to verify app is running correctly:

kubectl logs -n default deployment/bird-api
kubectl logs -n default deployment/bird-image-api
kubectl logs -n default deployment/bird-frontend

Watch real-time logs as pods start:

kubectl logs -f -n default deployment/bird-api
kubectl logs -f -n default deployment/bird-frontend

Check pod details and events:

kubectl describe pod <pod-name> -n default

Verify rollout status after deployment:

kubectl rollout status deployment/bird-api -n default
kubectl rollout status deployment/bird-image-api -n default
kubectl rollout status deployment/bird-frontend -n default

Force restart deployment with new image:

kubectl rollout restart deployment/bird-api -n default
kubectl rollout restart deployment/bird-image-api -n default
kubectl rollout restart deployment/bird-frontend -n default

Key Files

resend/
├── .github/
│   └── workflows/
│       └── docker-ci.yml          # CI/CD Pipeline definition
├── bird/
│   ├── main.go                    # Bird API code
│   ├── Dockerfile                 # Bird container config
│   ├── Makefile                   # Build automation
│   └── go.mod                     # Dependencies
├── birdImage/
│   ├── main.go                    # BirdImage API code
│   ├── Dockerfile                 # BirdImage container config
│   ├── Makefile                   # Build automation
│   └── go.mod                     # Dependencies
├── frontend/
│   ├── main.go                    # Frontend code
│   └── Dockerfile                 # Frontend container config
├── bird-api-k8s-manifests/        # Kubernetes manifests
│   ├── bird-api-deployment.yaml
│   └── bird-image-deployment.yaml
├── bird-chart/                    # Helm charts
│   ├── Chart.yaml
│   ├── values.yaml
│   └── templates/
├── infrastructure/                # Infrastructure as Code
│   ├── kubernetes-provider.tf    # K8s provider settings
│   ├── variables.tf              # Container image versions managed here
│   ├── eks-cluster.tf            # EKS cluster
│   └── ...                       # Other infrastructure files
└── README.md

variables.tf Role

The infrastructure/variables.tf file is the single source of truth for container image versions:

  • Define which Docker image tag to deploy

  • Update versions without touching resource definitions

  • Centralized management for all services

  • Easy rollback by changing image tags in one file

Summary

This CI/CD pipeline:

  • Automatically builds Go applications using Makefiles

  • Creates Docker images for each service using multi-stage builds (bird, birdImage, frontend)

  • Pushes images to Docker Hub with version tags

  • Enables infrastructure teams to deploy latest versions via Terraform using variables.tf

  • Centralizes image version management in variables.tf (single source of truth)

  • Eliminates manual build and push steps

  • Maintains secure credentials using GitHub Secrets

  • Allows easy rollback by changing image tags in one file

  • Uses AWS EKS for Kubernetes orchestration

  • Supports rolling updates with zero downtime

Part 1: Architecture Overview & Technology Selection

Full Detailed Architecture Design

Source Code on Github

When deploying APIs in the cloud, every technology choice involves trade-offs. In this comprehensive guide, I'll walk you through the complete system design of a highly available, scalable API infrastructure on AWS. I'll explain not just what I implemented, but why I made each architectural decision and what alternatives we considered.

This is the infrastructure behind the Bird API, a real-world project deployed to AWS EKS with auto-scaling, CloudFront CDN, and comprehensive CloudWatch monitoring.

Executive Summary

Building scalable infrastructure isn't about using the fanciest tools—it's about making careful choices that balance:

  • Cost - Stay within budget (this costs around $120/month)

  • Complexity - Keep it simple enough to understand

  • Capability - Meet all functional requirements

  • Future-proofing - Allow for scaling without rewrites

This infrastructure shows these principles through real-world decisions, with a trade-off analysis for each major component.

Design Goals Achieved

Goal Target Achieved
Availability 99.9% uptime ✓ Multi-AZ + self-healing
Scalability 2-10 pods, 2-5 nodes ✓ HPA + Cluster Autoscaler
Latency <100ms response ✓ CloudFront CDN at edges
Cost <$150/month ✓ ~$120/month
Recovery <15s pod failure ✓ Verified in tests
Monitoring Real-time alerts ✓ CloudWatch + 4 alarms

The Journey: Why These Technologies?

Every decision in this architecture answers a fundamental question: What's the simplest solution that meets our requirements, but as well can be continously improved?

This philosophy guided me toward AWS managed services (EKS, CloudFront, CloudWatch) over self-hosted alternatives, even when the self-hosted versions offered more power. I choose a system that:

  1. Works reliably with minimal operational overhead

  2. Scales automatically without intervention

  3. Alerts us when something goes wrong

  4. Costs predictably without surprises

  5. Can be deployed in 20-25 minutes

    System Architecture Overview

Before diving into individual components, here's the complete picture:

System Design for Bird API

Container Orchestration: Why EKS?

Let me start with the biggest decision: Kubernetes on AWS (EKS) vs alternatives.

The Alternatives We Considered

Option 1: Self-Managed Kubernetes

  • Setup Time: 2-3 Days

  • Operational Burden: Time Consuming

  • Cost: EC2 instances independently

  • Monthly Cost: 3 EC2 instances for HA = ~$200-250

Why NOT : For a single API, the operational overhead isn't justified. You're maintaining the control plane (patching, updates, security), managing etcd backups, and handling everything that AWS does automatically with EKS.

Option 2: Docker Swarm

  • Setup Time: 1-2 days

  • Operational Burden: Low

  • Scalability: Limited

  • Community: Declining

Why NOT: Swarm has limited auto-scaling capabilities, no multi-region support, and the community is shrinking. It's great for small-scale deployments but won't grow with you.

Option 3: AWS ECS (Fargate)

  • Setup Time: 1 day

  • Operational Burden: Very Low

  • Kubernetes Features: Not available

  • Cost: Pay per pod (~$40-50/month for our workload)

Why I almost chose it: ECS Fargate is genuinely simpler than Kubernetes. No nodes to manage. But we chose EKS because:

  • Kubernetes is industry standard

  • Terraform support better for K8s

  • EKS cost is similar (\(73 control plane + \)68 nodes = $141 vs ~$40-50 Fargate)

  • Actually, Fargate wins on cost—we could optimize here later

Decision: AWS EKS

Here's why EKS won:

Factor EKS Self-Managed K8s Docker Swarm ECS Fargate
AWS Integration Native Manual plugins Poor Native
Multi-AZ Built-in Manual Manual Built-in
Auto-Scaling Native Custom setup Limited Native
Industry Standard Yes Yes Declining No
Control Plane Mgmt AWS You Built-in AWS
Cost (Control Plane) $73/month $0 $0 $0
Monthly Total $141 $200-250 $100 $50-70
Operational Burden Low Very High Low Very Low
Future Flexibility Excellent Excellent Limited Limited

The surprising winner for cost: ECS Fargate (\(50-70/month)
The surprising winner for simplicity: ECS Fargate (no nodes to manage)
Our choice: EKS (\)141/month)

Why not the cheapest option?

Three reasons:

  1. Industry Adoption - Kubernetes skills transfer to any cloud. ECS is AWS-only.

  2. The cost difference is small - $70/month difference is acceptable for portability

  3. Future flexibility - Kubernetes runs anywhere. ECS doesn't. If we ever need multi-cloud, K8s is the only option.

The lesson here: Sometimes paying a little more for portability is worth it.

Load Balancing Strategy

Now that we've chosen Kubernetes, we need a load balancer to route traffic to our pods.

Load Balancer Options

Layer 4 (Transport): TCP/UDP
├── Network Load Balancer (NLB)
│   └─ Latency: <100µs, Cost: $0.006/hr
│
Layer 7 (Application): HTTP/HTTPS
├── Application Load Balancer (ALB)
│   └─ Latency: ~400µs, Cost: $0.0225/hr
│
Legacy:
└── Classic Load Balancer
    └─ Deprecated (don't use)

Decision: Network Load Balancer (NLB)

Here's the analysis:

Requirement NLB ALB Why NLB
Latency < 100ms? ✓ Ultra-low ✓ Low NLB is 3x faster
Simple API routing? ✓ Yes ✓ Yes Both work
Need URL-based routing? ✗ No ✓ Yes N/A for us
Cost for 2 services? $44/month $120/month NLB 2.7x cheaper
Future consolidation? Hard Easy But we don't need it yet

Trade-off we made: Each service gets its own load balancer (more expensive) instead of consolidating on one ALB (would need URL-based routing).

Cost of this trade-off: Extra $76/month vs ALB
Benefit: Simpler setup, no routing rules to manage
Break-even point: When you have 10+ microservices, switch to ALB
Our situation: 2 microservices, NLB makes sense

The lesson: Don't optimize for scale you don't have. Use NLB now, migrate to ALB later if needed.


Key Takeaways

From these architecture decisions, here are the principles we followed:

  1. Choose Managed Services Over Self-Hosted

    • EKS > Self-Managed K8s (let AWS handle control plane)

    • CloudWatch > Prometheus (no additional infrastructure)

    • CloudFront > self-hosted CDN (global by default)

  2. Optimize for Your Current Scale, Not Future Scale

    • 2 NLBs > 1 ALB (simpler now, migrate later)

    • 2 nodes > 10 nodes (scale as needed)

    • EKS > ECS Fargate (slight cost premium for flexibility)

  3. Prioritize Operational Simplicity

    • "Simple" beats "powerful"

    • "Boring" beats "cutting-edge"

    • "Built-in" beats "third-party"

  4. Document Trade-offs Explicitly

    • Every decision has a cost and benefit

    • Your future self will thank you

    • Enables future optimization


What's Next?

In the next part, we'll dive into:

  • Horizontal Pod Autoscaling (HPA) - Why 70% CPU? Why 10 pods max?

  • Cluster Autoscaling - When to add/remove nodes

  • Failure Recovery Design - How the system recovers in <15 seconds

System Design Series

  1. Part 1: Architecture Overview & Technology Selection (this post)

  2. Part 2: Autoscaling & Failure Recovery Design

  3. Part 3: Monitoring & CloudWatch Strategy

  4. Part 4: Cost Optimization for Production Workloads

  5. Part 5: Multi-Region & Disaster Recovery

Part 1.1 Terraform Files Explained

Bird API - Terraform Infrastructure Documentation

Overview

This documentation covers the complete Terraform infrastructure setup for the Bird API, a containerized application deployed on AWS EKS (Elastic Kubernetes Service). The infrastructure is designed for high availability, auto-scaling, and comprehensive monitoring.

Architecture Type: Microservices on Kubernetes
Primary Services: Bird API, Bird Image API, Bird Frontend
Cloud Provider: AWS
Container Orchestration: Kubernetes (EKS)
Infrastructure as Code Tool: Terraform


Table of Contents

  1. Architecture Overview

  2. File Structure & Organization

  3. Core Components

  4. Configuration & Variables

  5. Deployment Instructions

  6. Monitoring & Alerting

  7. Scaling & Performance

  8. Cost Considerations


Architecture Overview

High-Level Flow

The Bird API infrastructure follows a multi-tier architecture:

  1. Edge Layer: CloudFront CDN distributes content globally

  2. Load Balancing Layer: Network Load Balancers route traffic to services

  3. Kubernetes Layer: EKS cluster manages containerized workloads

  4. Persistence Layer: S3 for logs and state management

  5. Monitoring Layer: CloudWatch for metrics, alarms, and dashboards

Key Services Deployed

Each service runs with:


File Structure & Organization

Core Infrastructure Files

vpc.tf - Virtual Private Cloud Setup

Purpose: Creates isolated network environment with public/private subnets
Key R
esources:

Network Design:

Public Subnets (with IGW)
├── bird-public-subnet-1 (10.0.1.0/24)
└── bird-public-subnet-2 (10.0.2.0/24)

Private Subnets (with NAT)
├── bird-private-subnet-1 (10.0.101.0/24)
└── bird-private-subnet-2 (10.0.102.0/24)

eks-cluster.tf - Kubernetes Control Plane

Purpose: Creates and configures EKS cluster
Key Reso
urces:

Features:

eks-node-group.tf - Worker Nodes

Purpose: Manages EC2 worker nodes that run containers
Configuration:

Key Features:

IAM Policies Attached:

autoscaling.tf - Auto-Scaling Configuration

Purpose: Enables automatic scaling of cluster and pods
Components:

  1. Cluster Autoscaler:

  2. Metrics Server: (Commented out)

  3. Horizontal Pod Autoscaler (HPA):

Workflow:

High Traffic → CPU Increases → HPA triggers scale-up → More pods created
Low Traffic → CPU Decreases → HPA triggers scale-down → Pods removed
Node Capacity Full → Cluster Autoscaler → Adds EC2 node

kubernetes-provider.tf - Kubernetes Workloads

Purpose: Defines all application deployments, services, and policies
Deployments Created:

  1. Bird API Deployment

  2. Bird Image API Deployment

  3. Bird Frontend Deployment

Service Type: LoadBalancer (creates AWS Network Load Balancer)

Pod Disruption Budgets (PDB):

cloudfront.tf - Content Distribution Network

Purpose: Global content caching and routing via CloudFront
O
rigins:

  1. Bird Frontend NLB (default, serves UI)

  2. Bird API NLB (path: /api/*)

  3. Bird Image API NLB (path: implicit routing)

Caching Strategy:

S3 Logging Bucket:

Outputs:

monitoring.tf - CloudWatch Monitoring & Alerts

Purpose: Real-time visibility and alerting for infrastructure health

SNS Topic & Subscriptions:

CloudWatch Alarms (Low Thresholds for Testing):

  1. Pod CPU High: > 20% utilization

  2. Pod Memory High: > 30% utilization

  3. Pod Restarts High: 3+ restarts in 2 minutes

  4. Node Not Ready: Any node in NotReady state

CloudWatch Dashboard:

Log Insights Queries (Sample):

Error Logs: fields @timestamp, @message | filter @message like /ERROR/
Response Times: fields @timestamp, response_time | stats avg, max, p95
HTTP Status: fields status_code | stats count by status_code
Pod Metrics: stats avg(cpu), avg(memory) by pod_name

Log Retention: 7 days (configurable)

bucket.tf - State Management & Locking

Purpose: Secure remote Terraform state storage

S3 Bucket:

DynamoDB Table:

Backend Configuration:

backend "s3" {
  bucket         = "bird-api-terraform-state-123456789"
  key            = "bird-api/terraform.tfstate"
  region         = "us-east-1"
  encrypt        = true
  dynamodb_table = "bird-api-terraform-locks"
}

variables.tf - Input Variables

Purpose: Configurable parameters for different environments

Major Variables:

terraform.tfvars - Variable Values

Purpose: Provides actual values for variables

Typical Values:

aws_region           = "us-east-1"
environment          = "dev"
project_name         = "bird-api"
cluster_name         = "bird-api-cluster"
cluster_version      = "1.29"
node_group_desired_size = 2
alert_email          = "brunogatete77@gmail.com"
enable_cloudwatch_monitoring = true

locals.tf - Local Values

Purpose: Derived values used across multiple resources

Contents:

data.tf - Data Sources

Purpose: References existing AWS resources

Data Sources:

output.tf - Outputs

Purpose: Display important values after deployment

Typical Outputs:

backend.tf - Terraform Backend Configuration

Purpose: Remote state storage configuration

Current Setup: Local state (for initial setup)
P
roduction: Should migrate to S3 backend

kubernetes-provider.tf - Terraform Kubernetes Provider

Purpose: Authenticates Terraform with EKS cluster

Configuration:


Configuration & Variables

Environment-Specific Values

Development:

environment                   = "dev"
node_group_desired_size       = 2
node_group_max_size           = 3
bird_api_replicas             = 2
enable_cloudwatch_monitoring  = true
alert_email                   = "devops@example.com"

Production:

environment                   = "prod"
node_group_desired_size       = 3
node_group_max_size           = 10
bird_api_replicas             = 3
enable_cloudwatch_monitoring  = true
alert_email                   = "ops-team@example.com"
log_retention_in_days         = 30

Important Constraints

High Availability Requirements:

Scaling Limits:


Deployment Instructions

Prerequisites

# Install required tools
- AWS CLI configured with appropriate credentials
- Terraform >= 1.0
- kubectl >= 1.29
- helm (for manual chart installations)

Step-by-Step Deployment

1. Initialize Terraform:

terraform init

2. Create terraform.tfvars with your values:

cat > terraform.tfvars << EOF
aws_region                   = "us-east-1"
environment                  = "dev"
alert_email                  = "your-email@example.com"
EOF

3. Review planned changes:

terraform plan

4. Apply infrastructure:

terraform apply

5. Configure kubectl:

aws eks update-kubeconfig \
  --name bird-api-cluster \
  --region us-east-1

6. Verify cluster:

kubectl get nodes
kubectl get pods -n default
kubectl get svc -n default

7. (Optional) Install Helm charts manually:

# Cluster Autoscaler
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --set autoDiscovery.clusterName=bird-api-cluster \
  --set awsRegion=us-east-1

# Metrics Server
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm install metrics-server metrics-server/metrics-server

Post-Deployment Verification

# Check EKS cluster
aws eks describe-cluster --name bird-api-cluster

# Check node status
kubectl get nodes -o wide

# Check deployments
kubectl get deployments

# Check services
kubectl get svc

# Check HPA status
kubectl get hpa

# View CloudWatch dashboard
aws cloudwatch list-dashboards | grep bird-api

# Test API access
curl http://{CLOUDFRONT_DOMAIN}/api/...

Monitoring & Alerting

CloudWatch Alarms

Active Alarms:

Alert Destination:

CloudWatch Dashboard

Access via AWS Console: CloudWatch → Dashboards → bird-api-overview

Metrics Displayed:

Log Insights Queries

Use CloudWatch Logs Insights to analyze:

# Find errors
fields @timestamp, @message | filter @message like /ERROR/ | stats count()

# Analyze response times
fields @timestamp, response_time | stats avg(response_time), max(response_time)

# HTTP status distribution
fields status_code | stats count() as requests by status_code

# Pod resource usage
fields @timestamp, kubernetes.pod_name, container_cpu_utilization 
| stats avg(container_cpu_utilization) by kubernetes.pod_name

Scaling & Performance

Horizontal Pod Autoscaling (HPA)

Current Configuration:

Behavior:

Monitor HPA:

kubectl get hpa
kubectl describe hpa bird-api-hpa
kubectl get metrics pod -l app=bird-api

Cluster Autoscaling

Current Configuration:

Behavior:

Monitor Cluster Autoscaler:

kubectl logs -n kube-system -l app=cluster-autoscaler -f
kubectl describe node <node-name>

Resource Requests & Limits

Per-Container Settings:

Rationale:


Cost Considerations

Resource Costs

Compute (Largest Cost):

Network:

Storage:

Monitoring:

Cost Optimization Tips

  1. Use Spot Instances: Add t2.medium spot instances to node group

  2. Reserved Instances: For production, pre-purchase capacity

  3. Downsize Dev: Use t2.small for development

  4. Adjust Max Nodes: Reduce max_size if maximum scale never reached

  5. Log Retention: Consider reducing retention_in_days for dev

  6. NAT Gateway: Consider NAT instance for dev (cheaper but less HA)


Failure Simulation & Testing

Included Script: failure-simulation.sh

Tests Simulated:

  1. CPU Spike (load generation)

  2. Memory Pressure

  3. Node Failure (cordon/uncordon)

  4. Pod Crash Loop (force restarts)

Running Tests:

chmod +x failure-simulation.sh
./failure-simulation.sh

Expected Alerts:


Troubleshooting Common Issues

Pods in Pending State

kubectl describe pod <pod-name>
# Check if node capacity issue
kubectl get nodes
# May need cluster autoscaler or more nodes

Service LoadBalancer No External IP

# Check AWS service annotations
kubectl get svc -o yaml

# Verify cluster has AWS Load Balancer controller installed
helm list -A | grep aws-load-balancer

HPA Not Scaling

# Verify Metrics Server is installed
kubectl get deployment metrics-server -n kube-system

# Check HPA status
kubectl describe hpa bird-api-hpa

# View metrics
kubectl top pods

CloudWatch Alarms Not Triggering

# Verify SNS subscription is confirmed
aws sns list-subscriptions-by-topic --topic-arn <topic-arn>

# Check alarm history
aws cloudwatch describe-alarms --alarm-names bird-api-pod-cpu-high

EKS Cluster Access Issues

# Update kubeconfig
aws eks update-kubeconfig --name bird-api-cluster

# Test connectivity
kubectl cluster-info
kubectl auth can-i get pods

Additional Resources


Maintenance & Updates

Regular Tasks

Weekly:

Monthly:

Quarterly:

Upgrade Procedure

# Update cluster version
terraform apply -var="cluster_version=1.30"

# Update node group (blue-green strategy)
# 1. Update Terraform variable
# 2. terraform apply (creates new nodes before removing old)
# 3. Pods automatically migrate

Document Version: 1.0
Last Updated
: 2024
Au
thor: DevOps Team

Part 2 Preview (Autoscaling & Resilience)

The Autoscaling Problem

Imagine you run an API and suddenly you get 10x traffic. What happens?

Without Autoscaling:

With Autoscaling:

The difference? Two lines of configuration.

But here's the catch: autoscaling is easy to implement, hard to configure correctly.

The question we had to answer: What are the right thresholds?

Horizontal Pod Autoscaling (HPA)

What is HPA?

HPA is a Kubernetes controller that automatically scales the number of pods up or down based on metrics (usually CPU utilization).

Here's how it works:

1. Metrics Server collects CPU/Memory from pods (every 15 seconds)
2. HPA controller reads metrics
3. Calculates: desiredReplicas = currentReplicas × (currentCPU / targetCPU)
4. Updates Deployment with new replica count
5. Kubernetes scheduler creates/destroys pods
6. Repeat every 15 seconds

The Formula (Explained)

desiredReplicas = ceil[currentReplicas × (currentCPU / targetCPU)]

Let me break this down with a real example:

Scenario: Normal operation, traffic doubles

Before:
  Current replicas: 2
  Current CPU per pod: 50m (50 millicores)
  Target CPU: 70m (our threshold)
  desiredReplicas = ceil[2 × (50 / 70)] = ceil[1.43] = 2 pods
  Action: No change (we're fine)

After traffic doubles:
  Current replicas: 2
  Current CPU per pod: 140m (doubled!)
  Target CPU: 70m (same threshold)
  desiredReplicas = ceil[2 × (140 / 70)] = ceil[4] = 4 pods
  Action: Scale from 2 → 4 pods

What happens next:
  New 2 pods start up (takes ~5-10 seconds)
  Load balancer starts sending requests to them
  CPU per pod drops from 140m to 70m (load spreads)
  System stabilizes

This is the magic. The system rebalances itself.

Why 70% CPU Target? (Not 50%, not 90%)

This was one of our most debated decisions. Let me show you the analysis:

Option 1: 50% CPU Target

Threshold: Scale when any pod reaches 50% CPU

Pros:
✓ Lots of headroom
✓ Never hits limits
✓ Very responsive to traffic spikes

Cons:
✗ Wastes resources (30-50% idle pods)
✗ Higher cost (~\(200/month instead of \)141)
✗ Over-provisioned for typical workloads
✗ Money thrown away

Use case: Mission-critical systems (99.99% SLA)
Suitable for us? No

Option 2: 70% CPU Target (CHOSEN)

Threshold: Scale when any pod reaches 70% CPU

Pros:
✓ Good utilization (70% is industry standard)
✓ Proven for API workloads
✓ Cost-effective (~$141/month)
✓ Still provides safety margin (30% headroom)
✓ Minimal waste

Cons:
⚠ Brief spikes above 100% possible (mitigated by fast scaling)
⚠ Slightly less headroom than 50%

Use case: Production APIs with good SLA (99.9%)
Suitable for us? Yes, perfect fit

Real-world behavior at 70%:

Option 3: 90% CPU Target

Threshold: Scale when any pod reaches 90% CPU

Pros:
✓ Minimal waste
✓ Cheapest ($120/month)
✓ Maximum utilization

Cons:
✗ Frequent scaling (yo-yo effect)
✗ Insufficient headroom
✗ Risk of cascade failures
✗ Requests can timeout during scaling
✗ Poor user experience

Use case: Non-critical batch jobs
Suitable for us? No, too risky

Why 70% wins:

CPU Utilization vs Scaling Quality
50% CPU  →→→→→→→→→→  (Too conservative)
60% CPU  →→→→→→→→→→  (Better)
70% CPU  →→→→→→→→→→  (OPTIMAL) ✓
80% CPU  →→→→→→→→→→  (Risky)
90% CPU  →→→→→→→→→→  (Too aggressive)

Sweet spot: 70%
- Good utilization
- Safety margin
- Cost-effective
- Proven in production

Min & Max Replicas Decision

Why Min 2 replicas?

Option 1: 1 replica (minimum for cost)
├─ Cost: Lowest
├─ Availability: Zero HA
│   └─ Pod crashes → service down
│   └─ Pod update → downtime
└─ Risk: Unacceptable for "production-grade"

Option 2: 2 replicas (CHOSEN)
├─ Cost: Minimal ($7/month more)
├─ Availability: High (N-1 redundancy)
│   └─ 1 pod can fail, service continues
│   └─ 1 pod can update, other handles traffic
├─ Load distribution: Spreads requests
└─ Industry standard: Most production APIs

Option 3: 3 replicas (overkill for dev)
├─ Cost: $14/month more
├─ Availability: Very high (N-2 redundancy)
├─ Use case: Mission-critical systems
└─ For us: Unnecessary complexity

Decision: Min 2 replicas

Why Max 10 replicas?

Capacity Analysis:
- t2.medium node: 1 core = 1000m CPU
- Pod request: 100m CPU
- Pods per node: 1000m ÷ 100m = 10 pods

With 5 nodes max:
- Total capacity: 5 nodes × 10 pods = 50 pods max
- But we limit to 10 per deployment
- Reason: Prevents runaway scaling

Scenario: Traffic spike hits hard
Current replicas: 2
New replicas desired: 10 (max limit)
Cost: 10 pods × $X per pod
What if it keeps going? Limited by max=10

Cost ceiling:
10 pods × 100m CPU × \(0.05/hr ≈ \)0.05/hr = $36/month per deployment
Total for 2 deployments: \(72/month (vs \)68 for 2 nodes baseline)
Acceptable cost increase for 10x capacity

Decision: Max 10 replicas

Scaling Behavior: Up Fast, Down Slow

Scale Up (Aggressive):
├─ Trigger: CPU > 70% for 3 minutes
├─ Response: Immediate (adds pod every 15s if needed)
├─ Reason: Traffic is likely sustained
└─ Worst case: Add up to 10 pods in 2 minutes

Scale Down (Conservative):
├─ Trigger: CPU < 70% for 5 minutes
├─ Response: Remove 1 pod every 3 minutes
├─ Reason: Prevent yo-yo (scale up/down/up/down)
└─ Benefit: Saves cost gradually, not abruptly

Why asymmetric scaling (fast up, slow down)?

Real-world traffic pattern:
8:00 AM   ▁▁▁▁▂▃▅▇█▇▅▃▂▁▁▁▁  (morning surge)
         Scale up: 2→4→6→8→10 (good, users happy)

2:00 PM   ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁  (afternoon lull)
         Scale down slowly: 10→9→8→7→6→5→4→3→2

Why slow down?
- Traffic often returns soon (yo-yo prevention)
- Cost savings: 5 minutes × \(0.05/hr = ~\)0.004 (negligible)
- Stability: More important than saving pennies

Cluster Autoscaling

While HPA scales pods, Cluster Autoscaler scales nodes.

Think of it this way:

  • HPA: "We need more pods" → Adds pods

  • Cluster Autoscaler: "We're out of space for pods" → Adds nodes

How Cluster Autoscaler Works

Detection Loop (every 10 seconds):
1. Check for "Pending" pods (can't be scheduled)
2. Calculate if new node would help
3. Check if adding node violates constraints
4. If safe, launch new EC2 instance
5. Node joins cluster (~2-3 minutes)
6. Pending pods get scheduled

Example scenario:
- 2 nodes running, fully utilized
- 3rd pod tries to schedule
- No space available (Pending state)
- Cluster Autoscaler detects this
- Launches new t2.medium EC2 instance
- Instance joins cluster
- 3rd pod gets scheduled
- Total: 2-3 minutes

Configuration Decisions

Min Nodes: 2

Why 2 and not 1?

1 Node Setup:
├─ Cost: Cheapest
├─ HA: Zero (1 node failure = total downtime)
├─ Updates: Kill service during node updates
└─ Risk: Unacceptable

2 Nodes Setup:
├─ Cost: Acceptable
├─ HA: N-1 redundancy (1 node can fail)
├─ Updates: Can update 1 node while other runs service
└─ Industry standard for production

3+ Nodes:
├─ Cost: Higher baseline
├─ HA: N-2+ redundancy
├─ Use case: Mission-critical systems
└─ For us: Overkill

Max Nodes: 5

Capacity ceiling:
- 5 nodes × 10 pods per node = 50 pods maximum
- 50 pods × 100m CPU = 5 full cores
- Cost: 5 × t2.medium = \(35/month (vs \)68 for 2)
- Prevents: Runaway cloud costs

Cost protection:
- If somehow pods try to scale to 100
- Node limit prevents creating 10+ nodes
- Caps cost increase at 5x baseline
- Alert triggers at 80% capacity

Scale Down Strategy

Conservative approach:
├─ Node must be < 50% utilized for 10 minutes
├─ Respects PodDisruptionBudgets (won't violate HA)
├─ Drains pods gracefully (30-second window)
├─ Waits for pods to move to other nodes
└─ Only then terminates instance

Why 10 minutes (not 5)?
- Prevents rapid scale down/up cycles
- Traffic often returns (lunch rush, meetings ending)
- Cost: $0.07 for 10 minutes idle node
- Worth the stability

Why respect PDBs?
- Ensures minimum availability during scale down
- If 2 pods on node and minAvailable=1
- Only 1 pod can be evicted
- Other stays, preventing node termination
- Protects against "accidental downtime"

Failure Recovery Design

Here's where autoscaling meets resilience. What happens when pods or nodes actually fail?

Pod Failure Recovery

Scenario: Container crash (app exception, memory leak, etc)

Timeline:
T+0s    - Application crashes
        - Container exits with code 1
        
T+1s    - Kubelet detects exit
        - Restart policy: Always
        - Attempts restart

T+5s    - Container restarts, starts up
        - Health checks begin

T+15s   - Readiness probe succeeds
        - Load balancer adds back to rotation

T+30s   - Pod fully operational
        - Traffic flowing again

Recovery time: ~30 seconds
Service impact: 1 pod down, other pod handles traffic
User impact: None (transparent failover)

Protecting against cascading failures:

The key innovation here is the health check configuration:

livenessProbe:
  httpGet: /health
  initialDelaySeconds: 30  # Wait 30s for app to start
  periodSeconds: 10        # Check every 10s
  failureThreshold: 3      # 3 failures = kill pod

readinessProbe:
  httpGet: /health
  initialDelaySeconds: 10  # Fast feedback on readiness
  periodSeconds: 5         # Check every 5s
  failureThreshold: 3      # 3 failures = remove from LB

Why two probes?

Liveness Probe:
├─ Purpose: Detect hung processes
├─ Action: Kill and restart pod
├─ Latency: Up to 30 seconds (acceptable)
└─ Why slow: Avoids killing healthy pods that are busy

Readiness Probe:
├─ Purpose: Detect when app is ready to serve
├─ Action: Remove from load balancer (don't kill)
├─ Latency: Fast (up to 15 seconds)
└─ Why fast: Users should never hit unready pods

Node Failure Recovery

Scenario: Node hardware fails or gets disconnected

Timeline:
T+0s    - Network partition or node crash
        - Kubernetes doesn't know yet

T+40s   - Health check timeout
        - Node heartbeat missing

T+5m    - Node marked "NotReady"
        - Controller marks pods for eviction

T+5m30s - Graceful termination begins
        - Pods get 30-second shutdown window
        - New pods scheduled on healthy nodes

T+6m    - Pods forcibly terminated
        - If graceful termination didn't work

T+6m30s - Cluster Autoscaler detects
        - Unschedulable pods
        - Launches new EC2 instance

T+8m    - New node fully ready
        - Joins cluster

T+8m30s - Pods scheduled and ready
        - Load balanced to new node

Total recovery time: 8-10 minutes
Service impact: Temporary 50% capacity loss (if 2 nodes)
User impact: Requests go to remaining node (slower)
Data impact: None (stateless design)

Why 8-10 minutes is acceptable:

RTO vs Cost Trade-off:

Faster recovery (< 5 min):
├─ Requires: Reserved capacity (empty node always ready)
├─ Cost: +$35/month (extra node sitting idle)
├─ Benefit: Faster failover
├─ For us: Not worth it (dev environment)

Current approach (8-10 min):
├─ Cost: Optimal (only pay for what we use)
├─ Recovery: Slow but acceptable
├─ For us: Perfect balance

SLA implications:
├─ 99.9% uptime allows: ~8 hours downtime/year
├─ 1 node failure every 10 min: Highly unlikely
├─ Multiple failures/year needed to violate SLA
└─ Current approach sufficient

Availability Zone Failure

Scenario: Entire us-east-1a goes down

Setup:
Node 1 in us-east-1a (2 pods)
Node 2 in us-east-1b (2 pods)

Failure:
T+0s   - All us-east-1a infrastructure fails
T+5m   - Node 1 marked NotReady
T+6m   - Pods on Node 1 evicted
T+6m30s- Cluster Autoscaler launches replacement
T+8m   - New node in us-east-1b ready
T+8m30s- Pods scheduled

Impact:
├─ Temporary capacity: 2 pods (down from 4)
├─ Service: Still available (reduced performance)
├─ Traffic: Load concentrated on Node 2
├─ Duration: 8-10 minutes
└─ No data loss (stateless design)

Why we use 2 AZs (not 3):

Cost-benefit analysis:

2 AZs (CHOSEN):
├─ Cost: $141/month
├─ Capacity loss: 50% during AZ failure
├─ Recovery: 8-10 minutes
├─ Suitable for: Development, normal production
└─ Trade-off: Acceptable

3 AZs:
├─ Cost: +$13/month (~11% increase)
├─ Capacity loss: 33% during AZ failure
├─ Recovery: Same 8-10 minutes
├─ Suitable for: Mission-critical, 99.99% SLA
└─ For us: Overkill

Decision: 2 AZs is optimal
Could upgrade to 3 AZs later if needed

Pod Disruption Budgets (PDB)

The final piece of resilience: PDB prevents accidental downtime

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: bird-api-pdb
spec:
  minAvailable: 1  # Always keep at least 1 pod
  selector:
    matchLabels:
      app: bird-api

What this does:

Scenario: Admin accidentally runs kubectl delete pods -l app=bird-api

Without PDB:
├─ All pods deleted immediately
├─ Service completely down
├─ 30-second outage

With PDB (minAvailable: 1):
├─ First pod deleted
├─ Second pod protected by PDB
├─ Kubernetes prevents deletion
├─ Service continues
├─ Admin must explicitly override

When PDB prevents disruption:

  1. Cluster upgrades - K8s respects PDB

  2. Node drains - Won't violate PDB

  3. Admin mistakes - Accidental kubectl delete

  4. Voluntary disruptions - Node maintenance

When PDB doesn't help:

  1. Pod out of memory - OOMKilled regardless

  2. Node crash - Involuntary (kills immediately)

  3. Explicit force delete - --force --grace-period=0

Testing Resilience

How do we know this actually works?

We tested it. Here's what we did:

Test 1: Kill a Pod

# Get a pod name
POD_NAME=$(kubectl get pods -l app=bird-api -o jsonpath='{.items[0].metadata.name}')

# Kill it
kubectl delete pod $POD_NAME

# Watch replacement
watch kubectl get pods -n default -l app=bird-api

Results:

Test 2: Scale Down and Watch HPA Scale Back Up

# Scale down to 1 pod
kubectl scale deployment bird-api --replicas=1

# Wait 15 seconds
sleep 15

# Check status
kubectl get deployment bird-api
kubectl get hpa -n default

Results:

Test 3: Simulate High Load

# Run load generator in pod
kubectl run -i --tty load-gen --image=busybox --restart=Never -- /bin/sh

# Inside pod
while true; do wget -q -O- http://bird-api-service/; done

Results:

Test Results Summary

Test                          Status    Recovery Time    User Impact
────────────────────────────────────────────────────────────────────
Pod crash                     ✓ Pass    ~10 seconds      None
Manual pod deletion           ✓ Pass    ~10 seconds      None
HPA scale up under load       ✓ Pass    ~30 seconds      Minimal
HPA scale down (idle)         ✓ Pass    ~5 minutes       None
Service stays available       ✓ Pass    N/A              Zero downtime

RTO Achievement: < 15 seconds (pod recovery)
RPO Achievement: 0 (stateless design)
Service Availability: 100% during tests

Key Takeaways

1. Autoscaling Requires Right Thresholds

2. Two-Level Autoscaling Is Essential

3. Resilience Comes from Multiple Layers

4. Recovery Is Automatic

5. Test Your Resilience

6. Cost Follows Traffic

  • Light traffic: 2 pods, 2 nodes (~$0.19/hr)

  • Heavy traffic: 10 pods, 5 nodes (~$0.40/hr)

  • Automatic cost optimization

Configuration Reference

HPA Configuration:
  targetCPUUtilizationPercentage: 70
  minReplicas: 2
  maxReplicas: 10
  scaleDownStabilizationWindow: 300s
  scaleUpPeriod: 0s

Cluster Autoscaler:
  minNodes: 2
  maxNodes: 5
  scaleDownEnabled: true
  scaleDownUtilizationThreshold: 0.5
  scaleDownDelayAfterAdd: 10m

Health Checks:
  livenessProbe:
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3
    
  readinessProbe:
    initialDelaySeconds: 10
    periodSeconds: 5
    failureThreshold: 3

Part 3: Monitoring with CloudWatch - Knowing What to Watch

In Parts 1 and 2, we built a self-healing infrastructure that scales automatically. Now comes the critical question: How do we know when something goes wrong?

This is where monitoring separates "hope-driven development" from production systems.

But here's the paradox: More metrics isn't better. Too many alarms lead to alert fatigue.

This part covers our monitoring strategy: what to measure, why, and how to alert without drowning in notifications.

The Monitoring Problem

Imagine this: You wake up at 2 AM. Your phone is vibrating.

Alert: "Pod memory at 82%"

You check. Service is fine. False alarm. Go back to sleep.

2 more alerts come that night.

By morning, you've snoozed 10 alerts. All false.

Now the real problem happens: Pod memory hits 85% → OOMKill → service down → you don't notice because you've stopped paying attention to alerts.

This is alert fatigue, and it kills production systems.

The solution? Alert only on things that matter.

What Metrics Should We Track?

Not all metrics are equal. Some predict problems, others just report them.

Leading Indicators (predict problems)

  • CPU utilization (scaling will happen)

  • Memory usage trend (heading toward OOMKill)

  • Pod restart count (app is crashing)

Lagging Indicators (report existing problems)

  • Request latency

  • Error rate

  • Node status

We focus on leading indicators because we can act before users are affected.

The Five Metrics We Track

1. Pod CPU Utilization

What it measures: Percentage of CPU core being used
Source: Metrics Server (built-in with Kubernetes)
Update frequency: Every 15 seconds
Why it matters: Drives HPA scaling decision

Interpretation:
  < 30%   → Underutilized, pod is idle
  30-70%  → Healthy, normal operation
  70-80%  → Getting hot, HPA may trigger
  80-90%  → Warning, investigate
  > 90%   → Critical, possible resource leak

Action if consistently > 80%:
  - Increase pod limits? (more CPU per pod)
  - Optimize code? (use less CPU)
  - Scale more? (lower threshold)

2. Pod Memory Utilization

What it measures: Percentage of RAM being used
Source: Metrics Server
Update frequency: Every 15 seconds
Why it matters: Early warning for OOMKill

Interpretation:
  < 30%   → Plenty of headroom
  30-70%  → Normal
  70-85%  → Getting tight, monitor
  > 85%   → Warning, action needed
  > 100%  → OOMKilled (dead)

Why 85% threshold (not 90%)?
  Memory is NOT compressible
  If you go over limit, pod dies immediately
  No graceful degradation like CPU
  85% gives 27Mi buffer before 512Mi limit
  
Action if consistently > 85%:
  - Increase memory limit?
  - Check for memory leak?
  - Profile application?

3. Pod Restart Count

What it measures: How many times pod has restarted
Source: Kubelet
Update frequency: Real-time
Why it matters: Indicates application crashes

Interpretation:
  0 restarts → Healthy, never crashed
  1-2 times  → Occasional issues, investigate
  > 5 times  → Serious problem, fix immediately
  
Why > 5 in 5 minutes is critical:
  More than 1 restart per minute = crash loop
  Application is fundamentally broken
  Needs immediate attention

Possible causes:
  - OOMKilled (memory leak)
  - Application bug (exception)
  - Configuration error (bad env var)
  - Dependency unavailable (can't reach database)

Investigation steps:
  kubectl logs pod-name --previous
  kubectl describe pod pod-name

4. Node Status

What it measures: Is node Ready or NotReady?
Source: Kubernetes API
Update frequency: Every 40 seconds
Why it matters: Node failure is worst-case scenario

Status values:
  Ready    → Node healthy, accepting pods
  NotReady → Node unhealthy, evicting pods
  Unknown  → Kubernetes lost contact
  
When NotReady occurs:
  1. Node network failure
  2. Kubelet process crashed
  3. Disk full on node
  4. Node running out of memory
  5. EBS volume detached

Recovery:
  Cluster Autoscaler detects NotReady
  Launches replacement node
  Evicts pods to healthy nodes
  Terminates failed node
  
Timeline: 8-10 minutes total recovery

5. Error Rate (Optional, in logs)

What it measures: Percentage of requests that failed
Source: Application logs (parsed)
Update frequency: Every 5 minutes
Why it matters: Indicates service health

Healthy:
  Error rate < 0.1% (1 error per 1000 requests)
  
Warning:
  Error rate 0.1% - 1%
  Investigate why errors increasing
  
Critical:
  Error rate > 1%
  Service is in trouble
  
Note: We don't have HTTP-level logging in this setup
Could add it later with X-Ray or APM tools

CloudWatch vs Alternatives

Let me show you why we chose CloudWatch over competitors.

Monitoring Options Comparison

CloudWatch      │ Native AWS        │ $5/month
Prometheus      │ Self-hosted       │ Ops effort
Datadog         │ SaaS              │ $15-30/host
New Relic       │ SaaS              │ $40+/month
Splunk          │ Self-hosted/SaaS  │ $$$$

Detailed Comparison: CloudWatch vs Prometheus

Feature              CloudWatch      Prometheus
─────────────────────────────────────────────────
Setup time           5 minutes        2 hours
Infrastructure       Managed          Self-hosted
Query language       CloudWatch       PromQL
Dashboard quality    Good             Excellent
Alerting             Built-in         AlertManager
Cost                 \(5/month         \)20-40/month
                     (baseline)       (EC2 + monitoring)

Best for:
CloudWatch          ↓
- AWS-native setup
- Minimal ops burden
- Quick deployment
- Good enough queries

Prometheus          ↓
- Multi-cloud setup
- Powerful queries (PromQL)
- Advanced dashboards
- Complete control

Why We Chose CloudWatch

Decision Matrix:

1. Already on AWS
   → CloudWatch integrates natively
   → No additional infrastructure
   → Cost: Included baseline

2. Small-scale operation (2 pods)
   → CloudWatch sufficient
   → PromQL power not needed
   → Log Insights queries adequate

3. Cost-conscious
   → CloudWatch: ~$5/month
   → Prometheus + Grafana: ~$25/month
   → Savings: $20/month (14% total cost)

4. Minimal ops overhead
   → CloudWatch: Set it and forget
   → Prometheus: Manage Prometheus server, etcd, etc
   → No operational burden preferred

Decision: CloudWatch for this scale
Future migration path: Easy to add Prometheus if needed

The Four Alarms

We configured exactly 4 CloudWatch alarms. No more, no less.

Why 4? Because each alerts on something that requires action.

Alarm 1: Pod CPU High

Configuration:
  Metric: Pod CPU Utilization (average)
  Threshold: > 80%
  Duration: 2 periods × 5 min = 10 minutes total
  Action: Send SNS notification (→ Slack, email)

Trigger scenario:
  T+0 min:  Pod CPU hits 81%
  T+5 min:  Still at 85% (first period met)
  T+10 min: Still at 82% (second period met)
  → ALARM FIRES

Why 80% (not 70%)?
  70% is HPA trigger (automatic scaling)
  80% means HPA is in progress but not enough yet
  Need to investigate if:
    - Resource limits too low?
    - Code has efficiency issue?
    - Legitimate traffic spike?

Why 2 periods (10 minutes)?
  Prevents false alarms from brief spikes
  Gives HPA time to scale
  Only alert if sustained high CPU

Investigation steps:
  1. Check HPA status (kubectl get hpa)
  2. Check pod count (kubectl get pods)
  3. Check application logs (kubectl logs)
  4. Check recent code changes

Alarm 2: Pod Memory High

Configuration:
  Metric: Pod Memory Utilization (average)
  Threshold: > 85%
  Duration: 2 periods × 5 min = 10 minutes
  Action: Send SNS notification

Trigger scenario:
  Pod allocated: 512Mi
  Pod using: 435Mi (85% of limit)
  After 10 minutes at this level
  → ALARM FIRES

Why 85% (not 90%)?
  Memory can't be compressed
  If you hit 100%, pod gets OOMKilled immediately
  85% gives 27Mi buffer before disaster
  More conservative than CPU threshold

Why NOT auto-scale memory?
  HPA only scales on CPU
  VPA (Vertical Pod Autoscaler) can resize memory
  But VPA requires pod restart (causes downtime)
  Better to fix the leak than auto-resize

Investigation steps:
  1. Check memory trend (increasing or stable?)
  2. Look for memory leak (kubectl logs)
  3. Check database connections (each =~1-2Mi)
  4. Profile application memory usage
  5. If leak found: deploy fixed version

Alarm 3: Node Not Ready

Configuration:
  Metric: Node Status
  Threshold: status != "Ready"
  Duration: 1 period (no delay)
  Action: URGENT - Send SNS, may trigger on-call

Trigger scenario:
  Node heartbeat stops
  Kubernetes waits 5 minutes
  Marks node "NotReady"
  → ALARM FIRES IMMEDIATELY

Why no delay?
  Node failure is urgent
  Every second matters
  Need immediate attention
  Cluster Autoscaler should handle, but alert anyway

What this means:
  Hardware failed
  Network disconnected
  Kubernetes components crashed
  Something serious

Immediate action:
  1. Check node status: kubectl describe node <name>
  2. Check AWS console (is instance running?)
  3. Check Cluster Autoscaler logs
  4. Manual intervention if CA doesn't fix it

Expected resolution:
  Cluster Autoscaler detects and replaces
  Takes 8-10 minutes
  Manual check confirms recovery

Alarm 4: Pod Restarts High

Configuration:
  Metric: Pod Restart Count
  Threshold: > 5 in 5 minutes
  Duration: 1 period (immediate)
  Action: Send SNS notification

Trigger scenario:
  8:00:00 - Pod restart #1
  8:01:00 - Pod restart #2
  8:02:00 - Pod restart #3
  8:03:00 - Pod restart #4
  8:04:00 - Pod restart #5
  → ALARM FIRES (> 5 in 5 minutes)

What this indicates:
  Application crash loop
  Restarts are failing faster than recovering
  Something is fundamentally wrong
  Not a temporary issue

Root causes (in order of likelihood):
  1. Out of memory (OOMKill) - 40%
  2. Application bug (unhandled exception) - 30%
  3. Dependency unavailable (can't reach DB) - 20%
  4. Configuration error (bad env var) - 10%

Investigation:
  # Check logs from crashed pod
  kubectl logs <pod-name> --previous
  
  # Check events
  kubectl describe pod <pod-name>
  
  # Is it OOMKilled?
  kubectl get pod <pod-name> -o yaml | grep -i "OOMKilled"
  
  # Check recent deployments
  kubectl rollout history deployment/bird-api

Resolution options:
  1. Increase memory limit (if OOMKilled)
  2. Rollback to previous version (if app bug)
  3. Fix configuration (if env var wrong)
  4. Restart dependent service (if dependency issue)

Alert Fatigue Prevention

Here's what we intentionally don't alert on:

NOT alerting on these (prevents noise):

❌ CPU < 70% (normal, why alert?)
❌ Memory < 85% (healthy)
❌ Pod restarts = 0 (good thing!)
❌ Node count = 2-3 (normal range)
❌ Request latency < 500ms (acceptable)
❌ Error rate < 0.1% (normal)
❌ Pod not ready for 30 seconds (might be starting)

Why not?
  These are all NORMAL states
  Alerting on normal causes "noise"
  Teams stop paying attention
  Real problems get missed
  
The goal: Alert only when action is needed

Log Insights Queries

CloudWatch Logs are useless without ways to query them. Here are the queries we actually use:

Query 1: Recent Errors

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() as error_count by @message
| sort error_count desc

What it does: Finds all errors in logs, groups by message, sorts by frequency

Use case: Something goes wrong, want to know what

Query 2: Pod Restart Timeline

fields @timestamp, kubernetes.pod_name, @message
| filter @message like /restarted/
| stats count() as restart_count by kubernetes.pod_name, @timestamp
| sort @timestamp desc

What it does: Shows when pods restarted and how many times

Use case: Debugging the "pod_restarts_high" alarm

Query 3: Performance Analysis

fields response_time
| stats avg(response_time) as avg_rt, 
        pct(response_time, 95) as p95_rt,
        pct(response_time, 99) as p99_rt

What it does: Calculates average, 95th percentile, 99th percentile latency

Use case: Trending performance degradation

Query 4: Traffic Volume

fields @timestamp
| stats count() as request_count by @timestamp
| sort @timestamp desc

What it does: Shows requests per time period

Use case: Correlate with alerts (was there traffic spike?)

Cost of Monitoring

How much does CloudWatch actually cost?

CloudWatch Pricing:
├─ Logs ingested: $0.50 per GB
│  ├─ Our logs: ~100MB/day = 3GB/month
│  ├─ Cost: 3GB × \(0.50 = \)1.50/month
│  └─ With retention: 7 days
│
├─ Metrics: $0.30 per custom metric per month
│  ├─ We use 5 metrics (built-in)
│  ├─ Cost: $0 (built-in metrics free)
│  └─ Only pay for custom metrics we add
│
├─ Dashboards: Free
│  └─ Cost: $0
│
├─ Alarms: $0.10 per alarm per month
│  ├─ 4 alarms configured
│  ├─ Cost: 4 × \(0.10 = \)0.40/month
│  └─ Cheap insurance
│
└─ Total: ~$2-5/month

Compare to alternatives:
  Prometheus: $25-40/month (self-hosted EC2)
  Datadog: $40-100/month (SaaS)
  New Relic: $100+/month
  
CloudWatch: $5/month (best deal for AWS workloads)

Key Takeaways

1. Measure the Right Things

  • Focus on leading indicators (CPU, memory, restarts)

  • Avoid lagging indicators (latency, error rate) until at scale

  • Track only what you'll act on

2. Alert on Signals, Not Noise

  • 4 alarms, not 40

  • Only alert when action is needed

  • Prevent alert fatigue

3. CloudWatch is Good Enough

  • Native AWS integration

  • Cheap ($5/month)

  • Sufficient for development

  • Scale to Prometheus when needed

4. Logs Are Searchable

  • Log Insights queries enable debugging

  • Store logs 7-14 days

  • Delete old logs to control costs

5. Test Your Alarms

  • Don't assume they work

  • Actually trigger alert conditions

  • Make sure notifications work (Slack/email)

  • Document what each alarm means

Quick Reference: Our 4 Alarms

Alarms:
  pod_cpu_high:
    Threshold: > 80%
    Duration: 10 minutes
    Action: Investigate why HPA didn't prevent
    
  pod_memory_high:
    Threshold: > 85%
    Duration: 10 minutes
    Action: Increase memory or fix leak
    
  node_not_ready:
    Threshold: status != Ready
    Duration: Immediate
    Action: Check if Cluster Autoscaler fixes it
    
  pod_restarts_high:
    Threshold: > 5 in 5 minutes
    Duration: Immediate
    Action: Check logs for OOM or exceptions

Part 4: Cost Optimization - Running Kubernetes on AWS Without Breaking the Bank

You've built a production-grade infrastructure. It scales automatically. It monitors itself. It recovers from failures.

Now the hard question: Is it costing too much?

Our current setup runs ~$120-140 per month. For a development environment, that's reasonable. But for a startup, every dollar matters.

This part covers cost optimization strategies: from quick wins (\(50/month savings) to major architecture changes (\)100/month savings).

Current Cost Breakdown

Let's see where every dollar goes:

Monthly Cost Analysis (30 days)

EKS Control Plane:     $73.00/month
EC2 Worker Nodes:      $68.62/month  (2 × t2.medium)
Network Load Balancers: $8.76/month  (2 × NLB)
CloudFront CDN:        ~$0.85/month  (~10GB/month)
S3 Storage:            ~$0.50/month  (state + logs)
DynamoDB:              ~$0.50/month  (state locking)
CloudWatch:            ~$1.90/month  (logs + alarms)
Data Transfer:         ~$0.23/month  (NAT Gateway)

─────────────────────────────────────────
TOTAL MONTHLY COST: ~$160-170
─────────────────────────────────────────

Breakdown by component:
  EC2 Nodes:    43% ($68.62)
  EKS Control:  45% ($73.00)
  Load Balancers: 5% ($8.76)
  Everything else: 7% (~$10.00)

Where to optimize?
  EC2 Nodes (43%): Biggest opportunity
  EKS Control (45%): Can't change (managed)
  Load Balancers (5%): Can consolidate later
  Everything else (7%): Already cheap

Cost Optimization Strategies

Now let's look at concrete ways to reduce costs.

Strategy 1: Switch to ECS Fargate (Saves ~$70/month)

Current Setup:

  • 2 t2.medium nodes: $68.62/month

  • Each pod: 100m CPU, 128Mi memory

Fargate Alternative:

  • Pay per pod: $0.05 per CPU per hour

  • 2 pods average: 100m CPU = 0.1 CPU

  • Cost: 0.1 × \(0.05 × 730 = \)3.65/month

  • Memory: ~0.5GB × \(0.005 = \)3.65/month

  • Total for 2 pods: ~$7.30/month

  • With overhead (extra pods during scaling): ~$40-50/month

Savings: $20-30/month (20% reduction)

Trade-offs:

Fargate Pros:
✓ No node management
✓ No Cluster Autoscaler needed
✓ Cheaper for small workloads
✓ Automatic scaling
✓ Simpler operations

Fargate Cons:
✗ Can't SSH into nodes
✗ Limited customization
✗ Slightly higher per-pod cost at scale
✗ Not all EKS features available

When to switch:
  - If you have < 20 pods
  - If you need zero node management
  - If cost > operational complexity

Decision: Could implement now for 20% savings

Strategy 2: Use Spot Instances (Saves ~$48/month)

Current Setup:

  • 2 t2.medium on-demand: \(0.0470/hour = \)68.62/month

  • Fully available 24/7

Spot Alternative:

  • 2 t2.medium spot: ~\(0.0140/hour = \)20.44/month

  • Discount: 70% cheaper

  • Risk: Can be terminated without warning (2-minute notice)

Savings: $48/month (70% reduction)

Trade-offs:

Spot Pros:
✓ 70% cheaper than on-demand
✓ Designed for fault-tolerant workloads
✓ Perfect for Kubernetes (auto-healing)
✓ Auto Scaling Group handles replacement
✓ Transparent to application

Spot Cons:
✗ 2-minute interruption notice
✗ Brief traffic spike to remaining nodes
✗ Not suitable for stateful workloads
✗ Need to handle rapid restarts

Kubernetes fit:
  Spot PERFECT for Kubernetes
  - Auto Scaling Group replaces instantly
  - Pods rescheduled immediately
  - 99.9% uptime achievable
  - Users might not notice 30s disruption

Risk analysis:
  Spot interruption frequency: ~2-3/month/instance
  Each disruption: 30-60 seconds
  Impact: Temporary load spike, no data loss
  
  For our 2-node setup:
  - Expected interruptions: 2-3/month total
  - Each adds 30s latency to requests
  - Acceptable for development
  - Unacceptable for mission-critical

Decision: Implement Spot + On-Demand mix (1 spot, 1 on-demand) = $45/month savings

Strategy 3: Consolidate Load Balancers (Saves ~$15/month)

Current Setup:

  • 2 Network Load Balancers: ~\(8.76/month (but with LCU charges: ~\)50-60/month total)

ALB Alternative:

  • 1 Application Load Balancer: ~$35-40/month (with LCU charges)

Savings: ~$15-20/month (but requires redesign)

Trade-offs:

Consolidation Pros:
✓ Single load balancer (simpler)
✓ URL-based routing (future-proof)
✓ Saves $15-20/month

Consolidation Cons:
✗ Requires Ingress controller setup
✗ Requires rewriting service definitions
✗ More complexity for 2 services
✗ Worth it only with 5+ services

Decision: Keep 2 NLBs for now
Re-evaluate when adding 3rd service

Strategy 4: Reduce CloudWatch Retention (Saves ~$2-3/month)

Current Setup:

  • 7-day retention: ~$1.50/month

  • Covers debugging window

Reduced Retention:

  • 3-day retention: ~$0.70/month

  • Still covers most issues

Savings: ~$0.80/month

Trade-offs:

Less savings than you'd think:
- 7 days: $1.50/month
- 3 days: $0.70/month
- Savings: Only $0.80/month

Debugging benefit:
- 7 days: Cover entire week of issues
- 3 days: Miss issues found late in week

Decision: Keep 7 days
Cost savings too minimal to matter

Strategy 5: Reserved Instances (Saves ~$40/month, 1-year commitment)

Current Setup:

  • 2 t2.medium on-demand: $68.62/month

Reserved Instance Alternative:

  • 1-year commitment: ~$350 upfront per instance

  • 2 instances: ~$700 upfront

  • Per-hour cost: ~$0.04 (33% discount)

  • Monthly: ~$29.20

Savings: \(39.42/month (but requires \)700 upfront)

Break-even:

  • $700 ÷ $39.42/month = 17.8 months

  • But you save from month 1

  • Typical payback: 18 months

Trade-offs:

RI Pros:
✓ 33% discount
✓ Lock in price (no increase)
✓ Good for stable workloads

RI Cons:
✗ $700 upfront cost
✗ 1-year commitment
✗ Can't change instance type
✗ May have unused capacity

When to use:
  - Production workloads (stable)
  - Confident infrastructure won't change
  - Budget available for upfront cost
  
For us:
  - Not yet (still optimizing setup)
  - Revisit in 6 months when stable

Break-Even Analysis

Which optimizations are worth doing?

Optimization Strategy      Savings    Effort    ROI
────────────────────────────────────────────────────
Spot instances             $48/mo     Medium    High ✓
ECS Fargate                $20/mo     High      Medium
Consolidate LBs            $15/mo     High      Low
Reduce logs                $0.80/mo   Low       Very Low
Reserved instances         $40/mo     None      High* (with commitment)
Scheduled scaling          $15/mo     High      Low

* Requires upfront commitment

My Recommendation:

  1. Implement now (Quick wins):

    • Switch 1 node to Spot: -$24/month (low effort)

    • Total savings: $24/month

  2. Implement in 3 months (If stable):

    • Both nodes Spot: -$48/month total

    • Total savings: $48/month (30% reduction)

  3. Implement in 6 months (If proven stable):

    • 1-year Reserved Instances: -$40/month

    • Combination: -$88/month total (55% reduction)

  4. Not worth doing:

    • Reduce CloudWatch logs (savings too small)

    • Complex scheduling (effort > benefit)

    • Consolidate LBs (wait until 5+ services)

When to Optimize

Here's my philosophy: Don't prematurely optimize.

Development phase (current):
  Goal: Get it working
  Cost: Secondary concern
  Strategy: Use on-demand
  Monthly: $140-160

Early production (6 months in):
  Goal: Prove business model
  Cost: Important but not critical
  Strategy: Add Spot instances
  Monthly: $90-110

Scaling production (12+ months):
  Goal: Maximize profitability
  Cost: Primary concern
  Strategy: Reserved instances, Fargate, optimization
  Monthly: $40-60

Hypergrowth (2+ years in):
  Goal: Minimize per-customer cost
  Cost: Extreme optimization needed
  Strategy: Multi-region, custom infra, internal tools
  Monthly: $0.05-0.10 per customer

Our current phase: Development
Optimize when reaching next phase

Key Takeaways

1. Know Your Costs

  • EC2 is 90% of your cost

  • EKS control plane + LBs = fixed costs

  • Monitoring/storage = negligible

2. Spot Instances are Gold for Kubernetes

  • 70% cheaper than on-demand

  • Perfect for auto-healing architecture

  • Minimal operational complexity

  • Highest ROI optimization

3. Fargate is Great, But Check the Math

  • Better for <20 pods

  • More expensive at scale (above 100 pods)

  • Better for zero-ops teams

  • Lower cost per pod when small

4. Reserved Instances = 33% Discount

  • Requires commitment

  • Requires stable workload

  • Worth it after 1 year of operation

  • Not worth it while optimizing

5. Don't Optimize for Scale You Don't Have

  • Consolidating 2 LBs saves $15/month

  • But requires architecture change

  • Wait until 5+ services to consolidate

6. Monitoring and Storage are Cheap

  • CloudWatch: $5/month

  • S3: $0.50/month

  • Not worth extreme cost optimization

  • Focus on compute costs (EC2)

Cost Optimization Roadmap

Month 0 (Current):
  Cost: ~$140/month
  Setup: 2 on-demand nodes
  Status: Baseline

Month 3:
  Action: Add 1 Spot node (mixed)
  Cost: ~$110/month
  Savings: $30/month

Month 6:
  Action: Make 2nd node Spot
  Cost: ~$90/month
  Savings: $50/month

Month 12:
  Action: Reserved instances for 1 node
  Action: Keep 1 Spot for flexibility
  Cost: ~$50/month
  Savings: $90/month

At this point, re-evaluate:
  - Consider Fargate if < 20 pods
  - Consider Karpenter for better scaling
  - Consider multi-region if global
  - Consider custom infrastructure if >>$10k/month

Quick Reference: Optimization Options

Spot Instances:
  Cost: $24-48/month savings (20-30% reduction)
  Risk: Low (Kubernetes handles it)
  Effort: Medium (test first)
  Payoff: 1 month
  
ECS Fargate:
  Cost: $20-30/month savings
  Risk: Medium (architectural change)
  Effort: High (full redesign)
  Payoff: 3-6 months
  
Reserved Instances:
  Cost: $40/month savings
  Risk: None (committed cost)
  Effort: None
  Payoff: 17 months (with $700 upfront)
  
Consolidate LBs:
  Cost: $15/month savings
  Risk: Low (add Ingress controller)
  Effort: Medium (redesign services)
  Payoff: 6+ months (not worth it now)

Part 5: Multi-Region & Disaster Recovery - Preparing for the Worst

The infrastructure we've built scales automatically, recovers from failures, and costs ~$140/month.

But what if AWS us-east-1 (the entire region) goes down?

This part covers preparing for the worst-case scenario: a complete region failure.

The Disaster Recovery Problem

Let's be honest: Complete regional outages are rare.

In AWS history:

  • Major regional outage: 2012 (Virginia)

  • Minor outages: 2-3 per year per region

  • Frequency: ~0.05-0.1% of the time

But when they happen, they're catastrophic.

If your entire infrastructure goes down:

  • Downtime: 2-6 hours (typical regional recovery time)

  • Data loss: Maybe 0 (depending on strategy)

  • Revenue loss: Potentially 100%

The question: Is multi-region worth the complexity and cost?

Answer: It depends.

Scenario 1: Startup (<$100k revenue/year)
├─ Multi-region cost: +\(140/month = \)1680/year
├─ Regional outage impact: ~$1000 revenue lost
├─ Break-even: ~2 years of outages (unlikely)
└─ Recommendation: Single region (now), multi-region (later)

Scenario 2: Growth stage ($1M+ revenue/year)
├─ Multi-region cost: +$140/month = 1.7% of revenue
├─ Regional outage impact: ~$5000+ revenue lost
├─ Break-even: ~1 month of outages
└─ Recommendation: Implement multi-region NOW

Scenario 3: Enterprise (>$10M revenue/year)
├─ Multi-region cost: 0.17% of revenue
├─ Regional outage impact: $50k+ lost
├─ Break-even: Days
├─ Competitors: Already multi-region
└─ Recommendation: Multi-region + multi-cloud required

For this project (learning/development):
├─ Single region optimal
├─ Multi-region design patterns important (knowledge transfer)
└─ Worth understanding even if not implementing now

Multi-Region Architecture

What Does Multi-Region Mean?

Single Region:
  Everything in us-east-1
  ├─ EKS cluster in us-east-1
  ├─ Database in us-east-1
  ├─ S3 buckets in us-east-1
  └─ If us-east-1 down: SERVICE DOWN

Multi-Region:
  Clusters in multiple regions
  ├─ Primary: us-east-1
  ├─ Secondary: us-west-2
  ├─ Database: Replicated across regions
  ├─ S3: Cross-region replication
  └─ If us-east-1 down: Failover to us-west-2

Basic Multi-Region Setup

┌──────────────────────────────────┐
│     Route 53 (Global DNS)        │
│  Geo-routing + Health checks    │
└─────────────┬────────────────────┘
              │
    ┌─────────┴─────────┐
    │                   │
┌───▼─────────────┐  ┌──▼─────────────┐
│  us-east-1      │  │  us-west-2     │
│                 │  │                │
│ • EKS Cluster   │  │ • EKS Cluster  │
│ • 2 Nodes       │  │ • 2 Nodes      │
│ • LB 1 & 2      │  │ • LB 1 & 2     │
│ • CloudFront    │  │ • CloudFront   │
│   (distributed) │  │   (distributed)│
│ • Database      │  │ • Database     │
│   Primary       │  │   Replica      │
└─────────────────┘  └────────────────┘
    (Primary)          (Secondary)
        │                   │
        └───────────────────┘
        Replicated data sync

Database Replication Strategy

This is the hard part.

Option 1: Active-Passive (Primary-Replica)
├─ Primary: us-east-1 (read/write)
├─ Replica: us-west-2 (read-only, data sync delay)
├─ Failover: Manual or automated to read-only
├─ RTO: 5-30 minutes
├─ RPO: 0-5 minutes (depends on sync frequency)

Option 2: Active-Active (Multi-Master)
├─ us-east-1: Read/write
├─ us-west-2: Read/write
├─ Replication: Bi-directional
├─ Conflict resolution: Application logic
├─ RTO: 0 minutes (already serving users)
├─ RPO: 0 minutes (all writes replicated)
├─ Complexity: High (conflict handling)

Option 3: Managed (AWS Multi-AZ)
├─ Database service: RDS, DynamoDB, etc
├─ Synchronous replication
├─ Automatic failover
├─ RTO: 30-60 seconds
├─ RPO: 0 (no data loss)
├─ Cost: Usually 1.5-2x single region

For stateless APIs (our case):
├─ Database is only state
├─ If you use DynamoDB: Global Tables = Active-Active
├─ If you use RDS: Read replicas + failover
├─ Best: Managed multi-region database

Active-Active vs Active-Passive

The biggest decision: Should both regions serve traffic simultaneously?

Active-Passive (Simpler, Cheaper)

Normal operation:
  us-east-1: Serving 100% of traffic
  us-west-2: Idle (replicating data)
  
Failure (us-east-1 down):
  1. Health check fails
  2. Route 53 detects failure
  3. Failover to us-west-2
  4. Users redirected to us-west-2
  
Timeline:
  T+0s   - Failure occurs
  T+30s  - Health check timeout
  T+60s  - Route 53 fails over
  T+120s - Users can connect to us-west-2
  
Recovery time (RTO): 2 minutes
Data loss (RPO): 0-5 minutes (depending on sync)
Cost: 2x (second region idle most of the time)

Trade-offs:

Active-Passive Pros:
✓ Simple failover logic
✓ No cross-region replication complexity
✓ Cheaper than active-active
✓ Easy to test (just test failover)

Active-Passive Cons:
✗ Idle capacity (wasted money)
✗ Failover delay (2-5 minutes)
✗ User experience during failover
✗ Data loss possible (depends on sync)

Active-Active (More Complex, Better)

Normal operation:
  us-east-1: Serving 50% of traffic
  us-west-2: Serving 50% of traffic
  
Failure (us-east-1 down):
  1. Health check fails
  2. Route 53 redirects traffic
  3. us-west-2 absorbs all traffic
  
Timeline:
  T+0s   - Failure occurs
  T+30s  - Health check timeout
  T+60s  - Route 53 rebalances
  T+90s  - us-west-2 at 100% capacity
  
Recovery time (RTO): 1-2 minutes
Data loss (RPO): 0 (bi-directional replication)
Cost: 2x (but both regions utilized)

Trade-offs:

Active-Active Pros:
✓ No idle capacity (both regions used)
✓ Faster failover
✓ Zero data loss (bi-directional sync)
✓ Better load distribution
✓ Natural load balancing

Active-Active Cons:
✗ Complex multi-master replication
✗ Conflict resolution needed
✗ Higher operational complexity
✗ Harder to test
✗ Not all databases support (need DynamoDB, not RDS)

RTO/RPO Metrics

These are the most important DR metrics.

Recovery Time Objective (RTO)

Definition: How long can the system be down before it's unacceptable?

Example: RTO = 1 hour
If system fails at 2:00 PM, it MUST be back by 3:00 PM
If recovery takes until 3:30 PM, SLA is violated

RTO levels:
  < 5 min:  Critical systems (hospitals, airlines)
  5-15 min: Production systems (e-commerce, banking)
  15-60 min: Important business (sales platforms)
  1-4 hours: Can tolerate some downtime
  > 4 hours: Non-critical systems

Our infrastructure:
  Single region: RTO = Region recovery time = 2-6 hours
  With failover: RTO = 1-2 minutes (active-active)
  With manual failover: RTO = 5-30 minutes (active-passive)

Recovery Point Objective (RPO)

Definition: How much data can we afford to lose?

Example: RPO = 5 minutes
If system fails, it's acceptable to lose 5 minutes of data
If RPO is violated, we lost more than 5 minutes of changes

RPO levels:
  0 minutes (zero data loss):  Critical (banking, medical)
  0-5 minutes:  Production (most systems)
  5-60 minutes: Business systems
  > 1 hour:     Non-critical

Our infrastructure:
  Single region: RPO = 0 (stateless, no data loss)
  Active-passive: RPO = 0-5 min (depends on replication frequency)
  Active-active: RPO = 0 (bi-directional replication)

Testing Disaster Recovery

You can't trust a DR plan you haven't tested.

DR Testing Strategy

Level 1: Documentation Review
├─ Read and verify DR procedures exist
├─ Effort: 1 hour
├─ Cost: $0
├─ Confidence: 10%

Level 2: Tabletop Exercise
├─ Simulate failure scenario (on whiteboard)
├─ Walk through recovery steps
├─ Effort: 4 hours
├─ Cost: $0
├─ Confidence: 30%

Level 3: Failover Test
├─ Actually failover to secondary region
├─ Verify everything works
├─ Failback to primary
├─ Effort: 8 hours
├─ Cost: $200 (might get billed for both regions)
├─ Confidence: 85%

Level 4: Full DR Drill
├─ Simulate real regional failure
├─ Run for 2-4 hours
├─ Measure actual RTO/RPO
├─ Effort: 16 hours + follow-up
├─ Cost: Potential service issues
├─ Confidence: 99%

Recommendation:
├─ Do Level 1: Before going live
├─ Do Level 2: Every 6 months
├─ Do Level 3: Every 12 months
├─ Do Level 4: Every 2 years (or quarterly for critical)

Example: Level 3 Failover Test

Step 1: Preparation (1 hour)
├─ Document current state (RTO = 0 right now)
├─ Backup database
├─ Notify team
├─ Set 2-hour testing window

Step 2: Initiate Failover (30 minutes)
├─ Failover DNS to us-west-2 (Route 53 change)
├─ Wait for DNS propagation
├─ Verify us-west-2 is serving traffic
└─ Note: T = failover start time

Step 3: Verify (15 minutes)
├─ Test APIs from us-west-2
├─ Check application functionality
├─ Verify data is consistent
├─ Document any issues

Step 4: Measure RTO (calculated)
├─ Actual RTO = (Time traffic fully on us-west-2) - (Failover start)
├─ Goal: < 2 minutes
├─ Actual: Usually 45-90 seconds (DNS propagation)

Step 5: Failback (15 minutes)
├─ Restore DNS to primary (us-east-1)
├─ Verify us-east-1 ready
├─ Failback traffic
└─ Monitor for issues

Step 6: Post-Test (30 minutes)
├─ Document findings
├─ Update procedures if needed
├─ Review what went wrong
├─ Schedule next test

Implementation Timeline

Phase 1: Planning (Month 1)

Cost: $0 (planning only)
├─ Document current infrastructure
├─ Define RTO/RPO requirements
├─ Choose replication strategy
├─ Design failover procedures
└─ Get stakeholder approval

Phase 2: Infrastructure (Month 2-3)

Cost: +$140/month (duplicate infrastructure)
├─ Set up secondary region (us-west-2)
├─ Replicate EKS cluster
├─ Set up database replication
├─ Configure Route 53
├─ Set up monitoring across regions

Phase 3: Testing (Month 4)

Cost: +\(140/month (ongoing), +\)500 (test)
├─ Level 1: Documentation review
├─ Level 2: Tabletop exercise
├─ Level 3: Failover test (live)
├─ Fix issues found during testing
├─ Document procedures

Phase 4: Operations (Month 5+)

Cost: +$140/month (ongoing)
├─ Regular testing (Level 2 every 6 months)
├─ Monitor replication lag
├─ Update procedures as needed
├─ Handle actual failures (if any)
└─ Continuously improve RTO/RPO

Cost-Benefit Analysis

Multi-Region Costs (Annual):
├─ Additional infrastructure: \(140 × 12 = \)1,680/year
├─ Operational overhead: ~$5,000/year (extra work)
└─ Total: ~$6,680/year

Potential Losses (per outage):
├─ Critical system (\(10M revenue): \)50,000-100,000
├─ Production system (\(1M revenue): \)5,000-10,000
├─ Startup (\(100k revenue): \)500-1,000
└─ Development environment: $0 (acceptable downtime)

Break-even analysis:
├─ If 1 outage/year: \(6,680 cost vs \)50,000+ loss = Worth it
├─ If 1 outage/2 years: Cost accumulates, harder to justify
├─ If 1 outage/5 years: Probably not worth it

Decision matrix:
  Revenue < $500k/year:  Skip multi-region for now
  Revenue \(500k-\)5M:    Implement multi-region when stable
  Revenue > $5M:         Implement multi-region immediately
  Critical systems:       Always implement multi-region

Key Takeaways

1. RTO/RPO Must Be Defined Upfront

  • Different for different businesses

  • Drives architecture decisions

  • Not afterthought, core requirement

2. Multi-Region is Expensive

  • Doubles infrastructure cost

  • Adds operational complexity

  • Only worth it if revenue justifies it

3. Test Your DR Plan

  • Plans that haven't been tested fail

  • Failover tests catch hidden issues

  • Regular testing keeps team sharp

4. Choose Active-Active for Critical Systems

  • Zero downtime

  • Zero data loss

  • But requires multi-master database

5. Choose Active-Passive for Cost Optimization

  • Idle capacity is wasted money

  • Acceptable for non-critical systems

  • Simpler to manage

6. Stateless Design Simplifies DR

  • No special data handling needed

  • Replication just needs database

  • Failover is mostly DNS change

Quick Reference: Multi-Region Options

Active-Passive:
  RTO: 2-5 minutes (DNS propagation delay)
  RPO: 0-5 minutes (replication lag)
  Cost: 2x infrastructure + low ops
  Complexity: Low
  Best for: Cost-sensitive, acceptable downtime
  
Active-Active:
  RTO: 30 seconds (health check + DNS)
  RPO: 0 (bi-directional replication)
  Cost: 2x infrastructure + high ops
  Complexity: High (conflict resolution)
  Best for: Critical systems, zero downtime required
  
Managed (DynamoDB Global Tables):
  RTO: Immediate (both regions active)
  RPO: 0 (managed replication)
  Cost: 1.5-2x database cost
  Complexity: Low (AWS handles it)
  Best for: Greenfield projects, DynamoDB users

GitHub Actions Notifications on Slack with CI Pipeline Visibility

Context to add: When building production APIs, monitoring your CI/CD pipeline is critical. Integrate GitHub Actions with Slack to get real-time notifications of build successes, failures, and deployments. This includes visibility into:

  • Build status for each commit/PR

  • Deployment stages (test, staging, production)

  • Failed workflow runs with error logs

  • Passing test suites

  • Branch protection rule validations

Implementation details: Use the official GitHub Slack app or slackapi/slack-notify-build action to post workflow results directly to a dedicated Slack channel. Include commit details, authors, and direct links to failing tests.

CloudWatch Notifications via Slack and Email

Context to add: AWS CloudWatch monitors application metrics and logs. Setting up notifications ensures your team is immediately aware of issues:

  • Performance degradation (high latency, error rates)

  • Resource utilization alerts (CPU, memory, database connections)

  • Custom metric thresholds

  • Log pattern matching (e.g., exceptions, failed requests)

Implementation details: Create CloudWatch alarms that trigger SNS (Simple Notification Service) topics, which then notify both Slack (via a Lambda function or webhook) and email subscriptions. This dual-channel approach ensures critical alerts aren't missed.

CloudWatch Dashboard

Context to add: A dedicated CloudWatch dashboard provides a single pane of glass for monitoring your API's health. Include:

  • Request count and latency metrics

  • Error rate trends

  • Database query performance

  • API endpoint-specific metrics

  • System resource utilization graphs

  • Custom business metrics (if applicable)

Implementation details: Create a customized dashboard with widgets for each metric, set appropriate time ranges (last hour/day/week), and enable auto-refresh. Share this dashboard with your team or embed it in internal monitoring pages.

S3 Bucket for Logs with Retention Policy

Context to add: Centralized log storage is essential for compliance, debugging, and auditing. Store logs from both CloudWatch and GitHub Actions:

  • CloudWatch Logs exported to S3 for long-term archival

  • GitHub Actions workflow logs and artifacts

  • Application error logs and request traces

  • Audit logs for access and changes

Implementation details: Create an S3 bucket with:

  • Versioning enabled (optional, for audit trails)

  • Server-side encryption (SSE-S3 or KMS)

  • Lifecycle policies for automatic transitions to cheaper storage (e.g., Glacier after 90 days)

  • Set retention periods (e.g., delete logs after 2 years or per compliance requirements)

  • Enable access logging to monitor who accesses the bucket

  • Block public access by default

This satisfies compliance requirements (GDPR, HIPAA, SOC2) while managing storage costs.

Acquiring a Domain

Context to add: A custom domain provides a professional identity for your API and is often required for production deployments:

  • Use the domain for your API endpoints (e.g., api.yourdomain.com)

  • SSL/TLS certificates for HTTPS (using AWS Certificate Manager or Let's Encrypt)

  • DNS records pointing to your API (Route 53 for AWS-hosted APIs)

  • Email records (MX records) for transactional emails if needed

  • Subdomain routing to different services (API, docs, dashboard)

Implementation details: Register a domain via Route 53, GoDaddy, Namecheap, or similar. Set up DNS records to point to your API Gateway/load balancer. Obtain an SSL certificate and enforce HTTPS. Consider using a friendly URL for API documentation (e.g., docs.yourdomain.com).

Conclusion

We've covered the complete system design of a production-grade API infrastructure:

  1. Part 1: Why we chose EKS, NLBs, and CloudWatch

  2. Part 2: How autoscaling and resilience work

  3. Part 3: Monitoring strategy to prevent alert fatigue

  4. Part 4: Cost optimizations to reduce expenses

  5. Part 5: Multi-region DR for business continuity

The key principle throughout: Choose simple solutions that meet your needs, then optimize later as the business grows.

This infrastructure could handle 10x traffic automatically, recover from failures in seconds, and cost only $140/month. That's excellent value.

But it's built on understanding the trade-offs: Why 70% CPU threshold? Why 2 nodes minimum? Why CloudWatch instead of Prometheus? Every decision was deliberate and documented.