GSLB with Cloudflare

DevOps and Cloud Engineer
Focused on optimizing the software development lifecycle through seamless integration of development and operations, specializing in designing, implementing, and managing scalable cloud infrastructure with a strong emphasis on automation and collaboration.
Key Skills:
Terraform: Skilled in Infrastructure as Code (IaC) for automating infrastructure deployment and management. Ansible: Proficient in automation tasks, configuration management, and application deployment. AWS: Extensive experience with AWS services like EC2, S3, RDS, and Lambda, designing scalable and cost-effective solutions. Kubernetes: Expert in container orchestration, deploying, scaling, and managing containerized applications. Docker: Proficient in containerization for consistent development, testing, and deployment. Google Cloud Platform: Familiar with GCP services for compute, storage, and machine learning.
Building a Production-Grade Global Server Load Balancer (GSLB) with Cloudflare.
This article documents my hands-on implementation and comparative analysis of Cloudflare's GSLB solution versus traditional hardware load balancers (A10 Networks).
The Problem: Geographic Distribution at Scale
Our e-government platform operates from three data centers with 45+ physical servers handling millions of daily requests. Without intelligent traffic distribution, we observed:
High latency for remote users: Citizens in one region experiencing 300ms+ response times when routed to distant data centers
Poor failover: Manual DNS changes taking 5-10 minutes during datacenter outages
Uneven load distribution: Primary datacenter handling 70% of traffic despite having only 40% of capacity
Limited visibility: Lack of real-time health monitoring across geographically distributed endpoints
What is GSLB, Really?
GSLB extends traditional load balancing beyond a single datacenter by:
Geographic traffic steering: Routing users to the nearest healthy datacenter based on latency or proximity
Intelligent health monitoring: Continuously probing endpoints and removing unhealthy ones from rotation
Automatic failover: Seamlessly redirecting traffic when entire datacenters go offline
Capacity-aware routing: Distributing load based on datacenter capacity and current utilization
Think of it as "load balancing for datacenters" rather than servers.
Architecture: Traditional vs Cloudflare Approach
Traditional GSLB (What We Had with A10)
Components:
Hardware load balancers in each datacenter (A10 Thunder)
DNS-based traffic steering
Centralized management console
Health checks via ICMP, TCP, or HTTP probes
Flow:
User queries DNS for
services.gov.rwA10 GSLB checks:
User's geographic location (via DNS resolver IP)
Datacenter health status
Current load metrics
Returns IP of "best" datacenter
User connects directly to that datacenter's load balancer
Limitations I discovered:
Expensive hardware + licensing ($50K+ per datacenter)
Complex BGP configuration for anycast DNS
Single point of failure without redundant controllers
Health check latency (20-30 second intervals)
Limited global reach (our DNS servers only in 3 locations)
Cloudflare GSLB Architecture
Components:
Cloudflare's 330+ edge locations (global anycast network)
DNS + Layer 7 load balancing combined
Health monitors running from multiple regions
Optional Cloudflare Tunnel for private endpoint connectivity
Flow:
User queries DNS for
services.gov.rwAnycast routes query to nearest Cloudflare datacenter (sub-50ms for 95% of users)
Cloudflare checks:
Endpoint health data (updated every 60s from multiple regions)
Dynamic steering policies (proximity, latency, geo-steering)
Pool weights and priorities
Returns optimal datacenter IP OR proxies request through edge
Key advantages:
No hardware to buy/maintain
Global anycast network included
Sub-second health checks
Built-in DDoS protection
Combined GTM + local load balancing in single config
Implementation: The Proof of Concept
Scope & Objectives
Goals:
Compare failover speed: Cloudflare vs A10
Measure latency improvements with geo-steering
Test health monitoring accuracy
Validate production readiness for public-facing services
Test Environment:
3 origin servers (Kigali, Frankfurt, Singapore)
Each running identical Nginx instances
Cloudflare Load Balancer in front
A10 Thunder GSLB as baseline
Configuration
1. Health Monitors
Created HTTPS health monitors with strict validation:
# Cloudflare Health Monitor Config
type: https
port: 443
path: /health
interval: 60 # seconds
timeout: 5
retries: 2
expected_codes: 200
expected_body: '{"status":"healthy"}'
headers:
Host: "services.gov.rw"
Why this matters: The expected_body check ensures we're not just getting 200 responses from a misconfigured server, but actual valid responses from our application.
I also configured health checks from multiple Cloudflare regions:
Western Europe (for Frankfurt origin)
Eastern Africa (for Kigali origin)
Southeast Asia (for Singapore origin)
This provides geographic redundancy—if networking fails between one Cloudflare region and an origin, other regions can still validate health.
2. Endpoint Pools
Configured three geographic pools with weighted distribution:
# Pool 1: Africa Region
name: "africa-pool"
endpoints:
- address: "origin-kigali.internal"
weight: 0.4
enabled: true
origins: 4 servers
health_threshold: 2 # Need 2/4 healthy to serve traffic
# Pool 2: Europe Region
name: "europe-pool"
endpoints:
- address: "origin-frankfurt.internal"
weight: 0.3
enabled: true
origins: 3 servers
health_threshold: 2
# Pool 3: Asia-Pacific Region
name: "apac-pool"
endpoints:
- address: "origin-singapore.internal"
weight: 0.3
enabled: true
origins: 3 servers
health_threshold: 2
Weight calculation: With unequal capacity, I set weights proportional to total pool capacity:
Africa: 4 servers = 0.4 weight
Europe: 3 servers = 0.3 weight
APAC: 3 servers = 0.3 weight
Total = 1.0, giving Africa 40% of overflow traffic when all pools healthy.
3. Traffic Steering
Tested three steering methods:
A. Dynamic Steering (Latency-Based)
Measures round-trip time (RTT) to each pool
Automatically selects lowest latency pool
Builds RTT profile over time
B. Proximity Steering (Geographic)
Routes based on physical distance
Used GPS coordinates for each datacenter
Falls back to geographic region matching
C. Geo Steering (Regional Policies)
Explicit rules: "African users → africa-pool"
Compliance-friendly for data residency
Failover to next-closest region if primary unavailable
For production, I chose Dynamic Steering because:
More accurate than pure geographic distance
Adapts to network conditions automatically
Accounts for backbone peering differences
Testing Methodology
Test 1: Normal Operation Performance
Metrics measured:
DNS resolution time
Time to first byte (TTFB)
End-to-end request latency
Test locations:
Kigali, Rwanda (local)
Lagos, Nigeria (regional)
London, UK (international)
Mumbai, India (international)
Method:
# DNS timing
dig services.gov.rw @1.1.1.1 | grep "Query time"
# Full request timing
curl -w "@curl-format.txt" -o /dev/null -s https://services.gov.rw/api/v1/test
# curl-format.txt contents:
# time_namelookup: %{time_namelookup}
# time_connect: %{time_connect}
# time_starttransfer: %{time_starttransfer}
# time_total: %{time_total}
Test 2: Failover Scenarios
Scenario A: Single Server Failure
Stopped Nginx on 1 server in africa-pool
Measured: Time until Cloudflare marks it unhealthy
Measured: Impact on user requests
Scenario B: Entire Datacenter Failure
Simulated by dropping all traffic to Kigali datacenter
Measured: Failover time to Frankfurt/Singapore
Monitored: User experience during transition
Scenario C: Datacenter Recovery
Brought Kigali back online
Measured: Time until traffic returns
Checked: Load distribution after recovery
Test 3: A10 Comparison
Configured identical setup on A10 Thunder GSLB:
Same 3 datacenters
Same health check intervals (60s)
Same geo-steering policies
Tested same failure scenarios and measured differences.
Results: The Numbers Don't Lie
Performance (Normal Operations)
| Location | Cloudflare DNS | A10 DNS | Cloudflare TTFB | A10 TTFB |
| Kigali | 12ms | 45ms | 67ms | 85ms |
| Lagos | 18ms | 52ms | 94ms | 145ms |
| London | 8ms | 38ms | 89ms | 127ms |
| Mumbai | 22ms | 68ms | 178ms | 245ms |
Key findings:
Cloudflare DNS 60-75% faster than A10 (anycast advantage)
TTFB improvements of 20-40% across all regions
Dynamic steering outperformed static geo-steering by 12-18ms on average
Failover Performance
| Scenario | Cloudflare | A10 | Difference |
| Single server failure detected | 65s | 90s | 28% faster |
| Traffic rerouted | <1s | 8-12s | ~10x faster |
| Full datacenter failover | 75s | 120s | 37% faster |
| User-facing downtime | <5s | 30-45s | 83% reduction |
Critical insight: Cloudflare's distributed health checking meant multiple edge locations detected the failure simultaneously. With A10, only our primary DNS servers noticed, creating a bottleneck.
Load Distribution Accuracy
With Dynamic Steering enabled:
Before (Manual DNS):
Kigali: 72% of traffic (overloaded)
Frankfurt: 18%
Singapore: 10%
With Cloudflare Dynamic Steering:
Kigali: 42% (closer to 40% capacity)
Frankfurt: 31% (closer to 30% capacity)
Singapore: 27% (closer to 30% capacity)
Variance: ±2-3% during normal operations, proving weight-based steering works at scale.
Real-World Findings & Gotchas
1. Health Check Paths Matter
Mistake I made: Initially used / as health check path.
Problem: Nginx returned 200 even when application backends were down (cached responses).
Solution: Created dedicated /health endpoint that:
Checks database connectivity
Validates backend service status
Returns JSON with detailed health data
Lesson: Always use application-aware health checks, not just "is the web server responding?"
2. TTL vs Failover Speed
Challenge: DNS caching delays failover.
Cloudflare advantage: When using proxied (orange cloud) mode, DNS TTL doesn't matter—Cloudflare handles backend failover without DNS changes.
For DNS-only mode: Set TTL to 30-60 seconds (Cloudflare minimum: 30s). Lower TTL = faster failover but more DNS queries.
Our choice: Proxied mode for critical services, DNS-only for internal tools.
3. Session Affinity Complexity
Requirement: Users need to stick to same datacenter for session persistence.
Solution: Enabled cookie-based session affinity with 24-hour TTL.
Gotcha: Failover breaks sessions. Users get new cookies after datacenter failure.
Mitigation:
Implemented session replication between datacenters
Added graceful endpoint draining (30 min TTL) for maintenance
4. Cost Comparison
A10 Hardware GSLB (3 years):
Hardware: $150K (3x $50K appliances)
Annual support: $45K/year
Total 3-year TCO: $285K
Cloudflare Load Balancing:
Load Balancing subscription: $50/month base
Per-endpoint fee: $5/endpoint/month (10 endpoints)
Health checks: Included
Total 3-year TCO: $9,800
Savings: ~$275K over 3 years (96% reduction)
Plus: No hardware refresh cycles, no datacenter rack space, no power/cooling costs.
Production Deployment: What We Did
After the PoC, we deployed Cloudflare for our public-facing government services portal:
Phase 1: DNS Migration
Moved NS records to Cloudflare
Configured load balancer for
services.gov.rwSet up failover pools
Result: DNS query time dropped from 45ms to 12ms
Phase 2: Health Monitoring
Created endpoints for all public services
Configured HTTPS health checks every 60 seconds
Set up email/Slack alerts for unhealthy origins
Result: Detected and resolved 3 issues before users noticed
Phase 3: Dynamic Steering
Enabled latency-based steering
Monitored for 2 weeks
Validated traffic distribution matched capacity
Result: Load balanced within 3% of target distribution
Phase 4: Failover Testing
Scheduled maintenance window
Performed controlled datacenter failover
Monitored user experience
Result: Zero user-reported issues during planned outage
Key Learnings
What Worked Well
Anycast is a game-changer: Having DNS responses come from 330+ locations vs 3 = massive latency win
Unified platform: Combining DNS, load balancing, DDoS protection, and CDN in one platform simplified operations
Health check frequency: 60-second checks vs 5-minute checks (our A10 config) caught issues 5x faster
No hardware maintenance: Eliminating hardware refresh cycles freed up our team
What I'd Do Differently
Start with DNS-only mode: We jumped to proxied mode immediately. DNS-only would've been a safer first step.
More granular pools: Rather than "africa-pool", I'd create per-datacenter pools for finer control
Custom alerting earlier: Waited too long to set up PagerDuty integration for health check failures
Load testing: Should've done higher-load tests before production cutover
Advanced Configurations Worth Exploring
1. Least Outstanding Requests (LORS) Steering
For endpoints with varying request processing times:
steering_policy: least_outstanding_requests
# Routes to endpoint with fewest active connections
# Useful for: Background job processors, video encoding, ML inference
When to use: When some requests take 10x longer than others (video processing, report generation).
2. Cloudflare Tunnel Integration
Connect private endpoints without exposing them to the internet:
# Install cloudflared on origin server
cloudflared tunnel create gov-rw-backend
# Route traffic through tunnel
cloudflared tunnel route dns gov-rw-backend backend.internal.gov.rw
Benefit: Zero inbound firewall rules. All connectivity outbound from origin.
3. Session Affinity with Header-Based Routing
For API services needing consistent routing:
session_affinity: header
affinity_ttl: 3600
header: X-User-ID
# Routes same user to same origin for 1 hour
4. Custom Rules for Advanced Steering
Route specific traffic types to specific pools:
// POST requests to dedicated write pool
if (http.request.method == "POST") {
cf.load_balancing.pool = "write-pool";
}
// Large file uploads to high-bandwidth pool
if (http.request.uri.path.startsWith("/upload")) {
cf.load_balancing.pool = "upload-pool";
}
Monitoring & Observability
Metrics We Track
Load Balancer Health:
Requests per second per pool
Error rate by origin
Average response time by geographic region
Failover events (count & duration)
Health Check Status:
Probe success rate
Time to detect failures
Time to recover
Business Impact:
Service availability (by region)
User-facing latency (P50, P95, P99)
Geographic distribution of users
Dashboards & Alerts
Grafana Dashboard:
Real-time health check status
Request distribution across pools
Latency heatmap by region
Historical failover events
PagerDuty Alerts:
Critical: Entire pool unhealthy
Warning: <50% of endpoints healthy
Info: Single endpoint failure
Recommendations for Your GSLB Implementation
Start Simple
Begin with 2 datacenters: Don't optimize for 10 locations on day 1
DNS-only mode first: Get comfortable before proxying traffic
Simple health checks: ICMP or TCP before HTTP/HTTPS
Geographic steering: Easier to reason about than dynamic
Graduate to Advanced
Add dynamic steering: Once you trust health checks
Enable session affinity: When you understand traffic patterns
Custom rules: For specific use cases
LORS steering: When you have heterogeneous workloads
Production Readiness Checklist
[ ] Health checks from multiple regions configured
[ ] Alerting setup for unhealthy endpoints
[ ] Tested manual failover procedures
[ ] Session affinity configured if needed
[ ] Documented runbooks for common scenarios
[ ] Load tested each pool individually
[ ] Validated geographic traffic distribution
[ ] Established baseline metrics
Conclusion
GSLB isn't just "DNS with extras"—it's a fundamental shift in how we think about availability and performance for distributed systems. The combination of:
Anycast networking (sub-50ms DNS for 95% of users)
Distributed health checking (detect failures in seconds, not minutes)
Intelligent traffic steering (route based on real-time latency, not static rules)
...makes modern GSLB platforms like Cloudflare essential infrastructure for any global service.
For our e-government platform, the results speak for themselves:
83% reduction in user-facing downtime during failures
40% improvement in response times for international users
96% cost reduction vs hardware load balancers
If you're managing services that need to be fast and available across multiple regions, GSLB isn't optional—it's essential. And with Cloudflare's architecture, it's more accessible than ever.
Further Reading
What Cloudflare Could Improve: An SRE's Perspective
After deploying Cloudflare GSLB in production and studying their architecture deeply, I've identified several areas where Cloudflare could enhance their load balancing platform. These aren't criticisms—they're observations from someone who believes in the product and wants to see it get even better.
1. Health Check Granularity & Customization
Current Limitation
Health monitors probe at fixed 60-second intervals minimum. For highly dynamic workloads or during incident response, this can feel slow.
The Challenge
When an origin starts degrading gradually (CPU creeping up, response times increasing), 60 seconds feels like an eternity. By the time the health check fails, you might have already served hundreds of slow requests to users.
Improvement Opportunity
Adaptive health check intervals based on historical stability:
Stable origins (>99.9% uptime for 30 days): Check every 60s
Recently recovered origins: Check every 15s for first hour
Degraded-but-not-failed origins: Check every 5-10s
More sophisticated health checks:
Response time percentile thresholds (P95, P99) not just success/fail
Resource utilization probes (CPU, memory, connection count) via custom endpoint
Synthetic transaction testing (e.g., "can you complete a login flow?")
Real-World Impact
During our testing, an origin's database connection pool exhausted. Health checks passed (HTTP 200) but actual user requests were timing out. An application-aware health check would've caught this.
Suggested Implementation:
health_monitor:
type: advanced
checks:
- type: http_response
expected_code: 200
weight: 0.3
- type: response_time
p95_threshold: 500ms
weight: 0.4
- type: application_health
endpoint: /health/deep
expected_metrics:
db_connections_available: ">5"
cache_hit_rate: ">0.7"
weight: 0.3
2. Session Affinity Reliability
Current Limitation
Session affinity occasionally breaks, as evidenced by community reports of users "jumping between hosts" despite proper cookie configuration.
The Challenge
From Cloudflare Community (Nov 2024):
"Requests from outside of Cloudflare to the Load Balancer Hostname are not sticky! They are freely jumping between the hosts."
This is a critical issue for stateful applications. When session affinity fails:
Users lose shopping carts
Authentication sessions break
Form data disappears
Multi-step workflows fail
Improvement Opportunity
Session affinity should be bulletproof with:
Multiple fallback methods:
Primary: Cookie (_cflb) Fallback 1: X-Session-ID header Fallback 2: Source IP hashSession affinity health monitoring:
Track affinity "break rate" (% of requests that switch origins unexpectedly)
Alert when >0.1% of sessions break
Provide dashboard showing affinity violations
Graceful degradation:
If cookie lost, attempt session recovery via IP hash
Log affinity breaks for debugging
Option to return 503 instead of routing to wrong origin (for critical apps)
What We Did as Workaround
Implemented our own session replication between datacenters. Expensive and complex, but necessary given unpredictable affinity breaks.
3. Origin Limit Model Is Confusing
Current Limitation
Cloudflare counts each origin separately, even if it's the same IP in multiple pools.
The Example
You have 2 physical servers:
server-a.example.com(192.0.2.1)server-b.example.com(192.0.2.2)
You want 3 pools for different services:
Pool 1 (web): server-a + server-b
Pool 2 (api): server-a + server-b
Pool 3 (admin): server-a + server-b
Expected origin count: 2 (you have 2 servers) Actual origin count charged: 6 (2 servers × 3 pools)
With a 2-origin plan, you can only create 1 pool with 2 origins. This makes no sense operationally.
Improvement Opportunity
Origin should be the unique endpoint, not endpoint-per-pool:
# Global origin registry
origins:
- id: origin-a
address: server-a.example.com
datacenter: kigali
- id: origin-b
address: server-b.example.com
datacenter: frankfurt
# Pools reference origins
pools:
web-pool:
origins: [origin-a, origin-b]
api-pool:
origins: [origin-a, origin-b]
admin-pool:
origins: [origin-a, origin-b]
# Billing: 2 unique origins, not 6
Alternative pricing model:
Charge per unique IP/hostname
Unlimited pool membership
This aligns with operational reality
Why This Matters
Most customers have a handful of physical datacenters but many logical services. Current model penalizes proper separation of concerns.
4. DNS-Only Mode Steering Limitations
Current Limitation
DNS-only load balancers lose advanced steering capabilities:
LORS (Least Outstanding Requests) reverts to random
No session affinity except ip_cookie
Can't integrate with WAF, caching, or other L7 features
The Challenge
Some customers need DNS-only mode (non-HTTP protocols, compliance, etc.) but still want intelligent steering.
Improvement Opportunity
Steering methods that work without proxying:
Extended DNS Response (EDN)
Return multiple IPs with priority hints in DNS response
Client library interprets hints for smarter selection
Backward compatible (falls back to standard A record)
External Health API
Expose real-time health data via API
Customers can query: "Which origin is healthiest for region X?"
Build custom client-side logic using Cloudflare's health data
Conditional DNS Responses
dns_only_pool: steering: conditional rules: - if: "client_asn == AS36924" # Specific ISP return: origin-a - if: "query_time == weekday_business_hours" return: [origin-a, origin-b] # Load balance - default: origin-c # Fallback
Real-World Use Case
We wanted DNS-only mode for our internal services (non-HTTP protocols) but needed geo-steering. Had to implement our own GeoDNS solution. Should've been a Cloudflare feature.
5. Configuration Change Velocity & Blast Radius
Current Limitation
Configuration changes (new pools, steering updates, health monitor tweaks) apply globally and immediately.
The Challenge
This has caused production incidents:
November 2024: Configuration error caused cascading failures across Cloudflare edge
August 2025: Traffic surge + routing change → AWS us-east-1 congestion
The problem: No staging/canary deployments for config changes at customer level.
Improvement Opportunity
Configuration change rollout controls:
Canary deployments:
config_change: new_pool: americas-pool-v2 rollout_strategy: phase1: 1% of traffic for 5 minutes phase2: 10% for 15 minutes phase3: 100% if error_rate < baseline rollback_trigger: error_rate_increase: >20% latency_p95_increase: >100msTest mode:
Apply config changes to a small subset of traffic
Monitor metrics before full rollout
"What-if" analysis: "How would this steering change affect my traffic?"
Change approval workflow:
Critical config changes require confirmation
Show predicted impact before applying
Scheduled maintenance windows for risky changes
Why This Matters
As SRE, I want confidence that my config change won't break production. Cloudflare's current "apply now, hope for best" model is nerve-wracking for large-scale deployments.
6. Cost Transparency for Health Checks
Current Limitation
Selecting "All Data Centers" for health monitoring can dramatically increase traffic to origins, but it's not clear how much traffic.
The Challenge
From Cloudflare docs:
"Using All Data Centers sends individual health monitor requests from all existing Cloudflare data centers (and that number of data centers is growing all the time)."
The math:
330+ datacenters
60-second interval
3 probes per datacenter
\= ~16,500 health check requests/minute per origin
For origins with bandwidth/request limits, this is a problem.
Improvement Opportunity
Health check traffic estimator:
health_monitor_preview:
regions: [Western Europe, Eastern Europe]
interval: 60s
expected_traffic:
requests_per_minute: 180
bandwidth_per_day: 2.5 MB
estimated_cost: $0.00 # Cloudflare side
origin_cost_estimate: $0.15/day # Based on origin pricing
# Show BEFORE applying config
Smart health check scheduling:
Coordinate probes across regions (don't all fire at :00 seconds)
Adaptive probe counts (fewer checks if origin consistently healthy)
Option to "sample" datacenters (check 10% of DCs, rotate daily)
What Happened to Us
Selected "All Regions" thinking it meant ~10 regions. Turns out Cloudflare probed from 200+ locations. Origin's rate limiter kicked in, blocking health checks. Pool marked critical. Outage.
Would have been avoided with: Clear traffic estimates before applying.
7. Observability Gap: Per-Request Routing Decisions
Current Limitation
Analytics show aggregate traffic patterns, but it's hard to debug why a specific request went to a specific origin.
The Challenge
User reports: "I got routed to Singapore datacenter from London, why?"
Current debugging process:
Check logs (if you have them)
Guess based on steering policy
Hope you can reproduce
Improvement Opportunity
Request-level tracing for load balancer decisions:
Add debug header: CF-LB-Debug: true
Response includes:
CF-LB-Pool-Selected: europe-pool
CF-LB-Origin-Selected: frankfurt-01
CF-LB-Steering-Method: dynamic
CF-LB-Selection-Reason: lowest_rtt
CF-LB-RTT-Data: europe-pool:45ms, apac-pool:180ms
CF-LB-Health-Status: all-healthy
CF-LB-Session-Affinity: none
Real-time decision logs:
Stream of routing decisions to external SIEM
Filter by user, region, pool, or origin
Analyze patterns: "Why is 10% of US traffic going to APAC pool?"
Real-World Need
During our PoC, we wanted to validate that dynamic steering was actually choosing the lowest-latency pool. Had to trust Cloudflare's black box. Request-level visibility would've made this trivial to verify.
8. Load Balancer Analytics Need More Depth
Current Limitation
Analytics are good but lack SRE-critical dimensions:
Can't see P50/P95/P99 latency by pool
No historical steering decision breakdown
Limited error categorization
No anomaly detection
Improvement Opportunity
Enhanced analytics dashboard:
Latency heatmaps:
Response time by pool by hour: [Visual heatmap showing which pools were slow when]Steering decision history:
Why did traffic shift at 14:30 UTC? - Dynamic steering detected 50ms RTT increase in europe-pool - Automatically shifted 30% of traffic to americas-pool - Self-healed after 8 minutesError attribution:
521 errors increased 300% at 09:15 UTC Origin: frankfurt-02 Root cause: Health check passed but origin firewall blocked Cloudflare IPs Impact: 2,300 requests failed before pool marked criticalPredictive alerts:
Warning: africa-pool showing gradual latency increase Current P95: 890ms (baseline: 450ms) Predicted: Will exceed 1000ms in 15 minutes Suggested action: Add capacity or shed load to backup pool



