Building a Production-Grade Global Server Load Balancer (GSLB) with Cloudflare.

This article documents my hands-on implementation and comparative analysis of Cloudflare's GSLB solution versus traditional hardware load balancers (A10 Networks).

The Problem: Geographic Distribution at Scale

Our e-government platform operates from three data centers with 45+ physical servers handling millions of daily requests. Without intelligent traffic distribution, we observed:

High latency for remote users: Citizens in one region experiencing 300ms+ response times when routed to distant data centers
Poor failover: Manual DNS changes taking 5-10 minutes during datacenter outages
Uneven load distribution: Primary datacenter handling 70% of traffic despite having only 40% of capacity
Limited visibility: Lack of real-time health monitoring across geographically distributed endpoints

What is GSLB, Really?

GSLB extends traditional load balancing beyond a single datacenter by:

Geographic traffic steering: Routing users to the nearest healthy datacenter based on latency or proximity
Intelligent health monitoring: Continuously probing endpoints and removing unhealthy ones from rotation
Automatic failover: Seamlessly redirecting traffic when entire datacenters go offline
Capacity-aware routing: Distributing load based on datacenter capacity and current utilization

Think of it as "load balancing for datacenters" rather than servers.

Architecture: Traditional vs Cloudflare Approach

Traditional GSLB (What We Had with A10)

Components:

Hardware load balancers in each datacenter (A10 Thunder)
DNS-based traffic steering
Centralized management console
Health checks via ICMP, TCP, or HTTP probes

Flow:

User queries DNS for services.gov.rw
A10 GSLB checks:
- User's geographic location (via DNS resolver IP)
- Datacenter health status
- Current load metrics
Returns IP of "best" datacenter
User connects directly to that datacenter's load balancer

Limitations I discovered:

Expensive hardware + licensing ($50K+ per datacenter)
Complex BGP configuration for anycast DNS
Single point of failure without redundant controllers
Health check latency (20-30 second intervals)
Limited global reach (our DNS servers only in 3 locations)

Cloudflare GSLB Architecture

Components:

Cloudflare's 330+ edge locations (global anycast network)
DNS + Layer 7 load balancing combined
Health monitors running from multiple regions
Optional Cloudflare Tunnel for private endpoint connectivity

Flow:

User queries DNS for services.gov.rw
Anycast routes query to nearest Cloudflare datacenter (sub-50ms for 95% of users)
Cloudflare checks:
- Endpoint health data (updated every 60s from multiple regions)
- Dynamic steering policies (proximity, latency, geo-steering)
- Pool weights and priorities
Returns optimal datacenter IP OR proxies request through edge

Key advantages:

No hardware to buy/maintain
Global anycast network included
Sub-second health checks
Built-in DDoS protection
Combined GTM + local load balancing in single config

Implementation: The Proof of Concept

Scope & Objectives

Goals:

Compare failover speed: Cloudflare vs A10
Measure latency improvements with geo-steering
Test health monitoring accuracy
Validate production readiness for public-facing services

Test Environment:

3 origin servers (Kigali, Frankfurt, Singapore)
Each running identical Nginx instances
Cloudflare Load Balancer in front
A10 Thunder GSLB as baseline

Configuration

1. Health Monitors

Created HTTPS health monitors with strict validation:

# Cloudflare Health Monitor Config
type: https
port: 443
path: /health
interval: 60  # seconds
timeout: 5
retries: 2
expected_codes: 200
expected_body: '{"status":"healthy"}'
headers:
  Host: "services.gov.rw"

Why this matters: The expected_body check ensures we're not just getting 200 responses from a misconfigured server, but actual valid responses from our application.

I also configured health checks from multiple Cloudflare regions:

Western Europe (for Frankfurt origin)
Eastern Africa (for Kigali origin)
Southeast Asia (for Singapore origin)

This provides geographic redundancy—if networking fails between one Cloudflare region and an origin, other regions can still validate health.

2. Endpoint Pools

Configured three geographic pools with weighted distribution:

# Pool 1: Africa Region
name: "africa-pool"
endpoints:
  - address: "origin-kigali.internal"
    weight: 0.4
    enabled: true
origins: 4 servers
health_threshold: 2  # Need 2/4 healthy to serve traffic

# Pool 2: Europe Region  
name: "europe-pool"
endpoints:
  - address: "origin-frankfurt.internal"
    weight: 0.3
    enabled: true
origins: 3 servers
health_threshold: 2

# Pool 3: Asia-Pacific Region
name: "apac-pool"
endpoints:
  - address: "origin-singapore.internal"
    weight: 0.3
    enabled: true
origins: 3 servers
health_threshold: 2

Weight calculation: With unequal capacity, I set weights proportional to total pool capacity:

Africa: 4 servers = 0.4 weight
Europe: 3 servers = 0.3 weight
APAC: 3 servers = 0.3 weight

Total = 1.0, giving Africa 40% of overflow traffic when all pools healthy.

3. Traffic Steering

Tested three steering methods:

A. Dynamic Steering (Latency-Based)

Measures round-trip time (RTT) to each pool
Automatically selects lowest latency pool
Builds RTT profile over time

B. Proximity Steering (Geographic)

Routes based on physical distance
Used GPS coordinates for each datacenter
Falls back to geographic region matching

C. Geo Steering (Regional Policies)

Explicit rules: "African users → africa-pool"
Compliance-friendly for data residency
Failover to next-closest region if primary unavailable

For production, I chose Dynamic Steering because:

More accurate than pure geographic distance
Adapts to network conditions automatically
Accounts for backbone peering differences

Testing Methodology

Test 1: Normal Operation Performance

Metrics measured:

DNS resolution time
Time to first byte (TTFB)
End-to-end request latency

Test locations:

Kigali, Rwanda (local)
Lagos, Nigeria (regional)
London, UK (international)
Mumbai, India (international)

Method:

# DNS timing
dig services.gov.rw @1.1.1.1 | grep "Query time"

# Full request timing  
curl -w "@curl-format.txt" -o /dev/null -s https://services.gov.rw/api/v1/test

# curl-format.txt contents:
#   time_namelookup:  %{time_namelookup}
#   time_connect:     %{time_connect}
#   time_starttransfer: %{time_starttransfer}
#   time_total:       %{time_total}

Test 2: Failover Scenarios

Scenario A: Single Server Failure

Stopped Nginx on 1 server in africa-pool
Measured: Time until Cloudflare marks it unhealthy
Measured: Impact on user requests

Scenario B: Entire Datacenter Failure

Simulated by dropping all traffic to Kigali datacenter
Measured: Failover time to Frankfurt/Singapore
Monitored: User experience during transition

Scenario C: Datacenter Recovery

Brought Kigali back online
Measured: Time until traffic returns
Checked: Load distribution after recovery

Test 3: A10 Comparison

Configured identical setup on A10 Thunder GSLB:

Same 3 datacenters
Same health check intervals (60s)
Same geo-steering policies

Tested same failure scenarios and measured differences.

Results: The Numbers Don't Lie

Performance (Normal Operations)

Location	Cloudflare DNS	A10 DNS	Cloudflare TTFB	A10 TTFB
Kigali	12ms	45ms	67ms	85ms
Lagos	18ms	52ms	94ms	145ms
London	8ms	38ms	89ms	127ms
Mumbai	22ms	68ms	178ms	245ms

Key findings:

Cloudflare DNS 60-75% faster than A10 (anycast advantage)
TTFB improvements of 20-40% across all regions
Dynamic steering outperformed static geo-steering by 12-18ms on average

Failover Performance

Scenario	Cloudflare	A10	Difference
Single server failure detected	65s	90s	28% faster
Traffic rerouted	<1s	8-12s	~10x faster
Full datacenter failover	75s	120s	37% faster
User-facing downtime	<5s	30-45s	83% reduction

Critical insight: Cloudflare's distributed health checking meant multiple edge locations detected the failure simultaneously. With A10, only our primary DNS servers noticed, creating a bottleneck.

Load Distribution Accuracy

With Dynamic Steering enabled:

Before (Manual DNS):

Kigali: 72% of traffic (overloaded)
Frankfurt: 18%
Singapore: 10%

With Cloudflare Dynamic Steering:

Kigali: 42% (closer to 40% capacity)
Frankfurt: 31% (closer to 30% capacity)
Singapore: 27% (closer to 30% capacity)

Variance: ±2-3% during normal operations, proving weight-based steering works at scale.

Real-World Findings & Gotchas

1. Health Check Paths Matter

Mistake I made: Initially used / as health check path.

Problem: Nginx returned 200 even when application backends were down (cached responses).

Solution: Created dedicated /health endpoint that:

Checks database connectivity
Validates backend service status
Returns JSON with detailed health data

Lesson: Always use application-aware health checks, not just "is the web server responding?"

2. TTL vs Failover Speed

Challenge: DNS caching delays failover.

Cloudflare advantage: When using proxied (orange cloud) mode, DNS TTL doesn't matter—Cloudflare handles backend failover without DNS changes.

For DNS-only mode: Set TTL to 30-60 seconds (Cloudflare minimum: 30s). Lower TTL = faster failover but more DNS queries.

Our choice: Proxied mode for critical services, DNS-only for internal tools.

3. Session Affinity Complexity

Requirement: Users need to stick to same datacenter for session persistence.

Solution: Enabled cookie-based session affinity with 24-hour TTL.

Gotcha: Failover breaks sessions. Users get new cookies after datacenter failure.

Mitigation:

Implemented session replication between datacenters
Added graceful endpoint draining (30 min TTL) for maintenance

4. Cost Comparison

A10 Hardware GSLB (3 years):

Hardware: $150K (3x $50K appliances)
Annual support: $45K/year
Total 3-year TCO: $285K

Cloudflare Load Balancing:

Load Balancing subscription: $50/month base
Per-endpoint fee: $5/endpoint/month (10 endpoints)
Health checks: Included
Total 3-year TCO: $9,800

Savings: ~$275K over 3 years (96% reduction)

Plus: No hardware refresh cycles, no datacenter rack space, no power/cooling costs.

Production Deployment: What We Did

After the PoC, we deployed Cloudflare for our public-facing government services portal:

Phase 1: DNS Migration

Moved NS records to Cloudflare
Configured load balancer for services.gov.rw
Set up failover pools
Result: DNS query time dropped from 45ms to 12ms

Phase 2: Health Monitoring

Created endpoints for all public services
Configured HTTPS health checks every 60 seconds
Set up email/Slack alerts for unhealthy origins
Result: Detected and resolved 3 issues before users noticed

Phase 3: Dynamic Steering

Enabled latency-based steering
Monitored for 2 weeks
Validated traffic distribution matched capacity
Result: Load balanced within 3% of target distribution

Phase 4: Failover Testing

Scheduled maintenance window
Performed controlled datacenter failover
Monitored user experience
Result: Zero user-reported issues during planned outage

Key Learnings

What Worked Well

Anycast is a game-changer: Having DNS responses come from 330+ locations vs 3 = massive latency win
Unified platform: Combining DNS, load balancing, DDoS protection, and CDN in one platform simplified operations
Health check frequency: 60-second checks vs 5-minute checks (our A10 config) caught issues 5x faster
No hardware maintenance: Eliminating hardware refresh cycles freed up our team

What I'd Do Differently

Start with DNS-only mode: We jumped to proxied mode immediately. DNS-only would've been a safer first step.
More granular pools: Rather than "africa-pool", I'd create per-datacenter pools for finer control
Custom alerting earlier: Waited too long to set up PagerDuty integration for health check failures
Load testing: Should've done higher-load tests before production cutover

Advanced Configurations Worth Exploring

1. Least Outstanding Requests (LORS) Steering

For endpoints with varying request processing times:

steering_policy: least_outstanding_requests
# Routes to endpoint with fewest active connections
# Useful for: Background job processors, video encoding, ML inference

When to use: When some requests take 10x longer than others (video processing, report generation).

2. Cloudflare Tunnel Integration

Connect private endpoints without exposing them to the internet:

# Install cloudflared on origin server
cloudflared tunnel create gov-rw-backend

# Route traffic through tunnel
cloudflared tunnel route dns gov-rw-backend backend.internal.gov.rw

Benefit: Zero inbound firewall rules. All connectivity outbound from origin.

3. Session Affinity with Header-Based Routing

For API services needing consistent routing:

session_affinity: header
affinity_ttl: 3600
header: X-User-ID
# Routes same user to same origin for 1 hour

4. Custom Rules for Advanced Steering

Route specific traffic types to specific pools:

// POST requests to dedicated write pool
if (http.request.method == "POST") {
  cf.load_balancing.pool = "write-pool";
}

// Large file uploads to high-bandwidth pool  
if (http.request.uri.path.startsWith("/upload")) {
  cf.load_balancing.pool = "upload-pool";
}

Monitoring & Observability

Metrics We Track

Load Balancer Health:

Requests per second per pool
Error rate by origin
Average response time by geographic region
Failover events (count & duration)

Health Check Status:

Probe success rate
Time to detect failures
Time to recover

Business Impact:

Service availability (by region)
User-facing latency (P50, P95, P99)
Geographic distribution of users

Dashboards & Alerts

Grafana Dashboard:

Real-time health check status
Request distribution across pools
Latency heatmap by region
Historical failover events

PagerDuty Alerts:

Critical: Entire pool unhealthy
Warning: <50% of endpoints healthy
Info: Single endpoint failure

Recommendations for Your GSLB Implementation

Start Simple

Begin with 2 datacenters: Don't optimize for 10 locations on day 1
DNS-only mode first: Get comfortable before proxying traffic
Simple health checks: ICMP or TCP before HTTP/HTTPS
Geographic steering: Easier to reason about than dynamic

Graduate to Advanced

Add dynamic steering: Once you trust health checks
Enable session affinity: When you understand traffic patterns
Custom rules: For specific use cases
LORS steering: When you have heterogeneous workloads

Production Readiness Checklist

[ ] Health checks from multiple regions configured
[ ] Alerting setup for unhealthy endpoints
[ ] Tested manual failover procedures
[ ] Session affinity configured if needed
[ ] Documented runbooks for common scenarios
[ ] Load tested each pool individually
[ ] Validated geographic traffic distribution
[ ] Established baseline metrics

Conclusion

GSLB isn't just "DNS with extras"—it's a fundamental shift in how we think about availability and performance for distributed systems. The combination of:

Anycast networking (sub-50ms DNS for 95% of users)
Distributed health checking (detect failures in seconds, not minutes)
Intelligent traffic steering (route based on real-time latency, not static rules)

...makes modern GSLB platforms like Cloudflare essential infrastructure for any global service.

For our e-government platform, the results speak for themselves:

83% reduction in user-facing downtime during failures
40% improvement in response times for international users
96% cost reduction vs hardware load balancers

If you're managing services that need to be fast and available across multiple regions, GSLB isn't optional—it's essential. And with Cloudflare's architecture, it's more accessible than ever.

What Cloudflare Could Improve: An SRE's Perspective

After deploying Cloudflare GSLB in production and studying their architecture deeply, I've identified several areas where Cloudflare could enhance their load balancing platform. These aren't criticisms—they're observations from someone who believes in the product and wants to see it get even better.

1. Health Check Granularity & Customization

Current Limitation

Health monitors probe at fixed 60-second intervals minimum. For highly dynamic workloads or during incident response, this can feel slow.

The Challenge

When an origin starts degrading gradually (CPU creeping up, response times increasing), 60 seconds feels like an eternity. By the time the health check fails, you might have already served hundreds of slow requests to users.

Improvement Opportunity

Adaptive health check intervals based on historical stability:

Stable origins (>99.9% uptime for 30 days): Check every 60s
Recently recovered origins: Check every 15s for first hour
Degraded-but-not-failed origins: Check every 5-10s

More sophisticated health checks:

Response time percentile thresholds (P95, P99) not just success/fail
Resource utilization probes (CPU, memory, connection count) via custom endpoint
Synthetic transaction testing (e.g., "can you complete a login flow?")

Real-World Impact

During our testing, an origin's database connection pool exhausted. Health checks passed (HTTP 200) but actual user requests were timing out. An application-aware health check would've caught this.

Suggested Implementation:

health_monitor:
  type: advanced
  checks:
    - type: http_response
      expected_code: 200
      weight: 0.3
    - type: response_time
      p95_threshold: 500ms
      weight: 0.4
    - type: application_health
      endpoint: /health/deep
      expected_metrics:
        db_connections_available: ">5"
        cache_hit_rate: ">0.7"
      weight: 0.3

2. Session Affinity Reliability

Current Limitation

Session affinity occasionally breaks, as evidenced by community reports of users "jumping between hosts" despite proper cookie configuration.

The Challenge

From Cloudflare Community (Nov 2024):

"Requests from outside of Cloudflare to the Load Balancer Hostname are not sticky! They are freely jumping between the hosts."

This is a critical issue for stateful applications. When session affinity fails:

Users lose shopping carts
Authentication sessions break
Form data disappears
Multi-step workflows fail

Improvement Opportunity

Session affinity should be bulletproof with:

Multiple fallback methods:

 Primary: Cookie (_cflb)
 Fallback 1: X-Session-ID header
 Fallback 2: Source IP hash

Session affinity health monitoring:
- Track affinity "break rate" (% of requests that switch origins unexpectedly)
- Alert when >0.1% of sessions break
- Provide dashboard showing affinity violations
Graceful degradation:
- If cookie lost, attempt session recovery via IP hash
- Log affinity breaks for debugging
- Option to return 503 instead of routing to wrong origin (for critical apps)

What We Did as Workaround

Implemented our own session replication between datacenters. Expensive and complex, but necessary given unpredictable affinity breaks.

3. Origin Limit Model Is Confusing

Current Limitation

Cloudflare counts each origin separately, even if it's the same IP in multiple pools.

The Example

You have 2 physical servers:

server-a.example.com (192.0.2.1)
server-b.example.com (192.0.2.2)

You want 3 pools for different services:

Pool 1 (web): server-a + server-b
Pool 2 (api): server-a + server-b
Pool 3 (admin): server-a + server-b

Expected origin count: 2 (you have 2 servers) Actual origin count charged: 6 (2 servers × 3 pools)

With a 2-origin plan, you can only create 1 pool with 2 origins. This makes no sense operationally.

Improvement Opportunity

Origin should be the unique endpoint, not endpoint-per-pool:

# Global origin registry
origins:
  - id: origin-a
    address: server-a.example.com
    datacenter: kigali
  - id: origin-b
    address: server-b.example.com
    datacenter: frankfurt

# Pools reference origins
pools:
  web-pool:
    origins: [origin-a, origin-b]
  api-pool:
    origins: [origin-a, origin-b]
  admin-pool:
    origins: [origin-a, origin-b]

# Billing: 2 unique origins, not 6

Alternative pricing model:

Charge per unique IP/hostname
Unlimited pool membership
This aligns with operational reality

Why This Matters

Most customers have a handful of physical datacenters but many logical services. Current model penalizes proper separation of concerns.

4. DNS-Only Mode Steering Limitations

Current Limitation

DNS-only load balancers lose advanced steering capabilities:

LORS (Least Outstanding Requests) reverts to random
No session affinity except ip_cookie
Can't integrate with WAF, caching, or other L7 features

The Challenge

Some customers need DNS-only mode (non-HTTP protocols, compliance, etc.) but still want intelligent steering.

Improvement Opportunity

Steering methods that work without proxying:

Extended DNS Response (EDN)
- Return multiple IPs with priority hints in DNS response
- Client library interprets hints for smarter selection
- Backward compatible (falls back to standard A record)
External Health API
- Expose real-time health data via API
- Customers can query: "Which origin is healthiest for region X?"
- Build custom client-side logic using Cloudflare's health data

Conditional DNS Responses

 dns_only_pool:
   steering: conditional
   rules:
     - if: "client_asn == AS36924"  # Specific ISP
       return: origin-a
     - if: "query_time == weekday_business_hours"
       return: [origin-a, origin-b]  # Load balance
     - default: origin-c  # Fallback

Real-World Use Case

We wanted DNS-only mode for our internal services (non-HTTP protocols) but needed geo-steering. Had to implement our own GeoDNS solution. Should've been a Cloudflare feature.

5. Configuration Change Velocity & Blast Radius

Current Limitation

Configuration changes (new pools, steering updates, health monitor tweaks) apply globally and immediately.

The Challenge

This has caused production incidents:

November 2024: Configuration error caused cascading failures across Cloudflare edge
August 2025: Traffic surge + routing change → AWS us-east-1 congestion

The problem: No staging/canary deployments for config changes at customer level.

Improvement Opportunity

Configuration change rollout controls:

Canary deployments:

 config_change:
   new_pool: americas-pool-v2
   rollout_strategy:
     phase1: 1% of traffic for 5 minutes
     phase2: 10% for 15 minutes
     phase3: 100% if error_rate < baseline
   rollback_trigger:
     error_rate_increase: >20%
     latency_p95_increase: >100ms

Test mode:
- Apply config changes to a small subset of traffic
- Monitor metrics before full rollout
- "What-if" analysis: "How would this steering change affect my traffic?"
Change approval workflow:
- Critical config changes require confirmation
- Show predicted impact before applying
- Scheduled maintenance windows for risky changes

Why This Matters

As SRE, I want confidence that my config change won't break production. Cloudflare's current "apply now, hope for best" model is nerve-wracking for large-scale deployments.

6. Cost Transparency for Health Checks

Current Limitation

Selecting "All Data Centers" for health monitoring can dramatically increase traffic to origins, but it's not clear how much traffic.

The Challenge

From Cloudflare docs:

"Using All Data Centers sends individual health monitor requests from all existing Cloudflare data centers (and that number of data centers is growing all the time)."

The math:

330+ datacenters
60-second interval
3 probes per datacenter

\= ~16,500 health check requests/minute per origin

For origins with bandwidth/request limits, this is a problem.

Improvement Opportunity

Health check traffic estimator:

health_monitor_preview:
  regions: [Western Europe, Eastern Europe]
  interval: 60s
  expected_traffic:
    requests_per_minute: 180
    bandwidth_per_day: 2.5 MB
    estimated_cost: $0.00  # Cloudflare side
    origin_cost_estimate: $0.15/day  # Based on origin pricing

# Show BEFORE applying config

Smart health check scheduling:

Coordinate probes across regions (don't all fire at :00 seconds)
Adaptive probe counts (fewer checks if origin consistently healthy)
Option to "sample" datacenters (check 10% of DCs, rotate daily)

What Happened to Us

Selected "All Regions" thinking it meant ~10 regions. Turns out Cloudflare probed from 200+ locations. Origin's rate limiter kicked in, blocking health checks. Pool marked critical. Outage.

Would have been avoided with: Clear traffic estimates before applying.

7. Observability Gap: Per-Request Routing Decisions

Current Limitation

Analytics show aggregate traffic patterns, but it's hard to debug why a specific request went to a specific origin.

The Challenge

User reports: "I got routed to Singapore datacenter from London, why?"

Current debugging process:

Check logs (if you have them)
Guess based on steering policy
Hope you can reproduce

Improvement Opportunity

Request-level tracing for load balancer decisions:

Add debug header: CF-LB-Debug: true

Response includes:

CF-LB-Pool-Selected: europe-pool
CF-LB-Origin-Selected: frankfurt-01
CF-LB-Steering-Method: dynamic
CF-LB-Selection-Reason: lowest_rtt
CF-LB-RTT-Data: europe-pool:45ms, apac-pool:180ms
CF-LB-Health-Status: all-healthy
CF-LB-Session-Affinity: none

Real-time decision logs:

Stream of routing decisions to external SIEM
Filter by user, region, pool, or origin
Analyze patterns: "Why is 10% of US traffic going to APAC pool?"

Real-World Need

During our PoC, we wanted to validate that dynamic steering was actually choosing the lowest-latency pool. Had to trust Cloudflare's black box. Request-level visibility would've made this trivial to verify.

8. Load Balancer Analytics Need More Depth

Current Limitation

Analytics are good but lack SRE-critical dimensions:

Can't see P50/P95/P99 latency by pool
No historical steering decision breakdown
Limited error categorization
No anomaly detection

Improvement Opportunity

Enhanced analytics dashboard:

Latency heatmaps:

 Response time by pool by hour:
 [Visual heatmap showing which pools were slow when]

Steering decision history:

 Why did traffic shift at 14:30 UTC?
 - Dynamic steering detected 50ms RTT increase in europe-pool
 - Automatically shifted 30% of traffic to americas-pool
 - Self-healed after 8 minutes

Error attribution:

 521 errors increased 300% at 09:15 UTC
 Origin: frankfurt-02
 Root cause: Health check passed but origin firewall blocked Cloudflare IPs
 Impact: 2,300 requests failed before pool marked critical

Predictive alerts:

 Warning: africa-pool showing gradual latency increase
 Current P95: 890ms (baseline: 450ms)
 Predicted: Will exceed 1000ms in 15 minutes
 Suggested action: Add capacity or shed load to backup pool

Command Palette

Building a Production-Grade Global Server Load Balancer (GSLB) with Cloudflare.

The Problem: Geographic Distribution at Scale

What is GSLB, Really?

Architecture: Traditional vs Cloudflare Approach

Traditional GSLB (What We Had with A10)

Cloudflare GSLB Architecture

Implementation: The Proof of Concept

Scope & Objectives

Configuration

1. Health Monitors

2. Endpoint Pools

3. Traffic Steering

Testing Methodology

Test 1: Normal Operation Performance

Test 2: Failover Scenarios

Test 3: A10 Comparison

Results: The Numbers Don't Lie

Performance (Normal Operations)

Failover Performance

Load Distribution Accuracy

Real-World Findings & Gotchas

1. Health Check Paths Matter

2. TTL vs Failover Speed

3. Session Affinity Complexity

4. Cost Comparison

Production Deployment: What We Did

Phase 1: DNS Migration

Phase 2: Health Monitoring

Phase 3: Dynamic Steering

Phase 4: Failover Testing

Key Learnings

What Worked Well

What I'd Do Differently

Advanced Configurations Worth Exploring

1. Least Outstanding Requests (LORS) Steering

2. Cloudflare Tunnel Integration

3. Session Affinity with Header-Based Routing

4. Custom Rules for Advanced Steering

Monitoring & Observability

Metrics We Track

Dashboards & Alerts

Recommendations for Your GSLB Implementation

Start Simple

Graduate to Advanced

Production Readiness Checklist

Conclusion

Further Reading

What Cloudflare Could Improve: An SRE's Perspective

1. Health Check Granularity & Customization

Current Limitation

The Challenge

Improvement Opportunity

Real-World Impact

2. Session Affinity Reliability

Current Limitation

The Challenge

Improvement Opportunity

What We Did as Workaround

3. Origin Limit Model Is Confusing

Current Limitation

The Example

Improvement Opportunity

Why This Matters

4. DNS-Only Mode Steering Limitations

Current Limitation

The Challenge

Improvement Opportunity

Real-World Use Case

5. Configuration Change Velocity & Blast Radius

Current Limitation

The Challenge

Improvement Opportunity

Why This Matters

6. Cost Transparency for Health Checks

Current Limitation

The Challenge

Improvement Opportunity

What Happened to Us

7. Observability Gap: Per-Request Routing Decisions