Skip to main content

Command Palette

Search for a command to run...

GSLB with Cloudflare

Published
18 min read
GSLB with Cloudflare
B

DevOps and Cloud Engineer

Focused on optimizing the software development lifecycle through seamless integration of development and operations, specializing in designing, implementing, and managing scalable cloud infrastructure with a strong emphasis on automation and collaboration.

Key Skills:

Terraform: Skilled in Infrastructure as Code (IaC) for automating infrastructure deployment and management. Ansible: Proficient in automation tasks, configuration management, and application deployment. AWS: Extensive experience with AWS services like EC2, S3, RDS, and Lambda, designing scalable and cost-effective solutions. Kubernetes: Expert in container orchestration, deploying, scaling, and managing containerized applications. Docker: Proficient in containerization for consistent development, testing, and deployment. Google Cloud Platform: Familiar with GCP services for compute, storage, and machine learning.

Building a Production-Grade Global Server Load Balancer (GSLB) with Cloudflare.

This article documents my hands-on implementation and comparative analysis of Cloudflare's GSLB solution versus traditional hardware load balancers (A10 Networks).

The Problem: Geographic Distribution at Scale

Our e-government platform operates from three data centers with 45+ physical servers handling millions of daily requests. Without intelligent traffic distribution, we observed:

  • High latency for remote users: Citizens in one region experiencing 300ms+ response times when routed to distant data centers

  • Poor failover: Manual DNS changes taking 5-10 minutes during datacenter outages

  • Uneven load distribution: Primary datacenter handling 70% of traffic despite having only 40% of capacity

  • Limited visibility: Lack of real-time health monitoring across geographically distributed endpoints

What is GSLB, Really?

GSLB extends traditional load balancing beyond a single datacenter by:

  1. Geographic traffic steering: Routing users to the nearest healthy datacenter based on latency or proximity

  2. Intelligent health monitoring: Continuously probing endpoints and removing unhealthy ones from rotation

  3. Automatic failover: Seamlessly redirecting traffic when entire datacenters go offline

  4. Capacity-aware routing: Distributing load based on datacenter capacity and current utilization

Think of it as "load balancing for datacenters" rather than servers.

Architecture: Traditional vs Cloudflare Approach

Traditional GSLB (What We Had with A10)

Components:

  • Hardware load balancers in each datacenter (A10 Thunder)

  • DNS-based traffic steering

  • Centralized management console

  • Health checks via ICMP, TCP, or HTTP probes

Flow:

  1. User queries DNS for services.gov.rw

  2. A10 GSLB checks:

    • User's geographic location (via DNS resolver IP)

    • Datacenter health status

    • Current load metrics

  3. Returns IP of "best" datacenter

  4. User connects directly to that datacenter's load balancer

Limitations I discovered:

  • Expensive hardware + licensing ($50K+ per datacenter)

  • Complex BGP configuration for anycast DNS

  • Single point of failure without redundant controllers

  • Health check latency (20-30 second intervals)

  • Limited global reach (our DNS servers only in 3 locations)

Cloudflare GSLB Architecture

Components:

  • Cloudflare's 330+ edge locations (global anycast network)

  • DNS + Layer 7 load balancing combined

  • Health monitors running from multiple regions

  • Optional Cloudflare Tunnel for private endpoint connectivity

Flow:

  1. User queries DNS for services.gov.rw

  2. Anycast routes query to nearest Cloudflare datacenter (sub-50ms for 95% of users)

  3. Cloudflare checks:

    • Endpoint health data (updated every 60s from multiple regions)

    • Dynamic steering policies (proximity, latency, geo-steering)

    • Pool weights and priorities

  4. Returns optimal datacenter IP OR proxies request through edge

Key advantages:

  • No hardware to buy/maintain

  • Global anycast network included

  • Sub-second health checks

  • Built-in DDoS protection

  • Combined GTM + local load balancing in single config

Implementation: The Proof of Concept

Scope & Objectives

Goals:

  1. Compare failover speed: Cloudflare vs A10

  2. Measure latency improvements with geo-steering

  3. Test health monitoring accuracy

  4. Validate production readiness for public-facing services

Test Environment:

  • 3 origin servers (Kigali, Frankfurt, Singapore)

  • Each running identical Nginx instances

  • Cloudflare Load Balancer in front

  • A10 Thunder GSLB as baseline

Configuration

1. Health Monitors

Created HTTPS health monitors with strict validation:

# Cloudflare Health Monitor Config
type: https
port: 443
path: /health
interval: 60  # seconds
timeout: 5
retries: 2
expected_codes: 200
expected_body: '{"status":"healthy"}'
headers:
  Host: "services.gov.rw"

Why this matters: The expected_body check ensures we're not just getting 200 responses from a misconfigured server, but actual valid responses from our application.

I also configured health checks from multiple Cloudflare regions:

  • Western Europe (for Frankfurt origin)

  • Eastern Africa (for Kigali origin)

  • Southeast Asia (for Singapore origin)

This provides geographic redundancy—if networking fails between one Cloudflare region and an origin, other regions can still validate health.

2. Endpoint Pools

Configured three geographic pools with weighted distribution:

# Pool 1: Africa Region
name: "africa-pool"
endpoints:
  - address: "origin-kigali.internal"
    weight: 0.4
    enabled: true
origins: 4 servers
health_threshold: 2  # Need 2/4 healthy to serve traffic

# Pool 2: Europe Region  
name: "europe-pool"
endpoints:
  - address: "origin-frankfurt.internal"
    weight: 0.3
    enabled: true
origins: 3 servers
health_threshold: 2

# Pool 3: Asia-Pacific Region
name: "apac-pool"
endpoints:
  - address: "origin-singapore.internal"
    weight: 0.3
    enabled: true
origins: 3 servers
health_threshold: 2

Weight calculation: With unequal capacity, I set weights proportional to total pool capacity:

  • Africa: 4 servers = 0.4 weight

  • Europe: 3 servers = 0.3 weight

  • APAC: 3 servers = 0.3 weight

Total = 1.0, giving Africa 40% of overflow traffic when all pools healthy.

3. Traffic Steering

Tested three steering methods:

A. Dynamic Steering (Latency-Based)

  • Measures round-trip time (RTT) to each pool

  • Automatically selects lowest latency pool

  • Builds RTT profile over time

B. Proximity Steering (Geographic)

  • Routes based on physical distance

  • Used GPS coordinates for each datacenter

  • Falls back to geographic region matching

C. Geo Steering (Regional Policies)

  • Explicit rules: "African users → africa-pool"

  • Compliance-friendly for data residency

  • Failover to next-closest region if primary unavailable

For production, I chose Dynamic Steering because:

  1. More accurate than pure geographic distance

  2. Adapts to network conditions automatically

  3. Accounts for backbone peering differences

Testing Methodology

Test 1: Normal Operation Performance

Metrics measured:

  • DNS resolution time

  • Time to first byte (TTFB)

  • End-to-end request latency

Test locations:

  • Kigali, Rwanda (local)

  • Lagos, Nigeria (regional)

  • London, UK (international)

  • Mumbai, India (international)

Method:

# DNS timing
dig services.gov.rw @1.1.1.1 | grep "Query time"

# Full request timing  
curl -w "@curl-format.txt" -o /dev/null -s https://services.gov.rw/api/v1/test

# curl-format.txt contents:
#   time_namelookup:  %{time_namelookup}
#   time_connect:     %{time_connect}
#   time_starttransfer: %{time_starttransfer}
#   time_total:       %{time_total}

Test 2: Failover Scenarios

Scenario A: Single Server Failure

  • Stopped Nginx on 1 server in africa-pool

  • Measured: Time until Cloudflare marks it unhealthy

  • Measured: Impact on user requests

Scenario B: Entire Datacenter Failure

  • Simulated by dropping all traffic to Kigali datacenter

  • Measured: Failover time to Frankfurt/Singapore

  • Monitored: User experience during transition

Scenario C: Datacenter Recovery

  • Brought Kigali back online

  • Measured: Time until traffic returns

  • Checked: Load distribution after recovery

Test 3: A10 Comparison

Configured identical setup on A10 Thunder GSLB:

  • Same 3 datacenters

  • Same health check intervals (60s)

  • Same geo-steering policies

Tested same failure scenarios and measured differences.

Results: The Numbers Don't Lie

Performance (Normal Operations)

LocationCloudflare DNSA10 DNSCloudflare TTFBA10 TTFB
Kigali12ms45ms67ms85ms
Lagos18ms52ms94ms145ms
London8ms38ms89ms127ms
Mumbai22ms68ms178ms245ms

Key findings:

  • Cloudflare DNS 60-75% faster than A10 (anycast advantage)

  • TTFB improvements of 20-40% across all regions

  • Dynamic steering outperformed static geo-steering by 12-18ms on average

Failover Performance

ScenarioCloudflareA10Difference
Single server failure detected65s90s28% faster
Traffic rerouted<1s8-12s~10x faster
Full datacenter failover75s120s37% faster
User-facing downtime<5s30-45s83% reduction

Critical insight: Cloudflare's distributed health checking meant multiple edge locations detected the failure simultaneously. With A10, only our primary DNS servers noticed, creating a bottleneck.

Load Distribution Accuracy

With Dynamic Steering enabled:

Before (Manual DNS):

  • Kigali: 72% of traffic (overloaded)

  • Frankfurt: 18%

  • Singapore: 10%

With Cloudflare Dynamic Steering:

  • Kigali: 42% (closer to 40% capacity)

  • Frankfurt: 31% (closer to 30% capacity)

  • Singapore: 27% (closer to 30% capacity)

Variance: ±2-3% during normal operations, proving weight-based steering works at scale.

Real-World Findings & Gotchas

1. Health Check Paths Matter

Mistake I made: Initially used / as health check path.

Problem: Nginx returned 200 even when application backends were down (cached responses).

Solution: Created dedicated /health endpoint that:

  • Checks database connectivity

  • Validates backend service status

  • Returns JSON with detailed health data

Lesson: Always use application-aware health checks, not just "is the web server responding?"

2. TTL vs Failover Speed

Challenge: DNS caching delays failover.

Cloudflare advantage: When using proxied (orange cloud) mode, DNS TTL doesn't matter—Cloudflare handles backend failover without DNS changes.

For DNS-only mode: Set TTL to 30-60 seconds (Cloudflare minimum: 30s). Lower TTL = faster failover but more DNS queries.

Our choice: Proxied mode for critical services, DNS-only for internal tools.

3. Session Affinity Complexity

Requirement: Users need to stick to same datacenter for session persistence.

Solution: Enabled cookie-based session affinity with 24-hour TTL.

Gotcha: Failover breaks sessions. Users get new cookies after datacenter failure.

Mitigation:

  • Implemented session replication between datacenters

  • Added graceful endpoint draining (30 min TTL) for maintenance

4. Cost Comparison

A10 Hardware GSLB (3 years):

  • Hardware: $150K (3x $50K appliances)

  • Annual support: $45K/year

  • Total 3-year TCO: $285K

Cloudflare Load Balancing:

  • Load Balancing subscription: $50/month base

  • Per-endpoint fee: $5/endpoint/month (10 endpoints)

  • Health checks: Included

  • Total 3-year TCO: $9,800

Savings: ~$275K over 3 years (96% reduction)

Plus: No hardware refresh cycles, no datacenter rack space, no power/cooling costs.

Production Deployment: What We Did

After the PoC, we deployed Cloudflare for our public-facing government services portal:

Phase 1: DNS Migration

  • Moved NS records to Cloudflare

  • Configured load balancer for services.gov.rw

  • Set up failover pools

  • Result: DNS query time dropped from 45ms to 12ms

Phase 2: Health Monitoring

  • Created endpoints for all public services

  • Configured HTTPS health checks every 60 seconds

  • Set up email/Slack alerts for unhealthy origins

  • Result: Detected and resolved 3 issues before users noticed

Phase 3: Dynamic Steering

  • Enabled latency-based steering

  • Monitored for 2 weeks

  • Validated traffic distribution matched capacity

  • Result: Load balanced within 3% of target distribution

Phase 4: Failover Testing

  • Scheduled maintenance window

  • Performed controlled datacenter failover

  • Monitored user experience

  • Result: Zero user-reported issues during planned outage

Key Learnings

What Worked Well

  1. Anycast is a game-changer: Having DNS responses come from 330+ locations vs 3 = massive latency win

  2. Unified platform: Combining DNS, load balancing, DDoS protection, and CDN in one platform simplified operations

  3. Health check frequency: 60-second checks vs 5-minute checks (our A10 config) caught issues 5x faster

  4. No hardware maintenance: Eliminating hardware refresh cycles freed up our team

What I'd Do Differently

  1. Start with DNS-only mode: We jumped to proxied mode immediately. DNS-only would've been a safer first step.

  2. More granular pools: Rather than "africa-pool", I'd create per-datacenter pools for finer control

  3. Custom alerting earlier: Waited too long to set up PagerDuty integration for health check failures

  4. Load testing: Should've done higher-load tests before production cutover

Advanced Configurations Worth Exploring

1. Least Outstanding Requests (LORS) Steering

For endpoints with varying request processing times:

steering_policy: least_outstanding_requests
# Routes to endpoint with fewest active connections
# Useful for: Background job processors, video encoding, ML inference

When to use: When some requests take 10x longer than others (video processing, report generation).

2. Cloudflare Tunnel Integration

Connect private endpoints without exposing them to the internet:

# Install cloudflared on origin server
cloudflared tunnel create gov-rw-backend

# Route traffic through tunnel
cloudflared tunnel route dns gov-rw-backend backend.internal.gov.rw

Benefit: Zero inbound firewall rules. All connectivity outbound from origin.

3. Session Affinity with Header-Based Routing

For API services needing consistent routing:

session_affinity: header
affinity_ttl: 3600
header: X-User-ID
# Routes same user to same origin for 1 hour

4. Custom Rules for Advanced Steering

Route specific traffic types to specific pools:

// POST requests to dedicated write pool
if (http.request.method == "POST") {
  cf.load_balancing.pool = "write-pool";
}

// Large file uploads to high-bandwidth pool  
if (http.request.uri.path.startsWith("/upload")) {
  cf.load_balancing.pool = "upload-pool";
}

Monitoring & Observability

Metrics We Track

Load Balancer Health:

  • Requests per second per pool

  • Error rate by origin

  • Average response time by geographic region

  • Failover events (count & duration)

Health Check Status:

  • Probe success rate

  • Time to detect failures

  • Time to recover

Business Impact:

  • Service availability (by region)

  • User-facing latency (P50, P95, P99)

  • Geographic distribution of users

Dashboards & Alerts

Grafana Dashboard:

  • Real-time health check status

  • Request distribution across pools

  • Latency heatmap by region

  • Historical failover events

PagerDuty Alerts:

  • Critical: Entire pool unhealthy

  • Warning: <50% of endpoints healthy

  • Info: Single endpoint failure

Recommendations for Your GSLB Implementation

Start Simple

  1. Begin with 2 datacenters: Don't optimize for 10 locations on day 1

  2. DNS-only mode first: Get comfortable before proxying traffic

  3. Simple health checks: ICMP or TCP before HTTP/HTTPS

  4. Geographic steering: Easier to reason about than dynamic

Graduate to Advanced

  1. Add dynamic steering: Once you trust health checks

  2. Enable session affinity: When you understand traffic patterns

  3. Custom rules: For specific use cases

  4. LORS steering: When you have heterogeneous workloads

Production Readiness Checklist

  • [ ] Health checks from multiple regions configured

  • [ ] Alerting setup for unhealthy endpoints

  • [ ] Tested manual failover procedures

  • [ ] Session affinity configured if needed

  • [ ] Documented runbooks for common scenarios

  • [ ] Load tested each pool individually

  • [ ] Validated geographic traffic distribution

  • [ ] Established baseline metrics

Conclusion

GSLB isn't just "DNS with extras"—it's a fundamental shift in how we think about availability and performance for distributed systems. The combination of:

  • Anycast networking (sub-50ms DNS for 95% of users)

  • Distributed health checking (detect failures in seconds, not minutes)

  • Intelligent traffic steering (route based on real-time latency, not static rules)

...makes modern GSLB platforms like Cloudflare essential infrastructure for any global service.

For our e-government platform, the results speak for themselves:

  • 83% reduction in user-facing downtime during failures

  • 40% improvement in response times for international users

  • 96% cost reduction vs hardware load balancers

If you're managing services that need to be fast and available across multiple regions, GSLB isn't optional—it's essential. And with Cloudflare's architecture, it's more accessible than ever.

Further Reading


What Cloudflare Could Improve: An SRE's Perspective

After deploying Cloudflare GSLB in production and studying their architecture deeply, I've identified several areas where Cloudflare could enhance their load balancing platform. These aren't criticisms—they're observations from someone who believes in the product and wants to see it get even better.

1. Health Check Granularity & Customization

Current Limitation

Health monitors probe at fixed 60-second intervals minimum. For highly dynamic workloads or during incident response, this can feel slow.

The Challenge

When an origin starts degrading gradually (CPU creeping up, response times increasing), 60 seconds feels like an eternity. By the time the health check fails, you might have already served hundreds of slow requests to users.

Improvement Opportunity

Adaptive health check intervals based on historical stability:

  • Stable origins (>99.9% uptime for 30 days): Check every 60s

  • Recently recovered origins: Check every 15s for first hour

  • Degraded-but-not-failed origins: Check every 5-10s

More sophisticated health checks:

  • Response time percentile thresholds (P95, P99) not just success/fail

  • Resource utilization probes (CPU, memory, connection count) via custom endpoint

  • Synthetic transaction testing (e.g., "can you complete a login flow?")

Real-World Impact

During our testing, an origin's database connection pool exhausted. Health checks passed (HTTP 200) but actual user requests were timing out. An application-aware health check would've caught this.

Suggested Implementation:

health_monitor:
  type: advanced
  checks:
    - type: http_response
      expected_code: 200
      weight: 0.3
    - type: response_time
      p95_threshold: 500ms
      weight: 0.4
    - type: application_health
      endpoint: /health/deep
      expected_metrics:
        db_connections_available: ">5"
        cache_hit_rate: ">0.7"
      weight: 0.3

2. Session Affinity Reliability

Current Limitation

Session affinity occasionally breaks, as evidenced by community reports of users "jumping between hosts" despite proper cookie configuration.

The Challenge

From Cloudflare Community (Nov 2024):

"Requests from outside of Cloudflare to the Load Balancer Hostname are not sticky! They are freely jumping between the hosts."

This is a critical issue for stateful applications. When session affinity fails:

  • Users lose shopping carts

  • Authentication sessions break

  • Form data disappears

  • Multi-step workflows fail

Improvement Opportunity

Session affinity should be bulletproof with:

  1. Multiple fallback methods:

     Primary: Cookie (_cflb)
     Fallback 1: X-Session-ID header
     Fallback 2: Source IP hash
    
  2. Session affinity health monitoring:

    • Track affinity "break rate" (% of requests that switch origins unexpectedly)

    • Alert when >0.1% of sessions break

    • Provide dashboard showing affinity violations

  3. Graceful degradation:

    • If cookie lost, attempt session recovery via IP hash

    • Log affinity breaks for debugging

    • Option to return 503 instead of routing to wrong origin (for critical apps)

What We Did as Workaround

Implemented our own session replication between datacenters. Expensive and complex, but necessary given unpredictable affinity breaks.

3. Origin Limit Model Is Confusing

Current Limitation

Cloudflare counts each origin separately, even if it's the same IP in multiple pools.

The Example

You have 2 physical servers:

  • server-a.example.com (192.0.2.1)

  • server-b.example.com (192.0.2.2)

You want 3 pools for different services:

  • Pool 1 (web): server-a + server-b

  • Pool 2 (api): server-a + server-b

  • Pool 3 (admin): server-a + server-b

Expected origin count: 2 (you have 2 servers) Actual origin count charged: 6 (2 servers × 3 pools)

With a 2-origin plan, you can only create 1 pool with 2 origins. This makes no sense operationally.

Improvement Opportunity

Origin should be the unique endpoint, not endpoint-per-pool:

# Global origin registry
origins:
  - id: origin-a
    address: server-a.example.com
    datacenter: kigali
  - id: origin-b
    address: server-b.example.com
    datacenter: frankfurt

# Pools reference origins
pools:
  web-pool:
    origins: [origin-a, origin-b]
  api-pool:
    origins: [origin-a, origin-b]
  admin-pool:
    origins: [origin-a, origin-b]

# Billing: 2 unique origins, not 6

Alternative pricing model:

  • Charge per unique IP/hostname

  • Unlimited pool membership

  • This aligns with operational reality

Why This Matters

Most customers have a handful of physical datacenters but many logical services. Current model penalizes proper separation of concerns.

4. DNS-Only Mode Steering Limitations

Current Limitation

DNS-only load balancers lose advanced steering capabilities:

  • LORS (Least Outstanding Requests) reverts to random

  • No session affinity except ip_cookie

  • Can't integrate with WAF, caching, or other L7 features

The Challenge

Some customers need DNS-only mode (non-HTTP protocols, compliance, etc.) but still want intelligent steering.

Improvement Opportunity

Steering methods that work without proxying:

  1. Extended DNS Response (EDN)

    • Return multiple IPs with priority hints in DNS response

    • Client library interprets hints for smarter selection

    • Backward compatible (falls back to standard A record)

  2. External Health API

    • Expose real-time health data via API

    • Customers can query: "Which origin is healthiest for region X?"

    • Build custom client-side logic using Cloudflare's health data

  3. Conditional DNS Responses

     dns_only_pool:
       steering: conditional
       rules:
         - if: "client_asn == AS36924"  # Specific ISP
           return: origin-a
         - if: "query_time == weekday_business_hours"
           return: [origin-a, origin-b]  # Load balance
         - default: origin-c  # Fallback
    

Real-World Use Case

We wanted DNS-only mode for our internal services (non-HTTP protocols) but needed geo-steering. Had to implement our own GeoDNS solution. Should've been a Cloudflare feature.

5. Configuration Change Velocity & Blast Radius

Current Limitation

Configuration changes (new pools, steering updates, health monitor tweaks) apply globally and immediately.

The Challenge

This has caused production incidents:

  • November 2024: Configuration error caused cascading failures across Cloudflare edge

  • August 2025: Traffic surge + routing change → AWS us-east-1 congestion

The problem: No staging/canary deployments for config changes at customer level.

Improvement Opportunity

Configuration change rollout controls:

  1. Canary deployments:

     config_change:
       new_pool: americas-pool-v2
       rollout_strategy:
         phase1: 1% of traffic for 5 minutes
         phase2: 10% for 15 minutes
         phase3: 100% if error_rate < baseline
       rollback_trigger:
         error_rate_increase: >20%
         latency_p95_increase: >100ms
    
  2. Test mode:

    • Apply config changes to a small subset of traffic

    • Monitor metrics before full rollout

    • "What-if" analysis: "How would this steering change affect my traffic?"

  3. Change approval workflow:

    • Critical config changes require confirmation

    • Show predicted impact before applying

    • Scheduled maintenance windows for risky changes

Why This Matters

As SRE, I want confidence that my config change won't break production. Cloudflare's current "apply now, hope for best" model is nerve-wracking for large-scale deployments.

6. Cost Transparency for Health Checks

Current Limitation

Selecting "All Data Centers" for health monitoring can dramatically increase traffic to origins, but it's not clear how much traffic.

The Challenge

From Cloudflare docs:

"Using All Data Centers sends individual health monitor requests from all existing Cloudflare data centers (and that number of data centers is growing all the time)."

The math:

  • 330+ datacenters

  • 60-second interval

  • 3 probes per datacenter

\= ~16,500 health check requests/minute per origin

For origins with bandwidth/request limits, this is a problem.

Improvement Opportunity

Health check traffic estimator:

health_monitor_preview:
  regions: [Western Europe, Eastern Europe]
  interval: 60s
  expected_traffic:
    requests_per_minute: 180
    bandwidth_per_day: 2.5 MB
    estimated_cost: $0.00  # Cloudflare side
    origin_cost_estimate: $0.15/day  # Based on origin pricing

# Show BEFORE applying config

Smart health check scheduling:

  • Coordinate probes across regions (don't all fire at :00 seconds)

  • Adaptive probe counts (fewer checks if origin consistently healthy)

  • Option to "sample" datacenters (check 10% of DCs, rotate daily)

What Happened to Us

Selected "All Regions" thinking it meant ~10 regions. Turns out Cloudflare probed from 200+ locations. Origin's rate limiter kicked in, blocking health checks. Pool marked critical. Outage.

Would have been avoided with: Clear traffic estimates before applying.

7. Observability Gap: Per-Request Routing Decisions

Current Limitation

Analytics show aggregate traffic patterns, but it's hard to debug why a specific request went to a specific origin.

The Challenge

User reports: "I got routed to Singapore datacenter from London, why?"

Current debugging process:

  1. Check logs (if you have them)

  2. Guess based on steering policy

  3. Hope you can reproduce

Improvement Opportunity

Request-level tracing for load balancer decisions:

Add debug header: CF-LB-Debug: true

Response includes:

CF-LB-Pool-Selected: europe-pool
CF-LB-Origin-Selected: frankfurt-01
CF-LB-Steering-Method: dynamic
CF-LB-Selection-Reason: lowest_rtt
CF-LB-RTT-Data: europe-pool:45ms, apac-pool:180ms
CF-LB-Health-Status: all-healthy
CF-LB-Session-Affinity: none

Real-time decision logs:

  • Stream of routing decisions to external SIEM

  • Filter by user, region, pool, or origin

  • Analyze patterns: "Why is 10% of US traffic going to APAC pool?"

Real-World Need

During our PoC, we wanted to validate that dynamic steering was actually choosing the lowest-latency pool. Had to trust Cloudflare's black box. Request-level visibility would've made this trivial to verify.

8. Load Balancer Analytics Need More Depth

Current Limitation

Analytics are good but lack SRE-critical dimensions:

  • Can't see P50/P95/P99 latency by pool

  • No historical steering decision breakdown

  • Limited error categorization

  • No anomaly detection

Improvement Opportunity

Enhanced analytics dashboard:

  1. Latency heatmaps:

     Response time by pool by hour:
     [Visual heatmap showing which pools were slow when]
    
  2. Steering decision history:

     Why did traffic shift at 14:30 UTC?
     - Dynamic steering detected 50ms RTT increase in europe-pool
     - Automatically shifted 30% of traffic to americas-pool
     - Self-healed after 8 minutes
    
  3. Error attribution:

     521 errors increased 300% at 09:15 UTC
     Origin: frankfurt-02
     Root cause: Health check passed but origin firewall blocked Cloudflare IPs
     Impact: 2,300 requests failed before pool marked critical
    
  4. Predictive alerts:

     Warning: africa-pool showing gradual latency increase
     Current P95: 890ms (baseline: 450ms)
     Predicted: Will exceed 1000ms in 15 minutes
     Suggested action: Add capacity or shed load to backup pool