Kubernetes at the Edge: Orchestrating 847 Clusters Without Losing Your Mind
Running Kubernetes at one cluster is solved. Running Kubernetes at 847 clusters simultaneously, across 148 countries, with different hardware profiles, network topologies, and regulatory requirements — that is a different problem entirely.
This is the operational reality of Vertex’s edge infrastructure. Here is what three years of iterating on it has taught us.
The Core Problem: Configuration Drift
The first thing that breaks at fleet scale is configuration consistency. With one cluster, you update a ConfigMap and it takes effect. With 847 clusters, you update a ConfigMap and you have to answer: did it apply to all of them? Which ones failed? Why?
Configuration drift — the state where different clusters have different configurations — is the root cause of approximately 60% of the incidents we have investigated. The solution is treating cluster configuration with the same rigor as application code: version control, continuous reconciliation, automated drift detection.
We use a hub-and-spoke model with GitOps:
config-repo/
base/ # Applies to all clusters
namespaces/
rbac/
network-policies/
overlays/
region-europe/ # Applies to all EU clusters
gdpr-compliance/
region-apac/
cluster-type-edge/ # Applies to edge PoP clusters
cluster-type-transit/ # Applies to transit clusters
Every cluster runs an ArgoCD instance that watches this repository. Drift from the declared state triggers an automatic reconciliation within 60 seconds. If reconciliation fails — due to a conflicting manual change, for instance — it pages on-call.
Upgrade Strategy: The Slow Roll
Kubernetes upgrades across a fleet of 847 clusters require a completely different strategy than single-cluster upgrades. The naive approach — upgrade all clusters in sequence — means a bug in the new Kubernetes version can cascade to your entire fleet before you detect it.
We use a cohort-based slow roll:
Cohort 0 (1 cluster): An isolated “canary” cluster with no production traffic. Upgrade and run comprehensive integration tests.
Cohort 1 (10 clusters): Low-traffic clusters in non-critical regions. Upgrade and observe for 48 hours.
Cohort 2 (100 clusters): Medium-traffic clusters. Upgrade and observe for 72 hours.
Cohort 3 (736 clusters): Full fleet rollout, executed in parallel batches of 50.
The total time from “start upgrade” to “entire fleet upgraded” is approximately 10 days for minor version bumps, 21 days for major versions. This feels slow. It is intentional. A fleet-wide incident caused by a bad upgrade costs more than the 10 days of rolling upgrade time.
Multi-Cluster Service Mesh
With 847 clusters, cross-cluster service communication becomes a first-class problem. A request might originate at a Paris PoP and need to hit a service that only runs at a regional transit cluster in Frankfurt.
We run Istio with multi-cluster mesh federation. Each cluster is a mesh participant, and service endpoints are automatically federated across cluster boundaries.
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: remote-payment-service
spec:
hosts:
- payment-service.vertex-internal.svc.cluster.global
location: MESH_EXTERNAL
resolution: DNS
endpoints:
- address: payment-service.fra01.vertex-internal
ports:
grpc: 50051
locality: eu-central-1/fra01
labels:
cluster: fra01
Cross-cluster mTLS is automatic. Locality-aware load balancing preferentially routes to endpoints in the same region, falling back to adjacent regions only when local endpoints are unavailable or degraded.
Failure Isolation: The Blast Radius Principle
At fleet scale, you cannot prevent all failures. You can design your system so that any individual failure affects the smallest possible blast radius.
Our isolation model:
PoP-level isolation: A failure in a single PoP’s Kubernetes control plane never affects another PoP. Control planes are independent. This is a deliberate tradeoff: more operational overhead, dramatically smaller blast radius.
Namespace-level isolation: Tenant workloads are namespaced with strict NetworkPolicy enforcement. A misbehaving tenant cannot affect other tenants’ network bandwidth or CPU time.
Resource quotas: Every namespace has hard limits on CPU, memory, and pod count. A runaway deployment cannot consume cluster resources — it hits its quota and fails loudly, rather than degrading service for all tenants silently.
The Operational Reality
Running Kubernetes at this scale requires accepting several uncomfortable truths:
Things will break. 847 clusters means statistical near-certainty of continuous low-level failures. Design for recovery, not prevention.
Automation is not optional. A human cannot operate 847 clusters. Every toil-generating operation must be automated, or it will not be performed consistently at scale.
Observability is the product. Without comprehensive metrics, logs, and traces across the fleet, you are operating blind. Observability is not a DevOps concern — it is the foundational capability that makes everything else possible.
The reward for getting this right is a globally distributed, self-healing infrastructure platform that simply continues to operate while you sleep. That is the goal. It is achievable. It requires patience, rigor, and a willingness to rebuild assumptions from first principles.