What Does Scale Mean for Cloud Network Infrastructure?

September 2, 2020 | Rochan Sankar

In data center infrastructure, “at scale” is commonly invoked as a critical design objective. Bigger is better – simple enough, right? But what does it really mean to build infrastructure that delivers scalability for the growth of modern applications?

Think of the inner workings of a popular consumer or business app, with millions of daily active users. Behind the scenes in the infrastructure, massive databases and graphs are constructed, distributed caches are created, continuous AI and analytics algorithms are run, and content gets served and exchanged at global scale to and from end users. The underlying machinery of the application works across regional availability zones, multi-cloud and hybrid-cloud topologies, edge data centers, POPs, and now 5G access networks. End to end network infrastructure binds everything together for the workloads to execute on a seemingly singular, “cloud-scale computer system” sitting behind your thin-client phone or tablet.

It’s not hard to see how designing for infrastructure scalability matters before an application or service has to deal with exponential growth in users, data, compute capacity, storage requirements, or network traffic. In the context of network infrastructure, we can propose a definition for scalability as follows:

The ability of the underlying network architecture and its technology building blocks to support an arbitrarily large number of users/clients/endpoints over a proportionally large geographical radius, with no loss in application performance or resiliency, and at roughly linear operating cost.

Architecture choices matter when designing for scalability. The network control plane and data plane should be designed to support the capacity, performance, and observability required of a fully scaled-out network. We can examine how certain architecture choices impact each such dimension of “scale”.

Capacity at Scale. As distributed workloads grow, more infrastructure resources must be provisioned, orchestrated, and managed in concert. This responsibility falls substantially to the network control plane. When workloads are native to small clusters of servers and storage, a switching-centric control plane is typically sufficient to build the required connectivity. But as applications distribute themselves beyond a single cluster, into multiple clusters, availability zones and regions, or adaptively between core and edge locations, across premises and multi-cloud installations — a routing-centric approach to control plane design becomes essential. Routing-centric isn’t just about RFC feature support at Layer 3; it encompasses how the control plane stack should be built top to bottom. A route-first approach is essential to create extensible capacity – so that the network can robustly keep adding nodes, VMs, containers, paths, overlays, connections, flows, and policies to the infrastructure without breaking it. But it’s also essential for the same control plane architecture to be leverageable across an expanding end-to-end network topology: leaf-spine arrays, DCI layers, border gateways, POPs/CDNs, edge routers, and other network roles, as the application delivery machine expands to that scale. To do so, the control plane stack must have routing “protocol chops” – supporting super-scalable networking over EVPN overlays, native IPv6 underlays, or emerging protocols such as SRv6 which is essential for 5G.

Performance at Scale. As the application or service grows, the distributed system inside the “cloud-scale computer” expands – and performance bottlenecks shift primarily towards the I/O within the system. Take for instance how personalized ads are generated via recommendation engines: a deluge of user and machine generated data is collected, mapped to big data graphs and reduced, run through multiple analytics frameworks, fed into distributed deep learning and inference algorithms, and post-processed before ad serving decisions are executed. These operations could be carried out over tens of thousands of infrastructure nodes (compute, storage, and network), both physical and virtual. Because of parallelization, very little of the performance critical path – measured in job completion time – is compute bound. Instead, I/O bottlenecks going through system software stacks, memory, storage, and network elements dominate performance.

For its part, the network data plane can deliver performance at scale by increasing bandwidth and using topologies and techniques that improve resiliency while minimizing packet loss and queuing delays. For the network control plane, it is the rate at which network state can be updated across the vastly distributed “cloud-scale computer system” that determines performance at scale. This is true both for steady-state as well as for highly dynamic periods. The “update latency” can make-or-break the performance of the application, for instance, when different application components (e.g. microservices) that must talk to each other are spun up in different clusters or regions. Separating and centralizing the control plane from routing data planes in the system can streamline the update process by reducing protocol chatter and unifying management operations – but only if built with a resilient, scalable performance architecture. This depends on how well the control plane stack is designed to work across CPU cores and threads, including where it runs (kernel vs userspace), how databases are updated in system software and on the routing devices, and how control plane processes are organized and built for resiliency. Control plane architecture can be the difference that enables converging and distributing millions of routes across a network in mere seconds – which can be the determining factor for performance at scale.

Observability at Scale. One practical guarantee of network infrastructure at scale is that something will go wrong, all the time. This drives the need for high availability of network components and fast convergence of network state across all those components. The burden here falls again to the network control plane, particularly the operating system, since that is where network state is stored and where the slowest performance path exists. Furthermore, as the distributed “cloud scale computer system” grows, it becomes increasingly difficult to operationally maintain and troubleshoot for high availability if it is not well instrumented within the control plane. Gone are the days where SNMP could be considered sufficient. A control and management plane built for scale should implement observability at every potential failure point within its domain, down to the individual control plane process level. When that domain extends across physical data center boundaries, across edge/access networks and WANs, between on-prem and cloud installations, it’s even more powerful if the same real-time telemetry and analytics frameworks are uniformly available across each network node type. Performance matters critically here; a control plane architecture built for scale can provide operational insights with extremely fast response times and feed the automation engine to maximize network uptime.

Designing for scale in network infrastructure starts with architecture first-principles. In the data plane, it’s principles like high-bandwidth Ethernet, leaf-spine CLOS fabrics and IP-ECMP multipathing. In the control plane, the distinction between architecture choices and their impact on scalability haven’t been as obvious – because they’re not often talked about. Routing-centric, microservices based control plane design that optimizes for capacity, multi-threaded performance, and observability across all network shapes and sizes should emerge as the architecture of choice to enable true scalability of modern cloud applications.