ArcOS: Enabling the Open Integration Era for Routing

August 12, 2019 | Keyur Patel

In the age of digital transformation, advancements in 5G, AI, machine learning, IOT, and network automation require smarter, scalable, and secure networks that lower the total costs of ownership. In a bid to achieve these goals, network infrastructure is increasingly becoming routing-centric.

As networks move to routing-centric architectures, the scale of routing itself grows in multiple dimensions. From a software perspective, scale is typically measured in terms of the number of routes, number of paths from where the routes are received, any forwarding related information – be it MPLS labels, tunnel parameters, ACLs, QOS, etc., as well as protocol specific announcements such as Link State Advertisements (LSAs) or Link State Packets (LSP), and the number of protocol peers. From the hardware perspective, routing scale is measured by the number of routes, forwarding related information, packet buffers, and the number of ports and associated densities that are supported.

To deliver a seamless digital network transformation solution that meets and exceeds requirements well into the foreseeable future, both scalable hardware and software architectures that leverage one another are needed. The hardware components must support more routes, more paths (ECMP), more ACL entries, QOS, and port scale in terms of speed and port densities. They also need to provide higher density and deeper packet buffers to handle efficient QOS requirements. Advancements in merchant silicon brought by Broadcom’s StrataDNXTM Jericho2 – a 10Tbps switch-router SOC – is an example that allows Arrcus to expand the concept of open integration to routing.

A modern routing software needs to support a process-based architecture, efficient IPC mechanisms to interconnect processes, and furthermore the separation of workloads within each process with minimal or no locking. This creates the opportunity for a massively scalable solution where workloads can be distributed across multiple CPU cores, executed in distributed manner with minimal interdependencies while able to fully leverage ever increasing amounts of memory – a truly scale-out architecture.

ArcOS: Built from First Principles

In order to achieve all of this efficiently, one must construct a solution from first principles. ArcOS was built with routing and scale as key areas of focus. It is a 64-bit, internet-scale network operating system that enables independent scheduling of processes, superior performance, rapid convergence, low latency, and the ability to scale out in terms of threads per process. The control plane can thus scale to a large number of high-speed ports, be able to receive routing and forwarding information, and rapidly program forwarding information to multiple chips across multiple devices.

As a result, the ArcOS architecture has the ability to scale out to support the complete router platform spectrum: high-density fixed, modular chassis, and even open aggregated router solutions (with a scale-out control plane that can reside off-device).

After delivering best-in-class protocols with optimized performance and scale, we also focused on innovating in areas that would enable seamless operationalization of an ArcOS-based solution. The key areas are: high availability, manageability, and debuggability.

High Availability through Built-in Resiliency

Any scale-out software solution requires both easy and efficient upgrade mechanisms with minimal downtime. This is even more acute for routers, given their role as transit forwarders for networks, and any downtime on a router impacts the reachability for these networks. By designing-in software control plane separation from the underlying kernel, we reduce the impact caused by software upgrade’s downtime. The ArcOS components are packaged as multiple Linux packages that can be upgraded independently, including even the Linux kernel itself. As a result, ArcOS allows simple package installs at a component level. Each of the packages could have one or more executables installed natively or even as a pre-built container image with multiple sets of ArcOS processes.

ArcOS also supports a fast system cold reboot by performing as much work up front as possible before taking the entire control plane offline, including image downloads and image verification. This allows traffic forwarding to continue till the new software is installed and when the time comes to upgrade from the older to the newer software version, the forwarding path ASIC is re-initialized, and the control plane software state is downloaded.

Easy Manageability and Intuitive Debuggability

ArcOS supports standards-based debugging and logging insights to monitor both the software and hardware components. Furthermore, ArcOS supports an LTTng-based lightweight open source tracing framework designed for high-performance routing environments. This efficient framework enables trace-sensitive code paths where traditional logging would add an intolerable delay. The benefit of such an open-source framework is that it can always be “on” to trace and catch anomalies inside the code. As an example, network operators can use open source trace-viewing tools like “babeltrace” to analyze these traces. This approach provides the operator complete control of enabling trace points in an application. More information on LTTng can be found here.

The ArcOS software has the ability to stream component-specific state information using real-time telemetry. The data, in JSON format, is collected and maintained in the distributed database known as ArcOS DataStore. The ArcOS streaming agent is based on either KAFKA or gNMI and is used to securely stream the system-wide data out to ingesting servers in the cloud (be it a public or a private cloud) such that the data can be gathered and stored for longer intervals. The ArcOS streaming agent is designed to be flexible enough to cover other modes of transport and encoding as and when needed. With this, ArcOS enables the streaming of the complete range of system data: control plane, data plane, and physical/environmentals, and more importantly the vital statistics of the system.

Finally, the ArcOS DataStore is used to connect and communicate state information across software components. As a by-product of this, the DataStore “stores” the state that can be retrieved by an ArcOS utility and can be used to analyze and debug state information at runtime. ArcOS also has ArcAPITM, a collection of utility scripts used to gather system specific information in user readable form, to provide an operator efficiently-accessed and much-needed vital information across their system.

In summary, Arrcus has developed an independent networking operating system that solves the myriad of challenges of today’s networking infrastructure by providing solutions that enable your network transformation in a simple, scalable, and secure environment.

If you would like more information, please go to Network Different with Arrcus!

Keyur Patel
Founder and CTO at Arrcus
Former Distinguished Engineer at Cisco Systems, Inc.