Self-Healing SD-WAN Enables High Availability Networking

Self-Healing SD-WAN Enables High Availability Networking

22 Oct 2018

SD-WAN enables enterprises to use low-cost volume Internet access for business-critical traffic, including VoIP and other real-time services. However, while SD-WANs can manage Internet issues, providing a High Availability (HA) environment requires more. Especially for real-time IP-based services, this level of HA assures that the services do not have the extended outages that are often associated with non-redundant networks.

I recently spent some time reviewing Cato Networks’ recent announcement around increasing SD-WAN uptime of its Cato Cloud. Cato Cloud is a so-called OTT SD-WAN service. Cato provides a global, managed network backbone that includes a complete network security stack. Enterprises connect their sites, mobile users, and cloud resources into Cato Cloud by establishing encrypted tunnels across their Internet connections. Sites can also connect through MPLS. To increase availability and reduce the impact of the outages that enterprise networks invariably face, Cato has introduced two key features:

  • A new data center-oriented SD-WAN device, the X1700
  • Follow-the-network security rules

New SD-WAN Device for High Availability

Cato refers to their edge devices as Sockets. The new data center-oriented X1700 Socket is a 1u, rack-mountable SD-WAN device with redundant power supplies and redundant, hot-swappable hard drives. Previously, Cato only offered the X1500, which lacked component redundancy. (Companies can also connect to Cato Cloud through an IPsec tunnel from third-party devices but with more limited HA capabilities.)

Both Sockets can be deployed in redundant, active-active or active-passive configurations. In either case, the dual Socket configuration assures that the site connected will stay up regardless of local failures. For primary data centers, the configuration of dual X1700s, each with dual independent Internet access services, assures a very high level of availability for major sites that can impact large user populations. For branch offices, the dual X1500 Sockets approach also assures continued operations. Finally, a cold shelf stand-by can be easily installed as it is auto-provisioned with the necessary configurations for operation.

Follow-the-Network Security Rules

As part of the overall cloud SD-WAN solution, Cato Cloud includes network security. Part of the solution is a firewall capability converged into the SD-WAN to secure all Internet and WAN traffic. A key part of the Cato architectural strategy is a thin edge. With a thin edge, security and network services are implemented in the cloud, not as CPE. Eliminating the additional edge appliance hardware makes HA design much easier. With a network layer that is self-healing, including these capabilities correctly when a PoP has changed or a server running security fails is very important. Otherwise, in a failover scenario, you might find that your users have connectivity to a location in but not access as security policies have not been updated. As part of the self-healing solution, Cato dynamically establishes these functions wherever the data flows go.

Multiple Tiers of Redundancy

Together with the existing architecture, Cato is delivering self-healing at all tiers of the network. This is incredibly important as organizations look to eliminate their dependency on MPLS. To even be considered, an affordable, MPLS alternative must deliver the uptime of MPLS. This can be done with SD-WAN services but redundancy and automatic failover are critical.

With an OTT service, edge devices connect to a Point of Presence (PoP). Traffic is then forwarded between the PoPs as required to deliver to a connection to an endpoint at the far end of the traffic. As part of the focus on HA, Cato has built redundancy and failover into many phases of its connection management.

With the X1700 series, Cato Cloud now automatically recovers from component failure in one SD-WAN device. HA configuration gives protection against device-level failure. Using active/active last mile configuration protects against failure of any one access line.

At the PoP level, the PoPs consist of multiple compute nodes. Should the compute node handling your sessions fail, the sessions failover to another one of the compute nodes in the PoP. Should the PoP become unreachable whether due to lost network connectivity, data center issues, server failures in the PoP or any other reason, the connections automatically re-establish to the next closest PoP. This process continues until a PoP is found. Should the entire Cato Cloud fail, Cato Sockets will find one another and form an ad-hoc SD-WAN. For real-time traffic, this automated dynamic PoP connectivity ensures that even a short outage does not become apparent to the actual user real-time traffic.

Redundancy between PoPs is also implicit in the architecture. The PoPs form an overlay across multiple, tier-1 carrier networks and monitor those networks for real-time performance. Cato’s policy-based routing (PBR) algorithms constantly calculate the optimum path for each packet. Should there be a brownout or blackout on any one carrier networks, Cato Cloud calculates an alternate route around the failure.

What It All Means

The result is a SD-WAN network layer that includes HA and self-healing. As an example, assume a network with two major data centers and 50 remote sites providing video and voice as well as both business and Internet data access. Assume that the primary site is connected through an HA solution. For performance reasons, generally the PoP selected is the one closest in Internet distance. For this customer, they are running Cato firewall security as well as QoS policies and management for the real-time services. Now, the data center hosting that PoP suffers a major issue (hurricanes, earthquakes, power issues, etc.) and is no longer available. In this case, the SD-WAN will automatically open new connection paths to the next closest PoP, but it will also instantiate all the policies and services, like the firewall, into the new PoP. The result is a virtually uninterrupted real-time service, combined with policy and QoS to assure it operates effectively.

It’s been clear that SD-WANs solve a fundamental problem for the cost of connectivity. However, while the actual traffic methods in SD-WAN minimize the effects of core Internet congestion and issues, they do not protect against network outages or slow-downs end-to-end.

With these enhancements, Cato is addressing the key issues in availability and uptime for their SD-WAN. Five nines uptime on the private interconnect network connecting the PoPs and the addition of HA features will enable companies to deliver service reliability that meets and even exceeds approaches with more expensive options.


There are currently no comments on this article.

You must be a registered user to make comments

Add new comment

Your name: