CloudFlare Outage Blamed on Faulty Juniper Router

11 Mar 2013

Last week, there was a large failure at website security and performance firm CloudFlare, for which Juniper Networks is being blamed. An investigation is underway.

All of the firm's services were affected by the outage, and its Edge Routers were blamed for a "systemwide" failure. The router glitch caused the company's 23 data centers to fail. The company stated: "When a router goes down, the routes to the network that sits behind the router are withdrawn from the rest of the Internet."

CloudFlare added: "We have already reached out to Juniper to see if this is a known bug or something unique to our setup and the kind of traffic we were seeing at the time."

Service credits are being offered to accounts which are covered by service level agreements (SLAs). A distributed denial-of-service attack was also detected, and was targeting one of its customers. The company said: "What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed."

The reliability of cloud systems has been questioned heavily in the past, and it has been suggested that they pose a great risk to data storage in the cloud or through cloud-based services. Last year, there were regular outages reported by cloud-hosting companies, and it has been noted by security experts that the issue in service contracts needs to be addressed. Additionally, a continuity plan has been recommended in order to contain any interruptions to business.

According to CloudFlare, some Juniper routers failed to reboot automatically, and the routers' management ports were not accessible to the company. This caused a delay in getting back online. The company commented: "Even though some data centers came back online initially, they fell back over again because all the traffic across our entire network hit them and overloaded their resources."

The network was restored fully within an hour.

CloudFlare said its team began to restore the network within 30 minutes, with full restoration in about an hour. The company said: "We will be doing more extensive testing of Flowspec-provisioned filters and evaluating whether there are ways we can isolate the application of the rules to only those data centers that need to be updated, rather than applying the rules networkwide."

A spokesperson at Juniper confirmed that the company was aware of the situation at CloudFlare and an investigation has begun to determine the causes of the outage. According to the company, it is not aware that any other customers are having similar problems.

The spokesperson said: "While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available. Our customer support team is actively supporting CloudFlare in its efforts to resolve the issue."

There have been security issues at CloudFlare in the past; in June last year a number of problems caused a data-security breach in its network and an attack on one of its customers. (CY) Link