Skip to main content Skip to search
Start Your Free Trial
Blog

The Facebook Outage Wasn’t a DDoS attack, but it Shines the Light on Digital Resilience Planning

We were reminded that digital resilience is core to a successful online presence with yesterday’s not subtle multi-property outage at Facebook, Instagram and WhatsApp and its cascading effect to other properties and plug-ins. This also provides a reminder to any organization including the service provider networks and the colocation, data center or hosting providers that increasingly house critical applications and infrastructure.

The outage lasted around six hours on Monday, 4 October and was also reported to have affected internal systems that appeared to be dependent on resources from the outage, including access to Facebook physical properties. Adam Mosseri of Instagram likened the inability to operate to a “snow day” for Facebook employees as they effectively couldn’t work.

The business impact was clear and showcases the effects of the lack of resilience. A few items reported included:

  • Morningstar reported Facebook stock fell 4.9 percent, losing over $40 billion in market cap
  • In the same article, this translated into losing $164,000 a minute in revenue
  • Brand and confidence loss (and competitors potential gain, e.g., Twitter)
  • DownDetector reported over 14 million problems reported due to cascading effects
  • The economic impact beyond Facebook itself is obviously far greater

Outages can be driven by many things. One of the initial conversations within A10 was whether this was a DDoS attack. It was also asked by external parties. The site was down, no response from the servers, not even a fail page, so it might have been. However, the A10 Security Research team saw no unusual activity from our honeypots, or other monitoring systems, but did note the DNS and BGP issues. This pointed to, and was confirmed late yesterday, that core infrastructure issues had caused the outage. Facebook said yesterday:

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”

If you want to read more, ThousandEyes provided a technical article on the outage, covering the DNS and BGP details, while KrebsOnSecurity also offered a detailed summary.

Outages will happen, no matter how much we plan. It’s a fact of life IT professionals will always have to deal with. The challenge we face is how to mitigate that risk as much as possible and how we respond in times of crisis. While not all are specific to the Facebook outage, some best practices include:

  • Making the choices, and plans, to mitigate the biggest risks
  • Know what cyber security services, such as DDoS protection your data center provider offers
  • Have an internal plan on how and who to involve and notify, as well as an external communications plan
  • Develop fail-over plans for all infrastructure and eliminate single points of failover, for example, with global server load balancing and other techniques
  • Ensure security systems are in place, both monitoring for anomalies and mitigating nefarious activity
  • Eliminate chances for human error through automation and process-orientated checks and balances

The emphasis on digital resilience, both with technology and with planning, is becoming a bigger issue. And this is amplified by examples like the Facebook outage. It serves as a visceral reminder of the impact of downtime.

I am sure the internal Facebook team tasked with fixing this outage was not having a “snow day” yesterday.