Facebook is down: What can it teach us?

The outage of Facebook revealed the existing vulnerabilities of centralized IT structures and the cost of human error in configuration changes. This incident reminds us that it is crucial to select the best cloud storage for development and ensure regular monitoring of the solution’s infrastructure for vulnerabilities.
What happened to Facebook on October 4, 2021?
At least 3.5 billion people globally were unable to use Facebook as a communication tool or in any other manner on October 4, 2021. Effectively, Facebook and its core platforms, including WhatsApp and Instagram, went offline, disappearing from the Internet. The outage lasted between 11:39 am ET and 6:00 pm ET, when the company managed to achieve partial restoration and operations of its services. However, full functionality remained disrupted until late Monday, October 4.
Reasons behind the global outage
The VP of Facebook’s Infrastructure, Santosh Janardhan, offered his apologies on behalf of the company in the official statement. He claimed that the company’s engineering teams had discovered that configuration changes on backbone routers coordinating network traffic between its data centers were interrupting the communication. The mentioned network disruption had a cascading impact on the communication between data centers, including the Border Gateway Protocol (BCP) routing system.
An unnamed source, cited as “Facebook employee” by NBC News, stated that the problem stemmed from the Domain Name System (DNS) that connects users to websites. The issues affected the internal services and third-party tools. Another source, a WhatsApp employee, claimed that only calendars and emails remained accessible. Since access to conference rooms within the company required the use of tablets with an active internet connection, these rooms also remained unavailable within the company.
We outlined the following hypotheses:
- First, Facebook relied on withdrawals of BGP routes to the respective authoritative name servers located outside the facebook.com domain.
- Second, the company utilized the short Time To Live (TTL) for their DNS caches, which resulted in immediate effects of the reachability withdrawals of their name servers.
- Third, the disappearance of Facebook’s domain name and the related names eliminated the internal control and command tools. Such an outcome could stem from the withdrawal of the original BGP route or the DNS problem. The data centers of Facebook were unable to exchange traffic, exacerbating the issue.
Lessons learned: preventative actions and tools
The Facebook outage pinpointed the larger-scale problems that required addressing based on the following lessons learned:
- Redundancy is the king for networks and data storage
An overview of issues and recent developments in cloud computing and storage security reflects the value of redundancy. Inherently, it is about duplicating data and/or equipment so that an individual failure doesn’t disrupt the whole infrastructure. In particular, the cloud supports dynamic allocation of network resources in real time.
- Planning and rehearsing configuration changes matters
The lesson suggests the possibility of human and technical errors with the underlying need to address them early. Facebook stated that the issue of a single command sent by its engineers to analyze the availability of the global backbone capacity essentially eliminated the connections in the network and disconnected data centers on a global scale.
Since these errors are possible at large corporations like Facebook and Amazon, they may also happen when developing a customized app. For a custom solution that handles large volumes of data, such as a booking or delivery app, it is essential to implement sufficient circuit breakers.
- Importance of decentralized IT architectures
The centralization of the IT processes contributes to the risks of major service disruptions. Decentralization would support operations of the platform in certain regions, even in case it would experience failure in the others. At the same time, various solutions, such as cloud android development tools, may be highly effective in offering access to the DNS servers through the third-party provider.
Future outlook
Large corporations and small companies seeking the development of applications will benefit from using decentralized solutions helping with the minimization of risks. TRIARE team has been developing customized solutions across various industries and knows how to minimize the risks and prevent errors that might lead to app’s crushes.
The lessons learned from Facebook’s outage highlight the value of redundancy in network security and data storage, the need to practice configuration changes, and focus on decentralization.