More posts by this author
CTO at
Triare
The outage of Facebook revealed the existing vulnerabilities of centralized IT structures and the cost of human error in configuration changes. This incident reminds us that it is crucial to select the best cloud storage for development and ensure regular monitoring of the solution’s infrastructure for vulnerabilities.
At least 3.5 billion people globally were unable to use Facebook as a communication tool or in any other manner on October 4, 2021. Effectively, Facebook and its core platforms, including WhatsApp and Instagram, went offline, disappearing from the Internet. The outage lasted between 11:39 am ET and 6:00 pm ET, when the company managed to achieve partial restoration and operations of its services. However, full functionality remained disrupted until late Monday, October 4.
The VP of Facebook’s Infrastructure, Santosh Janardhan, offered his apologies on behalf of the company in the official statement. He claimed that the company’s engineering teams had discovered that configuration changes on backbone routers coordinating network traffic between its data centers were interrupting the communication. The mentioned network disruption had a cascading impact on the communication between data centers, including the Border Gateway Protocol (BCP) routing system.
An unnamed source, cited as “Facebook employee” by NBC News, stated that the problem stemmed from the Domain Name System (DNS) that connects users to websites. The issues affected the internal services and third-party tools. Another source, a WhatsApp employee, claimed that only calendars and emails remained accessible. Since access to conference rooms within the company required the use of tablets with an active internet connection, these rooms also remained unavailable within the company.
We outlined the following hypotheses:
The Facebook outage pinpointed the larger-scale problems that required addressing based on the following lessons learned:
An overview of issues and recent developments in cloud computing and storage security reflects the value of redundancy. Inherently, it is about duplicating data and/or equipment so that an individual failure doesn’t disrupt the whole infrastructure. In particular, the cloud supports dynamic allocation of network resources in real time.
The lesson suggests the possibility of human and technical errors with the underlying need to address them early. Facebook stated that the issue of a single command sent by its engineers to analyze the availability of the global backbone capacity essentially eliminated the connections in the network and disconnected data centers on a global scale.
Since these errors are possible at large corporations like Facebook and Amazon, they may also happen when developing a customized app. For a custom solution that handles large volumes of data, such as a booking or delivery app, it is essential to implement sufficient circuit breakers.
The centralization of the IT processes contributes to the risks of major service disruptions. Decentralization would support operations of the platform in certain regions, even in case it would experience failure in the others. At the same time, various solutions, such as cloud android development tools, may be highly effective in offering access to the DNS servers through the third-party provider.
Large corporations and small companies seeking the development of applications will benefit from using decentralized solutions helping with the minimization of risks. TRIARE team has been developing customized solutions across various industries and knows how to minimize the risks and prevent errors that might lead to app’s crushes.
The lessons learned from Facebook’s outage highlight the value of redundancy in network security and data storage, the need to practice configuration changes, and focus on decentralization.
Baidy Vyshnyvetskogo 56 Cherkassy, Ukraine.