Microsoft has officially recovered after a "networking outage" rendered top cloud platform Azure, messaging service Teams, and popular email client Outlook used by millions around the globe down.
Josh Stephens, CTO of BackBox, shared his thoughts on the recent outage, why it might have happened, and what insights organizations can gather from the incident and actions they can take to prevent similar network outages: "According to Microsoft, a change made to the Microsoft Wide Area Network (WAN) made Microsoft services inaccessible to users around the globe.
It's incredible that even the simplest configuration change or even a typo can sometimes cause a ripple effect and bring down a network and/or disrupt a supposedly fault-tolerant business service. Even tech giants like Microsoft aren't immune.
In many cases, the outage may not occur immediately after the configuration change was made and so it can be difficult to correlate the change during root cause analysis.
While many news reports have keyed upon the fact that a configuration change caused such a widespread outage, the real headline is that it took them four hours to restore service.
While this sounds exorbitant, without more technical details about the cause of the outage and, more specifically, the extenuating circumstances that extended the time it took to restore service, rather than pass judgment I will just honestly say, I’ve been there.
I do have some thoughts on how other network teams at organizations can be proactive now to avoid a similar disaster:
Accelerate the speed of solving difficult technical problems to ensure there is solid documentation and up-to-date network maps
Continuous, automated configuration auditing and remediation to ensure that all network devices are up-to-date and compliant with operational policies and industry standards
Automated network configuration backups that allow you to instantly restore backups and have automated weekly or frequent OS updates and patches
At a minimum, your automation platform should create backups daily, before and after changes, and store a long history of backups within an autoscaling, fault-tolerant data store. Furthermore, it should be able to reliably conduct upgrades at scale and while employing at least mildly complex workflows.
No single tool or approach can guarantee business continuity, but there are ways to be better prepared if the worst does happen."