There was an outage of Microsoft Azure last month that resulted in thousands of websites (including MSN.com) going offline without any notice. At the time, the error was blamed on undetected storage blobs.
The Real Story
Today, the public was notified by Microsoft that the Azure outage was actually caused by human error. The on and off connection issues in a variety of regions throughout the world were in reality due to an exposed bug following a breach in employee protocol. The company’s VP for Azure, Jason Zander, admitted:
“There was a gap in the deployment tooling that relied on human decisions and protocol”
After a complete analysis of the outage, Microsoft stated that “faulty-flighting” led to the problem. The computer engineer administering the upgrade, changed the program’s configuration and exposed a bug. This bug caused the Azure Blob storage Front-Ends to go into an infinite loop making it impossible for service requests to get resolved. Zander went on to explain:
“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’”
“When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”
From Now On
Microsoft is already taking precautions to ensure an outage like this never happens again. They have updated the deployment system for Azure, and now have stricter enforcement for the policies that pertain to updates, coding and configuration. Did you experience this outage, if so, for how long and what type of error message did you receive?