Azure DevOps Code Upgrade Leads to 10 Hours of Downtime in South Brazil Region
Microsoft recently experienced a 10-hour downtime in the South Brazil Azure region due to a typo in a code upgrade for a scale unit of Azure DevOps. The typo caused the deletion of 17 production databases, leading to an extended recovery period.
Background
Azure DevOps is a suite of application lifecycle services that administrators use to take regular snapshots of production databases. These snapshots are used to investigate reported customer issues or test improvements. A system in the background takes care of the daily creation and deletion of these snapshots after a certain time.
Typo in Pull Request
During a recent code upgrade, Microsoft replaced outdated Microsoft.Azure.Managment packages with supported Azure.ResourceManager-NuGet packages. Unfortunately, a typo crept into the major pull request, which caused the delete job to delete the entire scale unit Azure SQL server for the South Brazil region with 17 databases. As a result, customers could no longer use the server in question.
10 Hours of Downtime
The recovery process took a whopping 10 hours. This was because customers cannot repair Azure SQL servers themselves, but have to leave them to Azure engineers from Microsoft. In addition, the deleted databases had different backup configurations, which took even more recovery time.
After all databases were back online, the entire scale unit remained unreachable for customers due to complex problems with the tech giant’s web servers. When the customer traffic came online, these servers had to process an overload of traffic, so that they still went offline.
Microsoft’s Response
Microsoft has apologized to its customers and made several fixes to prevent a recurrence. The tech giant is also taking steps to ensure that a similar incident does not occur in the future.