Asleep at the wheel: Why did it take 5 HOURS for Microsoft to acknowledge an Azure DevOps TITSUP*?
We’ll have to wait until the US wakes up before we can answer that one
In an impressively frank postmortem, Microsoft has admitted that at least part of its organisation was asleep at the wheel in a very real sense while its European DevOps tooling tottered.
The travails of Azure during the current surge in usage are well-documented but, as well as showing the limits of cloudy tech, the pandemic-induced capacity constraints have also shown up some all too human failings at Microsoft.
Microsoft’s Azure DevOps hosted pools were no exception to those constraints, and from 24 to 26 March customers in Europe and the UK experienced substantial delays in their pipelines.
It was bad. By Microsoft’s own reckoning, during normal working hours over the three days, customers experienced an average delay of 21 minutes. The worst delay was nine hours.
The problem was that for each Azure Pipelines job, a fresh virtual machine is needed and, well, there was no room at the inn as Azure reacted to the COVID-19 surge.
Oh, and the Primary Incident Manager (PIM) was asleep, but more on that later.
As the failures mounted, Azure Pipelines kept trying, and the queues kept getting longer. The gang had noted the potential for problems prior to the incident and had already been working on an update to deal with it (ephemeral OS disks for Linux agents and chunkier Azure VMs with nested virtualization for Windows.) However, the change was a big one and took a while to roll out. After all, making things worse would be less than ideal. Fair enough.
However, what was not fair enough was the pisspoor communication from the Windows giant as users clamoured for answers. With refreshing frankness, Chad Kimes, director of engineering, said: “On the first day, when the impact was most severe, we didn’t acknowledge the incident for approximately five hours, which is substantially worse than our target of 10 minutes.”
Yes, Chad, it is.
Kimes proceeded to give an insight into how Microsoft handles this type of problem. Automated tooling detects customer request failures and performance wobbles. It then loops in both a Designated Responsible Individual (DRI) and the PIM. The PIM is the one who does the external communications.
However, a pipeline delay is picked up by a different process and the PIM wasn’t informed.
So, in this instance, while the DRI was frantically trying to work out why builds were borked in the UK and Europe, the PIM enjoyed the rest of the righteous, doubtless dreaming of gambolling through fields of purest Git.
It wasn’t until the PIM awoke and signed into the incident bridge at the start of business hours in the Eastern US that – oops – the borkage was finally acknowledged, five hours after things had gone TITSUP*.
Anyone who has had to ask the Windows giant the simplest of questions will sympathise with having to wait for someone in the US to wake up before an answer can be dispensed. We humble hacks at El Reg’s London office would therefore like to welcome those developers using the company’s Azure DevOps hosted pools to our own little world of pain.
Still, kudos to Microsoft for laying out what happened so bluntly. As well as offering profuse apologies to developers and an array of technical mitigations, the company also said: “We are improving our live-site processes to ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types.