Off-Prem

Microsoft admits slim staff and broken automation contributed to Azure outage

Just three people were on duty in Australia when 'power sag' struck and software failures left them blind

Mon 4 Sep 2023 // 06:57 UTC

Microsoft's preliminary analysis of an incident that took out its Australia East cloud region last week – and which appears also to have caused trouble for Oracle – attributes the incident in part to insufficient staff numbers on site, slowing recovery efforts.

The software colossus has blamed the incident on "a utility power sag [that] tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones."

Microsoft is known to operate some cloud infrastructure in parts of Sydney, Australia, that experienced power outages after an electrical storm last week. The "power sag" explanation is therefore consistent with wider events.

The analysis document explains that the two data halls impacted by the sag had seven chillers – five in operation and two on standby. Once the sag struck, Microsoft's staff executed Emergency Operational Procedures (EOPs) to bring them back online. But that didn't work "because the corresponding pumps did not get the run signal from the chillers."

That's not what is supposed to happen. Microsoft is talking to its suppliers about why it did.

Backup chillers didn't completely live up to their name.

"We had two chillers that were in standby which attempted to restart automatically – one managed to restart and came back online, the other restarted but was tripped offline again within minutes," Microsoft's report states.

With just one chiller working in data halls that need five, "thermal loads had to be reduced by shutting down servers."

Which is when bits of Azure and other Microsoft cloud services started to evaporate.

The software colossus's report offers a very detailed timeline of events that shows how its on-site team made it onto the datacenter's roof to inspect chillers exactly an hour after the power sag, and that the chillers' manufacturer had boots on the ground two hours and 39 minutes after the incident commenced.

But the document also notes that Microsoft had just three of its own people on site on the night of the outage, and admits that was too few.

"Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner," the report states. "We have temporarily increased the team size from three to seven, until the underlying issues are better understood, and appropriate mitigations can be put in place."

The analysis also suggests the prepared emergency procedures did not include provisions for an incident of this sort.

"Moving forward, we are evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first," the document states.

Manual resets

Microsoft also had trouble understanding why its storage infrastructure didn't come back online.

Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down.

"As a result, our onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting," the report states.

Some kit needed to be replaced, while some components needed to be installed in different servers.

Microsoft also admitted "our automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts."

And that's just the stuff the tech giant was able to discover in its immediate post-incident review, compiled within three days of an incident. The Beast of Redmond publishes full assessments of outages within fourteen days, and The Register awaits that document with interest – as, we imagine, will Azure customers. ®

Topics

Special Features

Vendor Voice

Resources

Off-Prem

Microsoft admits slim staff and broken automation contributed to Azure outage

Just three people were on duty in Australia when 'power sag' struck and software failures left them blind

Manual resets

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Microsoft kills classic Azure DaaS, because it isn't really Azure

Microsoft attempts to woo governments with Cloud for Sovereignty preview

Microsoft extends life support for aging Apache Cassandra 3.11 database

If a college graduate can’t protect your data, you’re in trouble

From chaos to cadence: Celebrating two decades of Microsoft's Patch Tuesday

Microsoft introduces AI meddling to your files with Copilot in OneDrive

Obscured by clouds: Time for IaaS vendors to come clean and play fair

UK IaaS market: Deeper probe by competition regulator lands soon

LinkedIn lays off nearly 700 staff, engineers to suffer the most

Microsoft says VBScript will be ripped from Windows in future release

Microsoft does not want ValueLicensing CEO anywhere near its confidentiality ring

Calls for Visual Studio security tweak fall on deaf ears despite one-click RCE exploit

About Us

Our Websites

Your Privacy