Azure SQL Database caused some annoyance over the weekend with admins on the US east coast unable to connect to the service following a network infrastructure power failure.
The disruption started at 07:33 UTC on September 16 and wasn’t fully mitigated until more than 12 hours later at 21:38, Microsoft confirmed on its status page.
Microsoft said: “Some customers using Azure SQL Database in East US may have experienced issued when trying to connect to Databases. Connections to database hosted in East US region may have resulted in persistent errors or timeouts.”
As a customer noted on the platform formerly known as Twitter: “Sql DB connection lost in production. The failover doesn’t kick in, causing impact in daily activities. Thanks god it (sic) Saturday.”
The exact cause of the wobbly sessions has yet to be determined. Microsoft said: “We identified that during a brief period of time an underlying network infrastructure experienced power power issue. This caused compute nodes to become unhealthy, resulting in failures and timeouts for SQL Database.
As is often the case, the domino effect kicked in, and downstream services that rely on SQL Database struggled to run as normal.
“We were notified of this issue through our internal monitoring systems, prompting us to initiate a thorough investigation,” Microsoft said. “In order to mitigate the initial impact, we rebooted the affected compute nodes, thereby restoring functionality to majority of databases. Subsequently, the remaining SQL DB instances was brought back online got bring survives to full functionality.”
The power issue at the root of the degraded services is still the source of investigations by engineers at Microsoft as they try to “establish a workstream to prevent future occurrence.”
So not exactly the mother of all outages that caused customers to gnash their teeth or pull their hair out, but also not a great look for Microsoft SQL DBV on Azure. Still, it wasn’t as embarrassing as the incident reported a week ago that wiped out Azure services on the Australian East cloud region.
In this, “a utility power sag tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones.” Microsoft concluded that understaffing and automation led to the challenge.
Source: The Register