Microsoft Azure OpenAI outage

Incident Report for insightsoftware

Resolved

https://app.azure.com/h/4TSS-3VG/6711e7

What Happened
Between 01:33 UTC and 02:38 UTC on 16 May 2025, a platform-level issue within the Azure OpenAI Service led to service interruptions affecting downstream integrations. During this period, our services that rely on Azure OpenAI experienced intermittent errors and degraded performance, including HTTP 500 responses.

Root Cause
Microsoft has identified that the underlying cause was an increase in total request size on their backend, which triggered container restarts due to memory limits being exceeded. This resulted in one of their backend resources becoming unhealthy and unable to process requests reliably.

Timeline of Response

01:33 UTC – Azure platform issue began impacting dependent services

01:49 UTC – Issue was automatically detected by Microsoft’s monitoring systems; Azure engineers were engaged

01:58 UTC – Root cause was isolated to memory-related container restarts within Azure OpenAI infrastructure

02:12 UTC – Microsoft engineers initiated mitigation by scaling out the affected backend systems

02:38 UTC – Full service restoration confirmed by Microsoft

Impact on Our Services
While our infrastructure and application code remained stable, availability of AI-powered features was temporarily degraded due to upstream dependency on Azure OpenAI. No data loss or permanent service degradation occurred within our environment.
Posted May 16, 2025 - 04:33 UTC
This incident affected: AI Services - EU (Azure OpenAI).