What Happened Between 01:33 UTC and 02:38 UTC on 16 May 2025, a platform-level issue within the Azure OpenAI Service led to service interruptions affecting downstream integrations. During this period, our services that rely on Azure OpenAI experienced intermittent errors and degraded performance, including HTTP 500 responses.
Root Cause Microsoft has identified that the underlying cause was an increase in total request size on their backend, which triggered container restarts due to memory limits being exceeded. This resulted in one of their backend resources becoming unhealthy and unable to process requests reliably.
Timeline of Response
01:33 UTC – Azure platform issue began impacting dependent services
01:49 UTC – Issue was automatically detected by Microsoft’s monitoring systems; Azure engineers were engaged
01:58 UTC – Root cause was isolated to memory-related container restarts within Azure OpenAI infrastructure
02:12 UTC – Microsoft engineers initiated mitigation by scaling out the affected backend systems
02:38 UTC – Full service restoration confirmed by Microsoft
Impact on Our Services While our infrastructure and application code remained stable, availability of AI-powered features was temporarily degraded due to upstream dependency on Azure OpenAI. No data loss or permanent service degradation occurred within our environment.
May 16, 04:33 UTC