Last week was a total nightmare for Microsoft cloud services. With multiple system failures caused by Azure AD problems that brought Microsoft 365 to its knees, followed by load-balancing issues for Exchange online a few days later, the Microsoft cloud-based collaboration environment showed its lack of resilience for IT complexity.
As reported by Microsoft a few hours after the downtime started, “… at approximately 19:15 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into Microsoft services, including Microsoft Teams, Office and/or Dynamics, Xbox Live, and the Azure Portal,” from the Azure status page.
And then, a few days later a second incident hit the Microsoft cloud infrastructure impacting Exchange Online due to load balancing issues.
These were NOT the first outages for Microsoft’s cloud environment, and they will NOT be the last. What can organizations do to improve their response when the next one hits?
Lessons Learned – be prepared for a rainy day
Microsoft’s typical response to a service outage is to send announcements via Twitter and other communication methods providing generic information about the problem. There is rarely a hint of the impact to enterprise organizations hosted in different geographies. Instead, companies are left fumbling around looking for answers when these outages occur and asked to check back later.
To their credit, Microsoft provides good ongoing status information during the outage, but the details are still generic, leaving customers to fend of themselves.
So, what if? What if you had an early warning system? What if your IT group had alerts that provided the details of the outage including WHAT workloads were currently impacted, and WHICH regions were currently affected? With this information you could take action, notify Helpdesk support groups and inform your employees to reschedule their Teams meetings.
During the recent M365 outages our customers using OfficeExpert TrueDEM EPM had that capability. They had the early warning ahead of Microsoft’s announcement and were able to leverage the actionable insights to their advantage.
OfficeExpert TrueDEM EPM and the Regional Outage Differences
While most services were down in North America, our customers found that Exchange online was not (see screenshots below). And since Exchange was still up and running, their IT support groups prompted their employees to reschedule their Teams meetings until the following day and bypass any confusion with business partners and customers.
Overall, there were differences with the impact of M365 service availability depending on your geographic region. OfficeExpert TrueDEM EPM identified them and provided our customers with the details 30-minutes before Microsoft sent out their initial announcements.
North America perspective
The following 4 screenshots are from our North America customers running OfficeExpert TrueDEM EPM. You can see that the outage had no impact on the Exchange Online (EXO) service, but the other services like Teams, OneDrive and SharePoint were totally out of commission for hours.
European perspective
For our customers in Europe, they had a different experience as they lost Exchange (EXO) access as well as the other main workloads. Microsoft Teams availability was impacted first, followed by OneDrive and SharePoint, and then about an hour later EXO was down. There was not as much business impact because these outages occurred during “off-hours” compared to the downtime in North America. Based on the data shown in the graphical charts below, you can see that M365 Services were all up and running around 3:00 AM, with some minor availability in between.
Note: Above Timestamps are Central European Time (CET)
Invest in an Early Warning System
Outages for M365 will occur again, that much has been proven over time. How your IT support groups can react during these unpredictable instances is up to you. For some enterprise organizations, this type of downtime is critical to their business. They need the detailed information so they can make the best decisions to direct their employees and partners. Knowing which M365 services are affected allows them to work proactively, notifying end-users and applying contingency plans before being inundated with helpdesk calls:
- Alert your helpdesk to prepare detailed instructions
- Switch to different modes of communications during the outage
- Reschedule meetings to the next day
- Inform business partners and customers to expect delayed communications
Find Out More…
If you are interested in learning more about our OfficeExpert TrueDEM EPM data analytics solution and how it can help you monitor service availability and maintain business continuity for your employees, please visit our overview page at https://www.panagenda.com/products/officeexpert/.