Last week several organizations across the world were impacted by an Office 365 outage. The Exchange Online service was not fully available for several hours. Some couldn’t access their mailboxes and for some, the mail delivery performance (sending/receiving) was just poor.
Whoomp! There it is…
The consequences are obvious. Loss of productivity, bad end-user experience, amplified end-user frustration, loss of business speed and loss of trust. And that’s just naming a few of the many possible business-critical impacts.
Interestingly enough, the case under which this incident was logged (EX172491) has been removed by Microsoft in the meantime.
Indeed, a fundamental question for many end users, administrators and businesses who rely on stable, high-performance cloud offerings on a daily basis.
Take a Walk on the Safe Side
Monitoring your Office 365 installation is a critical first step in getting the information you need on your enterprise applications in real time. You can’t effectively manage a vitally important part of your application infrastructure unless you know how it’s performing. Early insights on availability will help you prepare for outages.
Knowing who is impacted is an important element for steering the issue (e.g. notifying your end-users). Whether only a group of people, a subset of users (in case Multi-Geo capabilities of Office 365 are used) or the entire organization using the cloud tenant.
With OfficeExpert we offer a solution that helps you to identify the magnitude of the possible impact.
Furthermore, by using the Mail Flow Simulation Sensor by OfficeExpert, organizations could have seen that the system was somehow restored (accessing the mailbox worked again). They could have also seen that the underlying service of sending/receiving mails was still impaired by the incident though. The following screenshot shows that there was a steady increase in the mail delivery time between January 23rd and 26th.
Ensure Solid Business Continuity for Your End-Users
This transparency helps you know that a particular service is not fully restored. It also helps you understand how you can plan and communicate accordingly. At the end of the day, this naturally benefits the end user too.
Monitoring notifications ensures that you are the first to find out that an issue exists. Even before Microsoft tweets about it hours later. Knowing which services are affected allows you to work proactively by notifying your users and apply contingency plans before being inundated with user tickets.
UPDATE: Further Outage on January 29th!
Another major outage happened on January 29th, 2019 where users were unable to authenticate and access Office 365 services. Azure was affected by this incident also. The root cause which was communicated by Microsoft was a DNS issue with CenturyLink as an internal DNS provider.
The following screenshot shows how OfficeExpert has seen and measured this outage. The Skype for Business Service had a downtime of almost 3 hours. Other services such as Exchange Online were impacted for around 1 hour. The failure indicator (error message in the screenshot) states that a certain full qualified domain name could not be resolved. This matches exactly with the root cause statement by Microsoft.
UPDATE: O365 Outage on May 2nd!
On May 2nd at 10:10pm CEST (1:10pm PST) Microsoft sent out the following message: We’re aware of and investigating an issue affecting access to SharePoint and OneDrive. Further details can be found in the admin center under SP178746 and OD178975.
At first Microsoft was unable to get any information out to its community. Users worldwide were forced to turn to social media rumor mills to find out why they were having problems. Core services that negatively affected productivity included Azure, multiple Microsoft 365 services, Dynamics, and DevOps.
In the screenshot below, it can be seen that OfficeExpert identified an outage at 9:50pm CEST. This was a full 20 minutes before the first Microsoft communication was sent.*
We’re very pleased with the positive feedback we received from our customers using OfficeExpert. They were able to identify the global outage of related sub-services for themselves before it was made public.
This was like the global Azure outage in January when it took over 1 hour for Office 365 services to be restored. Again raising the question how best to minimize the impact of cloud outages on your business.
* according to publicly available sources