On October 28th, 2020, our clients experienced a website outage starting at 3pm (EDT), which affected both our Multi-Site and Private Instance networks on our legacy (AWS) platform, which hosts the majority of our websites.
The outage lasted for a total of approximately one hour and 34 minutes, although our websites began coming back online at 4:21pm. The root cause was discovered at 5:13pm.
The root cause was a planned retirement of one of our EC2 instances by our hosting provider, AWS, due to hardware degradation, as well as human error. The EC2 instance to be retired hosted our Network File System (NFS), and stored PHP code crucial to the operation of our website platforms. AWS notified us in advance of this retirement schedule, however, due to human error, the notification was not actioned, leading to the instance being retired while it was being used to operate our website platforms, which caused the NFS to become unavailable, and our application servers to fail. The issue was resolved by restarting the affected instance for the NFS, and restarting the application servers.
In response to this outage, we've reviewed our notification processes and created a new internal notification system for e-mails from AWS. We'll also be reviewing our usage of the NFS, potentially moving to Amazon's Elastic File Service, and reviewing our alerting system so that we can respond more quickly to issues like this in the future.
We sincerely apologise for the downtime caused by this - we understand how important it is that your website is available and performs as expected 100% of the time, and we will continue to aim for this. You can always monitor our platform status by visiting our Status page, and if you do notice any problems with your website, you can use our live chat service ("Chat with us!" at the bottom right of this page) to alert our Support team immediately and we will investigate the problem.