Outage Postmortem - August 28th and 30th, 2019

Follow

Overview

On August 28th, 2019 and August 30th, 2019, clients may have experienced website outages and HTTP 502 errors during the below times (all times EDT);

  • Wednesday, August 28th - 5:20am until 5:38am
  • Friday, August 30th - 11:15pm until 11:48pm

The root cause of the downtime on both days was a configuration issue with a new Memcached server which had been deployed, causing the server to be unable to handle more than a certain number of connections. On Wednesday, the issue was identified after a update to a plug-in, and was resolved by restarting the database and the Memcached server. However, the connection limit issue then occurred again on Friday, at which point we discovered the root cause and began taking steps to resolve the immediate issue, as well as implementing changes to prevent it recurring.

The immediate resolution which we took was to move to using Amazon ElastiCache, and we will use this for the foreseeable future instead of Memcached.

Follow Up

As the root cause of this issue was due to a configuration issue on the Memcached server, we will implement changes to ensure that any new servers are correctly configured prior to being used in our Production environment. We are confident that the root cause in this case has been addressed by moving to Amazon ElastiCache, however we are undertaking ongoing monitoring of this to ensure no further issues arise.

We sincerely apologise for the downtime caused by this - we understand how important it is that your website is available and performs as expected 100% of the time, and we will continue to aim for this. You can always monitor our platform status by visiting our Status page, and if you do notice any problems with your website, you can use our live chat service ("Chat with us!" at the bottom right of this page) to alert our Support team immediately and we will investigate the problem.

Have more questions? Submit a request

Comments