Forums/Official Pronto Marketing/Announcements & Updates

Downtime Report Tuesday May 17

Derek
posted this on May 17, 2011 08:26 pm

It’s our policy to always be open and transparent with any issues that impact your website. Here's our report on the May 17 downtime.

What Happened

All websites on Pronto servers were unavailable for the time period below. There were systems issues that ultimately lead to the failure, however a lack sufficient monitoring, including monitors on the monitors, turned what would have been rectifiable system issue in new infrastructure into a system failure.

The outage itself seems to be from a misconfiguration of parts of the system but we're investigating further to see if there was a possible external cause - more on that later. The reason the outage wasn't detected before there was downtime was due to us not having proper monitoring or adequate alerting in place in our new infrastructure.

The initial failure of the system seems to a result of first the database server, then the app server starting to swap memory to disk ultimately making the system unresponsive until the DB finally failed to accept connections at all resulting in a complete outage. Oddly enough, traffic had peaked at a little after 00:00 UTC and was gradually declining. Unfortunately at 0:42 UTC we lost monitoring of the http service so we cannot tell right away whether the traffic remained stable or if a burst of traffic suddenly brought it down perhaps caused by SPAM bots - a possible external cause. We are reviewing the logs of the reverse proxy and http servers to see if any unusual traffic appeared at that time.

 

Timeline

Tuesday May 17 (May 16 for the US)

0:42 UTC - DB Server & App Server start swapping memory to disk and lose responsiveness - system running a greatly reduced capacity

1:49 UTC - Last connection to DB server before it ceases responding completely - system offline

2:47 UTC - Database reset restores service - system online

 

Root Cause of Downtime

  • Ultimately we were not monitoring or alerting for SWAP usage.
  • We did not have alerts for when monitoring is disconnected; we were not monitoring our monitoring.
  • Had either of these been in place we would have been able to catch it likely before there was a full service failure.

Other mitigating factors that may have resulted in reduced capacity contributing to the failure include:

  • For some reason the apache processes were only utilizing one of the two CPU cores on the app server. This seems to be corrected now but we are monitoring.
  • The database system defaults were to have query caching turned off. We have now turned this on.

 

Next Steps

  • Deploying a live database peer which will handle read-only traffic which should nearly double database capacity and improve performance.
  • Setting up additional centralized monitoring of the RP on the old Server which is now being used for the redirects. This doesn't appear to have any influence on the failure we just experienced but our review of the situation revealed we failed to monitor this correctly as well.
  • Setting up proper monitoring for SWAP space on all devices.
  • Setting up more resilient and active alerting for any out of spec alerts.
  • Installing an NGINX reverse proxy on the app server which will take the load off apache for static files. We were going to deal with this with a CDN but feel this is a quick solution that will increase responsiveness & capacity right away. CDN will come later.
  • Install a memcache WP plugin that will cache settings/options lookups. It turns out our system is performing ~400 DB hits per page request because of how some of the PHP is written. Re-writing these parts of the code will be one of our first post migration development efforts since doing this will significantly decrease DB hits and make the system more responsive as a stop gap measure.
  • We are investigating other changes to our database infrastructure that would provide better monitoring and make the system more resilient.

 

Timeline to Implement Changes

  • 1-4 completed on May 17
  • 5-7 will be completed within the next 1-3 days and will increase capacity and improve responsiveness significantly.
  • We can do all but #7 without any downtime and #7 may be a zero downtime option as well or at least very short time. We will provide alerts if there is scheduled downtime.
 

Comments

User photo
Paul Andersen
Computer Solutions Sales and Service

Thanks for the update and explanation

May 18, 2011 02:01 am
User photo
Eric Peterson
Peterson Computer Consulting

I greatly appreciate the in depth explanation on what happened and what is being done to make sure it doesn't happen again.  Most vendors do not provide this kind of service.

May 18, 2011 09:21 pm
User photo
Cory
Pronto Marketing

Thanks for the positive feedback everybody. Since this post we've made great strides in the right direction for improved performance and stability. Later this week we will be implementing more caching configurations and a global CDN.

May 29, 2011 01:10 pm