Derek
posted this on May 17, 2011 08:26 pm
It’s our policy to always be open and transparent with any issues that impact your website. Here's our report on the May 17 downtime.
What Happened
All websites on Pronto servers were unavailable for the time period below. There were systems issues that ultimately lead to the failure, however a lack sufficient monitoring, including monitors on the monitors, turned what would have been rectifiable system issue in new infrastructure into a system failure.
The outage itself seems to be from a misconfiguration of parts of the system but we're investigating further to see if there was a possible external cause - more on that later. The reason the outage wasn't detected before there was downtime was due to us not having proper monitoring or adequate alerting in place in our new infrastructure.
The initial failure of the system seems to a result of first the database server, then the app server starting to swap memory to disk ultimately making the system unresponsive until the DB finally failed to accept connections at all resulting in a complete outage. Oddly enough, traffic had peaked at a little after 00:00 UTC and was gradually declining. Unfortunately at 0:42 UTC we lost monitoring of the http service so we cannot tell right away whether the traffic remained stable or if a burst of traffic suddenly brought it down perhaps caused by SPAM bots - a possible external cause. We are reviewing the logs of the reverse proxy and http servers to see if any unusual traffic appeared at that time.
Timeline
Tuesday May 17 (May 16 for the US)
0:42 UTC - DB Server & App Server start swapping memory to disk and lose responsiveness - system running a greatly reduced capacity
1:49 UTC - Last connection to DB server before it ceases responding completely - system offline
2:47 UTC - Database reset restores service - system online
Root Cause of Downtime
Other mitigating factors that may have resulted in reduced capacity contributing to the failure include:
Next Steps
Timeline to Implement Changes
Comments
Thanks for the update and explanation
I greatly appreciate the in depth explanation on what happened and what is being done to make sure it doesn't happen again. Most vendors do not provide this kind of service.
Thanks for the positive feedback everybody. Since this post we've made great strides in the right direction for improved performance and stability. Later this week we will be implementing more caching configurations and a global CDN.