RSS

Forums/Announcements & Updates

Pronto Update: Key Lessons Learned and Next Steps

Cory Brown
posted this on May 24 11:29

We’ve been up and stable for several days now and don’t expect any more downtime. We are still looking into the root cause of this, mostly likely an FTP compromise. The initial steps to clean things up, stabilize the server and put tighter security in place were completed a few days ago. And today in Bangkok life is quickly returning to normal; transit is working, business and schools are opening, and for the first time in a week 100% of the team is here in the office. Best Monday morning I’ve had in a long, long, time.

We did lose a week of onsite production time and needless to say are a bit behind with all these events. We’ll bear down of the next few days  and catch up – much of the editorial and design work continued, we just have a backlog of updates to do and requests to respond to. A little patience would be appreciated – but don’t hesitate to let us know if it’s urgent.

I do want to share some quick thoughts on Key Lessons Learned and Next Steps we’ll be taking to reduce the likelihood of this happening again and have a more effective response should it happen.

Communication and emergency planning

  • Analysis: Our “war room” approach was lacking. Being a small company with all of us sitting on one large room you wouldn’t think communication would be a problem however, at some crucial junctures it was. It reminded me of analysis of flight data recorders and the role small mistakes in communications with and within the cockpit crew can have on the safety of a flight.
    • This was exacerbated by the situation in Bangkok – we all think we can multitask but the evidence is most of us don’t do as well as we think. It was unfortunate  timing that had us in the middle of a problem with staff working at home or leaving early for curfew and just the overall distraction of events around us.
  • Next Steps: We’re going to develop an emergency check-list approach that should something like this happen again we have a more structured approach to situation assessments and how we approach solution, roles and response time frames. We’ll think through this clearly now so next time we get in the heat of the battle, we’re more focused and systematic. If there are external events it will  ensure we stay on plan.

Pay attention to warning signs

  • Analysis: We had a few days advance warning of this, a few pages on one site we got the Google warning, that should have been an alarm. At that time is was on a static page, of client uploaded content, not via WordPress. We reacted too slow while we “looked into it”.
    • We did determine that while this was the first manifestation, the source was not this user’s upload.
  • Next Steps: The emergency check-list above will be invoked as soon as there’s any sign of any vulnerability to the sites. We’ll get our security consultant involved and make sure we react quickly.  

Support infrastructure

  • Analysis: When we needed 24x7 phone support we didn’t have it. We got it in place, but too late. Our clients need a number to call for critical issues such as server down.  The monitoring we have didn’t perform as expected and isn’t at the level that ultimately gets someone to address a critical issue when needed.  We’re a around the clock business with clients in six different countries and need the support to deal with this.
  • Next Steps: We have a few ideas and will implement something over the next weeks that provides 24x7 emergency response. If you have ideas on how you do this in your small company we’d love to hear  from you.

Having the right people with the right skills

  • Analysis: We’ve used a few different people, at different times for server administration for tasks beyond our capabilities, but we’ve never had the full set of skills we need on consistent basis with a retainer or managed services contract. The server is under a managed approach by Rackspace, and they were outstanding within the context of that, but there are a large set of responsibilities that are beyond the scope of their responsibilities. This is a gap that is painful to fill in an emergency – I’m sure you’ve all gotten the panicked calls and know all too well.
  • Next Steps:  We now have an experience Linux server administrator on contract. He’s going to continue to work on the server the next week to harden our security and ensure that our business and production processes are as secure as possible. We’ll also engage him in the emergency check-list development above.
    • In addition, he going to help with performance tuning of the server. There are a number of things we know that can be done that have been beyond our capabilities that we can start working on

We apologize for the events of the past week and are dedicated to see they don’t happen again. If you want to add any suggestions or additions to the list above please let me know. Thanks to all of you for your understanding, patience and the many supportive comments.

Regards,
Derek