Every major software or web company I’ve ever been involved in has had a catastrophic outage of some sort. I view it as a rite of passage – when this happens when your company is young and no one notices, it gives you a chance to get better. But eventually you’ll have one when you are big enough for people to notice. How you handle it and what you learn from speaks volumes about your future.
Last week, two companies that we are investors in had shitty experiences. SendGrid‘s was short – it only lasted a few hours – and was quickly diagnosed. BigDoor‘s was longer and took several days to repair and get things back to a stable state. Both companies handled their problems with grace and transparency – announcing that all was back to normal with a blog post describing in detail what happened.
- BigDoor: Recovery Retrospect
- SendGrid: Postmortem: Aug. 25 2010 Delivery Delays
While you never ever want something like this to happen, it’s inevitable. I’m very proud of how both BigDoor and SendGrid handled their respective outages and know that they’ve each learned a lot – both in how to communicate about what happened as well as insuring that this particular type of outage won’t happen again.
In both cases, they ended up with 100% system recovery. In addition, each company took responsibility for the problem and didn’t shift the blame to a particular person. I’m especially impressed how my friends at BigDoor processed this as the root cause of the problem was caused by a new employee. They explain this in detail in their post and end with the following:
“Yes, this employee is still with us, and here’s why: when exceptions like this occur, what’s important is how we react to the crisis, accountability, and how hard we drive to quickly resolve things in the best way possible for our customers. I’m incredibly impressed with how this individual reacted throughout, and my theory is that they’ll become one of our legendary stars in years to come.”
I still remember the first time I was ever involved in a catastrophic data loss. I was 17 and working at Petcom, my first real programming job. It was late on a Friday night and I got a call from a Petcom customer. I was the only person around so I answered the phone. The person was panicked – their hard drive had lost all of its data (it was an Apple III ProFile hard drive – probably 5 MB). The person was the accounting manager and they were trying to run some process but couldn’t get anything to work. I remember discerning that it seemed like the hard drive was fine but she had deleted all of her data. Fortunately, Petcom was obsessive about backups and made all of their clients buy a tape drive – in this case, one from Tallgrass (I vaguely remember that they were in Overland Park, KS – I can’t figure out why I remember that.)
After determining the tape drive software was working and was available, I started walking the person through restoring her data. She was talking out loud as she brought up the tape drive menu and starting clicking on keys before I had a chance to say anything at which point she pressed the key to format the tape that was in the drive. I sat in shock for a second and asked her if she had another backup tape. She told me that she didn’t – this was the only one she ever used. I asked her what it said on the screen. She said something like “formatting tape.” I asked again if there was another backup tape. Nope. I told her that I thought she had just overwritten her only backup. Now, in addition to having deleted all of her data, she had wiped out her backup. We spent a little more time trying to figure this out, at which point she started crying. I doubt she realized she was talking to a 17 year old. She eventually calmed down but neither of knew what to do next. Eventually the call ended and I went into the bathroom and threw up.
I eventually got in touch with the owner of Petcom (Chris) at his house who told me to go home and not to worry about it, they’d figure it out over the weekend. I can’t remember the resolution, but I think Chris had a backup for the client from the previous month so they only lost a month or so worth of data. But that evening made an incredible impression on me. Yes, I finished the evening with at least one illegal drink (since the drinking age at the time in Texas was 18.)
It’s 28 years later and computers still crash, backups are still not 100% failsafe, and the stress of massive system failure still causes people to go in the bathroom and throw up. It’s just part of how this works. So, before you end up in pain, I encourage you to think hard about your existing backup, failover, and disaster recovery approaches. And, when the unexpected, not anticipated, not accounted for thing happens, make sure you communicate continually and clearly what is going on, no matter how painful it might be.