An increasing number of companies we are investors in are focused on DevOps. A year or so ago I read an early draft of a new book titled The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win. I really enjoyed it and asked Gene Kim, one of the authors to write a guest post on DevOps. He wrote it a while ago and it has sat in my draft queue waiting for the perfect moment to emerge. That moment is now. Following is a guest post on DevOps by Gene Kim, Multiple Award-Winning CTO, Researcher, Visible Ops Co-Author, Entrepreneur & Founder of Tripwire.
Since 1999, my passion has been studying high performing IT organizations. On this journey, we benchmarked 1,500 IT organization to understand what differentiated the highest performing organizations and allowed them to do what the others only dreamed of. Our findings went into a book that we published in 2004 called The Visible Ops Handbook, which described how these organizations made their “good to great” transformation.
Since then, this journey has taken me straight into the heart of the DevOps movement. Although I initially dismissed DevOps as just another marketing fad, my friend John Willis corrected me, in the way that only true friends can do, saying, “Don’t be dense. DevOps finally proves how IT can be a strategic advantage that allows a business to beat the pants off the competition. This is the moment we’ve all been waiting for.”
In that moment, I saw the light. Over the years, I’ve come to believe with moral certainty that everyone needs DevOps now, especially software startups where the successful execution of Development and IT Operations preordain success or failure.
Today, we can see how DevOps patterns enable organizations like Etsy, Netflix, Facebook, Amazon, Twitter and Google to achieve levels of performance that were unthinkable even five years ago. They are doing tens, hundreds or even thousands of code deploys per day, while delivering world-class stability, reliability and security.
DevOps refers to the emerging professional movement that advocates a collaborative working relationship between Development and IT Operations, resulting in the fast flow of planned work (i.e., high deploy rates), while simultaneously increasing the reliability, stability, resilience of the production environment.
The culture and practices that enable DevOps to happen cannot be delegated away. In a growing startup where teams start to specialize and multiply, the chaos of daily work often starts to slow down the smooth flow of work between Development and IT Operations, sometimes even resulting in outright tribal warfare.
In this blog post, I’ll describe what this downward spiral looks like, and what everyone in the company must do to break this destructive pattern and ensure that Development and IT Operations work together in a way that creates such a competitive advantage that it may almost seem unfair.
There is currently a core, chronic conflict that exists in almost every IT organization. It is so powerful that it practically pre-ordains horrible outcomes, if not abject failure. It happens in both large and small organizations, for-profit and non-profit, and across every type of industry.
In fact, this destructive pattern is the root cause of one of the biggest problems we face as an industry. But, if we can beat it, we’ll have the potential to generate more economic value than anything we’ve seen in the previous 30 years.
I’m going to share with you what this destructive pattern is in Three Acts, that will surely be familiar to you. (You can get the whole story in my book, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win).
Act I begins with IT Operations, where we’re supporting a large, complex revenue generating application. The problem is that everyone knows that the application and supporting infrastructure is… fragile.
How do we know? Because every time anyone touches it, it breaks horrifically, causing an epic amount of unplanned work for everyone.
The shameful part is how we find out about the outage: Instead of through an internal monitoring tool, it’s a salesperson calling, saying, “Hey, Gene, something strange is happening. Our revenue pipeline stopped for two hours.” Or, “the banner ads in my market are being served upside down and in Spanish.”
There are so many moving parts that it takes way too long to figure out what caused the problem du jour, which means we’re spending more and more time on unplanned work and increasingly unable to get our planned work done.
Eventually, our ability to support the most important applications and business initiatives goes down. When this happens, the organization suddenly finds itself unable to achieve the promises and commitments made to the outside world, whether it’s investors, customers, analysts or Wall Street.
Promised features aren’t delivered on time, market share isn’t going up, average order sizes are going down, specific revenue goals are being missed…And that’s when something really terrible happens.
In Act 2, everyone’s lives gets worse when the business starts making even bigger promises to the people we let down, to compensate for the promises we previously broke. Often, the entire organization starts dreaming up bigger, bolder features that are sure to dazzle the marketplace, but without the best grasp on what technology can and can’t do, or fully realizing what caused us to miss our commitments in the first place.
Enter the Developers. They start seeing more and more urgent date-driven projects put in the queue, often requiring things that the organization has never done before. Because the date can’t be moved (because of all those external promises made), everyone has to start cutting corners.
Development must focus on getting the features done, so the corners that get cut are all the non-functional requirements (e.g., manageability, scalability, reliability, security, and so forth). This means that technical debt starts to increase. And that means increasingly fragile infrastructure in production.
It is called “technical debt” for a reason—because technical debt, like financial debt, compounds.
When technical debt begins to accumulate, something very insidious starts happening. Our deployments start taking longer. What used to take an hour now takes three hours, then a day, then two days—which is okay, because it can still get it done in a weekend. But then it takes three days, and then a week, then two weeks!
Our deployments become so expensive and so difficult that the business says that we have to lengthen the deployment intervals, which goes against all our instincts and training. We know that we need to shrink the batch sizes, not make them bigger, because large changes make for larger failures.
The flow of features slows to a trickle, the deployments take even longer, more things go wrong, and because of all the moving pieces, issues take even longer to diagnose. Our best Dev and Ops people are spending all their time firefighting, and blaming each other when things go wrong.
I’m guessing that most of you can relate to at least some portions of this story? As I said, this happens both in large enterprises and growing startups alike. In my fifteen years of research in this area, I’ve found almost all IT professionals have experienced this cycle.
We know that there must be better way, right? DevOps is the proof that it’s possible to break the core, chronic conflict, so we can deliver a fast flow of features without causing chaos and disruption to the production environment.
When John Allspaw and Paul Hammond gave their seminal “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr” presentation at the 2009 Velocity Conference, people were shocked and amazed, if not outright fainting in the aisles at the audaciousness of their achievement.
It wasn’t a fluke. Other organizations such as Facebook, Amazon, Netflix and the ever-growing DevOps community have replicated their performance, doing hundreds, and even thousands, of deployments per day. DevOps is not only for large, established companies. It’s for any company where the achievement of business goals rely upon both Development and IT Operations. These days, that means almost every company.
We all need to be putting DevOps-like practices into place. This is why Kevin Behr, George Spafford, and I wrote The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win.
A novel, you might ask? How is a novel going to solve my problems?
As a friend once told me, “Before you can solve a complex problem, you must first have empathy for the other stakeholders. And story-telling is most effective means of creating a shared understanding of the problem.”
Dr. Eliyahu Goldratt demonstrated the power of a novel as a teaching tool through his book, The Goal: A Process of Ongoing Improvement. It’s a novel written in the 1980s about a plant manager who has 90 days to fix his cost and due date issues or his plant will be shut down. When I read this book nearly 15 years ago, I knew that this story was important, and that there was much I needed to learn, even though I never managed or worked in a manufacturing plant.
It isn’t an overstatement to say that The Goal and Dr. Goldratt’s Theory of Constraints changed my life—in fact, it probably was one of the biggest influences on my professional thinking. For eight years, my co-authors and I wanted to write The Phoenix Project, because we all believed that IT is not a department, but a strategic capability that every business must have.
As you can imagine, I was incredibly honored and thrilled when Jez Humble, author of the award-winning book Continuous Delivery recently told me, “This book is a gripping tale that captures brilliantly the dilemmas that face companies which depend on IT. The Phoenix Project will have a profound effect on IT, just as The Goal did for manufacturing.”
For those of you are looking for some places to start your DevOps journey, here are my three favorite DevOps patterns:
If you enjoyed this taste of DevOps and believe it can help achieve your goals, “The Phoenix Project” is available now, or you can download a free 170 page excerpt of the book. And of course, you can always find the latest writings on DevOps at the IT Revolution blog, where you can get our free whitepaper “The Top 11 Things You Need To Know About DevOps.”
Long live DevOps!
My friends at Standing Cloud have closed another $3 million financing from us (Foundry Group) and Avalon Ventures. They’ve also added a long time friend, co-investor, and amazing entrepreneur Will Herman to the board.
Standing Cloud is a great example how one of our funding strategies plays out. We are the seed investors and have been working closely with the company since inception. They’ve built an incredibly deep product around a very specific aspect of the broad Cloud computing ecosystem. When they started, much of Cloud computing was total noise and marketing baloney. While there’s still plenty of that in the system, many of the products and services are maturing and the particular segment Standing Cloud has gone after has suddenly become incredibly important to a large number of hosting, managed services, and cloud providers (often the same thing) not named Amazon. Specifically:
Standing Cloud provides a seamless application layer for cloud providers, making application deployment and management fast, simple and hassle-free for their customers. Standing Cloud’s standard application catalog includes 100 open-source and commercial applications; its Platform-as-as-Service (PaaS) capabilities support multiple programming languages, including Rails, PHP, Java and Python, and a wide range of cloud service providers and orchestration software systems.
If you are a hosting, managed service provider, or building a cloud service (public or private), you have three choices. The first is to ignore this stuff (dumb). The second is to try to build it all yourself and keep pace with Amazon (good luck). The third is to use Standing Cloud.
If you want an intro, just email me and ask.
We made a seed investment at the end of last year in a company called Standing Cloud. They are working on a set of application development and deployment tools for cloud computing that build on some of the ideas in the posts What Happened To The 4GL? and Clouditude. They’ve spent the first quarter of 2009 testing out a series of hypotheses we had concerning issues with Cloud Computing as well as building an first release of their cloud deployment tool.
Things in cloud computing land are much worse than our hypotheses, which were already suitably cloudy. Following is a simple example of a real issue that even a non-techie will be able to understand.
When starting cloud server instances via an API, it’s important to realize that the instance may become "available" from the perspective of the cloud provider system before all of its standard services are running. In particular, if your automated process will connect to the instance via ssh or http, it will be at least a few seconds after the instance appears before the applicable service is available. This problem generally does not arise if you start servers manually, because by the time you see that it is running, copy the IP address or domain name, and type the command or browser URL to connect, the services are usually ready. Possible solutions include:
- Wait a safe, predetermined amount of time. This is the simplest, but obviously is not robust.
- Attempt to open a socket on the applicable port (e.g., 22 or 80), and do so in a loop, with a brief sleep between attempts. These attempts will fail until the service starts, and you should have the loop time out after some period in case there is a deeper problem with the instance.
- In a similar loop, directly attempt to connect to the service. Depending on the SSH API you are using, this may have additional overhead or abstraction that is better avoided, but it is robust and likely to work.
A related, but more insidious issue is that the sshd authentication services are not necessarily available as soon as sshd is available on the port. Thus it is possible to connect to the service and have authentication fail, even though everything should be in order. A sufficient wait period or a loop is once again the solution. However, if the loop wait period is not sufficient, you may attempt too many failed authentications and lock yourself out of the system. Thus this approach has no fully robust solution aside from an empirically safe wait period either prior to authentication or in the loop wait.
Clearly these problems are tricky to diagnose if you are not aware of their idiosyncrasies.
A more robust but also more complex overall solution would be to incorporate a service on-board the instance that starts up at boot and checks the status of sshd from the inside, then makes an otherwise unused port available when the system is fully ready for connection and authentication. Of course, this requires that you can boot from a pre-configured image rather than a stock image, and it also requires that another port be open on the firewall.
The set of things like this is endless. In addition, each cloud computing environment has different nuances and each stack configuration adds more complexity to the mix. There are so many fun puns I that apply that I get giddy just thinking about all the storms ahead and the need for umbrellas.