My post on How to Fix Obamacare generated plenty of feedback – some public and some via email. One of the emails reinforced the challenge of “traditional software development” vs. the new generation of “Agile software development.” I started experiencing, and understanding, agile in 2004 when I made an investment in Rally Software. At the time it was an idea in Ryan Martens brain; today it is a public company valued around $600 million, employing around 400 people, and pacing the world of agile software development.
The email I received described the challenge of a large organization when confronted with the kind of legacy systems – and traditional software development processes – that Obamacare is saddled with. The solution – an agile one – just reinforces the power of “throw it away and start over” as an approach in these situations. Enjoy the story and contemplate whether it applies to your organization.
I just read your post on Fixing the Obamacare site.
It reminds me of my current project at my day job. The backend infrastructure that handles all the Internet connectivity and services for a world-wide distributed technology that was built by a team of 150 engineers overseas. The infrastructure is extremely unreliable and since there’s no good auditability of the services, no one can say for sure, but estimates vary from a 5% to 25% failure rate of all jobs through the system. For three years management has been trying to fix the problem, and the fix is always “just around the corner”. It’s broken at every level, from the week-long deployment processes, the 50% failure rate for deploys, and the inability to scale the service.
I’ve been arguing for years to rebuild it from scratch using modern processes (agile), modern architecture (decoupled web services), and modern technology (rails), and everyone has said “it’s impossible and it’ll cost too much.”
I finally convinced my manager to give me and one other engineer two months to work on a rearchitecture effort in secret, even though our group has nothing to do with the actual web services.
Starting from basic use cases, we architected a new, decoupled system from scratch, and chose one component to implement from scratch. It corresponds roughly to 1/6 of the existing system.
In two months we were able to build a new service that:
- scales to 3x the load with 1/4 the servers
- operates at seven 9s reliability
- deploys in 30 seconds
- implemented with 2 engineers compared to an estimated 25 for the old system
Suddenly the impossible is not just possible, it’s the best path forward. We have management buy-in, and they want to do the same for the rest of the services.
But no amount of talking would have convinced them after three years of being entrenched in the same old ways of doing things. We just had to go build it to prove our point.
Now that our federal government is back at work and the short term debt ceiling thing is resolved, it should be no surprise that the news cycle is now obsessed with Obamacare and its flawed implementation. Over the weekend I must have seen a dozen articles about this online and in the NY Times, and then I woke up this morning to a bunch of new things about the Healthcare.gov site underlying tech, how screwed up it is, and what / how the Health and Human Services agency is going to do to fix it.
The punch line – a tech surge.
To ensure that we make swift progress, and that the consumer experience continues to improve, our team has called in additional help to solve some of the more complex technical issues we are encountering.
Our team is bringing in some of the best and brightest from both inside and outside government to scrub in with the team and help improve HealthCare.gov. We’re also putting in place tools and processes to aggressively monitor and identify parts of HealthCare.gov where individuals are encountering errors or having difficulty using the site, so we can prioritize and fix them. We are also defining new test processes to prevent new issues from cropping up as we improve the overall service and deploying fixes to the site during off-peak hours on a regular basis.
From my perspective, this is exactly the wrong thing to do. Many years ago I read Fredrick Brooks iconic book on software engineering – The Mythical Man-Month. One of his key messages is that adding additional software engineers to an already late project will just delay things more. I like to take a different approach – if a project is late, take people off the project, shrink the scope, and ship it faster.
I think rather than a tech surge, we should have a “tech retreat and reset.” There are four easy steps.
- 1. Shut down everything including taking all the existing sites offline.
- 2. Set a new launch date of July 14, 2014.
- 3. Fire all of the contractors.
- 4. Hire Harper Reed as CTO of Healthcare.gov, give him the ball and 100% of the budget, and let him run with it.
If Harper isn’t available, ask him for three names of people he’d put in charge of this. But put one person – a CTO – in charge. And let them hire a team – using all the budget for individual hires, not government contractors or consulting firms.
Hopefully the government owns all the software even though Healthcare.gov apparently violates open source licenses. Given that, the new CTO and his team can quickly triage what is useful and what isn’t. By taking the whole thing offline for nine months, you aren’t in the hell of trying to fix something while it’s completely broken. It’s still a fire drill, but you are no longer inside the building that is burning to the ground.
It’s 2013. We know a lot more about building complex software than we did in 1980. So we should stop using approaches from the 1980s, admit failure when it happens, and hit reset. Doing a “tech surge” will only end in more tears.