Summer is usually a quiet time. Many people are on holiday, customer projects slow down, nothing much new gets started. However, there are also problems that are typical for this time of year: Something breaks, people responsible are on holiday, and it becomes apparent how brittle some processes are that normally work smoothly.
This happened to us last week.
Finishing things before their holiday, one backend developer committed a change right before they left. The CI/CD pipeline went through, our internal development systems were updated successfully, and the developer went on holiday.
But: When the frontend team started to work the next day, the creation of CDEs (Cloud Development Environments) failed. The backend was stuck in a crash loop.
Unfortunately, our backend team consists of only two developers, and both of them are on holiday right now. So the frontend team (also two developers 🙂 tried to troubleshoot this themselves.
I became aware of this at the standup at 11 a.m., at which time both frontend developers had spent their entire morning only trying to (unsuccessfully) debug CDE creation.
The three of us spent an hour debugging together, after which we concluded that they would simply have to work “blindly and slowly” – working on the code without being able to do a local build. They would still be able to check their changes in the development instance, but it takes the CI/CD pipeline about 15 minutes to finish, which is far too long for a frontend developer to wait to get feedback on their work. At least it is for our team, who has gotten used to instant hot reloads on their CDEs.
Previously, all developers built and ran Cloudomation locally, but since we started using CDEs, this knowledge has gathered dust, new developers had never even tried, and we concluded that trying to get a local deployment working would probably take them longer than just waiting for the backend devs to return from their holiday.
With some hints from one backend dev, generously provided via Slack from their holiday, I was able to hack together a workaround in the afternoon. CDE creation worked again at 4 p.m. So we lost most of a day for two frontend developers, and half a day of my own time.
In total, the damage was limited. Nevertheless, it showed me a few things:
- We have become very dependent on CDEs very quickly. This shows how useful they are, but also made our CDE deployment automation a single point of failure.
- We have a few ground rules that generally work well, but were problematic in this instance:
- Move forward. If necessary, we do rollbacks for customer systems, but not for our internal systems and generally avoid going back to old versions / commit hashes or reverting problematic changes, instead preferring to fix root causes right away. This is why we have no automated ways of going back to an older version of some components in our CDEs (yet).
- It’s okay to break internal development systems. That’s what they’re there for. They are intended as safe places that don’t run any production processes, they are not expected to be up, and everybody can break them without blocking other people’s work. I learned only now that this is only true for changes that happen during the day. Anything that is committed at the end of the day is pulled into the CDE template snapshots during the nightly build, and is propagated to all CDEs created in the morning. During the day, developers can simply choose not to pull changes from other team’s repositories if something is broken, and can continue to work in their isolated CDEs. But if the template is broken, this affects everybody.
- Not everybody should have to know how deployment works. That’s one of the main points of using CDEs: Removing the complexity of deployment from the already much to full plates of software developers. It’s great that we’re in a place where this is true – but it requires our CDE deployment automation to work.
We have no self-service workarounds in place that developers can use to unblock themselves without having to debug issues introduced by other teams. For example, having the CDE template from the previous day and being able to switch to that would have been an easy fix – but we throw away the old template when we create a new one. They’re all supposed to be ephemeral anyway and we can build them from scratch again any time – but that is true only if no major component is broken while the people who can fix them are all on holiday. - It’s extremely valuable to have all rounders (or hackers) in the team who can jump in and debug (i.e. build sloppy workarounds) to unblock other people. My experience is that such people are often not the best software engineers (that’s why I’m CEO :P) but can be invaluable in a crisis.
We’ve been using CDEs for a bit more than six months now. That’s long enough now to have lost a lot of knowledge on how to work without them, but not long enough to have ironed out all issues that come with using them. Looking at it this way, it is surprising that this was the first time we were in a situation where two developers were blocked for most of a day due to CDE unavailability.
In the end, I concluded that it is good that the backend team was on holiday. If they’d been here, they’d have fixed it quickly and I would probably not even have become aware. Now I am, and will make sure that we remove this dependence on the backend team by providing “easy fixes” that any developer can apply themselves.