How to work with shared dev clusters (and why) - Part II: Why backwards compatiblity is overrated

In part I of this series I wrote about the challenges when developers who work on Kubernetes-based applications try to run all services locally: Connecting local to remote services, cluster sharing, multi-tenant capabilities of remote services, versioning and managing compatibility of services.

You also read about factorial complexities and why the management of version compatibility is tedious when several services introduce breaking changes or you need to be able to support different versions of your software. With each dimension of variability, the possible number of constellations in which your software can exist grows very quickly. Each factor that means one service is incompatible with another service would mean that that specific service has to exist in two (or several) different configurations.

Each factor that means one service cannot be accessed by several other services means that it would have to exist once for each possible constellation of other services.

Now in part II, to understand what this means, first let’s take a look at an example. Then I explain why you most certainly already have this kind of complexity and what are the real costs of running everything locally.

An example

Let’s consider an example to show how quickly factorials can grow. For the sake of simplicity, I will consider only whether or not services are shareable, and how many versions need to be supported. I assume that supported versions mean that none of the services are backwards compatible. It wouldn’t make much difference: even if one or two were backward compatible, it would only subtract one or two from the final number.

Number of services	Shareable	Not shareable	Supported versions	Number of devs (or customers)	Minimum no of remote services*
6	6	0	1	100	6
6	6	0	2	100	12
6	6	0	5	100	30
6	6	0	10	100	60
6	5	1	1	100	105
6	3	3	1	100	303
6	0	6	1 (irrelevant)	100	600
6	3	3	2	100	306
6	3	3	5	100	315
6	3	3	10	100	330
6	0	6	10 (irrelevant)	100	600

* Depending on the performance of each service, several instances of each service may need to exist in order to service 100 developers, but they can scale down to this minimum number when there is low load.

Don’t be afraid of breaking changes

What this example neatly shows is that version compatibility is really the lesser of the two problems. Each additional version you need to support only adds one multiple to the number of services (6 in our example). In this example, supporting five different versions would require a still manageable number of at least 30 services to support each version. Even if half of the services were backwards compatible and could be shared between different versions, this would reduce the final number only to 21 (down from 30), showing that the value of backwards compatibility really isn’t that great.

The same principle applies to other factors besides versions that affect compatibility of services.

On the other hand, each unshareable service adds one multiple to the number of developers (or customers), which is typically much larger than the number of services. Having only one service that is not shareable increases the number of minimum required services in our example by 100!

Considering the huge cost that backwards compatibility introduces into development, it would be a lot more sensible to let go of backwards compatibility and focus on introducing multi-tenant capability to a larger number of services.

Side note: This only looks at the complexities in development. Introducing breaking changes can have other consequences in production, such as customers needing to change their processes and integrations if a breaking change is introduced to the API. However, here I talk about breaking changes to the inter-services communication and not to external APIs (can be the same, can be different).

You are not adding complexity, you are just moving it

You’d be forgiven for thinking that you don’t want this kind of complexity in your setup. The problem is: you most likely already have it. Only that right now, it is managed by individual developers. Each of them needs to figure out how to run their application locally while considering version compatibility and dealing with limited resources on their laptops.

When every developer is expected to run everything locally, this by default results in a full setup per developer. Nothing can be shared. In our example, this would be the maximum number of 600 services – each developer would have to run the full 6 on their laptop.

At the level of the individual developer, this kind of pain is often less noticeable because it is not a separate cost center. It is, however, real cost that many companies pay for daily – in time lost and frustration gained for their developers, and often in decreased quality of the software and production outages that are the consequences of not equipping developers with the tools they need to do their job well.

The real cost of running everything locally

Talking to a lot of developers, I heard about three scenarios how this can look like in reality:

Scenario 1: everything works well
Each developer knows how to run all services locally. This is the case for a miniscule minority of developer teams. I heard it only from teams that:

Consist of a maximum of 5 developers
All developers are highly skilled and at senior level
The number of services in the software is no more than 3
A lot has been invested in documenting and automating the local setup and build process

Scenario 2: it works, but is painful
This is the most common scenario I found with small software companies. They typically:

Have less than 100 developers (usually 10-30)
Have a handful of highly skilled senior developers who spend a lot of their time
- building tools (i.e. writing scripts) for local deployment and local build and
- supporting more junior developers with their local deployment and local build
Have few services (3-6)

Scenario 3: it doesn’t work
This is the case with the vast majority of medium to large software companies that I talked to. They typically:

Have more than 100 developers
Have more than 5 services (often several dozen or hundreds)
Are not able to run the full application locally at all because their laptops simply don’t have enough computing power

This means developers work blindly. They write code and push it directly to staging. They are not able to validate and test locally what they are doing (beyond unit tests).
This often results in

Many bugs in the software, which leads to
High investment in testing
High investment in “troubleshooting capabilities” such as automated rollbacks
Production outages, unhappy customers, penalties …

What does this mean?

I can only extrapolate from my sample of developers that I talked to. Doing this, I estimate that no more than 1% of companies are really successful with a setup where every developer is expected to run all services locally, i.e. scenario 1.

It seems that, as companies grow and age, their software becomes more complex. This leads to a rapid increase in factors that can vary in local deployment – factorial complexity.

This leads to scenario 2, where local deployment eats up developer’s time. In all teams that I talked to, there were several people that simply were not able to handle local deployment at all. Across all seniority levels, the amount of time spent either managing their own local deployment, or supporting colleagues with their local deployment, varied between 10-25% of developer’s time. The largest amount of time was spent by the most junior and the most senior developers – juniors needed a lot of help, and seniors provided a lot of help. Becoming a senior meant becoming an expert in deployment, such as writing helm charts or troubleshooting minikube clusters.

But the real bottleneck is computing resources that are available locally. In scenario 3, this was the most commonly cited issue. Even with investing a lot of time, developers were simply not able to run the full application locally. In this (worryingly common) case that a majority of developers are not at all able to run the application they work on locally, developers are stuck with working blindly. As mentioned earlier, this leads to quality issues down the road that are incredibly difficult and expensive to fix.

What is the solution? This is the topic for part 3 of this series. I’ll explain what is working for many developers and how to get there.