The 3 Biggest Challenges in Platform Engineering, and How to Overcome Them

  • Published

Let’s not mince words: Platform engineering is really difficult. Platform engineering teams by definition only exist if a software company’s tech stack is too complex for individual software engineers to manage on their own. The very purpose of platform engineering is to reduce complexity for software engineers – meaning that the platform engineering team is the place where this complexity is moved to (riffing on the “law of conservation of complexity”: Complexity cannot be reduced, it can only be moved). Read more about it in this post: Why Platform Engineering?

So there should be no expectation that platform engineering can be made easy. There are no magic tools out there that suddenly make a platform engineer’s life a breeze, allowing every junior system administrator to suddenly become a platform engineer.

But there are decisions you, as a platform engineer, can make that will make your life harder, or easier. Let’s take a look.

Challenge 1: Too many tools

TL;DR: Yes. Deal with it. That’s what you’re hired to do. To make it better long-term, add new tools which are generic, multi-purpose and flexible. Also, it is possible to get rid of tools.

“Too many tools” is the very core of what makes platform engineering difficult. It’s not just individual tools, it’s cloud providers, internal components and their dependencies, scripts your own team wrote, and everything else that makes up this highly interconnected thing that is the stack and infrastructure you’re supposed to make simpler for others to use. (Related: Platform Engineering Without Kubernetes)

The advice I read in various blog posts out there with helpful concepts like “portfolio management strategies” and “vendor ecosystems”. Most of it comes down to a belief that central management will help with the mess platform engineers deal with on a daily basis. I think this idea is inefficient, and removed from reality. It assumes that it is possible to devise a set of requirements that any tool a company buys has to fulfil, which would help procurement. In my experience, it typically doesn’t – on the contrary, it makes finding something that works a lot harder. In most cases, it means that only products from large vendors have a shot, which check the boxes, but often don’t provide the best feature set, and are for sure not the best choice when it comes to value for money.

In my experience, the most successful teams are the ones that are allowed to come up with their own requirements, pre-select tools that they’d like, and then hand them off to other teams like security or compliance, which can audit these tools and veto them if they find a significant blocker. This approach I’ve found to be much more effective and faster than asking security, compliance, legal, and other teams to come up with generic lists of requirements, which is pretty much an invitation to spend a lot of time coming up with a list that has to be bullet-proof, cover every edge case, and as a result is pretty impossible to fulfil by any one tool. Such an approach leads to the kind of decision making notorious for enterprises where tools are bought that check the boxes of the legal and security teams, but can’t do what the original requirement was.

However, it might not be in your power to decide on the process of choosing tools in your company. But what you most likely can do is lobby for tools that are:

  • Open, in the sense that they can be customised / extended without relying on the vendor,
  • Are simple to integrate, i.e. have a well-documented REST API and ideally other integration features such as webhooks,
  • Are as general-purpose, broad and flexible as possible. This might mean a bit more time on your end to get the tool to do exactly what you want, but in the end, it will allow you to get exactly what you want – and helps a lot to reduce the overall number of tools you’ll need.

Side note: There is a strong incentive in the software market to build very specialised tools for very specific use cases. From the perspective of a software vendor: If your tool can do one thing only, it’s likely that you can explain very quickly what it does, and it’s very likely that it will do that one thing very well. This makes it easier to sell. However, it also means that the market is littered with very niche solutions that focus on very specific use case(s).
If you fit the product’s intended user group perfectly, then go ahead. Most likely, however, you will end up with a tool that is very good at very little, and very bad at everything else. Tools that are general purpose and broad in their feature set generally provide a much better value for money. It takes longer to get to know them, but once you do, they are a lot more powerful and will provide you with a lot more value than a dozen very specific tools that you each have to learn and maintain individually, and then integrate somehow.

Another piece of common sense advice that seems to rarely make it into blog posts: Getting rid of tools is possible. In 90% of cases, it will be true that once you’ve started to use a tool and have integrated it into your ecosystem, you’ll end up using it forever. But for 10%, it is perfectly feasible to just kick them out when they turn out to be more work than they are worth. When it comes to getting rid of tools, the same applies as for everything else you do: Pick your battles. If it is possible to get rid of a tool, it’s generally a good idea to do so. However it might not be the most important thing you have to do right now, so be sure to propose this only when you have a very good reason to.

Challenge 2: Developers don’t use the platform

TL;DR: Whenever you start working on a feature or service, ask yourself the question: Who needs this feature (or service)? Then go and ask them why, and write down at least one, better three, use cases for your feature or service. Then build the feature / service to make those possible.

It’s a sin as old as software: Engineers go and build something truly beautiful. Then they give it to other people who are supposed to use it, who stubbornly refuse to do so. Why? Most likely because they simply don’t need it.

In my experience, getting people to describe very specifically how they want to use a feature and to what end (i.e. getting them to describe their use case(s)) is the easiest and most direct way to build a feature or service that is useful.

There is nothing else to say about this. If you like, you can call it “cultivating a product mindset”, “user centricity” or “prioritising features based on user value”. I prefer to just call it “talking to people”. (You can also just assign them a ticket, if you prefer not to talk.)

Challenge 3: Legacy tools and technical debt

TL;DR: It’s not going to change. The sooner you accept that, the sooner you can find ways to make things work despite it. For example: Add tools to your stack that can integrate with legacy tech, and/or agree on a standard multiplier for padding your estimates to allow for issues with technical debt in your code base.

When talking to software engineers, return on investment often feels like a dirty word. Engineers complain about their manager’s lack of interest in cleaning up technical debt, and relish opportunities to say “I told you so” when some legacy tool or some chunk of technical debt in the code yet again shows up as the most significant time sink in the estimate to implement a new feature or service. (If you do estimates. I think it’s a waste of time.)

In the end, it often really doesn’t pay to clean up technical debt. In other cases, it is just impossible to know what the return on investment for doing so would really be.

In all cases, it is worth it for you to sit down and think about what it would really mean to clean up this one bit of technical debt, or to get rid of this one piece of software that is making everything so difficult. If you honestly consider whether or not it is possible to do so, and what would be required in order to do it, and then duplicate the effort you imagine at least by two, think again if this is really (a) what you want to be spending your time on and (b) if you honestly think it is worth it. If it is, go ahead, share your thoughts on cost and benefit and who knows, if you argue like that you might even get the go-ahead.

But don’t count on it.

Getting rid of annoyances most often is just too far away from making money.

What is left within your sphere of influence is to find ways to make it easier for you to work with the technical debt you have. That may mean explicitly buying solutions that allow you to monitor and integrate legacy tools, or to figure out a standard multiplier for your estimates to allow for any additional effort caused by technical debt in your code base.

Bottom line

Platform engineering is difficult. It’s difficult because it exists to deal with complexity, so that software engineers don’t have to. It’s also difficult because software engineers are a legendarily difficult user group: They want to know and understand everything, without having to deal with the details of it all when they don’t need to. So it’s difficult to build a platform, and difficult to get software engineers to use it.

To make your life a bit easier, try the following:

  • Pick tools for your stack that can handle complexity: Tools that are customisable, easy to integrate, and general-purpose (or at least multi-purpose).
  • Get software engineers to describe specific use cases: How they want to use a feature or service and why. This will help you cut through nice-to-have wishlists and whittle down a service or feature to the core that is relevant. Software engineers will happily use a bare-bones service if it solves a relevant problem for them.
  • Accept that you will have to deal with legacy tools and technical debt. Don’t build your platform to fit an ideal you’ll never reach, build it for the existing reality. This means picking tools that can work with legacy software, and allowing time to deal with technical debt when building new services or platform features.

Advertising

We’ve built a tool that is designed to deal with exactly this kind of complexity. Cloudomation Engine is a pure Python framework for platform engineering. We built it before platform engineering was a thing, and are now thrilled to see that a growing community of engineers are waking up to the challenges and possibilities of managing tech and infrastructure complexity well.

Having worked as software and devops engineers before starting Cloudomation, we knew exactly what the pain points are (and always have been): Too many tools, stubborn software engineers as users, and seemingly pointless additional complexity caused by technical debt and legacy tech. So we built a framework to handle this. In Python, because we like Python.

If you’re curious to see what it can do, check out our videos on youtube, our documentation, or book a demo (no strings attached – showing off our tool to an appreciative audience is fun in itself).

Now that you're here...

Cloudomation Engine is a platform orchestrator that enables you to provide self-service tools, automate complex tasks, and gain full visibility into your infrastructure.

Let’s talk about how Cloudomation
can make that happen for you.

Margot Mückstein

CEO & co-founder von Cloudomation