I have been reading and thinking about (internal) developer platforms for a while now. I quickly understood the purpose and benefit of developer platforms: Their purpose is to make the lives of software engineers easier, and the benefit of that is more time for software engineers to do actual software engineering – i.e. an increase in productivity and happiness.
I also often read that developer platforms are a layer of abstraction. Software engineers should no longer need to know about the underlying infrastructure in order to use it (e.g. to deploy their software to it).
So far so good.
I also read that developer platforms
- should be treated as products, and focus on providing useful services to their core users: Software engineers.
- can really be anything: Documentation, templates, APIs, or portals. Anything an organisation identifies as a place where developers can get information, support and resources to handle infrastructure, pipelines and deployments.
- are unique to each organisation. How a specific developer platform looks like, which components it has, which features it provides, depends a lot on what is already there in an organisation.
So far so great.
Individually, all of these points make sense. But I struggled to put them all together. In my head, the idea of “developer platforms” became a large blob that was hard to describe – and even harder to apply.
Then I came across this video by Victor Farcic. It was the first resource that helped me break that mess down into tangible, applicable categories through which I understood how to go about building an IDP: Where to start and what’s important.
In the video, Victor Farcic describes how to build a developer platform in 5 steps: API, State management, One-shot actions / Workflows, RBAC & Policies and User interfaces (which, funnily, he calls optional). I found these steps useful, not so much as steps, but because they describe the core capabilities a developer platform should have in order to be useful. They don’t try to say what a platform should look like by listing product categories, but instead what a platform should be able to do. How you achieve those capabilities is totally up to you.
This post is not a summary of the video, it is my own take on the core capabilities of IDPs, which was inspired by the video. However, I formulate them slightly differently and added another one:
- API
- User Interfaces (not optional)
- Automation: One-shot Actions and State Management
- Constraints (policies, RBAC etc.)
- Documentation and Discoverability
Let’s take a look at each of those capabilities, what they mean, and why they are important.
#1 API
An API is at the very core of the value proposition of a developer portal. It is the core layer of abstraction that exposes services to platform users in a simple and secure way:
- Simple:
- API calls should be available with a minimum set of required parameters that is small and simple, making it possible to request a resource as easily as possible, e.g. by only requesting “a DB”, leaving all additional configuration to default to templates. Additional parameters for customisation should be available, but not required.
- API calls should follow a simple, standardised schema that is always the same, independent of the specific request.
- API calls should be independent of underlying infrastructure and tech stack: The API should not require the user to specify e.g. if the DB should be deployed to AWS or GCloud, and the request structure and parameters should be independent of the underlying infrastructure.
- The API should describe itself, i.e. return its own schema, so that applications built on top of it or users interacting with it directly can easily discover what is available.
- Secure:
- The API is the main place where constraints are applied and enforced. This means that the API should be able to apply policies, and only accept requests that pass authentication and validation.
The API is the core of the developer portal. If you try to build a developer platform without a central API that provides a unified layer of abstraction across services, you will end up with a mess that is impossible to maintain.
If you have a central component that allows you to define a custom API – for example, if your developer platform is based on Kubernetes – then it’s great to use that to define the core API for your platform. However if your developer platform should support deployment into different clouds or infrastructure environments, and/or provide services via the API that are unrelated to infrastructure deployment, then you should look for a dedicated API management tool, or build a custom API layer for your developer platform yourself.
#2 User Interfaces
User Interfaces are built on top of the API and provide ways for platform users to interact with the platform in ways that are simple and fit their preferences. User interfaces can, for example, be:
- Portals / web GUIs like Backstage or Port. For example, these can let platform users request resources by filling in a form which is based on the API schema,
- CLIs (command-line interfaces) which let platform users use platform features via a terminal,
- IDE plugins, which make platform features available directly from an IDE,
- Integrations with other existing GUIs,
- Custom web UIs,
- Or anything else that can access an API.
Considering that the main value proposition of a developer portal is to make the lives of software engineers easier, user interfaces are an essential platform capability. They should be chosen and designed with utmost care and in close collaboration with their users to make sure that platform features are usable in a way that makes sense to software engineers, and that fits their existing workflows. Getting user interfaces wrong can destroy a lot of the potential value a developer platform can provide.
#3 Automation: One-Shot Actions and State Management
One-shot actions or workflows are probably the part you are most familiar with already. This is process automation, scripting, pipelines, and anything else that allows you (or your platform users) to do something automatically – once.
One-shot actions are things like: Creating a snapshot, building a container image, running a set of tests on a specific deployment, etc. Anything that you need to be able to do automatically, which you do often, but for a specific constellation or configuration you typically only do it once. For example, you typically build a container image only once, until something changes, then you do it again once for the new container image definition.
State management means being able to define how something should be – a state – and having a tool that ensures that this state is maintained. For example, state management is a core feature of kubernetes. If a cluster is configured to run three pods, then kubernetes will make sure that there are always three pods. If there are two, it will add a third. If there are four, it will kill one.
Tools for state management allow a user to define the state without having to worry about the one-shot actions that allow the system to get from one state to another (for example, the one-shot-action of deploying or killing a kubernetes node). As such, state management tools are already a layer of abstraction that contain logic for one-shot-actions within their domain, and allow the user to use them in a conceptually different and simpler way: By only describing the desired state.
As such, state management tools are always domain specific. They contain logic around state management within one domain – such as a kubernetes cluster, for example, or infrastructure management in a wider sense in the case of e.g. Ansible.
In order to work, a state management tool has to “know” how to transform a system from any possible state into any other possible state. Depending on the domain, that can mean quite a lot of automation logic to provide state management that feels simple to a user.
It is absolutely possible to manage state with any system that is capable of one-shot actions. It is just more difficult to set up and manage, than with a dedicated state management system.
To summarise:
State management requires:
- A way to define a state
- Atomic one-shot actions that can be combined flexibly, so that the system can go from any state allowed in its definition, to any other state allowed in its definition
- State awareness, i.e. built-in capabilities to know what the current state is
State management transforms any type of input (current states) into the same output (the desired state). As such, state management is deterministic: You know exactly which output you will get for each input. The system is state aware and will continue to try to reach the desired state independently of which one-shot actions have or have not been performed before.
One-shot actions, on the other hand, typically focus more heavily on:
- Process logic and
- Configuration management, i.e. the inputs that a one-shot action receives which influence the process logic.
A one-shot action transforms a specific set of inputs into a specific set of outputs. Because one-shot actions are typically not state aware (at least not fully), the outputs of a one-shot action are not deterministic. The same input can lead to a different output, depending on the initial state of the system.
For example, the one-shot action of creating a kubernetes node can lead to one, ten, or one hundred kubernetes nodes existing afterwards – the one-shot action doesn’t know or consider how many nodes exist before it is started.
You most likely already have a lot of one-shot automation in place. When building a developer platform, it is tempting to re-use (or abuse) existing automation solutions to build platform services. Depending on which automation solutions those are, that can be a good idea, or a very bad idea. Existing, battle-tested pipelines and other automation that is in productive use already should obviously be re-used as much as possible. Building platform specific services with existing automation solutions can work well if those solutions support building modular, maintainable automations – and, ideally, if they provide state management functionality.
#4 Constraints
The platform API provides a set of services. At any given point, not all of these services will be available for every user. The rules that determine which services are available or unavailable to whom, when, and why are what I call constraints.
A platform has to support a way to define or integrate and enforce such constraints. Ideally, this is a feature of the API, which should allow definition of policies that restrict which requests can be done by which user group.
In existing infrastructure setups, and in probably most if not all developer platforms, constraints are implemented across a patchwork of different tools with very different focus. Role-based access control is one such focus of constraint modelling: Its purpose is to define roles, which are associated with permissions, which detail what rights a specific role has.
In the context of a developer platform, this can, for example, mean that software engineers with one role are allowed to request resources with GPUs, while others are not.
Resource quotas are another example of a policy or constraint.
The API should be aware of all these constraints and validate the requests it receives. If a request violates a constraint, it should be rejected with a helpful error.
Side note: Functionally closely related is schema validation. The API checks the request it receives and validates it against a schema which defines which parameters are required, and if all of the required parameters are contained in the request, and in the correct format. If the validation fails, the request is rejected with a helpful error.
Constraints can be modeled in the same way: To be valid, a request has to contain the correct set of permissions and other parameters that fit within the constraints framework.
#5 Documentation and Discoverability
The API provides services based on state management and one-shot-actions. Constraints determine which of those services in what configuration are available to a specific user at a specific point in time. User Interfaces allow users to use these services. But in order for any of that to make sense, users need to know which services are available, and they need to understand how they can use them. Without discoverability, all other platform capabilities become a lot less valuable, because they will be used less.
Discoverability is a topic that should ideally be provided by each platform component being self-describing: The API should have an OpenAPI specification describing available calls, the constraints, wherever they are modelled, should be transparent, at least to a degree where a user knows that their request failed due to insufficient permissions, and user interfaces should provide information to users about available services (help in the CLI, for example, or a well-structured web app with a search feature).
But discoverability also extends to a topic central to many IDPs: Documentation. Discoverability of features, services, but also of infrastructure, intellectual property (software components, repositories etc.) and other things relevant to a software engineer’s work, is a core value that IDPs can provide. In fact, many IDPs start out as documentation hubs such as software catalogs, or local deployment guides with links to configuration templates etc.
When building an IDP, discoverability should be considered early on. Manually maintained documentation is great and often indispensable when providing self-service, which users have to understand in order to use. However manually curated documentation is also very resource intensive to maintain. Any component of the IDP that is self-descriptive and provides built-in features that support discoverability is a big boon that will make long-term maintenance of your IDP a lot easier.
How does this help me build a platform?
If you think about the problems you have had recently as a platform or devops engineer, you will probably be able to map them to one of the described capabilities.
Some examples:
- If you have trouble with downtimes of instances, or accidental high cost due to cloud resources being left lying around, you have insufficient state management capabilities.
- If you have issues with software engineers not having access to infrastructure they need, or with junior engineers messing with deployments they should not mess with, you are not managing constraints well.
- If your developers are unable to spin up a feature branch system on their own, or triggering a build and test run for a specific commit, then you need to take a look at your one-shot-action automation capabilities.
- If your software engineers are simply not using the APIs and services you provide, you should probably take a look at the user interfaces you provide or check if your services are discoverable / documented.
If you start thinking about the problems you have – or, ideally, the services you would like to be able to provide to your software engineers – in these categories, it will help you identify what you need to do, and how to go about doing it in a way that will make your life sustainably easier.
Instead of abusing your one-shot automation tool for state management, hacking together workflows with lots of ifs and polling what is there before doing anything, you should probably consider finding a tool that is built for state management. It will make your life easier.
If you have policies flying around everywhere and no real idea which constraints apply where and if it makes sense, or if everyone simply has full rights on everything and you know that this is not how it should be but simply don’t know how to sustainably manage a principle of least privilege across your entire infrastructure and user base, then you should look for a tool that makes constraint modeling simpler for you. And so on.
The purpose and value of these core capabilities is to help you understand why you keep failing with some topics, and how you can start fixing things in a way that lasts and is sustainably manageable.
Final note on our own behalf
I’m not just thinking about developer platforms, I’m building a framework for creating developer platforms that are useful and sustainable. Cloudomation is a pure Python framework for building developer platforms. We launched Cloudomation in 2019 as a general-purpose (one-shot action) automation framework, and have kept fidgeting with it ever since. I’m proud to say that, by now, we have a broad, flexible and powerful framework in place that supports the core capabilities described in this post:
- API: Cloudomation has a built-in API manager that allows you to define custom APIs
- User Interfaces: Cloudomation supports
- Custom web-apps based on custom APIs
- Integration with portal solutions like Backstage
- A basic CLI
- Schema-based single-form web UIs (not suitable as full platform solutions, but suitable to provide individual services to specific user groups, e.g. to enable a consultant to deploy a demo environment)
- Any custom-built user interface based on its APIs
- One-shot actions: Cloudomation was conceived as an automation platform and excels at handling complex workflows. This is a core strength of the platform. Both integration with existing pipelines, scripts, IaC and other tools, as well as native automation written in Python are supported by a broad, battle-tested feature set.
- State management: Cloudomation is the only framework we are aware of that allows users to (easily) build custom state management for their own domain. This is powered by our unique object-oriented automation approach, which allows users to model objects that describe the desired state, and associate them with lifecycle hooks that manage state transitions. Read more
- Constraints: Cloudomation supports role-based access control (RBAC) which can be modelled directly on the platform, as well as integration with LDAP for authentication. In addition, constraints modelling can be done using the core automation features available on the platform.
- Documentation and discoverability: Our custom APIs can self describe and our Cloudomator LLM-based assistant can provide information and write documentation on custom generated automations, services, and APIs in Cloudomation. When it comes to hosting and serving documentation to users, Cloudomation is not the right tool, but it can seamlessly integrate and continuously update automatically generated documentation (e.g. for API schemata) in specialised documentation solutions, or developer portals (like Backstage).