Conscious decoupling

Consolidating or fragmenting code is a decision best revisited

What I call “lumping” is the act of bringing two or more logically-distinct systems closer together, by managing them in the same codebase, repository, computer, or network. “Splitting” is the act of dividing a system along some seam, separating it out into different modules, repositories, services, or networks. A system that’s lumpy is tightly-coupled and easier to update in lockstep; a system that’s splitty is loosely-coupled and easier to change independently. Examples of highly lumpy designs include monorepos or monoliths; splitty ideologies include microservices or service-oriented architectures.

Lumping

Splitting

I am not going to say much about whether you should lump or split individual lines of code or functions, because there’s some objective truth to how modularized a codebase should be. Most developers would agree that you should not have ten functions that all do the same thing, or one function that does everything, and it’s hard to get us to agree on anything.

Similarly I think we have consensus that at the level of discrete tooling, the highly composable Unix-style “small pieces, loosely joined” philosophy is a better foundation than an all-in-one Swiss army knife approach, if for no other reason than that the Unix approach doesn’t preclude building a Swiss army knife on top of it while the reverse is not true.

I do think there are interesting things to say about lumping or splitting in the context of architectures and also about teams. I’m going to call both of these “systems” even though one concerns how services appear in an infrastructure diagram and the other is about the org chart.

When splitting is definitely better

The gold standard for preferring two things to one thing is when you need those things to be totally isolated from each other. For example, I probably don’t need to explain why you should host https://status.example.com on a different computer from https://example.com if the purpose of the status site is to tell you when the main site has gone down.

However, it’s interesting to note that the more useful a status site is, the more information it needs about the system it’s monitoring, which can tend to make it more coupled. This in turn makes it more vulnerable to unexpected kinds of disruptions—for example, if a monitoring system directly pings the site it’s monitoring, a naively-implemented monitor could itself end up unavailable while it waits for a response from a very slow site. In response to an event like this, the team would likely make an effort to more strongly decouple then. Even a system that clearly wants to be independently-operated may tend towards lumpiness until it reaches a desirable equilibrium.

Microservices: a case study in splitting

Microservices, where a single large codebase (often a web site or service) is refactored into many small web services, rose to prominence in the mid-2010s and is still a popular approach that many teams successfully execute. As a design philosophy, it affirmatively asserts that there are advantages to making systems highly decoupled, and a good case study because enough time has elapsed since it emerged for teams of various sizes to have applied it and measured their results.

An oft-stated example of why a team might want to move to a microservices architecture is because they feel their velocity is constrained. They don’t deploy changes to their monolithic site very often because the monolith is fragile, which means that when they finally deploy, there may be loads of unrelated changes bundled together, which introduces more fragility. Splitting out a part of the infrastructure allows that particular part to be iterated on much more quickly, resulting in higher velocity and developer satisfaction.

The advantages of splitting in situations like this are usually very obvious but the costs tend to be underestimated. If you have a monolith and decide to split out your first microservice, you are doubling the number of systems you are logging, monitoring, and deploying. These tools for observing the health of your system are critical, but they’re less interesting than building new software and so they tend to be an afterthought for smaller teams. Logging barely even gets a mention in the first edition of Building Microservices, but after evaluating outcomes over the years, the 2021 revision of the same book was revised to include log aggregation in Chapter One as “a prerequisite for adopting a microservice architecture.” (Ironically, log aggregation is a form of lumping that gets added to an architecture to deal with a problem introduced by splitting.) If you don’t have good logging, you probably don’t have good monitoring, and if you aren’t monitoring your system then you don’t know when it’s broken. Unmonitored systems tend to be broken, just quietly enough that nobody’s complained yet.

Split later, or never

My issue with applying microservices to the problem of slow release cadence is that it’s a form of solutioning—arriving at an implementation before fully studying the problem. If a team has a problem with deploy velocity, I think the first five tools to reach for are all the word “why”:

“Why don’t we deploy more frequently?”

Because changes need to be tested carefully by a skilled person.

“Why are so many changes being tested manually?”

Because we don’t have good test coverage.

“Why don’t we have good test coverage?”

Because nobody likes writing tests in this codebase.

“Why don’t people like writing tests in our codebase?”

Because the test framework runs too slowly.

“Why does the test framework run slowly?”

Because no one has prioritized speeding it up.

If you make the test framework run more quickly then developers will be happier because developers are impatient. If fewer changes require manual review then human reviewers can use their brains for complex thinking work and they will be happier too. Solving first for the root cause of the velocity problem means you will have increased net happiness without increasing net complexity, and avoided doubling the number of things to log and monitor.

My argument here is that if you have a system that you think needs splitting, try everything you know you’d want to do eventually first. Those improvements accrue benefits immediately and can be available faster than a new microservice or any other method of decoupling a complex system.

Unsplitting

There are many reasons a team might decide to re-lump a split system. Existing teams tend to want to split their current codebase—it increases their autonomy and lets them walk away from annoying problems. New teams tend to want to pull systems away from extremes and reduce complexity, and as a rule it’s easier to decide to consolidate a distributed system than to fragment it—unsplitting shrinks the surface area of the system you’re responsible for. If there’s one kind of refactoring engineers love doing more than decoupling something they wrote, it’s DRYing out something they didn’t.

Another scenario where relumping is beneficial is when the system is nearing the end of its lifecycle and a highly decoupled system that was once optimized for speed of experimentation just isn’t being iterated on anymore. If a system is mature, as long as it’s still being operated it needs security patches and version bumps for deprecated libraries. A lot of small tech organizations are faced with upgrade backlogs for systems that are otherwise “complete” and don’t need a lot of new features, but are so decoupled that the same upgrades need to be applied to dozens of services. The salient point is that the “right” architecture changed over time—it may well have been desirable to have a microservices design at the start, but software has the annoying habit of being both disposable and also somehow extremely hard to get rid of.

Teams too

Software teams might argue at the start of a project whether to implement microservices or not, but at least at some point the argument ends and people start writing code. Unfortunately, how to structure a team involves lots of opinions and the argument over it will literally never end. Whether a team involves two people or ten people, or whether each team works on one thing or many things, or whether a team is single- or cross-functional—you can expect to revisit this topic again and again no matter what your role is. (If you are some kind of manager you will talk about this to the exclusion of almost everything else.)

Two teams might operate in perfect synchrony for years because one member is very good at staying in constant communication with the other team. When she leaves and is replaced by an equally competent person with different priorities, it can take a while to notice that the linchpin holding it all together was removed, at which point it might seem reasonable to lump the teams together even though the less invasive solution would be to improve communication.

Sometimes two teams are two teams for lousy reasons. Maybe one person was a jerk and half the team defected just to get some work done in peace. Maybe there’s someone with a fancy title who wants to have a vanity skunkworks team instead of doing the hard work of improving the whole org. Maybe it just grew that way and there’s really no principled reason at all.

As with systems design, I think it’s generally better to err on the side of lumpiness with team size—keeping communication low-latency, reducing silos, and providing an opportunity for collaboration are all easier with a larger team up to a point. But people also hate being involuntarily reorganized so I also think it’s better to stick with the status quo rather than attempt some new team structure unless there’s no viable alternative. What can be useful, though, is probing deeply into why a given structure seems wrong—did it work before? If so, what changed? If it never worked well, how did it evolve that way in the first place?

Taking the long view

Structures that are optimized for change velocity can be valuable in the early stages of development, and especially makes sense for high-growth organizations that will look completely different from year to year. Most of my experience has been with teams of moderate size and on products that are expected to operate for 5-10 years, with only a third of that time spent on active feature development and the rest on long-term maintenance and support. Just as code is read more than it’s written, successful software is maintained far longer than the time that was spent to build it.

A lot of conditions have to be true for a highly-decoupled system to really pay off. It needs to be easy for you to spin up and tear down all the ancillary bits of a system as your development evolves. You need to commit to retaining certain kinds of specialists—devops or system reliability engineers especially—or risk depending on complex architectures that no one understands. (Ironically, one failure state of highly distributed systems after a time is that they can become too fragile to modify, mimicking the problems of the monolithic system they were meant to supplant.)

Be mindful about the choices your team makes regarding its architecture or organization. Re-evaluate them periodically because business and personnel changes happen all the time. It should be hard to decide to switch a system towards more-lumpy or more-splitty because switching is disruptive. Solve for the proximal frictions first before reaching for the big levers.

This article is licensed under a Creative Commons Attribution 4.0 International License.

Originally published June 30, 2022