This is related to the second of the things requested by Jesper, which was encapsulation. Encapsulation is a tool to use when designing software. It’s a bit abstract, and I don’t think people always agree on what it means. To me, encapsulation is part of the bigger term modularisation, which doesn’t immediately help because it’s another abstract design tool that people don’t always agree on. I think that understanding modularisation first will help with understanding encapsulation, so I’ll get to encapsulation in a later article.
I think that modularisation is chopping a big lump of code into smaller parts or modules, and encapsulation is a property that the parts might or might not have once you’ve created them.
Introducing coupling and cohesion
It’s important to get the boundaries between modules in the right place, even though this is often hard and also a controversial area because people can disagree on how far to go and the approach to take. One approach is volatility-based decomposition, which ends up with code split into things like managers, engines, resource access layers etc. At a lower level there is the Roles, Responsibilities and Collaborators approach.
In this article I won’t go into how to draw boundaries in the right place, but instead describe a tool that will help you to check if boundaries are in the right place once you’ve drawn them. The tool is the pair of terms coupling and cohesion. Coupling is about the strength or closeness of the relationship between one module and other modules, and cohesion is about the strength or closeness of the relationships between the things inside one module.
Ideally you want high cohesion within modules and low or loose coupling between modules. If you had low cohesion within a module, that suggests a module is two or more modules that each would have more cohesion, or it’s just a Miscellaneous module where stuff has been dumped that doesn’t seem to belong anywhere else. If you had high coupling between modules, that suggests that they’re really one module.
Note that neither of these statements is always true – I used the word “suggests” on purpose both times. Software is a branch of engineering in that it tries to reconcile competing constraints, and the constraints and the precise balance between them will vary from one system to the next. So I’m wary of Grand Pronouncements about How Things Must Always Be Done. Instead, I find it more useful to build an assorted collection of tools and know which to pick for which job.
The benefits of low coupling and high cohesion
You might, understandably, say: So what? Why do I want one number (coupling) to be small and another number (cohesion) to be big? Why does it matter if the boundaries between modules are drawn in the wrong place?
The first, and to me most important, reason is understandability i.e. a benefit to the humans who work on the system, rather than a benefit directly to the system’s behaviour. I don’t store lots of lines of code in my head – my brain is simply too small to hold more than the few lines I’ve recently seen. Nor do I store lots of individual requirements in my brain – again, there are too many of them. Instead, I store concepts and modules, because they’re important and there are fewer of them.
What if someone asks me a question about how we deal with orders? I know that’s related to the shopping cart module (that creates orders), the order processing module, the account history module (that lists old orders), etc. Depending on the question, I can go to the relevant module and look at its code, its tests and maybe some designs. I don’t expect all the code that touches on a concept to be in one place – it’s OK that there are separate shopping cart, order processing etc. modules. But there are few enough modules, and they each have a strong enough identity (via high cohesion) that I can easily tie them to concepts, that I can have a decent chance of navigating the system while only having space in my brain for a small index of concepts and modules.
Also, when I’m designing or coding a change, there is a risk that I will introduce bugs. If all the code to deal with a concept is in the same module, in my experience that there’s a much lower risk that I will miss or forget things. If order processing is covered by modules A, B and C and I want to change order processing in some way I might change module B and forget to even check A and C let alone change them.
Another way to reduce the risk of introducing bugs is to reduce the blast radius of a change. If two modules have high coupling, it’s likely that if I change one of them I will have to change the other one in sync. With a more arms-length or low coupling relationship between the modules, it’s more likely that a change will be absorbed by only one of them.
A dumping ground or low cohesion module can be a prime candidate for churn. By that I mean lots of code changes by lots of people, often at the same time. This churn can introduce its own risks or inconveniences, such as merge conflicts in version control systems like Git. However, a dumping ground can help avoid two other problems. One problem is a blizzard of tiny modules that each have hardly any contents. The other is modules that have outliers that bring down the cohesion for what would otherwise have high cohesion. You need to balance the costs and benefits of these for your context.
The relationship between code structure and organisation structure is another important area that has been thought about for a long time – see e.g. Conway’s law. Even though it’s important, I think it’s outside the scope of this article.
I realise that I’m now about to try to explain a pair of abstract concepts via some hand-wavy analogies using a different field, but I hope that it’s nonetheless helpful.
The analogy I’ll use is clustering points on a 2d graph. For this you will have to imagine that each unit of code, e.g. each method, is represented by a dot in 2d space. It doesn’t really matter what these dimensions represent; they’re more to show how related two methods are. The closer two dots are in 2d space, the more closely related they are. I’m not drawing lines between dots because I think they would make it much less clear, but will instead draw boxes around dots to show cluster / module boundaries and also will colour points in a way that I hope is helpful.
If you’re interested how you might get code do this clustering for dots that represent real data, where the dots might be in multi-dimensional space rather than just 2d space, please refer to my article on k-means clustering. I won’t go into the details of k means clustering here, but a useful aspect of it for this article is that you end up with a figure for each cluster (module in this case) that relates to the average distance between members of that cluster, which is the analogue of cohesion. In case they’re useful, there are diagrams in another article that illustrate don’t repeat yourself (DRY) and the single responsibility principle (SRP).
A good state
I’ll start with a system with high cohesion and low coupling.
In this example, the dots inside each module are close to each other; this represents high cohesion. Each member of each module belongs there, because it has lots to do with the other members of its module. There is white space between each module and its neighbours; this represents loose coupling. There are no dots outside a module that have a lot to do with dots inside the module.
Here there are two modules that are strongly related to each other – please imagine there are other modules that are all OK, i.e. with loose coupling and high cohesion, that I’ve not bothered to draw. A change to one of these two modules is likely to need a change to the other, and it might be hard to reliably remember which module a given bit of code is in.
This is a candidate for fixing, but exactly what the fix is depends on the circumstances. One approach is to merge them into one module. This has the benefit of reducing the coupling between modules, but there is now a large module and that could introduce its own problems. If you need to slim down the module then it might be possible to split it using extra dimensions. I.e. this diagram could be using its two dimensions to represent functional area in some way, but it’s missing a third dimension that represents something like level of abstraction. You could split the big module into a high-level module and a lower-level module.
Low cohesion – two sub-modules
Here is one module that covers quite a lot of the space, but it’s actually two smaller modules welded together – think about how far apart the two furthest points are in this module. If you split the module into its two components, each component will have higher cohesion than the single big module had.
It’s easy to present this idealised example as obviously wrong, but it could have evolved over time from something better. It could have started with code that came from only the sides of the modules that are closest to the other module, which could have had fairly good cohesion. As new requirements and code were added, it developed into something that had two separate identities. There’s no quick and easy way to avoid this, but something to keep an eye on as a codebase evolves over time.
Low cohesion – miscellaneous module
This is a miscellaneous, utility or dumping ground module. Little in this module is particularly close to anything else in the module. One approach is to chop it up arbitrarily into smaller bits, but I don’t recommend it as it will be hard to remember what lives where.
What I suggest is that you accept it as it is, but keep an eye on it. If it develops such that there are some more closely-related members, they can be broken out into a meaningful module with decent cohesion, leaving a smaller module for the miscellaneous stuff. (If you were being poetic, it would be like a star or planet accreting from a cloud of matter.)
Modularisation is a tool that helps with software development. It’s important but can be hard to do, and people can disagree on which method to use. There are, however, a couple of ways to see how healthy a given set of module boundaries are. Are the contents of each module related to each other (high cohesion) and is each module at arm’s length from other modules (loose coupling)?