Going a bit deeper with Don’t Repeat Yourself (DRY)

DRY, or Don’t Repeat Yourself, is a principle of software engineering.  It makes code quicker and easier to understand and to change.  For instance, instead of having the same chunk of code typed out twice or more, you carve it out into a method and then call it each time you need it.  However, I think that there’s a bit more substance to it than the simple DRY slogan, and I will try to look into this a bit in this article.

I would also be wrong to miss an opportunity to refer to Radio 4’s Just A Minute, hosted by Nicholas Parsons (below) since 1967.  Contestants have to talk on a random subject for 60 seconds without repetition, deviation or hesitation.  Fortunately we usually have time to think before coding, and also have a chance to go back over our code and tidy it up afterwards.

Nicolas Parsons, host of Just A Minute
Image credit

Reducing rather than removing repetition

For a start, even when you apply DRY you do still repeat yourself, but in ways that seem more acceptable.  In each of these pairs, the left-hand side is replaced by the more acceptable alternative on the right-hand side:

  1. Repeated identical chunks of code < Repeated identical calls to the same method*
  2. Repeated similar chunks of code < Repeated calls to the same method, passing it different parameters
  3. Repeated number or text literals with the same meaning < Repeated use of the same named constant / read-only variable / member of an enumeration

*possibly assigning the result of the method to different variables

My friend Miles supplied an advanced comment when I discussed this with him, that builds on item 3 above.  In big systems, it’s better to have your enumerations generated from a central thing like a Master Data Management (MDM) system.  If you have several independent sub-systems (that have independent code bases) that need to talk to each other, e.g. by using the same database, then it’s possible for the enumerations in sub-system A to drift away from those in sub-system B.

For instance, the people working on sub-system A need to introduce a new member in an enumeration and so update their code.  However, they forget to inform the owners of sub-system B (or sub-system C, D etc.) and then some form of communication breakdown occurs because A’s view of the enumeration is different from everyone else’s.  It would be better to have the list of valid values defined once, in an MDM system, and then have something like local code generators that read from the MDM to produce local enumerations.

Note that this is simply higher-order DRY.  Not only are you not repeating 1, 2, 3 etc. in the code of a given sub-system by using an enumeration, you are now also avoiding repeating the definition of enumerations across sub-systems by defining them once in the MDM and copying them locally.

Why is some repetition better than others?

Code that has been tidied up in the way recommended by DRY is better in several ways.

It’s easier to change all instances of the repeated thing (method etc.) if it is defined in one place and then used in many places.  However, you still need to check that all places where the thing is used want the changed behaviour.

If you have to worry about internationalisation, then having each text string defined once with a meaningful name will help a lot.  (It won’t be enough to solve all the problems associated with internationalisation, but it does help.)

Often code gets shorter after applying DRY, without reducing the understandability of the code that remains.  Therefore it’s quicker to understand, so it’s quicker to get on with and then accomplish your task, such as debugging it or addding to it.

Often, the post-DRY code is actually more understandable than the pre-DRY code.

Easier to understand

In the post-DRY version, the relationship between bits of code is clearer.  If you compare two bits of code A and B:

  1. If they call the same method, passing it the same parameters, you should expect the same behaviour* – A and B are the same;
  2. If they call the same method, passing it different parameters, you should expect similar but probably different behaviour;
  3. Else A and B are different.

* this assumes a similar-enough starting state for A and B.

Also, you don’t need to check every character in an 8-line block of code that appears to be repeated – the repeated method call is usually much shorter than the method’s definition, and so there is less text to check.

Understandability is helped by abstraction and encapsulation.  Instead of having 8 lines of code in two places, you call a method that hopefully has a name that sums up those 8 lines.  So the meaning is collected together from 8 lines and concentrated into a few characters.  If you need to know the details you can refer to the method’s definition, otherwise you can just get on with understanding the rest of the code that calls the method.

Similarly, named constants are often clearer than e.g. 1 or 2.  It’s not that 1 or 2 are hard to understand, but they only have a superficial meaning.  Why is it 1 rather than 2?  Do I mean this kind of 1 or that kind of 1?  E.g. if in some parts of my system 1 means Monday and in other parts 1 means Female, you can easily get confused by appearing to compare days of the week with genders.  If instead you used a days-of-the-week enumeration (where Monday happens to have the value 1) and a separate gender enumeration (where Female happens to also have the value 1) then things are much clearer.

Detour into entropy

While these reasons are all good and true, there’s something else I want to get into but first I need to lay some groundwork.

There are at least two definitions of the term entropy:

  1. In physics it is a measure of disorder – how many ways can a system be rearranged and still be the same thing?
  2. In information theory it is a measure of surprise or lack of predictability.

(They’re actually the same thing – perfectly ordered things, e.g. all zeroes, have least / no information.)

I am neither a physicist nor an information theorist, so I will probably mess these up.  I also won’t go into the information theory version here, but if you’re interested there’s a good brief introduction to the information theory version of entropy on Khan Academy.  There’s also a nice brief introduction to the physics version of entropy by Prof Brian Cox.

My understanding of the physics version is based on rearrangements and duck typing.  Duck typing is a way of deciding what type of thing something is – if it walks like a duck and talks like a duck, then it’s a duck.

You rearrange the elements of a thing (like the grains of sand in a sand dune) and see if it still passes the duck typing test for a sand dune.  If you can rearrange the elements many different ways that all still pass the test, that system (e.g. a sand dune) has high entropy.  If you can rearrange the elements in only a few ways that still pass the test, that system (e.g. a sand castle) has low entropy.

So, in my mind, there is a notional machine that performs the duck typing test for each kind of thing (sand dune, sand castle etc.)  I expect that people who actually know what they’re talking about will be cringing by now, so I hope you will enlighten me in the comments below.

Prof Brian Cox with a sand dune
Several low entropy systems: a former keyboard player for D:Ream, a sand castle, a bucket and spade, and a house.  Image credit.

IDE as duck typing entropy machine

We can now come back from our detour.  The compiler and code completion thing (like IntelliSense) in your IDE will be much more helpful to you with post-DRY code than with pre-DRY code.

Imagine you have 8 lines of code that together do task T.  You want to do T in two places.  In pre-DRY code you have to copy and paste all 8 lines or (even worse) type out it all out twice.  You could easily make a mistake, for instance selecting only 7 of the 8 lines (or 9, by accidentally including the next line after T’s code by mistake).  It’s possible that this incorrect code will still compile.

In post-DRY code, to have T a second time you call its method a second time and pass the correct parameters.  You can still make a mistake, for instance by passing in the wrong parameters.  However, the code completer and compiler will give you a lot of help.

There are fewer characters to type, and if you get any of them wrong in the method’s name, it’s highly unlikely that you will accidentally type in the name of a different method and so you will end up with code that won’t compile.  If you add comments to this so that the code completer offers them to you as you type, then the chances of mistakes are even smaller.

There is an even greater difference with constants.  If you mean to repeat the string “Monday” but instead type “Moonday” the compiler will still accept it, and the code completer will probably not help.  There is a long list of ways of mis-typing a given text string, and they will all compile (unless you accidentally mess with quotes etc.).  However, if the text string is the value of a constant, then the compiler and code completer will help you pick out the name of at least a different text constant or variable and probably the correct text constant.

Similarly, if you mean to compare an integer to 1 but instead type 11 or 2, that will still compile.  However, if you compare the integer to members of e.g. the OrderStatus enumeration, the only mistakes that will still compile will be those where you pick the wrong member of the enumeration, where the enumeration member’s name will hopefully help you to spot this yourself.  (Or an automated unit test will spot it.)

So, in a very hand-waving kind of way, the post-DRY code has lower entropy and the IDE acts as the duck typing machine.  There are fewer ways of rearranging the characters in your code (by making typos and other mistakes) such that it is still similar enough to pass the duck typing machine.  It can’t tell which strings are OK and which aren’t for your program, but it can tell e.g. which method names are OK and which aren’t because you give it a list of all valid methods (by defining them or including them via libraries).

Conclusion

DRY is a very useful principle in software engineering.  It can take effort to abide by as code changes over time, but I can’t remember an occasion where it wasn’t worth it.  The resulting code is usually shorter and easier to understand, and shifted in the direction where the IDE has more to go on and so can be more helpful.

One thought on “Going a bit deeper with Don’t Repeat Yourself (DRY)

  1. Thanks for the name-check, Bob.

    And MDM (or more specifically a Reference Data Management System) becomes even more important when you are assembling big systems from multiple, diverse coding teams (or even commercial packages), since it is unlikely that in a typical business domain/problem domain that there will be any common agreement on reference data values in attributes of the data model (unless there is some industry standard). Let alone a canonical data model/data schema, and you don’t want to be customising the end points to make them match.

    Reference Data management then really comes into its own when building integrations between diverse components to create a maintainable, flexible mapping between the enumerated values in component A and component B (potentially via a canonical data model which introduces its own, hopefully more comprehensive, enumerations).

    In an API architecture, it’s in interesting question as to which code should be responsible for the mapping (caller, called, or some gateway/brokering code in the middle).

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s