Wednesday 14 March 2012

DRY and False Negatives

Dear Junior

In earlier letters we have talked about the DRY-principle, how looking for duplicated code can help finding violations of DRY, but also about the false positives - the rare locations where unrelated concepts by coincidence happen to be represented by similar code. Using Domain-Driven Design terminology it is where two different concepts are modelled, but where the models happens to be represented the same way.

But the hardest part of DRY is the false negatives.

A "false positive" is when some kind of test raises a signal but does so wrongly. E g a medical lab test might say that you have a disease which you does not have. If you get a false positive you have to do further investigations before you realise that it was just a false alarm. In our case the false positives are where code that was alarmed as duplication, but actually was jus a coincident and no violation of DRY.

A "false negative" on the other hand, is when the test does not raise a signal but does so wrongly. E g a medical lab test says you are healthy, but you actually have the disease. Now, false negatives are a lot harder to handle as the few mistakes (wrong classification by the test) hide within an enormous pile of "perfectly normal cases" (i e the true negatives).

This is the reason why medical tests most often are designed to avoid false negatives even if that design means more false positives. In the medical field this is natural. The cost in money and suffering in a missed AIDS-diagnosis totally overshadows the costs and anxiety of having to run a few extra tests in case the initial test wrongly showed "infected".

Talking about to DRY and code: what are our false negatives? Well, it is those DRY-violations that does not show up as code duplicates. Using DDD terminology we have exactly the same concept being modelled, but it is represented in different places in the model, and the representations differ from each other. Code-wise, we have pieces of code that fulfills the same purpose in the solution, but are written in a different way.

We return to our online sales system as an example to make that clear.

In the customer management there might be a method that validate that zip-codes are five-digit strings. Some programmer wrote the regexp as part of an if-condition and at some later point of time some other programmer extracted it to a metod.

class CustomerManagement {
    private boolean validateZip(String zip) {
        return zip.matches("[1-9][0-9]{4}");
    }
}

Totally unaware of the private method in CustomerManager, some third programmer was working on the shipping logic. This programmer wanted to avoid printing invalid zip codes on packages so he looked for some feasible method in the standard APIs for a minute or two before giving up and whipping together his own helper method. Also, this programmer was not so well versed in regexps so he found another way of validating.

class Shipping {
    public static boolean isValidZipCode(String zipcode) {
        try {
            int i = Integer.parseInt(zipcode);
            return Integer.toString(i).length() == 5;
        } catch (NumberFormatException e) {
            return false;
        }
    }
}

(BTW, these examples are inspired be real code-bases)

This is where we suddenly have a tricky instance of DRY-violation. We express the exact same idea - the criteria for how a valid zip code is formatted. But we do so in completely different way - so different that no code-duplicate detector will find them.

It would be far from tasteful to compare the direct human suffering of failing medical tests with the effects of DRY-violation. Nevertheless, in medical tests the false negatives are far more troublesome than the false positives. And the same holds for violations of DRY. The false positives we can find and discuss, but the false negatives will just silently create implicit couplings between different parts of the code - couplings that are hard to find and might break when the concept change but only a few of the representations are updated. Also, it leads to our code getting bloated.

In summary I dare to make an analogy to medicine. When looking for the DRY-disease, we will get a lot of help from testing for code-duplication. There will be some false positives that we need to keep our eyes open for so that we do not try to cure where there is no illness. However it is the false negatives that the easy test does not find that will cause us the most trouble in the long run.

Yours

Dan

Ps My colleague Anders Helgesson phrased this as "Man ska inte upprepa sig och säga samma sak, om och om igen, flera gånger" wich in translation is "You should not repeat yourself and say the same thing, over and over again, many times" - a wonderful example of a DRY violation where searching for duplicated text gives a false negative.


Wednesday 7 March 2012

DRY and Duplicated Code

Dear Junior

In my last letter I wrote about how the DRY-principle (Don't Repeat Yourself) talks primarily talks about the number of times an idea is expressed in the code-base, not about text-repetition as such.

If nothing else, the intention of DRY becomes clear when you go back the source that caused the DRY-meme to spread within the programming community: the Pragmatic Programmer by Andrew Hunt and David Thomas, published in 1999.

"Every piece of knowledge must have a single, unambiguous authoritative representation within a system" 

(Needless to say, when reading this book I found its core ideas to be extremely well aligned with my own ideas about programming - and the rest of the book have had profound impact on my view of the craftsmanship aspects of programming)

Now, to put my last letter in perspective it must be pointed out that the absolutely easiest way to find violations of the DRY principle is to search your codebase for duplicated text. It is very likely to find multiple places where the same idea is expressed over an over again using the same technical construct.

If for example you have an order system, there might be an attribute "payment" on the order telling when the order was paid.

class Order {
   Date payment = null;
   Date payment() { return this.payment; }
}

Most probably orders are not allowed to be sent for shipping unless they are paid. So, very probably there will be code somewhere checking this condition.

if(order.payment() != null) // do shipping
else // do not allow shipping

It is also probable that the same check is done in a lot of other places throughout the codebase. What we have ended up with is that the same technical construction "order.payment() != null" is used to express the same idea "order is payed" in a lot of places - a clear violation of DRY. Most probably we will also find a lot of places checking the negation "order.payment == null".

Of course the codebase would benefit from encapsulating the interpretation of a nulled payment-field and putting that encapsulated interpretation in one place - preferably the Order class.

class Order {
   Date payment = null;
   Date payment() { return this.payment; }
   boolean isPaid { return order.payment() != null; }
}

The benefits are several. To start with it is now possible to change the technical representation of an unpaid order without having to run around the code-base. We actually want all those places in the code to change at the same time - because they conceptually related things (actually they are conceptually the same thing).

An even more important benefit is that the API of the Order class becomes more expressive, making the client code more lucent.

if(order.isPaid()) // do shipping
else // do not allow shipping

Obviously, searching for duplicated text is an excellent way of finding candidate violations of DRY. If the code looks exactly the same, then those places will need to change at the same time - i e we have implicit couplings in the codebase that we risk breaking when we change.

There are excellent tools out there to detect duplicated code - some bundled in code-quality tools like Sonar, some available directly in your IDE. Most of these even have support for "fuzzying" where they detect that code is duplicated "apart from different naming of variables" for example.

Using these tools on an "average codebase of the street" will probably yield hundreds of places with duplicated code. Of these candidates, the vast majority of them will be violations of the DRY principle - especially the short code-snippets like "order.payment() == null" which are so easy to "just write" instead of looking for a method that does the job.

However, be aware that within that huge pile of DRY-violations there might also be a few "false positives". Those are the code-places that by accident happen to look the same way, but actually represent different and unrelated ideas. To "reduce the common code" in that case would only cause a irrelevant coupling between unrelated parts of the solution.

In summary, looking for duplicated code is an excellent "screening test" to find loads of candidates where the DRY-principle has been violated. However, the DRY principle is not about duplication of text, but about duplication of ideas. So, the two should be kept apart.


Even if duplicate code most often is a sign of DRY-violation, there are valid reasons to have the same code-text in two different places. 


Jérémie Chassai phrased it pretty well on Twitter: "You should repeat yourself if you say the same thing for two different reasons".

Get those tools, start searching for duplicates, and good DRY-hunting

Yours

Dan

Thursday 1 March 2012

DRY is about Ideas, not Text

Dear Junior

Ever since the Pragmatic Programmer, the principle of DRY (Don't Repeat Yourself) has been one of the central pieces of advice for good programming. From a private perspective, it has many a time helped me to write code that is better than had I not had that meme in the back of my head. However, I have also seen it misused and (in my understanding) misunderstood several times - sometimes with disastrous result.

What I think is central about DRY is that it refers to ideas. Slightly rephrased DRY can be expressed that "each idea about your software solution should be expressed once and only once in your code". With "idea" I mean stuff like "an order number is a six-digit string not beginning with 0 or 1", or "an order that is cancelled after shipping will still be debited the costs that we cannot recover".

Unfortunately, a common misuse is to let DRY to talk about code as text. Let me make it a little bit more clear though an example. 

Say that we have some kind of order system where order numbers are five-digit numbers. Somewhere we might find this code that checks whether a string is a valid order number or not.

       public static boolean isValidOrderNumber(String orderNumber) {
            return orderNumber.matches("[1-9][0-9]{4}");
        }

The code might be in the Order class, or even inside an OrderNumber value object, and possibly being called by the constructor.

As it is an order system, there will also be a part that talks about shipping, to get the goods to some specified address. The address consists of a lot of things, among those a zip code that in this country is a digit string of length five (initial digit must be non-zero). So, somewhere we can expect to find the code doing this validation.

       public static boolean isValidZipCode(String zip) {
            return zip.matches("[1-9][0-9]{4}");
        }

This code might be in the Address class, or inside the ZipCode value object - possibly called by the constructor.

Now, this is where some DRY-zealots shout "duplication, duplication" because they have pattern-matched the string matching in isValidOrderNumber with the string matching in isValidZipCode and noticed that they consist of exactly the same text (barring a rename of a variable).

I think this is a mistake, and those zealots focus on the wrong thing. 

In the eyes of the zealots, the textual duplication is bad. Instead one method should call the other. This obviously becomes bizarre code if you try it. For example it would create a completely irrelevant coupling between the zip code and the order number - two concepts that are not related at all.

Alternatively, they claim, the common code should be broken out to a separate method called from both places. Now, that might make more sense, but you still introduce a coupling between the un-related concepts order number and zip code: they will now covariant - change one and the other will change.

For example, should this company decide to use letters in their order numbers, then they need to regression test all their use of addresses as well.  

Do this kind of coupling on a massive scale, and you will end up with a system where every change is full of surprises - conceptually totally unrelated concepts start behaving different just because they had some code-snippet in common.

What to do instead?

We should not primarily look at the code as a mass of text. Instead we should look at it as representation of a set of ideas. This is of course the perspective of Domain Driven Design, where we see the code as the encoding of a distilled domain model - a model that captures our selected way of looking at the problem domain.

Now, seen from that perspective the "repeated code text" in the two validation methods is completely unproblematic. It is just two separate concepts that happens to be represented using the same technical mechanism. They have nothing to do with each other - the duplication is a pure coincidence and the two concepts should be able to change totally uncoupled from each other.

The principle of Don't Repeat Yourself is about ideas, it does not apply to text.

Yours

   Dan