Wednesday, 14 March 2012

DRY and False Negatives

Dear Junior

In earlier letters we have talked about the DRY-principle, how looking for duplicated code can help finding violations of DRY, but also about the false positives - the rare locations where unrelated concepts by coincidence happen to be represented by similar code. Using Domain-Driven Design terminology it is where two different concepts are modelled, but where the models happens to be represented the same way.

But the hardest part of DRY is the false negatives.

A "false positive" is when some kind of test raises a signal but does so wrongly. E g a medical lab test might say that you have a disease which you does not have. If you get a false positive you have to do further investigations before you realise that it was just a false alarm. In our case the false positives are where code that was alarmed as duplication, but actually was jus a coincident and no violation of DRY.

A "false negative" on the other hand, is when the test does not raise a signal but does so wrongly. E g a medical lab test says you are healthy, but you actually have the disease. Now, false negatives are a lot harder to handle as the few mistakes (wrong classification by the test) hide within an enormous pile of "perfectly normal cases" (i e the true negatives).

This is the reason why medical tests most often are designed to avoid false negatives even if that design means more false positives. In the medical field this is natural. The cost in money and suffering in a missed AIDS-diagnosis totally overshadows the costs and anxiety of having to run a few extra tests in case the initial test wrongly showed "infected".

Talking about to DRY and code: what are our false negatives? Well, it is those DRY-violations that does not show up as code duplicates. Using DDD terminology we have exactly the same concept being modelled, but it is represented in different places in the model, and the representations differ from each other. Code-wise, we have pieces of code that fulfills the same purpose in the solution, but are written in a different way.

We return to our online sales system as an example to make that clear.

In the customer management there might be a method that validate that zip-codes are five-digit strings. Some programmer wrote the regexp as part of an if-condition and at some later point of time some other programmer extracted it to a metod.

class CustomerManagement {
    private boolean validateZip(String zip) {
        return zip.matches("[1-9][0-9]{4}");
    }
}

Totally unaware of the private method in CustomerManager, some third programmer was working on the shipping logic. This programmer wanted to avoid printing invalid zip codes on packages so he looked for some feasible method in the standard APIs for a minute or two before giving up and whipping together his own helper method. Also, this programmer was not so well versed in regexps so he found another way of validating.

class Shipping {
    public static boolean isValidZipCode(String zipcode) {
        try {
            int i = Integer.parseInt(zipcode);
            return Integer.toString(i).length() == 5;
        } catch (NumberFormatException e) {
            return false;
        }
    }
}

(BTW, these examples are inspired be real code-bases)

This is where we suddenly have a tricky instance of DRY-violation. We express the exact same idea - the criteria for how a valid zip code is formatted. But we do so in completely different way - so different that no code-duplicate detector will find them.

It would be far from tasteful to compare the direct human suffering of failing medical tests with the effects of DRY-violation. Nevertheless, in medical tests the false negatives are far more troublesome than the false positives. And the same holds for violations of DRY. The false positives we can find and discuss, but the false negatives will just silently create implicit couplings between different parts of the code - couplings that are hard to find and might break when the concept change but only a few of the representations are updated. Also, it leads to our code getting bloated.

In summary I dare to make an analogy to medicine. When looking for the DRY-disease, we will get a lot of help from testing for code-duplication. There will be some false positives that we need to keep our eyes open for so that we do not try to cure where there is no illness. However it is the false negatives that the easy test does not find that will cause us the most trouble in the long run.

Yours

Dan

Ps My colleague Anders Helgesson phrased this as "Man ska inte upprepa sig och säga samma sak, om och om igen, flera gånger" wich in translation is "You should not repeat yourself and say the same thing, over and over again, many times" - a wonderful example of a DRY violation where searching for duplicated text gives a false negative.