Friday, 17 July 2009

Canonical Data Format for Domain Model Data

Dear Junior

When accepting data from a user, it can come in variations of forms, and the same data can come in different forms. E g, a phone number can be sent as "0709-158843", "0709 158 843", or "+46(0)709158843", all denoting the same thing - the phone number to my mobile phone. This can really be a hassle - say that one user stores the phone number as "0709 158 843" an another searches for people with phone number "0709-158843"; then there is a risk that there will be no match, even though it is actually the same phone number.

This situation can get even subtler: non-7-bit-ascii characters like the Swedish letters å, ä, and ö can be represented as bytes in several different ways. Thus, there is always a risk that the search-form does not use the same encoding as the storage in the database. The result might again be "no match" even though the two users (one entering a name like "Öberg", another searching for it) punched exactly the same keys in the same sequence, perhaps even using the same physical keyboard.

To get around this, I usually decide a canonical data format to be used in the model, which naturally also will be in the code that implementing the model. And now I do not mean technical issues like using Unicode character encoding, those are up to the programmers to decide. I mean what a phone number looks like "in its own nature", so to say "beneath different representations".

What a phone number should look like is a domain modelling issue, and should be decided together with the domain experts (user representative, product owner, whoever ...). It is often pretty efficient to bring up the problem explicitly: "We need to set a standard for what phone numbers should look like, otherwise the application will look like a mess, searches will fail, and integration with other systems will be a nightmare. I have a few suggestions for possible alternatives: is there anyone you like better then the other, or are there another format you would suggest?" Probably you will come out with something like "0709-158843", which can be abstracted to the regexp "0ddd-dddddd[d]*" or similar. This is your canonical form, and will in Domain Driven Design terminology be part of your domain model.

When choosing your canonical form and how to represent and store it, you must obviously take a look at the system as a whole. Probably your choice will be guided both by functionality you want to provide as well as system qualities (NFRs) such as capacity, response time, and security. E g if you are managing a forum site with discussions that are mainly held in only in English, you might restrict comments to only contain A-Z, a-z, some white-spaces and some punctuations. If you have several languages, but mostly west-European, you might not be able to restrict the ranges, but can store it using UTF-8 (for storage capacity), if there are a lot of other you might use UTF-16. Finally, if the content is sole for publishing on web and it can contain characters like '<', you might HTML-encode it for security reasons.

After deciding canonical form and its representation, it will be the responsibility for any indata handler to validate incoming data and convert it to the canonical form. The logic for doing this can preferably be put into a value object class (PhoneNumber) so that it is easily found and used. Then the rest of the application can safely use (as field declarations, variables, arguments and return values) this type, making the rest of the code more precise and expressive.

Also, by default all data presentations (i e output) will be on this format as well. If there are presentations that need another form, the responsibility falls on them to convert to the format needed. E g some listing might want all phone numbers to be structured as "0709 [tab] 15 88 43", then it is up to that listing to convert the phone numbers to that format. In the same way, if some presentation needs a specific encoding (UTF-16, Base-64 or HTML-encoded), it is up to that presentation tier to do the conversion.

In this way, the life within the model becomes simple, searching and matching can be done with out trouble. At the same time, we can facilitate all the input and output formats we need by pushing coding, and conversions towards the system boarder.