Monday 31 August 2009

Domain Models, Data Models, and How They Resemble Each Other


Dear Junior

In what ways domain models and data models should resemble each other is a really interesting topic. It is not uncommon for me to ask for what a “Foo” and a “Bar” really are and what their relationships are and upon that question being answered by the architect who with a smiling face show me the database schema, complete with join tables and everything. That was not what I had in mind.

So what is the difference between a database data model and a DDD domain model? Why do I care?

For me, it basically boils down to the purpose of the respective model. The purpose of the database data model is to provide the system with an efficient storage for persisting data, and for searching same data. The purpose of the domain model (a la Domain Driven Design) is to work as the focal point and a vehicle for communication about the system. In the database model you might have redundancy and pre-aggregated data for efficiency. In the domain model you try to make each concept and each relation as clear and unambiguous as possible.

True, in some places the database model is used as the domain model, and even the business side domain experts have learned to read the schemas. My experience is that even though this might work, for example by using ER-diagrams instead of the actual tables, it has its limitations. First, people tend to forget the distinction between the concepts per se, and their realisation in the database. So, if you want to re-conceptualise (which is good to try from time to time) you might be struck with counter arguments such as “but that will violate the constraints in the triggers”. Secondly, a database data model must by necessity cover every single piece of persistent data – however marginal its importance. To do strict modelling of a concept is a quite brain-expensive operation, thus it is simple waste of precious resources to do so on every minute detail. To get leverage out of your modelling efforts it usually suffices to concentrate on the most central parts. Thirdly, a database model will leave out data that is not persistent, or is extracted or imported from other sources - and those might well be very central to the conceptual understanding of the problem at hand.

For these reasons, I simply advice to do a separate domain model, even if you have a complete database model in place. They simply facilitate different kinds of discussions.

So, how much should these models resemble each other? Well, let us say that we have established that the domain model is the focal point of discussion. Further on, the domain model has grown out of a discussion where several types of input and considerations have been taken into account: how the customers of the product (e g end users) think, standards within the domain, GUI experts, and database experts. All those inputs are needed to give a model that fulfils its purpose – to enable us to build a good system. In that way the domain model is the result of a negotiation between several parties with the goal of meeting each party’s need as well as possible. So, as the database expert has his or her say in that discussion, there should be some level of resemblance.

But, how much should they resemble each other? Well, if the domain model is the common and central model, then each party will need to shrew it slightly to realise it into its context. The programmers might need to translate concepts to a different language, to make working code. The interface designers might need to lay components in the design in a slightly different way. And, the database designer might need to add an aggregation of data for efficiency, or add a join table to make ends meet. However, any unmotivated difference should be considered a technical debt and a target for a refactoring effort.

For the motivated adaptations, the burden of maintaining the respective mapping falls back on each party. If we need to reconceptualise and change the domain model, then each party will need to do both that change and the changes in their adaptations. So, in order to survive each party should keep the adaptations to the minimal motivated level.

Domain models a la Domain Driven Design and data models for database persistence should in a sound system resemble each other. But, they are not the same as they fulfil different purpose, and should not be confused with each other.

Yours
   Dan

PS When trying to keep down the semantic difference between the domain model and the data model, you probably want to refactor the database continuously - something that gets a lot easier if you version your database.
PPS A related question is to what extent technical details are allowed to be part of the model.
PPPS Domain Driven Design is in general a large and interesting field in and of itself.

Friday 21 August 2009

Heapsort, the Binary Mafia, and Product Backlog Priorities

Dear Junior

I recently picked up my university book on algorithms and by accident came across the section on heap-sort and its data structures. When reading it, I was struck by a strong association to product backlogs and how product owners keep them sorted, often spending a substantial time on that task.

A common advice to product owners is to give each user story a priority by attributing it an “importance number”. High importance number means important story, so if you want to send a story to the top, you can always change its number to being higher then the currently most important. This is a neat trick eliminating the common problem when things are given priority numbers (where priority one” is most important) – because what do you do when something even more important arrives? Give it “prio zero”?

In the same breath, product owners are usually advised that all stories should have unique importance numbers, thus distinguishing them. For those stories that to be developed in near future (what I call the “Backlog Shortlist”), it is important to sort out their relative priorities, and I agree that assigning unique numbers to those stories is a feasible way to do it.

However, for the rest of the backlog, setting unique numbers is unnecessary work. Following the advice forces us to determine the exact order of even the bottom ten, most unimportant, stories – and to do this at an early stage. Guess what those priorities might, and will, change many times before those stories are ripe for development. Chances are high that those stories will not even be developed, at least not in their current shape. So basically we have wasted precious product-owner time on establishing details that was not needed yet what I call “premature exactness”.

Going back to the algorithms and data structures, this reminds me on how a list is sorted using insert-sort, where a new item “scans” through the list to find its place in the sorted list. Doing this with one element at a time with all the elements you want to sort gives you a sorted list with the most important element at the beginning. If you extract the most important element one at a time from that list you obviously get them in priority order. Also, if you let new elements enter the list during the process, you have a what computer scientists call a “priority queue”. The connection to the backlog should be obvious.

However, it turns out that insert-sort is a pretty bad way of implementing a priority queue. This is because inserting a new element requires you to scan half the list on average – in other words, inserting a new element is linear to number of elements in the list, or O(n) using computer science terminology. So, the complete work for processing n original items through the priority processing will be O(n*n).

Heapsort on the other hand uses another data structure called the “heap”. This heap is a binary tree with some extra restrictions on how nodes can sit relative to each other – not to be confused with garbage collected memory area used in runtime environments such as Java or LISP.

What makes the binary tree into a heap is an extra property each node has a value that is larger than its immediate children’s. Thus, the largest value will be at the top. However, in the heap the nodes do not care about whether the right or the left child is larger; nor does it care about the values in the nodes two steps down. And it is exactly this ignorance of irrelevant details that happens to make the heap an excellent structure for implementing a priority queue.

I think of a heap as a Mafia organisation with branches. In each cell (tree node), the toughest guy or gal is the cell boss with two subordinates - each being the boss of one sub-branch each. The other way around, the tough boss is part of another cell (the tree node above) where the boss of that cell is yet tougher. And of cause, at the top is the head boss (root node) that is the toughest of them all.

So, as there are two sub-bosses that are directly under the head boss are those two the second and third toughest in the org? Well, one of them is the second toughest, no doubt – but the other one might not be the third toughest. He or she might just be lucky enough that the third toughest is not in his or her part of the org. The third toughest might be one of the underdogs to the other sub-boss. This is the same way as it is not necessary the second best team that plays the final in a cup. The winner of the final should be the best, but the second best team could have lost against the best in some early round.

Back to the Ma; it is easy to know who’s the toughest; it’s the boss right at the root of the heap. But what happens when the head boss get shot ("extracted from the heap")? Well, the two sub-bosses will challenge each other, and the toughest will step up to take the empty place. Now, there is a vacancy for the promoted sub-boss's position. So, the toughest of the two in that cell will step up to fill that vacancy, leaving a vacancy another step down in the hierarchy – rattling down through one “leg” in the tree. At each level there will be a challenge between the peers on who is to be the new boss, and the number of challenges in total will be the height of the tree, which is roughly the logarithm of the size of the org. So, appointing a new head boss when the old gets shot is a O(lg n) operation.

So, what happens when a newbie enters the org? Well, the newbie will enter as a leaf node in one of the cells, reporting to some lowest level local cell boss. Now it well be that the newbie is tougher than the local cell boss, it which case the latter will be challenged, and they will switch places, the newbie being the new local cell boss. Of course, the newbie might be tougher then the next level boss as well, whereupon there will be a new challenge and another switch. At worst, the newbie will be tougher then everybody in the org, and will challenge his or her way all the way to the top becoming the new head boss. This will take as many challenges as the number of levels in the org – i e again, the tree height. So, accepting a new member is a O(lg n) operation.

If inserting a new element is O(lg n) and extracting highest priority element is O(lg n), then processing all elements of a list takes O(n lg n). As product backlogs often can be some thirty to hundred elements (if not more), the difference between n*n and n*lg n is substantial – especially if we talk about using precious product owner time. So, heap-sort is a much better role-model than insert-sort when grooming our product backlog.

How come the heap performed so well? Its efficiency stems from not keeping the entire queue perfectly sorted. Instead it focus on the most important thing – to keep the “toughest” element at the top, and keeping the rest of the queue roughly sorted where “tough” elements tend to hang around at the top, and ”lesser tough” somewhere lower down. Thus it avoids premature exactness, e g decisions that are not needed yet. I also think about is as some kind of "fuzzy late evaluation" - if you get what I mean.

Applying this back to our product backlog, we might go so far as to structure our backlog as a heap. That would be cool but I have never done it, but I am definitely looking for the opportunity. Taking some of the wisdom back, we have learned that it suffices to have the lower part of the backlog “roughly sorted”. One way of applying that would be to relax the requirement on unique importance of each story. We should still require unique numbers for the shortlist, but for the rest it suffices that there are strata of stories that are roughly as important and which might share the same importance number.

Yours

Dan