Thursday 8 May 2008

Version of Data in Database

Dear Junior
I recently had a database versioning going into production. And by database versioning I do not mean that we know what patch of level the database management system (in this case IBM DB2), but the business data itself.
If you never or seldom change the database definition you can easily keep track of changes. However, with development agile style we constantly change our system either by adding functionality or by doing refactoring of the structure, and then we need to change the database as well.
Sooner or later, we run into the situation where we want to test some functionality, get a backup from production, and then wonder: "ok, so what structure changes do I need to do to get this usable?" It might be easy to see what tables or columns are missing and should be added. However, we then rely on a fresh nights sleep and good morning coffee, and still we might miss more subtle changes. And, trust me, this gets even worse if it is not a production copy, but some test data database that have been lying around and where disparate groups have performed random experimental changes "just to see what it would look like" - changes that later may have made it to production, or not. In that case you should expect turmoil of inconsistencies.
The best way I have found to get out of this dilemma is to assign each version of the database structure a number. Of course you need some place to start, and at this last occasion we selected what was in production on 2008-01-01 and announced that as structure zero. The idea is that every change of the database must declare what version it is a change to. So, instead of saying "add column FOO to table BAR", you must say "if the database is in version 32 you can get it to version 33 by adding column FOO to table BAR". So, this would be the SQL-code for "patch 33".
The information on database structure and the patch-scripts are of course in our subversion in a directory of its own: "database" and with subdirectories "baseline", "patch001", "patch002" etc.
The baseline in our case is nothing more than the DDL for recreating the database structure as it looked in production in 2008-01-01, plus scripts for inserting "base data" that has to be there and is not managed by Canatala (which is the name of the application).
In this case the application connected a bank to several insurance providers; the base data it needed was information about the insurance companies. (In fact, I have substituted the application and the domain for non-disclosure reasons, but the problem is the same.) Anyway, this data was not managed by the application, but imported as a batch on first deploy, and then enhanced by inserts at later deployments.
To get really down to earth the directory contains twelve files.
  • baseline-sql01-create-structure-ddl.sql
  • baseline-sql02-insert-basedata-insurance-providers.sql
  • baseline-sql03-insert-basedata-xyz.sql // some other base data
  • baseline-sql04-insert-basedata-uvw.sql // yet more base data
  • baseline-sql05-sp-PROVIDERSYNC.sql // a stored procedure
  • baseline-sql06-sp-XYZ.sql // another stored procedur
  • ...
  • baseline-sql12-sp-UVW.sql // last stored procedure
Note that the files are named in a way that enables a script to automatically apply them in the right order using the filename up to the second '-'. At the same time the file names describe to a human eye what the SQL is about, without disturbing the script. We were kind of proud over that balance.
The "patch" directories contain all the SQL-files needed to take a database from one schema version to the next. As an example, we shortly thereafter came across a new concept "insurance delegate", which we also needed to store in the database - thus a new table INSURANCE_DELEGATES.
This table also needed to be filled with some data to match the existing insurance providers. All this became patch 002, so the "patch002" directory contains the files needed to take a version 001 database and upgrade it to version 002.
  • patch002-sql01-create-provider-delegate-table.sql // creating new
    table
  • patch002-sql02-insert-provider-delegates.sql // filling the new table
  • ... // another script excluded, you will soon see what it is about
I think you got the grip by now. If you need to upgrade a database, just run all the scripts in all the patch-directories of a "higher" version then what you have. If you need to create a database from scratch, use the baseline and then run all the patch scripts.
So, but how do you know which patch-scripts to run, i e how do you know what patch-level you are at?
The really good part is that this version information should be *part of the data*. In this last case we added a new table defined by:
--- Table on database version (current and history), when patch was
--- deployed, and what it consisted of
CREATE TABLE CANATALA_DB_VERSION(
CANATALA_DB_VERSION BIGINT,
PATCHDEPLOYTIME TIMESTAMP,
DESCRIPTION VARCHAR(256)
);
And then, whenever you apply a patch, you change the data in this table to reflect the change of patch level.
Actually, there are two ways of using a table like this. Either you just save the information of the current version, or you save the entire history. I have done both, and settled for saving the history because it really helps when you "find" some database lying around and wonder about its life hitherto. Next time I will also probably extend it with some kind of environment information, e g an extra column storing the IP-address of where it was running when it was patched to distinguish if it was patched in production, or upgraded after being copied from production.
As you might have guessed, this table was not in the database when we started, so the CREATE is not part of the baseline-sql01-create script. We had to introduce it.
Actually, introducing the version table was our first patch, patch000. So directory patch000 contained two files, one introducing the structure, and the other inserting the data that "this database is now of patch level 0".
  • patch000-sql00-create-version-table.sql
  • patch000-sql01-insert-dbversioninfo.sql
First file contains simply the CREATE TABLE I mentioned earlier. The second file contains the INSERT needed to insert "version 0" into the table:
INSERT INTO CANATALA_DB_VERSION
(CANATALA_DB_VERSION,PATCHDEPLOYTIME,DESCRIPTION)
VALUES
(0,CURRENT TIMESTAMP,'Starting versioning of database');
I guess this makes it pretty obvious what is in the file I left out in patch002 - it is the db-version insert for that patch.
> patch002-sql03-insert-dbversioninfo.sql
INSERT INTO CANATALA_DB_VERSION
(CANATALA_DB_VERSION,PATCHDEPLOYTIME,DESCRIPTION)
VALUES
(2,CURRENT TIMESTAMP,'Adding support for insurance delegates');
If we follow this scheme, we can take a look in any database and see its patch-history and decide what patches we need to bring it to a usable patch-level.
Of course, this model is not perfect - there are several ways to improve it. For a start, there is no guarantee that each patch actually will contain a dbversion insert. One way to fix that could be to have the patches be applied by some kind of special tool. That tool could check that the patch directory actually contains such a file and otherwise refuse to apply the patch. Another drawback is that it is possible to apply a patch to a database of the wrong level, e g applying patch 34 to a database of version 31. You might even accidentally apply the same patch twice. This is also something a tool could check.
Even if there are plenty of ways to improve this model it is a really good way to start.
But the best part has not even started. Once you have got versioning of the database under control, you sudden have got the possibility to start working with the database in a new way - such as doing refactorings of the database structure. And we can get agile in the database tier as
well.
Once you have gotten your toes wet, there are lots of places to dig deeper. I would like to point out two. K Scott Allen has written five blog posts on database versioning that are really well worth reading [see Jeff Atwood's blog at http://www.codinghorror.com/blog/archives/001050.html]. However, thereafter I really recommend "Refactoring Databases" by Scott Ambler and Pramod Sadalag where they really show the power of the approach.
Yours
Dan
ps A great thing with having a versioned database is that you can start refactoring it to mitigate the mapping between the domain model and database design.