Wednesday 27 May 2009

Why Release Planning Works

Dear Junior
A very efficient way of planning releases is to extrapolate the team's work within the limits of a "high velocity" and a "low velocity". Mike Cohn (one of the front figures of Scrum) makes a good work in showing this in a blog post entitled "Why There Should Not Be a 'Release Backlog'", where a team with velocity 33±4 can reliably be trusted to complete 165±20 story-points within the upcoming five sprints.
People that have tried this technique report that it works remarkably well – in fact it seldom fails. However, you should not use it to predict what you can finish within the next sprint or two, but for a release four-to-five or more sprints ahead the method seems to work fine.
This catches my attention. If somethings works remarkably well, it would be nice with some explanation; and that explanation should also explain why the short-term does not work. What is so magical about four-to-five sprints? If something works fine for seven sprints, it should work equally fine for three, should it not?
Actually, applying some not-to-advanced statistics and probability theory can explain both why the method works so well for five-sprint planning and why it is risky for two-sprint planning. Better up, it can even tighten up the estimates for longer horizons. E g the "next quarter" which roughly equals 6 two-week-sprints, the estimate is 20% tighter.
Before getting started we have to note that there is an inherent difference between guessing velocity and measuring velocity, especially if you want to do estimates based on statistics.
If you guess that velocity is in the range 29-37 you do not gain any extra information when guessing twice. Making a careful analysis of team composition and their environment is of course better than guessing out-of-thin-air, but from a statistical perspective it is equivalent to guessing – doing the same analysis twice gains no extra information. This is not said to talk down on guessing or estimating a velocity, before the first sprint it is the best we can do, it is said to point out the limitations.
So, if you guess (or calculate) a velocity of 33±4, all you can say about the completion within five sprints is that it will be in the range 145-185 story points.
Graphically, your view of the future will look like this:
A key difference when you measure velocity is that a second measurement gives extra information, even if it yields the same result. And this is why we can do better through statistical analysis. Note that I will take the liberty of being a little bit sloppy with terminology, I just want you to get the gut feeling across, not writing a thesis.
To start with, we have to model the sprint velocity as a probabilistic event. Looking at Mike Cohn's example (used with permission) we get the data-points 36, 28, 36, 38, 24, 35, 32, 35.
Looking at the data it seems fair to model it as a normal distribution. Actually, we could be wrong, it could be some other distribution. However, we will later see that even if we guess the wrong distribution, it will not matter in the end.
Pulling the data through standard formulas gives us an average of 33 and a standard deviation of 4.75. Using mathematical notation, we denote the sprint velocity with 'X'. The long-term expected average is denoted E[X], and the standard deviation is σ[X].
So, in statistical terms we are observing the outcome of the process X (giving one number after each sprint) where
E[X] = 33
σ[X] = 4.75
Let us pause for a moment and compare this standard deviation with the numbers that Mike Cohn got when calculating "high velocity" and "low velocity". He took the mean of "best three" and of "worst three" giving a range of 28-to-37, which correspond to ±4.5, pretty close to the 4.75 we got.
This is not a coincidence. Taking the average of three best and three worst out of eight, is an estimation of the standard deviation.
This actually explains why the Mike-Cohn-trick does not work very well for predicting what will be the outcome of the next sprint (note that nobody claimed it would work).
A normal distribution does only give 68% of its outcome within avg±1σ. So if we make a guess that next sprint will get velocity between 28.25 and 37.75 we will be wrong one-third of the times, which is not reliable. If we want e g 95% confidence (being right 19 times out of 20) we have to increase the interval to two standard deviation, in our example 33±9.5, i e 23-43. However an estimate where the higher bound is roughly double the lower bound is not impressive – you would probably not leave the management board meeting alive.
So how come the trick work for longer sequences of sprints? What we basically want to know is the average and standard deviation of
where each of X1...Xn is the outcome velocity of a sprint.
This is what Central Limit Theorem talk about. To no surprise the average
E[X1+X2+...+Xn] = E[X1]+E[X2]+...+E[Xn] = n*E[X]
so the expected long-term average of n sprints is just n times the average of each sprint.In out example, for five sprints we get
E[X1+X2+...+Xn] = 5*33 = 165
and no-one is surprised.
However, presenting the expected outcome to stakeholders is dangerous, as 'expected' only mean 'close to', not 'likely'. For example throwing tree six-side dices will have expected value 10.5, but you will never ever see an outcome that is 10.5, it might be 10, it might be 11, but never 10.5.
Adhering to the phrase "rather roughly right than precicly wrong" this is why I prefer to present intervals. Smaller intervals are more precise, but less probable to be correct, larger intervals are more probable to be correct, but less precise, and thus less useful. The trick is to find the balance.
I usually settle for an interval with the confidence 95%, i e I take the risk of being wrong one time in twenty. For a process that follows normal distribution (note: no-one said that sprint outcome do follow that distribution) the 95% interval is avg±2σ. We know the average (n*E[X]), and thus we need to know the standard deviation of the sum X1+...Xn. Here Central Limit Theorem come with a nice surprise. The standard deviation of a sum do not grow linear with the number of terms. Instead
σ[X1+X2+...+Xn] = √n*σ[X]
The intuition behind this is that if there are many sprints, then good and bad sprints tend to even out, but it takes consistently good luck to get high sums, and consistently bad luck to get low sums.
Put another way, if you think about each sprint as throwing a dice, then five sprints will be throwing five dices and counting the sum. The risk of five dices simultaneos showing low numbers is of course lower than the risk of one single dice showing a low number. Roughly, for throwing one six-side dice the risk of the lowest value (1) is one out of six; for throwing two dices the risk of the lowest sum (1+1) is one out of thirty-six (6*6); for throwing three dices the risk of the lowest sum is one out of two hundred (216 = 6*6*6). The extremes becoming relatively less and less usual accounts for the standard deviation to grow non-linear. If you do the math, it turns out it follows the square-root.
So, 95%-confidence interval will be given by
which we can calculate by putting in our observed average 33 and standard deviation 4.75.
Hold the horses! The 95%-confidence interval is based on that the random process is of normal distribution, and there is little evidence that the sprint outcomes are of that distribution. Do we not risk fooling outselves?
Here the Central Limit Thorem comes to our rescue. It roughly states that if you add a sum of independent outcomes (X1+X2+...+Xn), then the outcome-sum will be of normal distribution whatever the distribution of X. In practice this convergence is really fast. On the Wikipedia page there is a wonderful graphical example of how a really weird distribution become very near normal distribution already after adding three terms. So, for any practical purpose we can view the velocity over a few sprints as following a normal distribution.
Returning to our calculations we can now compute the 95% confidence interval for running five sprints.
n*E[X]±2*√n*σ[X] = 5*33 ± 2*√5*4.75 = 165 ± 21.2
or phrased another way (desperately trying to avoid mentioning the expected average) as the interval 144-186., which might even be acceptable.
Interesting enough this result (144-186) is very close to, and only slightly better than, the 140-185 that the Mike-Cohn trick gave for a five-sprint-prediction. Let us check why; the reason is simple.
The mike-cohn-trick was number of sprints times an estimated standard deviation.
The statistical analysis was twice (to get 95%) the stdev for the sum, which was the stdev for an individual sprint times square root out of number of sprints.
As √5 is approx 2.24, the latter gives
2*σ*2.24 = 4.48*σ
which is a slightly tighter interval that 5*σ.
Actually, for four sprints, the methods coincide. Mike's method gives
and statistical analysis gives
2*σ* 4 = 2*σ*2=4*σ.
Graphically, the two methods look like this:

We now see clearly both why the extrapolate method work for at least four-to-five sprints, and why it does not work (fail often) for fewer sprints. Note also how the lines intersect at x=4.
Now we can easily calculate a prediction for "next quarter". Following Mike Cohn's advise on running a one-week catch-your-breath sprint after six two-week sprints, 6*2+1 make up the 13 weeks of a quarter, which give you six "production-sprints" in each quarter. With a 33 point average, a quarter's worth of work will be around 200 points, and the 95%-confidence interval will be 2*√6*σ = 4.9*σ, which is 20% tighter than the "linear" 6*σ. With our standard deviation it gives 175-221 instead of 170-226; enough to make the difference of a few small stories, but far from the size of a sprint.
Of course for larger sprint number, the difference will be higher – 9 sprints will give a 50% better interval. However, as it is hard to keep circumstances constant for more that a quarter, so if you do two week sprints, the analysis does not become meaningful.
Just as a though experiment we can run the analysis on the 13 one-week sprints of a project. Imagine that you have observed the following velocitys during the last months:
Then standard formulas give
E[X] = 7.6
σ[X] = 2.6
and a next-quarter-planning of 13 one-week sprints give
n*E[X]±2*√n*σ[X] = 13*7.6 ± 2*√13*2.6 = 99 ± 19 = 80-118
which seems reasonable. Using "linear estimation" we get
n*E[X]±n*σ[X] = 13*7.6 ± 13*2.6 = 99 ± 34 = 65-133
which would basically be useless.
If we move in the other direction and imagine that we only have two sprints to a release. Then we know that Mike's trick does not work, and statistical analysis give an interval that is quite broad.
n*E[X]±2*√2*σ[X] = 3*33 ±2*√2*4.75 = 66±13 = 53-76
Well, even if the interval vary with 50%, you still have a lower limit that is not bad. You can still safely plan for 50 story points, and send the boxes for printing. The risk you run is less then 5%, actually just half of that (2.5% risk of underestimating, 2.5% risk of overestimating), and depending on circumstances and personality you might find that risk level acceptable.
If you want to reduce the risk further or plan more that those 50 points, there are always non-statistical methods like spending some effort breaking down even the remaining stories in yet smaller substories. Or the team can do some look-ahead tentative designs to gain confidence in their estimates, perhaps revising them. From an inspect-adapt point-of-view, this is of course guesswork but can be usable nevertheless. If you want to honour inspect-adapt, you better do this a few sprints beforehand, so you can observe a lower variation in sprint outcomes and act upon that observation.
What you do when breaking down stories in smaller substories is actually reducing the standard deviation of X, and can be analysed using similar statistical methods, but that is a topic in its own right, and have to wait.
In conclusion: Mike's trick work because his "low" and "high" velocities are roughly a quick method for calculating standard deviation, and for sprint sequences of at least four sprints the trick yields an interval that encompasses the 95% confidence interval. And, it is much simpler that doing the statistical calculations in full.
So, now we can go back to our daily life and continue to use Mike Cohn's trick, but at least we know why it works and why it has got the limitations it has.
I like to know why things works.
ps Of course it is a constant work keeping the backlog in shape to be able to do this kind of analysis. I try to spend just enough time on the lower priority items, just to keep them going, and focusing on what will probably be up for development the next few sprints. To do this I keep the backlog separated in the "usual part" and a short-list, for which I have some harder rules.