Wednesday 27 May 2009

Why Release Planning Works

Dear Junior
A very efficient way of planning releases is to extrapolate the team's work within the limits of a "high velocity" and a "low velocity". Mike Cohn (one of the front figures of Scrum) makes a good work in showing this in a blog post entitled "Why There Should Not Be a 'Release Backlog'", where a team with velocity 33±4 can reliably be trusted to complete 165±20 story-points within the upcoming five sprints.
People that have tried this technique report that it works remarkably well – in fact it seldom fails. However, you should not use it to predict what you can finish within the next sprint or two, but for a release four-to-five or more sprints ahead the method seems to work fine.
This catches my attention. If somethings works remarkably well, it would be nice with some explanation; and that explanation should also explain why the short-term does not work. What is so magical about four-to-five sprints? If something works fine for seven sprints, it should work equally fine for three, should it not?
Actually, applying some not-to-advanced statistics and probability theory can explain both why the method works so well for five-sprint planning and why it is risky for two-sprint planning. Better up, it can even tighten up the estimates for longer horizons. E g the "next quarter" which roughly equals 6 two-week-sprints, the estimate is 20% tighter.
Before getting started we have to note that there is an inherent difference between guessing velocity and measuring velocity, especially if you want to do estimates based on statistics.
If you guess that velocity is in the range 29-37 you do not gain any extra information when guessing twice. Making a careful analysis of team composition and their environment is of course better than guessing out-of-thin-air, but from a statistical perspective it is equivalent to guessing – doing the same analysis twice gains no extra information. This is not said to talk down on guessing or estimating a velocity, before the first sprint it is the best we can do, it is said to point out the limitations.
So, if you guess (or calculate) a velocity of 33±4, all you can say about the completion within five sprints is that it will be in the range 145-185 story points.
Graphically, your view of the future will look like this:
A key difference when you measure velocity is that a second measurement gives extra information, even if it yields the same result. And this is why we can do better through statistical analysis. Note that I will take the liberty of being a little bit sloppy with terminology, I just want you to get the gut feeling across, not writing a thesis.
To start with, we have to model the sprint velocity as a probabilistic event. Looking at Mike Cohn's example (used with permission) we get the data-points 36, 28, 36, 38, 24, 35, 32, 35.
Looking at the data it seems fair to model it as a normal distribution. Actually, we could be wrong, it could be some other distribution. However, we will later see that even if we guess the wrong distribution, it will not matter in the end.
Pulling the data through standard formulas gives us an average of 33 and a standard deviation of 4.75. Using mathematical notation, we denote the sprint velocity with 'X'. The long-term expected average is denoted E[X], and the standard deviation is σ[X].
So, in statistical terms we are observing the outcome of the process X (giving one number after each sprint) where
E[X] = 33
σ[X] = 4.75
Let us pause for a moment and compare this standard deviation with the numbers that Mike Cohn got when calculating "high velocity" and "low velocity". He took the mean of "best three" and of "worst three" giving a range of 28-to-37, which correspond to ±4.5, pretty close to the 4.75 we got.
This is not a coincidence. Taking the average of three best and three worst out of eight, is an estimation of the standard deviation.
This actually explains why the Mike-Cohn-trick does not work very well for predicting what will be the outcome of the next sprint (note that nobody claimed it would work).
A normal distribution does only give 68% of its outcome within avg±1σ. So if we make a guess that next sprint will get velocity between 28.25 and 37.75 we will be wrong one-third of the times, which is not reliable. If we want e g 95% confidence (being right 19 times out of 20) we have to increase the interval to two standard deviation, in our example 33±9.5, i e 23-43. However an estimate where the higher bound is roughly double the lower bound is not impressive – you would probably not leave the management board meeting alive.
So how come the trick work for longer sequences of sprints? What we basically want to know is the average and standard deviation of
X1+X2+...+Xn
where each of X1...Xn is the outcome velocity of a sprint.
This is what Central Limit Theorem talk about. To no surprise the average
E[X1+X2+...+Xn] = E[X1]+E[X2]+...+E[Xn] = n*E[X]
so the expected long-term average of n sprints is just n times the average of each sprint.In out example, for five sprints we get
E[X1+X2+...+Xn] = 5*33 = 165
and no-one is surprised.
However, presenting the expected outcome to stakeholders is dangerous, as 'expected' only mean 'close to', not 'likely'. For example throwing tree six-side dices will have expected value 10.5, but you will never ever see an outcome that is 10.5, it might be 10, it might be 11, but never 10.5.
Adhering to the phrase "rather roughly right than precicly wrong" this is why I prefer to present intervals. Smaller intervals are more precise, but less probable to be correct, larger intervals are more probable to be correct, but less precise, and thus less useful. The trick is to find the balance.
I usually settle for an interval with the confidence 95%, i e I take the risk of being wrong one time in twenty. For a process that follows normal distribution (note: no-one said that sprint outcome do follow that distribution) the 95% interval is avg±2σ. We know the average (n*E[X]), and thus we need to know the standard deviation of the sum X1+...Xn. Here Central Limit Theorem come with a nice surprise. The standard deviation of a sum do not grow linear with the number of terms. Instead
σ[X1+X2+...+Xn] = √n*σ[X]
The intuition behind this is that if there are many sprints, then good and bad sprints tend to even out, but it takes consistently good luck to get high sums, and consistently bad luck to get low sums.
Put another way, if you think about each sprint as throwing a dice, then five sprints will be throwing five dices and counting the sum. The risk of five dices simultaneos showing low numbers is of course lower than the risk of one single dice showing a low number. Roughly, for throwing one six-side dice the risk of the lowest value (1) is one out of six; for throwing two dices the risk of the lowest sum (1+1) is one out of thirty-six (6*6); for throwing three dices the risk of the lowest sum is one out of two hundred (216 = 6*6*6). The extremes becoming relatively less and less usual accounts for the standard deviation to grow non-linear. If you do the math, it turns out it follows the square-root.
So, 95%-confidence interval will be given by
n*E[X]±2*√n*σ[X]
which we can calculate by putting in our observed average 33 and standard deviation 4.75.
Hold the horses! The 95%-confidence interval is based on that the random process is of normal distribution, and there is little evidence that the sprint outcomes are of that distribution. Do we not risk fooling outselves?
Here the Central Limit Thorem comes to our rescue. It roughly states that if you add a sum of independent outcomes (X1+X2+...+Xn), then the outcome-sum will be of normal distribution whatever the distribution of X. In practice this convergence is really fast. On the Wikipedia page there is a wonderful graphical example of how a really weird distribution become very near normal distribution already after adding three terms. So, for any practical purpose we can view the velocity over a few sprints as following a normal distribution.
Returning to our calculations we can now compute the 95% confidence interval for running five sprints.
n*E[X]±2*√n*σ[X] = 5*33 ± 2*√5*4.75 = 165 ± 21.2
or phrased another way (desperately trying to avoid mentioning the expected average) as the interval 144-186., which might even be acceptable.
Interesting enough this result (144-186) is very close to, and only slightly better than, the 140-185 that the Mike-Cohn trick gave for a five-sprint-prediction. Let us check why; the reason is simple.
The mike-cohn-trick was number of sprints times an estimated standard deviation.
5*σ
The statistical analysis was twice (to get 95%) the stdev for the sum, which was the stdev for an individual sprint times square root out of number of sprints.
2*σ*√5
As √5 is approx 2.24, the latter gives
2*σ*2.24 = 4.48*σ
which is a slightly tighter interval that 5*σ.
Actually, for four sprints, the methods coincide. Mike's method gives
4*σ
and statistical analysis gives
2*σ* 4 = 2*σ*2=4*σ.
Graphically, the two methods look like this:

We now see clearly both why the extrapolate method work for at least four-to-five sprints, and why it does not work (fail often) for fewer sprints. Note also how the lines intersect at x=4.
Now we can easily calculate a prediction for "next quarter". Following Mike Cohn's advise on running a one-week catch-your-breath sprint after six two-week sprints, 6*2+1 make up the 13 weeks of a quarter, which give you six "production-sprints" in each quarter. With a 33 point average, a quarter's worth of work will be around 200 points, and the 95%-confidence interval will be 2*√6*σ = 4.9*σ, which is 20% tighter than the "linear" 6*σ. With our standard deviation it gives 175-221 instead of 170-226; enough to make the difference of a few small stories, but far from the size of a sprint.
Of course for larger sprint number, the difference will be higher – 9 sprints will give a 50% better interval. However, as it is hard to keep circumstances constant for more that a quarter, so if you do two week sprints, the analysis does not become meaningful.
Just as a though experiment we can run the analysis on the 13 one-week sprints of a project. Imagine that you have observed the following velocitys during the last months:
7,8,3,10,9,5,11,8
Then standard formulas give
E[X] = 7.6
σ[X] = 2.6
and a next-quarter-planning of 13 one-week sprints give
n*E[X]±2*√n*σ[X] = 13*7.6 ± 2*√13*2.6 = 99 ± 19 = 80-118
which seems reasonable. Using "linear estimation" we get
n*E[X]±n*σ[X] = 13*7.6 ± 13*2.6 = 99 ± 34 = 65-133
which would basically be useless.
If we move in the other direction and imagine that we only have two sprints to a release. Then we know that Mike's trick does not work, and statistical analysis give an interval that is quite broad.
n*E[X]±2*√2*σ[X] = 3*33 ±2*√2*4.75 = 66±13 = 53-76
Well, even if the interval vary with 50%, you still have a lower limit that is not bad. You can still safely plan for 50 story points, and send the boxes for printing. The risk you run is less then 5%, actually just half of that (2.5% risk of underestimating, 2.5% risk of overestimating), and depending on circumstances and personality you might find that risk level acceptable.
If you want to reduce the risk further or plan more that those 50 points, there are always non-statistical methods like spending some effort breaking down even the remaining stories in yet smaller substories. Or the team can do some look-ahead tentative designs to gain confidence in their estimates, perhaps revising them. From an inspect-adapt point-of-view, this is of course guesswork but can be usable nevertheless. If you want to honour inspect-adapt, you better do this a few sprints beforehand, so you can observe a lower variation in sprint outcomes and act upon that observation.
What you do when breaking down stories in smaller substories is actually reducing the standard deviation of X, and can be analysed using similar statistical methods, but that is a topic in its own right, and have to wait.
In conclusion: Mike's trick work because his "low" and "high" velocities are roughly a quick method for calculating standard deviation, and for sprint sequences of at least four sprints the trick yields an interval that encompasses the 95% confidence interval. And, it is much simpler that doing the statistical calculations in full.
So, now we can go back to our daily life and continue to use Mike Cohn's trick, but at least we know why it works and why it has got the limitations it has.
I like to know why things works.
Yours
Dan
ps Of course it is a constant work keeping the backlog in shape to be able to do this kind of analysis. I try to spend just enough time on the lower priority items, just to keep them going, and focusing on what will probably be up for development the next few sprints. To do this I keep the backlog separated in the "usual part" and a short-list, for which I have some harder rules.

Monday 4 May 2009

Scrum Coaches Ought to Use Scrum

Dear Junior

Throughout the years I have been in contact with several initiatives to introduce Scrum to either a single team or to an organisation as a whole, as well as myself playing the role as consultant/coach/mentor. Most of the coaches that have driven these initiatives have seemed to be thoroughly convinced of the advantages of iterative work, time-boxing, limited sprint-commitments, demonstrable business value etc. Also, several of them have spoken about how Scrum is not a methodology limited to software development, but that it is a general project framework. Yet, very few of these initiatives, which are projects of their own, have been planned or executed in any way that resembles Scrum, with backlog items, sprints, demos, definition of done etc. Rather, most of them have either been waterfall (“first interviews, then determining sprint length, then …”), or they have been the process equivalent of “happy hacking” (cruising around in the organisation, going to the meetings that seem interesting at the moment, coaching someone here, throwing a workshop there, improvising from day to day).

If they really believe in Scrum, why don’t they use it themselves?

A project “introducing Scrum” out to be run as a Scrum project. In this project the product owner (“Scrum-product-owner”) is the one “owning” the process problem, probably the CTO, CIO or Head of Development; they are the ones that want see visible improvements. The role of the scrum master (“Scrum-scrum-master”) of this project is most probably the coach/mentor. Finally, the team (“Scrum-team”) consists of key people that will do the work of transforming the work: apart from the coach, there will most probably be the scrum master to-be, perhaps someone responsible for processes, perhaps a CM.

The backlog items might be

“As Head of Development

I want an estimated and planned backlog

so that Business side develop trust in that IT work to generate business value”

DoD: Of six named stakeholders, at least four should grade the statement “IT delivers according to priorities and plan” with “agree” or better.

or

“As Team Member

I want peace-of-work for at least two weeks to finish what I have committed to

so that I will be able to finish stuff”

DoD: Team has performed one sprint where they delivered what they committed to at sprint planning

Each of these should give a concrete organisational impact with an obvious value, and that value should be possible to demonstrate.

The team (coach and staff) should estimate how hard it might be to make that organisational change, and for the next month (or whatever sprint-length they choose) focus on that change and commit to make it happen. E g creating peace-of-work might be a fulltime-job to shield the team from “executive requests”, and will then be estimated high. In that case the Scrum-team should perhaps just commit to that story. Or, it might be easy, in which case it will have a low estimate, and the Scrum-team could take on doing something more in parallel. Of course, the Scrum-product-owner (CTO/HoDev) will judge what will give the most benefit for the organisation given the cost (estimates) and select the story that give most bang-per-buck. During the sprint, they meet daily to synch their work, and the Scrum-scrum-master (i e the coach) facilitate that the Scrum-team works in an efficient fashion. After the Scrum-sprint is over, they all meet, demonstrate what has been achieved, have a retrospective to gather and share experiences; and off they go again.

So, why is not more “introducing/improving Scrum”-projects structured this way? From someone promoting Scrum, using it for his/her own work should be a no-brainer.

Or, do they not think Scrum is suitable?

Or, is it awkward to put on the self-discipline needed to actually follow a process?

Or, is it just too hard to come up with a backlog?

Or, don’t they want to give up control to the Scrum-product-owner/CTO?

Or, have they simply not thought about it?

I have no idea. At least, this is the way I prefer to run Scrum-improvement projects, and I am surprised it is not more common that what it is.

I think eating your own dog-food have some merits.

Yours

Dan