Bugs Model

It is well known that bugs, feature usage etc in products follows the '80-20' rule: fixing 20% of the most common problems provides 80% of the total value available by fixing everything. I'm now going to demonstrate that you can get good results without even identifying what is actually in that 20%. It leads to a shocking simple rule of thumb that I'm surprised isn't commonly cited in the literature.

The pareto distribution is the mathematical distribution that implies the 80-20 rule. Lets create 100 missing features ('bugs') for our product. Each one is assigned a value. We can treat this as 'the number of people who would like this' or 'the value provided by implementing this feature' or 'the rate of production crashes that this bug will cause'

In [3]:
import numpy as np
a = np.random.pareto(1, 100)

The total value available to us by fixing everything is thus:

In [4]:
value_total = a.sum()
print("The total value available is %.f" % value_total)
The total value available is 621

However, this costs us 100 units of work. We can get a much better return by finding the most valuable 20% and doing just those:

In [7]:
# Sort them into descending order
a.sort()
a=np.flip(a)
# Pick the first 20
value_perfect = a[:20].sum()
print("Doing the most valuable 20 items is worth %.f, or %.f%% of the total" %
      (value_perfect , value_perfect/a.sum() * 100))
Doing the most valuable 20 items is worth 542, or 87% of the total

This is good. We did a fifth of the work and got nearly all the benefit. However it requires finding out via some other method the probabilities of each bug occurring, (or the number of customers who want each feature). This information is not directly available to us, and so we have to approximate it someway, for example by surveying a large number of customers, or by tallying every crash in production.

However, if we know loss information of the form 'Customer wanted this feature and didn't buy because of it' or 'it crashed', then we can consider a very simple approach:

  • Take the last sales loss / crash
  • Invest in a well-engineered, general fix to that problem

Compared to the big fancy spreadsheet above this rule is so simple that it is hard to consider it 'management' at all. On the other hand it is easy to execute, delivers results immediately, requires limited information and appears quite robust to data errors. The areas we work on will not be the absolute most-common, but they will be heavily weighted toward being the most common ones because well, they are the most common.

How much worse does this very simple approach perform compared to the 'perfect' approach above (which we already know to be impossible to achieve in practice)?

In [8]:
# Probability of picking an issue is proportional
# to the number of customers who hit it
p = a / a.sum()
# Sample 20 issues
value_sample = np.random.choice(a, 20, replace=False, p=p).sum()
print("Doing 20 items that got hit is worth %.f, or %.f%% of the total from doing them all" % 
      (value_sample , value_sample/a.sum() * 100))
loss = (value_perfect - value_sample) / a.sum()
print("The sampling approach loses %.f%% in performance vs one that used perfect knowledge" % (loss * 100))
Doing 20 items that got hit is worth 492, or 79% of the total from doing them all
The sampling approach loses 8% in performance vs one that used perfect knowledge

So all that extra effort, all those spreadsheets, data collection and opportunities to go wrong buy us at the very most a 10% improvement in outcomes.

In practice, the data gathering and prioritization process is not complete, not zero cost and not perfect. That is going to very rapidly shrink that <10% advantage and make the more complex thing worse in practice than a simple rule that (as we have just proved) is 90% as good as you are ever going to get.

This model doesn't apply to all software development effort, for example where the customer desires are unclear or where they don't know what they want. However an awful lot of software development does fit this model. Here's the process repeated for clarity:

Inputs

  • A stream of examples where the product or service didn't work as desired. Examples: Crash reports, missing features reported from sales calls.

Goals

  • Have less of those bad things happen.

Process

  • Grab the next example (Crash report, etc) from the firehose
  • Build a well engineered fix for that and all nearby issues. The goal is not to fix the issue, rather it is to use the issue as an archetype for the most important thing to work on at the moment.
  • Repeat
In [ ]: