Outcomes: The Forgotten Test Case


Outcome versus Output

My very simple definition of the difference between output and outcome is:

  • Output is delivering volume
  • Outcome is delivering value

Theoretically, they could be the same but in practice this is seldom the case.

I can create a lot of ‘stuff’ (i.e. output) but realise very little, or even negative, value for all the time and effort spent producing said ‘stuff’. This article provides two practical examples of what normally happens (we focus only the output), why this is a terrible practice and the problems associated with ignoring the outcomes.

The “Password Reset” Example

I was involved with a very large transactional banking system with a user base across 20 different African countries. Over 95% of the call centre issues were for password reset requests (approximately 3,500 calls a month). Therefore, it made sense to offer self-service password reset functionality. At the same time, we were looking to move from traditional to agile ways of work and “Password Reset” was chosen as the agile pilot feature.

I was managing the business and systems analysis team on this program and, wanting to get a pre-agile baseline, had tracked the cycle times of a sample number of features. The fastest delivery cycle time I could find was a very small feature that took one year to make it from analysis to production – and most features would take between two to three years from concept to production.

Password Reset was delivered in six weeks – a new record. This was fantastic! Success was declared – “This agile thing really works!” We had a party with pizza and beer and a couple of people from the project team were recognised at the annual corporate level rewards ceremony.

Fast forward 18 months. I was presenting at Agile Africa and wanted to use Password Reset as an example in my talk. Therefore, I got hold of the Product Manager and asked if I could get the current call centre stats. They are in the graph below.

Surprise, surprise – call centre volumes were still sitting at 3,500 per month, almost all of which were customers who were still phoning the call centre to get their passwords reset! We had delivered an output – a feature called “Password Reset”. We’d declared success, enjoyed the party and wallowed in the glory of our fine achievement before quickly moving onto the next pressing deliverable on the backlog. Unfortunately, we had delivered zero value.

I still used Password Reset in my Agile Africa talk – but in a completely different context than originally planned. Password Reset became my prime example of the importance of focussing on outcomes rather than output and the use of feature hypotheses.

Watch: The Power of Feature Hypotheses

No one (including myself) had questioned the success of Password Reset. Clearly, we should have considered that the success of the feature was not in building working functionality but rather in achieving a specific result (in this case reduced call centre volumes).

No one seemed concerned that the new feature was not being used by customers. As far as I know I was the first person to ask for the stats since the original motivation for Password Reset functionality was presented. On reflection, I realised that measuring the value of features post-implementation is surprisingly rare.

I bumped into the Product Manager responsible for Password Reset a few months later. We had a corridor chat where he excitedly told me that “Password Reset is finally being used!” with the caveat that “Although as soon as our customers started using it, we realised that all the mobile phone numbers in the system were incorrect so we couldn’t use the SMS texting option”.

The “One Time Pin” Example

Here’s another easily accessible example. I attended a presentation highlighting some changes that a special, dedicated usability improvement team had made to an online banking application. A similar “before and after” example to the one below was included in one of the slides.

The basic idea was to reduce the number of incorrect and timed out One Time Pins (OTPs) by moving the OTP to the front of the message, making it more prominent, and removing the highlighting and distraction of other numbers in the message.

I thought that this would be another good example to use for my Agile Africa presentation and tried to hunt down some data. I am pretty good at hassling people but still came up short. Despite my best efforts, I couldn’t get any before or after stats on incorrect and timed out OTPs.

All logic would suggest that these sensible changes to the formatting of the OTP message would result in a higher success rate of transactions. Way back in the 18th century, Voltaire said “Common sense is not so common.” Likewise, as I explained in this article Kohavi’s Law & Harry Potter explain why your experience and intuition suck, it appears that logic derived from expert opinion and intuition only applies about one third of the time when it comes to the expected benefits from new features.

In the aforementioned article, I propose Kohavi’s Law which predicts that if we measured the results after the new OTP message was implemented, we would have a 66.6% chance that there would be no detectable benefit or we’d have a negative result (i.e. the error rate would actually increase).

How could this be?

Perhaps we’ve unwittingly made the message more confusing for users or perhaps there is no change because we’re solving the wrong problem. I recently had a situation (using another banking platform) where the OTP messages only arrived on my phone over 12 hours later, presumably because of an issue on my cellular provider’s network.

Perhaps 99% of out OTP errors are caused when there are problems on the user’s cellular network and OTPs are not received by the end user in time. In this scenario, to reduce the number of failed OTPs we’d need to provide an alternate method of sending OTPs to the user. Of course, the cost of an additional method may not be worth the reduction in error rate but (assuming we have the data on hand) that can be evaluated, and an objective business decision made.

Another possibility is that the majority of incorrect OTPs are due to fraudulent transaction attempts. In this case, there is a definitely a problem we want to resolve but the root cause is not “users entering the OTP incorrectly”.

So far as this feature went, the software team delivered the change without understanding the why – i.e. what is the real problem they were trying to solve, or the true business need that was being addressed. The new OTP message was implemented but no one actually knows whether the time spent developing the new functionality had a positive, negative or zero impact.

Did we actually reduce the number of OTP errors? Everyone involved believed that this was a successful change but no one actually knows.

Aside: Challengers versus Champions

I am not covering A/B/n testing explicitly in this article but if you want an excellent one hour introduction to the topic check out this YouTube keynote speech Ron Kohavi did at the 2015 Conference on Knowledge Discovery and Data Mining

The ideal situation for an organisation would be to have the means to run as many quick, meaningful experiments as possible and get statistically relevant results.

Consider the four options below (the existing ‘Champion’ and three ‘Challenger’ OTP format messages. Which one will reduce OTPs the most? 

If this was presented in a meeting it is likely to stir plenty of robust discussion amongst the attendees. There’ll be strong opinions, debate and justifications as to ‘Why I’m right.’ All very time consuming and frustrating.

The correct answer is no one knows.

Running experiments on each of the options is the only way to be certain that you pick the correct option (which may even be that none of the challengers have any detectable impact on error rates). 

(Don’t) Show Me The Money

Sadly, the Password Reset and One Time Pin examples are the norm in most organisations. Most teams are in the habit of delivering outputs rather than outcomes. Looking back over a 20-year career in software development, it’s incredibly rare that anyone (including myself) checked whether the stated benefits were achieved after going live.

Evaluating the outcomes happens so seldom that one particular attempt stands out in my memory. A special “small changes” team was delivering minor enhancements on a legacy banking system. Business stakeholders from different countries had to submit “positional papers” (which were mini-business cases) to a central business team for approval and prioritisation.

Every positional paper required a “show me the money” Rand value to have any chance of being approved. The stated benefits of delivered positional papers over a two-year period was over R236 million ($13 million). The “head office” business stakeholders tried to get the regional business stakeholders who’d submitted the positional papers to quantify the actual benefits. The proven benefits were zero (or “unknown” is perhaps a fairer explanation). Not one of the regional business stakeholders could or would quantify estimated benefits in hard cash*.

* In fact, it is very difficult (and seldom preferable) to try to define feature value in purely financial terms because it is usually impossible to prove afterwards (a topic I’ll explore in the next article in this series).

The Forgotten Test

There are a myriad of other good reasons why outcomes are neglected including long cycle times, massive functional batch sizes, “fairy tale” business cases, the ‘no time / delivery pressure’ argument and decision makers not wanting to have their expertise questioned (or their decisions put to the test).

No project or feature gets approved without a strong justification for an expected result. Huge amounts are invested in comprehensive testing coverage for software systems. However, all these tests are completely worthless if we release software that has zero or negative value.

Earning our Medals

As a compulsive marathon runner, it seems like we are rewarding ourselves for getting to the start line. No marathon runner would dream of claiming credit for a race medal before crossing the finish line. Most marathon runners consider it bad luck (and terrible etiquette) to wear a race shirt before the race (and would never wear the shirt of a race they did not complete). It’s time that we earned our medals and t-shirts in software development.

Are you rewarding yourself for getting to the start or crossing the finish line? (Photo courtesy Comrades Marathon Organisation)

The true test of our software changes is not that they are defect free and work according to spec but rather that they achieve the expected benefit. Up until 8 November 2014, Microsoft followed a ‘Ship it’ mentality and would have shipping celebrations and awards.

I doubt that there is a computer user in the world who has avoided being negatively impacted by poorly shipped Microsoft products and features. One of the big contributors for Microsoft moving from an output to an outcomes-based approach is Ron Kohavi and the “Experimentation Platform” his team set up. Kohavi elaborates, “Shipping is not the goal, shipping something useful to the customer is the goal.” The only way you know whether you’ve moved the dial is to evaluate the outcome of every single change (otherwise you run the risk that ‘ship happens’).

The ability to ‘ship it’ involves plenty of hard work and dedication but it’s like getting to the start line of a marathon. Once the production starting gun fires, the only thing that matters is how well your features perform in a competitive environment. It’s time that we stopped awarding ourselves participation trophies and started checking the race results.

This is the second in a series of articles on “Outcomes over Outputs”.

Read Part 1: Kohavi’s Law & Harry Potter explain why your experience and intuition suck

Follow Running Mann:

One Reply to “Outcomes: The Forgotten Test Case”

Leave a Reply

Your email address will not be published. Required fields are marked *