I have a fascination with Statistics. To be honest, it tends to be a fascination with its misuse, but it is a fascination none the less. I was reminded of this over the weekend twice – once on Sunday morning, before coffee, when I retweeted a statistic:
It may, or may not be true, I have no idea – but because it sounds good, I retweeted it anyway (it doesn’t actually harm my business case either), a few minutes later, half way down the first coffee of the day, it occured to me that this wasn’t quite right and I tweeted the following in pennance:
“Nothing like retweeting an unsubstantiated statistic first thing on a Sunday morning. 95% of people agree ;-)”
The second thing, was courtesy of my son, who forwarded me the following – genuine and true – statistic:
The statistics of advertising fascinate me too – the variable, and selective sample size that returns just the right percentage of “dogs that prefer” muttfood™. ( That involves finding the 8 dogs who have no sense of smell and 2 who do – to give a believable 80% … )
The point is that, as someone said ( generally attributed to Disraeli – but aparrently not his )
“There are three kinds of lies: lies, damned lies, and statistics.”
When selling things, statistics are warped and presented in such a way as to scare us, to emphasise our need for the product or even to show us that our peers are using it, so why aren’t we ? Obviously there is a huge overlap between psychology and statistics here, but none the less, the point stands.
When we know about statistics though, we can turn them to our advantage. Not only are we in a position to treat what we are told with more care, but we can start to ask questions that might actually enlighten what the reality is. Let’s go back to our first example:
“93% of companies that lost their data center for 10 days or more due to a disaster, filed for bankruptcy within one year.”
What can we ask about this data ? Well, let’s start with asking where it came from ? Who has admited that their data centre was down for 10 days ? What happens to those who’s data centre was only down for 9 days ? 5 days ? 2 days ? Is there a direct linear corrolation between data centre down time and probability of bankruptcy ? Were all the companies in good financial shape before hand ? Were they skimping on data-centre maintenance because of poor cash flow ? Was their main stock warehouse in the same building as the data-centre when it burnt down ?
A key thing to remember is that corrolation isn’t causation. This is important, and why the placebo effect is an issue in medical trials. If A rises and B rises, is A the cause of B – or is there an unseen, or more to the point, unmeasured, C that is causing the rise of B ?
However we can bring forward even more questions. Ok, so ( more or less ) 1 in 10 companies that have a 10 day outage survive – what are they doing right that we can emulate ? Did they have a better buisiness continuity plan ? ( Almost certainly – but I don’t have the data to back that statement up ). Is there a commonality amongst the companies that survived ? ( Are they all in the same industry – all consultancies for example ? Does this mean that my business is at less risk ? )
I hope you see my point that too little information about the data is, whilst not a bad thing per se, not exactly condusive to sensible decision making.
So where does that leave us within our own organisations ? Well it leaves us with an necessity to collect the right data. That’s easier said than done to be honest, because we’re back up against the corrolation/causation barrier again – we need to be sure that we are gathering data that does actually relate to what we are seeking to study. Ensuring that A is related to B involves verifying that C has nothing to do with it – nothing acts in isolation, so excluding C can save a wild goose chase and a waste of money pursuing the wrong track.
Much as it may seem unscientific, I really do recommend the idea of getting together a few people and brainstorming possible data sources and other connections between the possible influencing factors. Everyone has a perspective, and often it is the perspectives of others that add the most value !
Don’t forget the human factor in this – it could be that there are less viruses during the summer, not only because of your new AV product, but because the staff are away, surfing the net less and bringing less into the network – in fact your trial data is useless because the product is actually worse, but it had less to find, and thus looks more effective … More effective decision making is enabled with good statisics, effective decision making saves money.
This is where historical data has a value – don’t discard old reports and metrics, use them to show year on year growth and annual, monthly, weekly, daily and hourly trends. You’ll be able to make more sense of any new data in light of this information. You can also spot anomalies in the data, and, if you get to the stage of doing this in real time, you can find problems and security incidents as they happen, and that is the holy grail of information systems and security management.
In part two, next week, we’ll start to decompose some basic things to collect, potential sources and analysis of the data.
Why don’t you subscribe, either to my Twitter feed (@si_biles) or to the Blog, and you’ll be notified of that post and other things of interest as time goes on ?