We talk a lot about big data on this blog, but we usually address it at an abstract level. I think that’s important because it’s useful to communicate the conceptual underpinnings of what can influence a sea change in the way businesses think about data. But it’s also helpful to look at specific examples of how big data is helping companies do better. In this post I’d like to take a look at a nice example of the difference big data can make from the data back-up service Backblaze.
Backblaze stores around 100 Petabytes of data — that’s 100,000,000 Gigabytes — on 40,000 hard drives. When you have that many hard drives, you can guarantee that a few are going to fail every day. You could wait until they fail and then replace them, but it’s better to spot when a hard drive is likely to fail and replace it first. Predicting hard drive failure is a difficult task. A few months ago Backblaze released aggregate figures of hard drive failure rates which showed that hard drive failures followed a predictable path, most fail early or they fail after a couple of years.
But those aggregate figures aren’t especially useful for determining when any particular hard drive will fail. For that you need different data. All modern hard drives contain a monitoring system called S.M.A.R.T.(Self-Monitoring, Analysis, and Reporting Technology) that returns over 70 different statistics about the state of a hard drive.
All that data sounds great, but in fact most of it useless for determining whether a hard drive is likely to fail. The difficult thing is to know which of the 70 statistics correlate with a hard drive failure, and for that you need a mass of data across many different hard drives. To be statistically significant, any correlation between reported S.M.A.R.T. statistics and hard drive failures has to be observed across hundreds or thousands of hard drives over a significant period of time. And that’s just what Backblaze is in a position to do.
You can look at the Backblaze blog article for full details, but, in short, most of the S.M.A.R.T statistics turned out to be useless for deriving hard drive failure predictions; even those would one have intuitively thought to be useful and those that showed superficial correlations with hard drive failure. By monitoring the S.M.A.R.T. statistics for thousands of drives, the company was able to whittle the relevant statistics down to just five that usefully predict failure.
The Backblaze post is a neat example of how a company is leveraging the data available to it to to improve the efficiency of its business operations. Now, not all companies have or need tens of thousands of hard drives, but most have equivalent data flows relevant to their operations that, if leveraged properly, are capable of improving the way their business works.