Big Data – Optimal Storage Infrastructure
by David Merrill on Feb 23, 2012
There is plenty of talk in the press today about big data, analytics and our next new wave for IT. I would like to present 2-3 blogs on a small but important subset of the big data world: storage infrastructure (and more importantly, optimal storage architectures). I will use our storage economics approach for the definitions of “optimal”, meaning you can address optimized storage from other dimensions as well (resiliency, scale, performance, etc.) as you develop big data strategies.
I had a period of time—2-3 years ago—where I was measuring and costing large Hadoop and Azure systems environments. I became very excited about these new distributed architectures, but at that time was not able to dedicate effort and resource for further research on their cost behaviors. Now that these systems have a new moniker (big data), the demand for big-data-economics conversations is here again. Good thing I have the models and methodology all sorted-out.
My observations with these large (5-8PB) Hadoop and Azure systems (S3 would not be any different in my opinion) is that local JBOD or rack-mount DAS disks were common for the deployments. You can image the rack space, power, cooling and floor space needed for these large systems. The Hadoop file system was not very efficient (by design) in terms of disk written-to utilization—about 8-12%—so large distributed file systems needed 6-10x the written capacity delivered for the raw capacity (there was no RAID overhead). I would encourage big data infrastructure architects to apply best practices and measurement systems to ensure that optimal designs are brought to big data projects. Even with the large revenue impact of data scientists and analytics, they are “not above the law” from being good stewards of limited IT budget and capital.
In my next few blogs, I will present distinct total cost case studies from these big data environments. One was a large on-line retailer, another was video surveillance and the third was a large gaming and social system (cloud) infrastructure. In this early big data economics work, we were able to model and demonstrate total unit costs with different (non-traditional) metrics and lay-down a plan for new storage infrastructures that reduced the cost of big data environments. You might be surprised at other non-cost measurements and metrics that fundamentally changed the management’s mind in regard to their design (Hint: it had to do with carbon emissions).
These specialized analytic architectures can provide a new opportunity for company revenue, but this revenue opportunity should not overshadow practical cost-v.-benefits ratios and IT optimization. Look for more on this topic in upcoming entries.
Comments (4 )
Looking forward to your series of posts on this topic
Would you be interested in chatting about how a specialized database can dramatically alter the economics of a Hadoop cluster?
You can tweet me @ramonchen id love to get your thoughts
[...] week I posted an introductory blog about big data and some work I had done a few years back in this space (before it was called big [...]
[...] of late about capacity efficiency. I’ve blogged about it, Hu has put in his views on it, as has David Merrill, who owns all things “economics” for [...]
[...] of big data, with a focus on the storage approach used in Hadoop and Azure architectures. The intro blog, case study #1, and a review of bare-metal analysis have been posted to this blog over the past few [...]