Big Data Volume Requirements
by Hu Yoshida on May 2, 2012
Referring back to my last post, I am continuing my series on big data where we are looking at the dimensions of big data: volume, velocity, variety and value, and what we need to do to address them. The first dimension has to do with the “big” in big data – volume.
“Big” is defined by most dictionaries as “considerable size, number, quantity, magnitude or extent”. Big is a relative term. For instance big may be 2TB when considered for an “in memory database” like SAP HANA, or it could be exabytes for search engines like Google. Big is also a rapidly moving target. When Hitachi announced USP storage virtualization platform in 2004, with the capability of managing 32PB of internally and externally attached storage, most people thought it was over the top. Today most enterprise customers have over a petabyte of storage and they can install it in less space than they required 5 or 10 years ago. We can install 3PB in Hitachi Unified Storage (HUS) with 3TB disk drives in the width of two data center floor tiles and VSP, which we announced in 2010, can manage 255PB of internal and external storage. 255PB is a quarter of an exabyte!
An exabyte used to be considered a futuristic capacity, but search engine companies already have exabytes of storage. Some cloud companies are looking at file sharing or backup services for home data. Since many homes are storing a TB of data, it only takes a million subscribers to have an exabyte of data.
So the first requirement for big data volumes is scalable capacity, way beyond what is available today since the demand for data storage is accelerating. Where terabytes used to be the norm and petabytes seemed beyond the horizon just a few short years ago, we are now in the world of petabytes with exabytes around the corner. Many companies have 5 year planning cycles. In the past, they would plan for a doubling of capacity. Now they need to plan for an increase of an order of magnitude and plan it so that they can grow it nondisruptively. This requires storage virtualization.
VSP and HUS can scale capacity in the petabytes, and with virtualization in VSP we can create a pool of storage capacity approaching a quarter of an exabyte. But what about file and content data? How will they be able to scale since more and more of the growth will come from unstructured data?
The difference with unstructured data is that it is accessed through internet protocols and stored in file or content platforms. Most file and content systems are limited to terabytes where they need to scale to petabytes today and into exabytes tomorrow. Since this data is unstructured it must be searched and accessed as files or objects. Traditional file systems from UNIX or Linux store information about a file, directory or other file system object in an inode. The inode is not the data itself but the metadata that describes the data in terms of ownership, access mode, file size, timestamps, file pointers and file type, for example. When a traditional file system is created there is a finite upper limit on the total number of inodes, which then limits the maximum number of files, directories or other objects the file system can hold. HNAS and HCP use object based file systems, which enable them to scale to petabytes and billions of files or objects. HNAS and HCP are gateways, which sit on top of VSP or HUS so that they can leverage the scalability of the block storage and while enjoying the benefits of a common management platform, Hitachi Command Suite. HNAS and HCP are architected for big data for file and content
In addition to scalability, big data volumes must be able to scale nondisruptively and migrate across technology generations. Movement of data must be kept at a minimum and done in the background. Big data should be copied only once for availability and changes should be versioned instead of backing up the whole big data volume with every change. Big data is too big to be backed up.
Across the Hitachi family we can move and tier data in the background. We can add capacity to VSP or HUS block pools, HNAS file systems or HCP tenants, and automatically rebalance the data across the new capacity. Older file systems and block storage devices do not allow for dynamic expansion. In order to use new capacity, data in these older systems has to be unloaded from the old block or file system and reloaded onto the new capacity. This is totally impractical with the volumes associated with big data today.
Big data volumes also have to be resilient. There can’t be any single point of failure, which would require the rebuilding of a big data volume. With block systems we have redundancies built throughout VSP and HUS. We also need to have the same resiliency for HNAS and HCP nodes. These nodes have to be stateless so they can be easily replaced if any node fails.
The volume of big data is not just about the amount of capacity but also includes technologies to eliminate the traditional methods of storing files, moving volumes, backup and replication. Big data also requires the integration of file, block and content under a common Hitachi Command Suite management platform.
Comments (6 )
[...] for more info: HDS Blogs: Big Data Volume Requirements – Hu' Blog [...]
Great to see the industry finally adopting the “3V”s of big data over 11 years after Gartner first published them. For future reference, and a copy of the original article I wrote in 2001, see: http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/. –Doug Laney, VP Research, Gartner, @doug_laney
Thanks for the reference. I also referenced your contribution in my previous post.
It is not often that we can identify the starting point of major new ideas after the vendor marketing teams get a hold of it. But in the case of Big Data it was clearly the research paper that you wrote for Meta Group. Congratulations on your thought leadership. I look forward to continuing to hear your thoughts on Big Data.
Hu, Do you one document that combines your thougts on Big Data? This is great information…
Hi, Robert. No I do not have one document today, but after I finish my series on big data, I intend to consolidate it into a white paper. I have one or two more posts to do over the next few weeks. In the meantime you can review these past posts starting with Big Data Origins, then this one and followed by:
Very interesting article. I think I should request your permission for the Russian translation.