Unstructured Data Demands New Storage Architectures
by Hu Yoshida on Jun 4, 2007
Industry analysts like IDC say that data and storage is growing at about 60% per year. While that is challenging enough, we expect to see a new wave of data requirements that will dwarf the capacity requirements that we see today. That growth will be driven by unstructured data.
In Tony Assaro’s recent blog post, “In Search of CAS” he says “The simple fact is that most companies and organizations create more unstructured data, including files, presentations, spreadsheets, images, graphics, etc., than any other data type. Additionally, it is becoming commonplace to create audio and video files, which consume massive amounts of capacity, even in mainstream companies. The majority of stored data is unstructured and the ongoing creation, storage, access and use of this data will drive the CAS market going forward.
In addition I believe that a whole new wave of unstructured data will be generated by sensors like RFID tags, smart cards, surveillance camera’s, and monitoring devices of all kinds. The new Boeing 787 airplane will generate TBs of monitoring data on each flight. We will also see the repurposing of old sensor data such as seismic data from 50 years ago to determine if we can squeeze more oil out of old oil deposits or if we can pump CO2 back into the ground.
While the growth of structured data in data bases and unstructured data like email is not projected to grow as fast as unstructured data, they will also drive the need for a content archive system to store less active data. The 2 GB email mail box is right around the corner, and emails will drown unless less active emails are archived off of production systems. When this structured/semi-structured data is stored in an active archive, it may also need to be stored as objects, so that a record or an email associated with a particular individual or event can be retrieved end/or erased for compliance or other business purposes..
Since this data is not organized, it will require more intelligent storage systems that can ingest different types of data objects, along with its meta data and associated policies, preserve the integrity of the data, and provide common search and retrieval across large amounts of data.
In order to meet this growing demand, we must change the way we store data. Storage architectures that were designed 20 years ago with static cache configurations can not meet the scalability requirements. IT needs a storage controller that can scale to hundreds of petabytes, tens of thousands of host connections, and millions of IOPs. This can not be done with clusters of appliances sitting in front of static storage systems. This new storage system must provide end to end storage virtualization in order to transition to this new storage trajectory, virtualization of port connectivity, virtualization of volumes and files, and virtualization of capacity within those volumes and files. We can not rip and replace what we have today. Virtualization is needed to transition to the new storage architectures without disrupting our current operations and throwing out our capital investments
We can not have a number of different stand alone storage systems for each content management system. We must be able to leverage all our content no matter which application generated it and we must be able to manage it centrally. This is what we plan to achieve with our Hitachi Content Archive Platform in combination with our Universal Storage Platform V.
As opposed to introducing yet another island of storage for content archives and another set of software tools and management interfaces, we are delivering content archive services within a common management framework that can scale to 20 petabytes and support up to 32 billion objects, taking advantage of heterogeneous storage virtualization services already available from Hitachi. Utilizing enterprise Hitachi storage functionality such as RAID in a storage area network (SAN) + array of independent node (SAIN) architecture, the Hitachi Content Archive Platform is the first solution in the industry to enable customers to scale archive server nodes and storage capacity independently in order to meet the current and future demands of unstructured data.
Comments (2 )
Hu– In my opinion, the CIO challenge with unstructured data growth is it presents an enormous liability for corporations. For the first time in history, a company’s storage behavior can land a CEO in jail and/or cost it billions.
The ‘storage islands’ problem and lack of scalability pale in comparison to a CIO’s real challenge and that’s data classification. Without auto-classification of relevant business metadata at the point of data creation or use, this exposure will not diminish. Here is further context on the issue if you like:
No one vendor will solve this problem but firms like Hitachi can clearly play a role in the ecosystem and would be well-served to participate in the discussion.
Several days ago I submitted a post (which I guess was rejected) about data classification as a barrier to managing unstructured storage growth. But I’ll try one more time.
In my discussions with CIO’s this is a key challenge, probably the one they cite as most significant in managing unstructured information. Specifically, autoclassification at the point of data set creation or use by dynamically generating classification meta data. The problem is without automation, managing unstructured data and reducing corporate liabilities becomes virtually impossible. And using only time-based classification criteria doesn’t give business context.
What, if anything, can Hitachi do here to address this issue? Or does Hitachi not see this as critical?