Replicas outpaces unstructured growth
by Hu Yoshida on Mar 3, 2009
If you have seen an IDC presentation on storage in the past few years you would have seen their chart on “Changing Enterprise Data Profile”. Here is a link to a public PDF copy that IDC presented at SNW in 2007. While many analysts have similar projections, this IDC chart is different in that it has a separate projection of disk capacity growth rate for replicated data. Other analysts include copies and replicas as part of the structured data capacity.
This breakout is helpful in pointing out that replicated data has different storage requirements and is growing faster than structured data. While IDC’s most recent updates include growth projections for content depot as well as unstructured data, the projected growth for replicated data remains at 43% CAGR and structured data growth remains at 32.3% CAGR. What are the differences between structured data and replicated data, why is it growing faster, and what can we do to service this demand.
For most data centers today, structured data and replicated data are still the major concern for the production environment. Structured data is the primary online production data, while replicated data is used for offline processing for backup cycles, business continuity, data mining, development test, extract/translate/load, etc. In a 7×24 world, the production data can not be interrupted or suffer any degradation in performance due to functions like backup or data mining, so they are done off of a clone copy or point in time copy to avoid impacting the production server. The fact that the replicated data is growing faster than the structured data indicates an increase in offline processing for information mining and information sharing. Where data mining might have been done once a day, today it might be done several times a day to increase business agility. If we include the movement toward disk to disk backup, this projection might be even higher. If replication of data is increasing faster than the primary data we need to look at ways to increase the efficiency and utilization of replication data capacity. There are four basic considerations for doing this in a storage array.
Dynamic tiers of storage: The replication usually does not need the level of performance and availability of the primary data. Costs can be lowered by replicating to a denser disk in a lower cost storage array through the use of storage virtualization. Storage virtualization also can be used to promote the replicated data to a higher performance tier of storage if the offline requirement is for performance.
- Thin provisioning: Instead of replicating the entire volume with all its allocated unused capacity, thin provisioning can be used to replicate only the portion that contains data. This reduces the need for storage capacity and reduces the operational time to move the data bytes.
- Non disruptive point in time replication: In order for a storage array to take a consistent point in time copy, there must be a way to capture all of the data up to a given point in time. This is difficult due to buffering outside of the storage array. Some applications provide a checkpoint or an interface to “flush” the buffers to insure time or transaction consistency for the replicated data. This takes a slight interruption in production processing. Hitachi can avoid this disruption with a feature that was thought up by Claus Mikkelsen called ATTIME. This enables the USP storage controller to set a future time in the storage system so that only data that is written up to that time is written to the replicated storage thus ensuring a consistent point in time without interruption to the application.
- Shredding the replicated data: The life of replicated data capacity may be short and if privacy is a concern, the storage array should have the capability to shred the data using Department of Defense prescribed overwrites, before the capacity is made available again to the storage pool. Also if the replica is moved to another tier of storage, the previous tier may need to be shredded.
Data replication can increase business agility by providing real time replicas for offline processing. The characteristics and requirements of replicated storage capacity are different from production data, and require storage features like dynamic tiering, thin provisioning, non disruptive point in time copies, and shredding for privacy. For more information on replicas and copies see Claus’ blog.
For IDC’s latest report by Richard Villars on “Enterprise Disk Storage Consumption Model: Aug 2008 – Doc # 214066″ Please click here to order this report.




