United States
Site Map Contacts Hitachi Global Community
Hu's Blog - Data Storage and Virtualization Thought Leader Hitachi - Inspire the Next

Hu Yoshida's Blog - Vice President | Chief Technology Officer

Home > Corporate > HDS Blogs > HDS Bloggers > Hu's Blog
Products, Solutions and more

Hu's Blog

Pre, post, and In Line Deduplication Ratios

by Hu Yoshida on Mar 2, 2008

Last month, I spent some time with David Russell, a Gartner Research Vice President for Storage Technologies and Strategy. One of the storage technologies that David is very bullish on is de-duplication. David feels that de-duplication answers a pressing need for managing backup and recovery, and can deliver immediate cost benefits, by reducing the amount of storage we need for backup. He is predicting that the acceptance rate for this technology will be faster than other technologies because the benefits are much easier to recognize, those benefits being the reduction of duplicate bytes of data by orders of magnitude.

The basic principle for de-duplication, involves comparing bit streams of data and where the bit streams are identical, discarding the duplicate and leaving a reference to it in the original bit stream. This ensures that only unique bit streams of data are stored. De-duplication works best with backup data which contains a lot of repetition between backup cycles.

There are basically three implementations of de-duplication. Pre de-duplication, where, the comparison and de-duplication is done in the backup server before it is written to storage, post, where the back up data stream is chopped up into chunks, processed with a hashing algorithm, and the hash is compared to find duplicate chunks. The last is an inline implementation where a large data space, like a peta byte, has been pre-indexed and reduced to a 4GB, memory resident, index. This essentially enables the de-duplication to occur on the fly without the need to chunk the data.

The post and inline de-duplication methods have similar de-duplication ratios on the order of 25:1 when the backup object is small enough for the hash index to be resident in memory. When the object gets larger and the number of chunks increases, the hash indices spill out to disk and performance is degraded. The inline approach has the advantage of becoming more efficient as it sees more data. However the pre de-duplication vendors have been claiming de-duplication ratios that are an order of magnitude higher.

I asked Dave how they could justify such high de-duplication ratios. He explained that it was all in the math and that most de-duplication ratios are basically the same for a given backup data object. He explained it with the following example.

A full backup of 1TB data base tablespace is taken on day one. The next day another full backup is taken and only 2GB of that backup has any changes. 

Using traditional full backup approaches after 2 nights, the backup capacity required is 2 x 1TB = 2TB 

One method of calculating de-duplication ratios could yield a low ratio: 

    • Total de-duplicated backup capacity used = 1TB + 2GB = 1.002TB 
    • If the de-duplication ratio compares the amount of total physical storage used to the total amount that would have been used by traditional backup methods, the ratio = 2TB / 1.002TB = approximately 2:1 

Another method of calculating de-duplication ratios could yield a high ratio: 

    • Total de-duplicated backup capacity used still = 1.002TB 
    • If the de-duplication ratio compares the amount of data stored in the most recent (second) backup to the amount that would have been used by traditional backup methods, the ratio 1TB / 2GB = 1000GB / 2GB = 500:1 

In the above example, the net effect to the customer and the amount of data that would be stored and then replicated from the branch office to the primary data center is the same, even though the ratios are very different.

For information on Hitachi Data Systems latest VTL and Deduplication products for medium, large, and enterprise customer see our February 11, 2008 Announcement letter.

 

 

 

Related Posts Plugin for WordPress, Blogger...

Comments (5 )

Mike Dutch on 03 Mar 2008 at 9:02 pm

Does HDS intend to join the SNIA de-dupe group (DMF DPI DDSR SIG)?

joseph martins on 04 Mar 2008 at 1:54 pm

Hu,

Dave’s dedupe math is not universally true (i.e., it isn’t true for all dedupe products). We are quite familiar with the technologies, and we spent 13 months test-driving one dedupe-based product here at our office on our own live data set.

Let’s use your initial 1TB seed as an example. An ordinary initial backup would consume 1TB, but that is not the case with some of the dedupe-based products. For some products, dedupe first occurs within the initial backup. Depending on the data set, you might find that the initial deduped backup consumes a mere fraction of a terabyte – perhaps as little as a few gigabytes.

After a second full backup, traditional methods would consume 2TB, however, two full deduped backups would consume substantially less than 1TB. So on and so forth. In some cases, it may take dozens or hundreds of full backups to reach the 1TB consumed by the original non-dedupe seed.

I should also point out that the inline approach is not the only method that becomes more efficient as it sees more data.

Here are a few figures for you to consider:

We began with ~193GBs of data consisting of office files, databases, multimedia content, and email spread across a combination of 5 PCs and Macs.
We performed 13 months of full daily backups.
Unique bit streams averaged a few percent per month.
After 13 months we had not exceeded 1TB of backups, let alone the 1.8 TB capacity of the backup server.
Traditional methods (without compression) would have required some 76 TBs to achieve the same data retention using full daily backups. Weekly fulls with daily incrementals would have required somewhere in the neighborhood of 18 TBs.

I do not personally know Dave, but I share his bullishness. I would go one step further and claim that the benefits extend well beyond the cost benefits of the reduction of duplicate data. As companies store more, for less, for longer periods of time, they’ll begin to tap these environments for more value. Analytics and proactive discovery are just the tip of that iceberg.

[...] In response the storage industry has developed technologies like compression, single instance store, and de-duplication which are rapidly being deployed this year. You can find solutions that are deployed in the application server, the SAN, the storage array, or backup/archive appliance. Careful consideration must be given to where you deploy these tools, since the where, will impact performance and effectiveness as I pointed out in my previous post. [...]

[...] In response the storage industry has developed technologies like compression, single instance store, and de-duplication which are rapidly being deployed this year. You can find solutions that are deployed in the application server, the SAN, the storage array, or backup/archive appliance. Careful consideration must be given to where you deploy these tools, since the where, will impact performance and effectiveness as I pointed out in my previous post. [...]

Jon Toigo on 26 Apr 2008 at 8:18 am

Interesting analysis, Hu. I have been trying to get the de-dupe vendors to clarify a few issues I encounter in every customer visit I do regarding the meaning and value and risk of de-dupe. I have posted a questionnaire comprised of 10 queries drawn from actual conversations with IT folk on the subject. Some questions contain inaccuracies or misapprehensions of the customer to be sure, but that goes to the huge information gap that exists in this space. The pity is that the marketing folks are creating the erroneous views, but few vendor technical folks are working to dispel the marketecture. Thanks for this thread, which has been more illuminating than 20 PowerPoint presentations I have seen in the past year or two on the subject.

If HDS wants to respond to the post, it is here: http://www.drunkendata.com/?p=1692.

JWT

Hu Yoshida - Storage Virtualization Thought LeaderMust-read IT Blog

Hu Yoshida
Vice President and Chief Technology Officer

Connect with Us

     

Recent Videos

Switch to our mobile site