Data Deduplication Claims
by Hu Yoshida on Nov 18, 2006
Data Deduplication was one of the hot topics at SNW last month. Deduplication is a commonality factoring approach that eliminates duplicate data in a data stream. It is particularly effective for backup data due to the many copies that are made of the same data. It is typical to take a weekly full with daily incrementals, which may be kept for six months or more where only 10% of the data is actually changed.
On Monday of SNW week, EMC announced that they were acquiring Avamar, which has a backup product that does data deduplication. What caught my attention in the press release was the claim that “…customers can achieve an industry leading 300:1 daily data reduction in real-world applications.” That claim is staggering! We partner with Diligent and use their HyperFactor technology for data deduplication which we claim to provide an average of 25:1.
We believe that the commonality factoring in HyperFactor is much more efficient than the Avamar approach since HyperFactor is applied to the continuous data stream and not to chunks as is done with Avamar. Neville Yates, the CTO of Diligent explains that HyperFactor works in three steps; it scans the incoming data for similarities, compares the incoming data with the most similar in the repository, than when new data is found, compresses the new data and stores it in the repository while updating the Index with the knowledge of the new data. Typically new users see about a 10:1 factoring which continues to increase as HyperFactor sees more data. 25:1 is a typical average, but some customers see 40:1 to 50:1 depending on their type of data.
Avamar can only dedupe to the chunk level which their literature says is an average size of 24KB. That means that on a daily basis, there are 300 copies of every 24KB that they see. In 24KB there are 6,144,000 different combinations of bits.
On the Avamar Website they have an ESG Lab Validation report in which Avamar claims a 90% reduction of data stored as backups. It goes on to describe the back up of 1 TB data with weekly fulls and daily incrementals with a retention of 10 weekly tapes. Avamar would only require 1TB to store the actual capacity of 10TB. I would call that a 10:1 data reduction over 10 weeks which is a far cry from 300:1 daily data reduction that was called out in the press release.
I would like to understand this claim of 300:1. It may be that we have been miscalculating the efficiency of our HyperFactor and it should be greater than 25:1.
Comments (5 )
“In 24KB there are 6,144,000 different combinations of bits”
Actually, it’s a bit more than that. 24 x 1024 = 24,576 bytes = 196,608 bits, which make for 2^196,608 different combinations. Since 2^10 is about 10^3, that number of combinations is more than 10^58,982 (the number consisting of a 1 followed by 58,982 zero’s).
Otherwise, I wholly agree. Compression ratios are very susceptible to marketing spin. Somebody should create a public benchmark data set.
I think this 300:1 de-dupe ratio for Avamar may be a press release typo. It has been previously put by Avamar at 30:1 I believe.
Hello Ernst and Chris. Thanks for the comments. Ernst I stand corrected, I was never good in math. I agree that we need to have a public benchmark to measure not only the deduping ratio, but the speed of deduping and impact on backup and restore speeds.
The 300:1 has appeared in several documents so I suspect that it is not a press release typo.
Ideally, data dedupe should not be managed at the target. By then much of the “damage” has already been done. Allow me to explain.
First, such an environment can only dedupe data to which it is exposed as part of, say, a backup process. While the resulting backup environment may be capacity-efficient, this does nothing to eliminate enterprise-wide “production” data redundancies. That is to say data which is not sitting on backup servers or in archives.
To fully leverage data deduplication, it must be done at the source. Once you’ve begun to eliminate duplicate data in a production environment, backups will naturally be faster and more capacity-efficent.
Continuous, transparent deduplication at the source ensures that the least amount of data is stored in the production environment, the archive, and on backups. This is why Avamar is able to make substantially higher data reduction claims – they’re able to dedupe more than just a backup or archive stream, so 300:1 is not unrealistic. And their particular approach leads to other interesting benefits (e.g. more efficient, lower risk data destruction).
If you’re still not convinced, there is another important consideration – bandwidth consumption. Target deduplication does nothing to conserve production or backup bandwidth. Let’s use your example of 300 copies. Which would you prefer? Eliminate production dupes on an ongoing basis, and send a single copy to backup? Or, send 300 copies to backup and dedupe on-the-fly thereby consuming 300 times the backup-stream bandwidth, 300 times the processing, and, sadly, still 300 times the capacity of the more expensive production environment?
Disclaimer: I ran an older release of Axiom in my home, yes, my home, for more than two months last year to test the product as part of my CODiE Award storage software judging responsibilities. I easily achieved a 90%+ reduction in my “production” environment and consequently an equivalent reduction for my backup. I no of no other individual who has performed a test of similar breadth, depth, or length.
As an aside, Diligent’s technology may be equally capable of reduction well beyond the 25-50:1 you’ve validated if the company can figure out how to apply its hyperfactoring to a production environment. The ball is in their court…let’s see what they do with it.
I realized shortly after crafting my response that I’ve mashed together two equally important aspects of deduplication in a way that might confuse or mislead your readers. I would be happy to rewrite and resubmit my response and discuss both aspects separately before you publish the comment, if you wish.
The net of it is this: I tested Axion in its intended role, but I also managed to push the envelope and use Axion in ways that Avamar does not yet publicly endorse. It’s worth your time to check it out for yourself. Though I do wonder how EMC would feel about that.