I Agree with Chuck on Data Dedupe
by Hu Yoshida on Oct 5, 2009
Chuck Hollis had an interesting observation on deduplication of primary data and I/O density. He points out that while deduplication is great for backup, archive, and large file repositories, it might not be as great for primary data. His reasoning is that dedupe can cause an increase in I/O density which may impact the performance of primary data and negate the value of space savings that dedupe could bring. He goes on to say that the current work arounds for the I/O density problem, more disks and/or faster disks, defeat the premise of using dedupe for saving disks and reducing costs.
I agree with Chuck and think his post was very well written. I would like to offer some additional views on dedupe.
Dedupe is most effective for backup where there is a lot of repetitive bit strings. But dedupe for backup is addressing the symptom and not the cause. The cause is all the stale data that we backup and dedupe over and over again. Most data that is over 60 days old will rarely be referenced again. This stale data also does not need to be restored as long as it is stubbed out to a content repository like HCAP which is replicated for data recovery.
Use an active archive solution like HCAP to reduce the working set and eliminate stale data from your backups, dedupe, and restores. This not only reduces storage capacity requirements but also reduces operational costs.
While some people use dedupe for archiving, if that archive is for compliance, it may be better to do single instance store to avoid the discussion as to whether or not dedupe and redupe constitutes a modification or the probability of a hash collision which could lose data. The HDS HCAP platform does single instance store and does not do dedupe for content archive. Hashing is done for immutability to ensure that the content has not changed between ingestion and retrieval.
Chuck alludes to the concept of tiering as a work around for I/O density where primary data can be moved to high performance tiers when they need performance. Today that requires the movement of volumes or files. Moving volumes up and down tiers of storage in response to performance requirements is practical only if the volumes are virtualized so that data movement does not disrupt the application. Moving volumes is a heavy task so it is best done sparingly where the requirements for tiering are infrequent or the periods of high performance are predictable and the data can be staged ahead of time. On the other hand moving files is more practical with Hitachi NAS which does file virtualization and can move files non-disruptively between tiers of storage and also stub out a file to HCAP based on policies. With the Hitachi Data Discovery Suite, HDDS, file can be indexed and moved between Hitachi NAS tiers and HCAP based on content awareness.
Dynamic provisioning also helps to reduce the workload of tiering and dedupe by eliminating the overhead of “zero” pages.
So I agree with Chuck. “Dedupe is a useful tool for taming information (data) growth, but it is not a panacea”. I would also add that dedupe is great at addressing the symptom of bloated storage growth, but other solutions like Dynamic Provisioning, and File and Content services are more effective in addressing the cause which is stale data and over provisioning.
Comments (10 )
Hu it occurs to me that you are saying customers should be looking at any capacity savings features and HDP is a great example. So perhaps you and Chuck are on to something else: dedup is not the end state, but capacity optimization is!
I’ve chosen to stay out of the little tempest in a teapot that’s going on over at Chuck’s blog, but I suggest to look through the comments. In between the vitrol some good points are made about why Chuck is wrong. For example, if deduplication is increasing access to common blocks it means that you’ll be seeing much better cache efficiency, which will offset additional load on the drives. The “boot storms” he talks about with many virtual machines hosted on the same storage are actually less likely to occur with deduplication than without!
Deduplication has seen its first successes in the D2D backup space, where it’s easy to get a lot of deduplication due to the data patterns and traditional backup schedule. Applying deduplication beyond backup is hard, because the opportunities for deduplication are fewer and further between, and so these D2D backup devices have never been able to address archive or primary storage effectively. That doesn’t mean dedupe is bad for primary, it just means that it’s harder to do.
At Permabit, we consider dedupe for backup to be Dedupe 1.0, and the future for dedupe innovation is in Dedupe 2.0, which includes dedupe for primary and cloud storage. We host a forum over at http://www.dedupe2.com/ to discuss this further, and recently released our Permabit Cloud Storage product to address new customer needs.
Dedupe for primary is a huge win for the storage consumer, but it’s taken us nearly a decade of extensive technology and patent development to solve the scalability and speed challenges needed for that market. I think it’s no coincidence that the two voices denouncing primary dedupe the most, HDS and EMC, has no products to offer with a feature that will soon become a customer requirement.
Michael, the point that I am trying to make is similar to what you are doing on The Storage muse:
Optimizing capacity may be a by product, but what I woud like to really do is optimize TCO by reducing Operational costs with tools like Dynamic Provisioning and an integrated file, content and search capability.
Jered, thanks for your comment. I am not against primary dedupe or dedupe of any kind. What I am saying is that there are other tools that could more effectively address some of the causes of data bloat like empty allocations and stale data. Tom Cook your CEO recently was quoted as saying essentially what I said in this Blog.
“If you really think about where deduplication technology can be applied and how it can become more valuable to an enterprise, start with primary storage and focus on moving static / persistent information off of it to a value tier so you are not backing it up repeatedly. When you do those things, you can reduce your cost structure by, in many cases, 10-20x of what it was before. You decrease the importance for optimization at the D2D backup level and improve your overall organizational efficiency tremendously.”
[...] This post was mentioned on Twitter by Avnet’s StoragePath and Amer Chebaro. Amer Chebaro said: RT @HDScorp: New post: I Agree with Chuck on Data Dedupe http://bit.ly/12pmHQ *WOW!! That’s a first!!!* http://myloc.me/Ucqd [...]
Interesting debate. I understand both arguments conceptually, but think you might both be right in the proper light. Fact is that we’ve seen really important workloads go faster (Databases, E-mail, etc.), primarily because the reduction in the size of a working set makes more of the relevant data fit into higher performing regions of the storage – be it loaded into cache or simply requiring less physical seeks to get a transfer completed. Faster – not “just as fast” but faster.
If you reduced all of the capacity from 100 different databases, that previously existed on 100 different spindles, and shoved them all into one spindle – unless you offset those newly created issues, Chuck is right. In reality, you wouldn’t do that, you’d buy 2TB of Flash and all 100 databases would now scream. And, it would cost you a pile less money to boot.
I think it will get even more interesting in the coming year when we start combing data reduction along with properly balanced high performance capabilities. It’s not a zero sum game in my opinion.
Hu, I completely agree that virtualization, tiering and dynamic provisioning are critical in an environment with costly top-tier primary storage. In such an environment, dedupe for that top tier is a huge win as well, but Chuck seems to disagree. I interpreted your agreement with his post as being counter to primary dedupe. I’ve expanded slightly on what I said earlier over at my blog: http://blog.permabit.com/?p=514
[...] response to the post, Hu Yoshida at HDS put in his view, which is that he essentially agrees with Chuck on this question. His main point is that dedupe for primary isn’t a panacea. True [...]
[...] Hu Yoshida has weighed in on Chuck’s side, but then laid out the much more reasonable view that virtualization, tiering and dynamic [...]
A nice post with lots of food for thought.