I Don’t Agree with Chuck on Everything
by Hu Yoshida on Oct 10, 2009
On my previous post, which was titled, I Agree With Chuck on Data Dedupe” I received a fair number of comments. Some were from Jered Floyd of Permabit and even one from Steve Duplessie . My post was intended to point out that while dedupe was an excellent tool for reducing storage bloat, it was addressing the symptom and not the cause which was stale data and over allocation at the source.
Unfortunately this post was interpreted as supporting Chuck Hollis’ view on dedupe of primary data.
There are many ways to optimize capacity and reduce operational costs for primary data, including thin provisioning and archive. The opportunities for dedupe of primary data, as Jered acknowledges in his blog, is harder to find. One other consideration about primary data is that it may be copied many times over for additional offline processing, like data mining, extract/translate/load, development test, etc. Some primary data volumes and may be copied and moved over 20 times. You would probably have to redupe before you make those copies, and the copies may be moved to storage that does not have the dedupe capability. A lof of data older than 60 days may never be referenced again. If you archive out the stale data and thin provision the source, these efficiencies carry forward with the copies and moves. In the case of files, when they become inactive, stub the file out of the file share to an active archive like HCAP, and you are only backing up the stub with the file share.
So my suggestion was that archive of stale data at the primary source volume or file will reduce the working set of active data and reduce the work and capacity requirements for everything that follows including dedupe.
Comments (2 )
Lot’s of good dedupe sessions at SNW next week.
See you there!
Hu, I see what you’re getting at here, but I also think that addressing the “symptom” (multiple copies of data) is sometimes the right thing to do. Consider the problem of dozens of revisions of a PowerPoint document. We could beg Microsoft to include revision control in PowerPoint (I’d actually like this very much), or we can let the storage solve the problem. Sometimes that’s the right solution.
Finding and eliminating stale data is a great thing to do, but once you’ve hit the easy targets the cost grows significantly. You can be limited by technological restrictions like my example above. Having a platonically ideal solution (eliminate the data redundancy at its source) doesn’t mean that another method (have the storage identify and eliminate redundancy) might not work better in many cases. Just because I change my sheets once a week doesn’t mean there’s no sense in making the bed in the morning.
So, I think we are in agreement. Different technologies can solve different problems (dedupe doesn’t help when moving between different storage systems), but different technologies can also solve the same problem in a cooperative way.
Are you going to be here at SNW? I have a talk on Tuesday afternoon where I will be touching on primary storage optimization. It’d be great if you could stop by and share your thoughts.