The other side of Single Instancing – Re-Instancing?
by Ken Wood on Jul 15, 2009
As an occasional contributor to Michael’s blog, my focus will be on the various and well-known space reclamation techniques of de-duplication and single instancing. By posting, I aim to share my knowledge on these techniques and also have a conversation with other professionals (users and vendors alike) who might have similar or differing opinions.
Now down to business. The way I see it there are two fairly distinct areas to reclaim storage space, at the block level and at the file, or object level, when using data removal techniques.
Obviously, combination’s of these techniques can exist to provide additional space savings, but how about the single instancing of a compressed file? Or block level de-duplication of a RAID-1 or n-way mirrored device? Local and remote replication of files or block volumes? More importantly, what if I want multiple instances of my data for performance reasons or multiple instances across my infrastructure for protection? Stated another way, what do I call the storage infrastructure-wide orchestration of getting rid of unnecessary copies, but make copies in a controlled fashion when I need them? The ability to single instance or de-duplicate data for space savings, then the ability to re-instantiate or replicate data for performance or protection is what I call Controlled Instancing.
The concept of Controlled Instancing is the infrastructure-wide orchestration of policies that puts data on the appropriate platform to save storage space, protect data, and/or re-instantiates data to meet performance demands of files and objects or block storage devices, or the appropriate combination of all of these. The concept of Controlled Instancing isn’t a point solution, but is the controlled movement and placement of data many storage solutions integrated together.
The Case for Controlled Instancing for File Data
Many individual data management technologies (old, new, emerging) are slowly coming together, accidentally and haphazardly, but more importantly, uncoordinatedly, to gain better storage efficiencies. Individually it means that these are primarily point products to address one facet of storage efficiency. Although block and file level space reduction can be combined to increase storage space efficiencies even further, Controlled Instancing for file data meets protection, space savings and performance requirements within the same domain.
Granted, there is a lot of value to removing multiple copies of data to reduce the amount of storage required to store it, but Controlled Instancing for Data Files allows space savings and includes the capability to force the copies back due to peak demand and to replicate the remaining file for disaster protection or recovery both locally and/or remotely. This combination and orchestration through policies of several integrated products forms the foundation of a new approach to managing data beyond just space saving. Also known as putting data in the right place, at the right time, for the right reason.
These diagrams are similar to the one Michael posted a couple of months ago when he described file storage tiering. I usually describe the Hitachi Data Discovery Suite – HDDS – as two features. First and what it is most commonly associated with it is indexing and search, but the second feature I like to talk about is its migration control and migration policy engine. HDDS controls the data migration engine in HNAS either through its GUI or scheduled as a policy. Files on HNAS can be migrated to HCAP, for example, where internal policies can eliminate duplicate files and compress the remaining file. The reference of a file, not its metadata, on HNAS will force the compressed file back through the HCAP link and re-instantiate the HNAS stub. Each file on HNAS that was reduced to a single file and compressed on HCAP is re-instantiated on HNAS, all copies if that’s how the workload presents itself – automatically.
HDS continues to evolve and develop many of the components of Controlled Instancing for Data Files to further enhance this concept. Expect me to add to this concept as we make more features and components available to enable the management of file services infrastructure fully automated to meet user requirements.
Anyway, to wrap this up for now, I’m essentially a strategist for HDS who researches data and storage technologies. I cover a wide range of technology subjects both present and future. I will be covering topics that include server and storage virtualization, storage performance, storage architectures and a specialized area I like to call “eXtreme Storage”. I especially enjoy looking at the world and what’s currently going on and rolling it back up into how this impacts storage, what else could storage be doing, and how storage technologies could make life easier for everyone.
Comments (2 )
I tend to look at Dedupe and all related efforts from a very simplistic point of view.
1. 1TB of SATA disk is at around 1000$ USD
2. 10TB of Dedupe capable disk is at around 120,000$ USD. Now with a 1:10 ratio this boils down to 100TB usable.
I guess the below is probably a question for David Merrill.
Why should I invest 20,000$ more when I can dump everything on a dumb SATA array without dedupe capabilities ? At what compression/dedupe ratio does the investment make sense ? I guess you could argue about rack space and power comsumption etc but with a good Archival Engine I could set a watermark on the SATA array and move everything that is older than a year to Tape and limit the SATA growth to 60TB.
What I’m trying to say is that I see a lot of talk about dedupe/compression/single instancing but very few articles about the economic sense that this makes to a company.
Vinod, thanks for your comment and your challenge. However, I think you are enforcing my Controlled Instancing concept. A box of disks is literally a valid option for content to be migrated to. While I focused on migration targets that manage data itself, a low cost option with little intelligence can also be included in a Controlled Instancing infrastructure. In fact, with the creation and enforcement of certain policies, data types can be segregated on to multiple classes of storage. A box of disks could be a best-effort storage pool for non-business related content that may find its way onto corporate file servers.
My statement still holds “… known as putting data in the right place, at the right time, for the right reason.”. From the perspective of taking content from one platform to another platform “straight up” allows me to invent a new term, one-for-one instancing, though this is really just down tiering and again is covered in Controlled Instancing. As far as economic sense of using space reclamation technologies versus deeper and cheaper storage and store everything one-for-one — I do know that this is still not an acceptable option for many environments, at least for now and the near future. Hopefully, David Merrill could chime in on the economics here.
I do support your opinion on tape as a tier in a Controlled Instancing infrastructure. The idea of SSD-to-Disk-to-Tape and back, all orchestrated automatically, should be of interest to many IT professionals looking to control costs while maintaining service levels. This is also implies that not all data is treated or stored equally. Some content is destined to land on a box of disks with fingers crossed.