Single Point Compression is Not a Black Hole
by Ken Wood on Jul 10, 2011
This blog is the first in a series of short articles I plan on writing over the next several weeks describing capacity optimization techniques and designs. This series will describe various standard data manipulation algorithms for reducing the amount of storage required to store data such as single instancing, compression and de-duplication.
The amount of data being generated today is staggering. The amount of data being stored is just as amazing. Those of us that have been in this industry for a while (my career started in the late ‘70s) can attest to the time when storage was measured in kilobytes on real rusty platters. This is not a reminiscence piece, but more of a how far we’ve come and what we’re doing about it. Back then there wasn’t as much digital data being generated or kept, and the majority of that data was human generated versus today’s machine generated data.
The technique of storing more data in less space or the perception that actual data is being stored is not new. Data compression utilities built into the operating systems and as add-on applications have been available for decades, and single instancing had been available in Unix OSes for just as long through symbolic and hard links. In fact, I remember examining scripts and programs that were actually the exact same file with a different file name. Depending on what file name you referred to it as would change the behavior of the program. In reality, the different file names of the program or scripts where hard linked to the same file in the file system. Examining the argv argument (the name of the file) would cause a different set of instructions to be executed within the program. This was a way of saving time maintaining code, but also a very early form of data de-duplication – Single Instancing. This technique was also used with hard links and casing off the $0 argument for scripts.
So today, there’s file based single instancing in appliances and software products. Basically, the management and orchestrated pointing of file names or pathnames to files that are exactly the same file, usually using symbolic links and/or stub files, and the management of these symbolic links and/or stub files. This assumes that there is a significant amount of files stored that are the same with different file names or possible a different pathname. In many common environments, there are high percentages of files that are the same, but with different file names or pathnames, so file-based single instancing can provide a very good level of capacity optimization for network file servers.
Hitachi Content Platform (HCP) uses this technique to optimize its storage capacity. The removal of a file is extremely sensitive so a two-step integrity check is performed to ensure that the file being “single instanced” is an exact copy. This integrity check does a quick matching of hash signatures associated with each file to create a candidate list of potential file copies, then a binary compare of the files listed in the candidate list is performed and only when an exact match is made does the file get single instanced, that is, the file is replaced with a pointer to the already stored file. HCP also leverages file compression to further optimize its storage capacity.
File level compression can work on every file. Granted, some files are already in a compressed format such as MP3, JPEGs and MPEGs. In these cases, compression doesn’t create a significant capacity savings and in some cases can increase the size of the original file if not managed correctly. However, for the majority of the rest of the files types that exist in an enterprise, many files can be reduced significantly through compression. In higher end systems, the files appear to be stored as is, however the underlying bits, bytes and blocks of the file is compressed through the file system. The tall tale sign is the slight performance impact at read time as the file(s) are uncompressed for use.
Of course, the combination of both file-based single instancing and compression can yield even more capacity optimization. This technique is used to take the file that is referenced by many file links via single instancing and to also compress the original file. This compounds that effect of just single instancing or just compression.
So this is a quick article on two capacity optimization techniques that have been available in the industry for quite some time. In my next blog, I will discuss some of the more modern and advance techniques for capacity optimization called data de-duplication, which you will find out is a modern twist on single instancing.
[...] In my last blog, I described some past techniques and the current method of file level single-instancing and its capacity optimization companion, file compression. In this post, I’d like step this up with the more modern approach to capacity optimization—data de-duplication—and I’m going to show you how it’s done. [...]