May I Please Have Some More Capacity Optimization, Sir?
by Ken Wood on Aug 16, 2011
So, this is the third and final installment of my blog series on capacity optimization techniques. The first article was on file level single instancing and file level compression, which also included a combination of the two. The second article described how data de-duplication works, which I demonstrated by using Linux commands.
In this post, I’ll show you how file level de-duplication and compression can save even more capacity. Plus, data de-duplication implies that sub-file single instancing is in use under the covers of the capacity optimization front. Since I enjoyed doing the demonstration style article showing some of the ways that this technology works under the covers, I’m going to employ this technique again.
From my previous post, “To De-dupe, or Not to De-dupe, That is De-data”, I demonstrated that the “split”, “md5sum”, “rm” and “cat” native Linux commands (at least in my CentOS 5.4 system) can be combined to de-duplicate a large file into a smaller file footprint (several smaller sub-files) at almost a 3:1 capacity savings ratio. What I plan to show and demonstrate here is combining a form of file-level single-instancing and file-level compression, AND file-level data de-duplication to reduce the capacity footprint further. In this case, I’ll show how a sub-file is a sub-file is a sub-file. Just because a sub-file was “extracted” from another original file, doesn’t mean it can’t be used somewhere else. In fact, these “multiple references” to a sub-file are the most powerful feature in reducing the storage capacity footprint in de-duplication systems.
Note: This demonstration is to illustrate the core functions of file-level de-duplication and capacity optimization techniques. It should not be used to de-duplicate your production data as a capacity savings technique!!!
In the screenshot below, I have created two files, “blogtest.dog” and “kentest.bup”. They are approximately 4.4 MB in size each for a combined total of about 8.9 MB. The first thing I do is find out what the MD5 hash fingerprint is for both files using the “md5sum” command. The two files are different, therefore I can’t file-level single instance these two files.
Next, I’ll use the “split” command to fix block split the files up into 256KB sub-files, then I “ls –l” the directory to show the resulting sub-files and the sizes of these sub-files. I prefixed the output file with “1st-” and “2nd-” (“blogtest.dog” and “kentest.bup”, respectively). You should be able to see that filenames “1st-aa” through “1st-aq” correspond to file “blogtest.dog” and that filenames “2nd-aa” through “2nd-aq” correspond to file “kentest.bup.” Both sun-file sequences end with sub-filename “*aq”, however, the file sizes reflect the size differences of the originals. Counting the number of sub-files generated (using the “wc” command) shows that 17 sub-files were created for each original file or 34 total sub-files.
Similar to before, I will now calculate the MD5 hash fingerprint of each sub-file using the “md5sum” command on both sub-file sequences. You should be able to see that the hash fingerprint “ec87a838931d4d5d2e94a04644788a55” is present in both sets of sub-files from both original files. This means that each of the 2 original files contain a set of 256KB sub-file patterns that are identical to one another.. This also means that both original files can share this sub-file between them, thus I can delete all sub-files that calculate to this fingerprint except for the first one. So, I keep the first sub-file “1st-ab” with the hash signature “ec87a838931d4d5d2e94a04644788a55” and execute the “rm” command on the remaining sub-files with the same fingerprint.
As you can see, I have “de-duplicated” the total number of sub-files from 34 down to 10 sub-files, 4 for the original file “blogtest.dog” and 6 for the original file “kentest.bup”, and there are no sub-file instances containing the “ec87a838931d4d5d2e94a04644788a55” hash fingerprint for the original file “kentest.bup”. Basically, each sub-file is now a unique piece of data.
The amount of capacity occupied by these two original files has now been reduced from approximately 8.9 MB to 2.5 MB, assuming we actually deleted the original two files. This is approximately a 3.5:1 reduction ratio.
But wait! There’s more.
Now let’s compress the remaining sub-files to see how much additional capacity savings we can achieve. By using the “gzip” command, I compress the 10 sub-files individually and replace the original sub-file with the compressed sub-file and append the “.gz” label after the filename. There are still 10 sub-files, but now the amount of capacity occupied has been dramatically reduced further. The combined total capacity of the two original files is now approximately 196 KB! So from 8.9 MB to 196 KB, this is about a 45:1 reduction ratio.
Of course, this is a dramatization and a demonstration. Your actual de-duplication ratios will vary considerably, or as they say, “your mileage may vary”. It really depends on the type of data you have to store.
So, now we have to “rehydrate” the two original files to their fully bloated original state by reversing this process. As you recall from the previous post, the high-level order in which the data de-duplication functions happen is:
- Chunk it
- Hash it
- Toss it or keep it
For this extra level of capacity optimization, there are a couple of additional steps.
- Chunk it
- Hash it
- Toss it or keep it
- Compress it
- Reference it
Technically speaking, the Reference it part is done even without the compression step, so even in the three step functions, there is a Reference it step. However, I’m highlighting this in this blog post because we did two extra steps to achieve the extraordinary capacity optimization results: sub-file Single Instancing and sub-file Compression. The sub-file Single Instancing comes from the one common sub-file between two completely separate original files. This can be illustrated in the diagram below. In fact, this is going to serve as the mapping we will use to rehydrate these compressed sub-files back to the original files.
Again, instead of deleting the original files, I’ve renamed them so that I can do a binary comparison of everything in the end. Then, using the “gunzip” command, I uncompress the sub-files back to their original 256KB chunk size, except of course for the lastsub-files, which are the remainder of the original files during the chunking process. Now we need to assemble the files back together. I use the “cat” command to concatenate the sub-files together in the proper order. I use the sub-file “1st-ab” as a replacement for sub-files “1st-ac”, “1st-ad”, “1st-ae”, “1st-af”, “1st-ag”, “1st-ah”, “1st-ai”, “1st-aj”, “1st-al”, “1st-am”, “1st-an”, “1st-ao”, “1st-ap”, “2nd-ac”, “2nd-ad”, “2nd-ae”, “2nd-ag”, “2nd-ah”, “2nd-ai”, “2nd-al”, “2nd-am”, “2nd-an”, “2nd-ao” and “2nd-ap”, which were all deleted earlier. This is to re-create the original files “blogtest.dog” and “kentest.bup”.
Initially, you can see that the files rehydrate back to their original sizes. To ensure that everything went back together properly, I run the “md5sum” command against the newly rehydrated files and compare them to the original renamed files, then I perform a full binary comparison with the “cmp” command to make sure everything is 100% perfect.
Trust me when I say that if any of these pieces don’t go back together in the right order, then the hashes will not match up correctly. Then you know you have a problem. As I have shown, this could get to be a laborious task by hand. Scripting could be an option to automate several aspects of this process. However, the best way is to let an appliance with reliable code and a hardened database do this for you; it makes all of these steps invisible. I’ve gone through these steps for you to show a little bit of what’s under the covers to this technology—or maybe what’s not under the covers. The combination of several of these techniques also has the potential of saving large amounts of capacity beyond any one method alone.