Lies and Virtualization – Capacity Optimization is an Altered Reality
by Ken Wood on August 26, 2011
OK, so I lied about my last blog being the final of a three part series (1, 2 & 3). But aren’t we used to being lied to these days? I classify virtualization into three categories of prose: one-to-many, many-to-one, and this-to-that. (Turns out, I thought I blogged about this some time ago, but it was a paper I started and never finished. I’ll extract this and post it as a future blog soon.) 
Virtualization, as we use it in our IT lives, is a lie. Virtualization technologies, whether it is in storage (which includes RAID, capacity optimization, Dynamic Provisioning, Dynamic Tiering, etc.), servers, CPU, file systems, or desktops hides the real truth and sometimes administrative tasks from the rest of the world. Virtualization is an abstraction of a complex physical reality, or the aggregation of many simple elements, to a more understandable, usable, acceptable and simplistic view, on behalf of an application or user.
Virtualization lies to us for our own good. When used with capacity optimization technologies, as I’ve described in several of my previous posts, you will be presented with a false view, but you will like what you see. The amount of available space for storing your data will appear vast and voluminous (and in some cases “bottomless”) but under the covers lies a complex system of algorithms and coding used to squeeze, chop-up and delete your data into a smaller and smaller footprint. These layers, in fact, can continue down to another layer of abstraction, such as a dynamically provisioned volume of storage. This, in turn, records data on an array of disk drives protected by another layer of abstraction called RAID-6. Even the disk drives themselves are accessed via a logical block address, which hides and simplifies the realities of addressing data as cylinders, tracks and sectors, or steering you from the bad block replacement area.
The abstraction layer above capacity optimization technologies hides the truth from the user and application in order to focus on doing what they do best, and not manage capacity, or so it seems. More times than not, this just moves the administrative action line further out in time. Storage is not “bottomless,” but it can seem very deep. This layer presents, for example, a 5TB file system to the user or application, which can write data to this space. Depending on the architecture and the type of data being written, the file system’s utilization percentage will act strange to people who are used to chasing the 90% threshold for most modern file systems. It may seem like no data is being written, or like data was written, then disappeared, but when checking the file system, data (or at least the file names) exists.
What’s happening is storage is being optimized against the data that already exists. The more data that is stored, the more data there is to compare to, the more chances of duplicate data detected, the more duplicate data is removed. The task at hand is the management and metadata tracking of keeping references. The mapping of the chunks that are used and shared across many files can tax a management system by creating a large database with heavy comparison tables.
This mapping of chunks—you can replace chunks for blocks, pages, sectors, sub-files, etc.—IS the lie. That is, virtualization is hiding this from you.
Or, as David Black of EMC used to be fond of saying during our time together in SNIA Technical Council meetings, “…most problems can be solved in computer science with yet another layer of indirection…”



