A Note on IBM’s XIV
by Claus Mikkelsen on Jan 12, 2009
One thing I do not like getting involved in is competitive “bashing” (how many of you are laughing now?), but I also have a “rep” for some strong opinions on occasion, especially when it comes to a subject I’m quite familiar with: storage architectures. And correct me if I’m wrong on this, but I recently read an interesting post by Joerg Hallbauer that’s hard to ignore. It’s a post on IBM’s XIV storage device.
For those of you unaware (or living under rocks for the last year), XIV is the Israeli storage company started by Moshe Yanai of EMC Symmetrix fame; IBM purchased them a year ago. It’s an interesting box trying to employ an architecture I’m actually quite fond of: data dispersion. I say “try” because it just ain’t there. Nice try, though…
We all know that RAID-5, as an architecture, is being stretched. I mean, we’re already in the world of 1-2 TB drives with 20-50 TB drives on the horizon. Anyone want to do the math on how many months it will take to rebuild a 50TB drive? It’s not pretty, and RAID-6 is only a partial solution.
To that extent, data dispersion has some interesting qualities in that instead of RAIDing (is that a word? It is now!), data is dispersed somewhat randomly across potentially every drive in the array. So that Oracle database you have could actually reside on all 180 disks within the XIV box. The good thing is that every block is, in fact, duplicated. So if a drive fails, its contents can be recreated quickly by the duplicated blocks from all of the remaining drives. Sound like mirroring? Think again. Here’s the rub…
In mirroring, as Joerg points out, if you lose both drives before reconstruction completes, you’ve lost the contents of that drive. In RAID-5 you will lose the contents of the RAID group if you lose 2 drives and with RAID-6 if you lose 3 drives. Basic elementary RAID architecture.
But Joerg claims that you will lose the entire contents of the XIV array if any 2 drives in the array fail before reconstruction completes, and he is most certainly right. And with 180 drives in the array, the potential is higher than most people think.
Since data is duplicated and “dispersed”, a loss of 2 drives means that you’ve lost the duplicated data required to reconstruct. I liken it to punching a large hole through a book. The fact is you’ve probably lost little of the content (words), but the book becomes worthless. Now imagine what would happen if you punch a large whole through an Oracle database? As in losing every 10th row, for example? That gets pretty ugly pretty fast.
My personal opinion is that data dispersion is very promising as a storage architecture, but we need to look at alternative ways of protecting data. Unfortunately, in the XIV case (or in any data dispersion implementation), the stripe factor (number of times every block is replicated) MUST be greater than 2. Three is better, 4 is better yet. But 2 is just bad. I would argue it’s even worse than JBOD because at least with JBOD, you only lose the contents of the failed drives. With XIV, the entire array content is exposed.
As a result of XIV’s architectural limitation, Jeorg relegates XIV storage to Tier 3 at best – I would have to agree with him.
Comments (5 )
[...] That was then, this is now. Today, nobody seems shy about attacking competitors. Take this EMC blog post under the headline “Frankenstorage Defined” that starts off by talking about Hitachi, IBM and NetApp. Here is an HDS blog post on IBM’s XIV storage system and its “architectural limitations.” Here is a NetApp blog taking a shot at an EMC executive in one of the posts. Look at just about any blog written by an executive at a storage vendor and in many of them you’ll find them saying bad things about their rivals. [...]
I just wanted to clarify some points in this article if I may. The article is fine, but some points aren’t correct as it related to the design of the XIV/2810. While Yes it is true that two drive failures at the same time on modes 1 – 3 or 10 – 15 would cause the array to go down: IF two DDM’s failed at the same time- on Modes 4 – 9 it would not cause any outage! When you consider on most Raid systems you have complex controllers doing all the work. With 15 years of supporting Storage Products: It makes perfect sense in those systems, that almost 99% of the multiple DDM failures are caused by these RAID controllers. In the XIV this is not the case; We can actually lose a entire module and the System will tolerate it. In the World of Massive Storage Farms, XIV provides one of the 1st mid-range Storage Solutions to actually provide an Alternative to the 136 hour “RESTORE”.
Keith Maddox, Remote IBM Storage Support
This article on the XIV, while true in some respects misses a few important points. First of all, if you lose two drives, you may lost the entire array contents but you may not. It depends on which drives you lose. For example, in any give tray, there is no dupliate data. Therefore you can lose an entire tray and you are fine. There are other rules as to which drives you must lose to lose the whole array, but certainly it is not any two drives. The other fact not mentioned is that the 1 Tbyte drives currently employed on the XIV have a rebuild time of at most 34 minutes if the drive is pretty much full. This means, within 34 minutes you no longer have the risk of a second drive failure. How long does it take you array to rebuild a 500 Gbyte FC drive … much longer I am guessing. What about a 1 Tbyte SATA drive … is 24 hours enough time? With predictive failure, even with the data only in two locations, the chances of total data loss drop considerable. With repliction to another array you can further protect you data.
Just some additional food for thought.
In fact, I think that you missed the point, the strength in XIV is data dispersion instead of disk redundancy. You have blocks redundancy in addition to node and disk redundancy.
The key piece of information missing here relates to aggressive pre-failing of drives. Almost all drives will show some symptoms before failing. XIV aggressively weeds them out as soon as it detects heat or speed inconsistencies. And if it’s fixing one pre-failed drive-load of data, and it finds another, it will finish the first one before fixing the second. If the second one is a sudden hard fail (can happen but much less likely) XIV can press the original half-fixed prefailed drive back into service for 20 minutes while it fixes the dead one, then go back and fix the payload of the pre-failed drive. So XIV has a lot of detailed drive management that you don’t see in other systems, that allows it to manage it’s way through double disk failure. When you look at a new system you have to look at the architecture in its entirety – you can’t take one aspect of it and then apply thinking that comes from knowledge of other older architectures.