20 year old architectures
October 13th, 2006
StorageMojo’s comment on "The Capacity Illusion" points out that RAID was developed 20 years ago when "capacity was expensive and I/Os were relatively cheap…. Now the world is different and capacity is cheap and I/Os are expensive", he submits that "if Patterson et al. were designing a fast, very big, very reliable drive today, it would look very different. This generated many thoughts that I decided to capture in a post rather than in a comment to my previous post.
First just to set the record straight, I’d like to point out that Ken Ouchi, a fellow Nisei who I knew in IBM, was issued the first US Patent 4,092,732 for RAID, in 1978, so the concept of RAID, in particular RAID 5, has been around for nearly 30 years. RAID was formally defined by researchers at UC Berkeley, Patterson, Gibson, and Katz, in the paper titled ‘A case for Redundant Arrays of Inexpensive Disks (RAID)" in 198x. We began to see RAID storage controllers being shipped by IBM and other storage vendors in the early 1990’s. By the time RAID became commercialized, the acronym was changed to a ‘Redundant Array of Independent Disks" since the disks that were used with RAID systems where not that inexpensive. It turned out that if you really used inexpensive disks, you spent a lot of time doing parity rebuilds which impacted performance and made the systems more expensive than if you used reliable IBM disks, which was the standard in those days.
Secondly, I totally agree with StorageMojo that 20 year old architectures like RAID need to be updated.Today we have 500GB SATA disks with a Terabyte disk on the horizon. Larger capacity disks not only take longer to rebuild, but many more applications are affected since more LUNs are striped across those larger disks.
Last week I had lunch with a customer in Singapore whose biggest complaint about storage was disks failures, particularly with today’s large capacity SATA disks. While he hasn’t lost any data yet, the rebuilds and sparing was killing him. During a rebuild on his modular storage system, his system slowed to a crawl for a long time as it processed the parity rebuild for a very large capacity SATA disk drive. His performance was hit again when the disk was replaced and the system went immediately into a rebuild from the spare. This was occurring more frequently with the SATA disks. This wasn’t Hitachi storage.
While we have not found a replacement for RAID we have taken steps to mitigate the effects of RAID rebuild and sparing. First we start with the most reliable disks we can find to reduce the incidence of failures. There still is a difference between RAID as in inexpensive and RAID as in Independent. In Hitachi TagmaStore storage systems, we provide an option for RAID 6, with two parity drives. This allows you to defer rebuild to a later time, and protects you in the event of a second failure. When we do a rebuild to a spare we also leave it on the spare until you decide to copy it back to the original array group. This at least allows you to schedule the rebuild and spare recovery to off peak hours.
Our systems also use proactive soft fail sparing with SMART technology to do a fast copy to a spare before we have to do a costly parity reconstruction. On SATA disks we add enhanced data protection features like idle seek, periodic full sweep, and head unload to extend its life.
Our controller based virtualization helps to isolate the impact of the rebuild and spare recovery by offloading that to externally attached storage systems. Lower capacity, high performance, high reliability, FC disks could be used internal to the USP as tier 1 capacity, and the larger FC or SATA capacity disks could be attached as lower tier storage on external storage systems where the longer RAID rebuild and sparing workload would be isolated to that external storage system. Intermixing, slower, less reliable SATA or FATA disks in tier 1 storage systems will impact that system’s performance and availability with more frequent drive failures and longer rebuild times which consumes internal bandwidth, cache, and controller cycles. For these reason’s we do not offer FATA disks in our USP/NSC, preferring to attach them through external disk systems.
Over the years there have been several companies that have tried to solve this problem. STK developed "Iceberg" with log structured files about 15 years ago which had some degree of success before it faded away. Many of you may remember Zambeel of a few years ago with their "organic" storage concept. They created a storage array with drawers full of ATA disks, in which data would be mirrored in multiple drawers. If a drive failed you would use a copy and create another mirror of it on a spare drive in another drawer. You left the failed drive in place until you had a drawer full of failed drives, than you would "prune" the storage system of that disk drawer rather than replace each individual drive at time of failure. They compared it to pruning the dead branches off of a tree.
I agree with StorageMojo that it is time to replace 20 year old RAID architectures with something that does not impact I/O as much as it does today with our larger capacity disks. This is a challenge for our developers and researchers in Hitachi.
While we are talking about 20 year old architectures,we need to look beyond RAID. Customers should look at their storage vendor’s controller architecture. While vendors have kept up with faster disks, processors, buses, and cache memory, most have not changed their basic storage architecture in 20 years or more. 20years ago there was no concept of switched fabrics where many more hosts can connect to common storage resources. Hitachi has been constantly evolving its storage architecture over the years, with the ability to dynamically reconfigure cache, internal cross bar switches to switch around hot spots and failures, virtualizing physical storage ports into 1024 virtual ports with separate address space for connectivity and safe multi tenancy, logical partitioning for QoS, and virtualization for common management of a heterogeneous pool of storage.
A lot has changed in 20 years, has your storage system kept up to date?


Hi HU,
Enjoy your blogs,and yes Disks have changed for the better over the years I do not worry about head crashes,but saying that we became very good at backup and restores I just wonder now if we needed to that we could get all our data back from tape?
Good question. Its hard to know what you have when it has been sitting idle for some time. With spinning disks we keep monitoring it to see if it’s operating properly. Plus we have RAID for reconstruction.
Hi Hu,
Im curious what your thoughts are on the benefits a back-end switching architecture might bring to speeding up RAID rebuild times - is the shared bus/loop architecture seen on the back end of many subsystems a factor in slowing down parity rebuilds?
I don’t understand the fail-spare-replace-copyback approach of storage system disk management. With global spares, the copyback should not be required. Why do we still do things this way? Why does not the old spare become the new drive, and the old drive, once replaced, the new spare? Perhaps it is because of a physical disk layout (i.e., “all spare disks will occupy slot 14″). Or it could be optimization of the paths for performance and availability.
So it seems the ideal would be to create a true back-end fabric (as opposed to loop) architecture so there is no performance or availability reason to copyback. Then just specify a number or percentage of disks in the spare pool. If a disk fails, replace that physical disk. Anything else is needless complexity which should have been automated out of storage arrays a decade ago.
Next, regarding the capacity vs. I/O cost tradeoff, if capacity is cheap, but I/Os are expensive, a return to RAID 1 (or 1+0) semantics may be in order. At least narrower stripes may be in order, especially for SATA. Some customers are doubling their stripe width with RAID6 compared to RAID5, so as to keep their capacity costs constant.
Maybe it is time to consider a 4-disk wide RAID 6 stripe: The same capacity as mirroring, but you still have a RAID 5 level of protection with a single disk failure. However, many fewer I/Os are required to rebuild a disk. Imagine a system which immediately block copied the entire affected LUN to a new LUN, rather than a spare-in of disk. This would allow full-stripe writes to the new volume instead of Read Modify Writes to the spare. Spare volumes rather than spare disks are certainly possible if capacity is cheap.
Or maybe the answer is in RAID 1 with an automated ability to access a DR or HSM copy if a second drive fails. The performance of a long-wave accessed copy, or a SATA copy after a FC drive failure, may be similar to accessing a degraded RAID5 or RAID6 volume. This would quite literally be “thinking outside of the box”, with respect to the storage array.
It’s time we start using storage media which doesn’t depend on a spinning disc.
The use of semiconductors for storage would result in reduction of power and heat.
Hi Hu,
Our HDS team *casually* mentioned that 750GB ATA drive support for the USP-V will be announced Monday October 10th and will be available on November 14th. 1TB ATA soon to follow. I almost fell out of my chair!
With this in mind, how would you qualify your above statement that “Intermixing, slower, less reliable SATA or FATA disks in tier 1 storage systems will impact that system’s performance and availability with more frequent drive failures and longer rebuild times which consumes internal bandwidth, cache, and controller cycles. For these reason’s we do not offer FATA disks in our USP/NSC, preferring to attach them through external disk systems.”
“Im curious what your thoughts are on the benefits a back-end switching architecture might bring to speeding up RAID rebuild times - is the shared bus/loop architecture seen on the back end of many subsystems a factor in slowing down parity rebuilds?”
Back-end switching won’t have any effect on rebuilt times as you will still be reconstructing to a single drive. That means that reconstruction can not be faster that the write throughput of a single disk.