Putting HDD Product Trends Into Perspective: A Subsystem View
by Claus Mikkelsen on Sep 30, 2011
On a few occasions I’ve blogged about the drive industry, drive performance, and the effects it has on the storage array business in general, but below is a guest blog from my friend and colleague Ian Vogelesang on disk drive trends. He originally had posted this on our internal HDS website, but it’s too good to keep wrapped up, so I’m sharing it here for general consumption. Make sure you have some time on your hands, as the post is quite lengthy (but worth it!)
Ian is one of the smartest guys in this area (well, smart in general), and this offers amazing insight into what many of us think of as “just a disk drive”. It’s long, it won’t be for everyone, but it’s definitely worth reading by anyone in the storage “biz”.
Ian’s quick bio: Assistant GM of HDD development at Hitachi in Odawara, Japan, then VP Operations of Hitachi Data Systems in Santa Clara, then VP Marketing and then VP Strategy & Product Planning at Hitachi GST (during assimilation of IBM Storage Technology Division), before returning to Hitachi Data Systems.
So let’s take it away…!!!
By: Ian Vogelesang
This detailed blog posting targets a highly technical audience, exploring how HDD product trends will impact subsystem performance and economics over the next year or two and beyond.
It’s hard to anticipate what the reader will know and what the reader will not know, so I’ll leave the Reader’s Digest version for others.
Trends by category
- 7200 RPM LFF
- Today’s capacity point is 2 TB. This is available in both SATA and SAS versions on the AMS2000 family, and in a SATA version on the USP V and VSP.
- This platform is actively under development: 3 TB SATA models are already available in retail stores from multiple vendors, and SAS 3 TB models are expected soon.
- 7200 RPM SFF
- Both the VSP and AMS2000 product lines support 7200 LFF drives which have twice the capacity, so for now we are offering only the LFF 7200 RPM models.
- 10K LFF
- Long dead.
- 10K SFF SAS
- Today’s capacity points are 300 GB and 600 GB.
- This form factor is currently the mainstream platform in enterprise HDDs, so we expect higher capacity 10K SFF models over time.
- 15K LFF
- The 15K 600 LFF drive was the end of the line, and no higher capacity 15K LFF drives are expected.
- 15K SFF SAS
- Today’s capacity point is 146 GB
- Seagate has a 300 GB drive but we are not carrying this product because it is twice as expensive as a 10K RPM 300 GB SFF drive, but has much less than twice the performance.
- Hitachi GST says they plan no further 15K SFF drives at this time.
3 TB 7200 RPM LFF
Let’s start with the 3 TB SATA drive which is expected any time now, as it has been available in retail for a while. It’s safe to assume 3 TB models must currently be in qualification for subsystem applications.
This 3 TB 7200 RPM platform will also be available in a SAS-interface version at a higher price. Even larger models are expected over time. Generally speaking the SAS version of new models based on the 7200 RPM LFF platform will be available a few months after the corresponding SATA version ships.
Seagate had implemented a SAS-interface version of their 2 TB 7200 RPM LFF platform, and this is the SAS 7.2K 2 TB drive that is currently available on the AMS2000 series.
Both Seagate and Hitachi GST have announced SAS versions of their 3 TB 7200 RPM LFF drives, and thus going forward we will have multiple suppliers for SAS 7200 RPM LFF models.
Tests of the SAS interface version of existing 2 TB 7200 RPM Seagate HDDs on the AMS2000 series showed the SAS interface version of the drive when configured in the AMS2000 to offer in most cases over 2x the throughput of the SATA version of the same drive enclosure.
In other words, in a subsystem application spending extra money to use the electronics from a SAS drive instead of the SATA electronics on the same basic drive with the same platters, heads, spindle motor, and actuator gives you twice the performance.
Some people say that performance isn’t the point on SATA drives which are all about capacity. To those people, I ask them if they think that poor people don’t care about money.
Prediction – SAS will largely displace SATA for 7200 RPM subsystem applications, even as SATA will remain the interface in PCs
The first part of this blog posting is going to explain why I expect to see the SAS 7.2K drive to largely displace its SATA-interface twin within 2 years.
The gigantic capacity of a 7200 RPM LFF drive is achieved as a tradeoff against other factors. In order to have the highest recording capacity, we need to use the largest diameter platters.
The platters in a 7200 RPM LFF drive are actually a bit bigger in diameter than 3.5 inches. When the dimensions of the 3.5-inch external form factor were originally fixed, drives actually had 3.5-inch diameter platters that fit between the sides of the base casting that naturally had to have clearance inside the casting walls for those 3.5-inch diameter platters. Nowadays if you take a drive apart you will see that the base casting in the area where the edge of the platter approaches the walls is slightly machined out, enabling the diameter of the platter to be slightly bigger than 3.5-inches. (I forget right now what the actual diameter is, but it’s a value in millimeters, not inches.)
OK, so given that you are going to use the largest diameter platters there are, it turns out (sic) that you can only rotate such big platters at 7200 RPM. Because of the wide diameter of the platters, if you try to run the platters faster you would astronomically increase the power consumed creating air turbulence, and you wouldn’t be able to cool the drive effectively.
So the first strike against SATA drives is that the HDD rotates relatively slowly.
Then second strike against SATA drives is the slow seek speed.
Part of the slow seek speed comes from the simple fact that seek speed is inversely proportional mechanically to the length of the access arm, and a drive with 3.5-inch platters inside will have the longest arm and thus proportionately slower seek speed for how hard you push.
Then the other part of the slower seek speed comes from the budget financially and in space for the actuator, and more specifically, the rare earth metal permanent magnet inside that the actuator pushes against. If you are making the cheapest drive, you can’t afford a bigger rare earth magnet to let the actuator push harder, and anyway, with those huge 3.5-inch platters inside, there’s no room for a bigger actuator motor anyway.
So strike one was the slow RPM, strike two was the slow seek speed, and now strike three is that the ATA specification that has progressed into SATA was purely conceived for the purpose of direct attach to a host (a PC). The original authors of SATA never considered subsystem applications as evidenced by the fact that they fixed the sector size in SATA at 512 bytes.
Using SATA drives in subsystem applications
If you want to provide the usual assurance that you can detect any subsystem virtualization mechanism failures and any data corruption “end to end” within the subsystem, you are going to have to further compromise the performance of a SATA drive.
The problem with a fixed 512-byte sector size at the HDD level happens with the mechanisms that you need to employ in the architecture of a subsystem.
The same kinds of challenges face subsystem designers as face the architects of disk drives themselves – how can you really be sure every time that you are presenting the right data at the right time?
In a disk drive, you are emulating a logical disk drive with LBA addresses that are statically mapped (except during defect assignment) from the LBA to a physical location on disk.
So you have a virtualization mechanism that translates a logical address into a physical address on hardware. How can you know if your virtualization mechanism is working as designed, that is, that every time you read from a logical address, you get the data that was most recently written to that logical address? In other words, how can you detect if you accidentally wrote it in the wrong place or accidently tried to retrieve it from the wrong place?
How do disk drives do it?
In disk drives, to provide these assurances, we used to use an ID field as a physical field that was stored in addition to the host sector, and where the ID field that is written with the data is the (logical) ID used by the host to specify where to put the data.
Suppose the drive originally wrote some data in the wrong place. If at any time later the host tried to read the LBA that really did belong in that physical location, the drive would retrieve the erroneously written data, but the contents of the ID field on disk wouldn’t match the requested ID field, and thus this ID field mechanism detects and fails I/O operations that would otherwise give the wrong data to the host.
Similarly, if you were to read from the wrong location, the ID field at that wrong location wouldn’t match the ID you were requesting and the I/O would fail.
So storing a logical address field with the data (or to make the field smaller, storing a cryptographic hash of the logical address with the data as is done in subsystems) can assure that virtualization mechanisms are working as designed.
Then IBM invented no-ID formatting in the late 1990s. The idea here has to do with how you compute the ECC bytes that are used to detect and even repair minor corruptions sprinkled here and there within a sector.
The idea is to compute these ECC bytes logically as if the sector were longer than 512 bytes by the size of the LBA address, and then to compute the ECC as if the LBA were logically prepended to the data. This meant that we no longer needed to have a separate ID field, but without increasing the size of the ECC data, we could still could derive a fingerprint that detected if the data that had been stored under that LBA really originally came from that LBA. Thus we can ensure that every time we read data from disk, we know that after all the complex logical-to-physical things there are in a disk drive (zoned recording, serpentine LBA layout, skip-slip defect assignment, grown-defect assignment, etc.),we still will detect if there’s any corruption of the virtualization mechanism.
Now let’s talk about subsystem applications of HDDs.
Subsystems also have virtualization mechanisms, and you need to provide the same two assurances that you provide as a disk drive designer, namely that the data is intact, and that it physically came from the same place originally used to put the data when it came from the host.
Therefore you need to have a checksum of the data, and you need to have a fingerprint of the LUN &LBA, and these need to be captured at the point of entry into the subsystem at the host port, because we need to offer “end to end” protection, and these check bytes need to accompany the host sector on its journey through the data paths in the subsystem, in and out of cache and onto disk and back from disk.
These requirements to store a few check bytes with each 512-byte host sector in subsystem applications were well understood at the time of the first SCSI drives, and even back before then.
The SCSI spec as originally conceived provided for the user to perform what is called a low-level format which (re)creates all the sectors on the drive. This low-level format can be performed with a range of sector sizes from 512 bytes (always used for direct host attach) in increments of 4 or 8 bytes up to as much as 528 bytes in some models. These larger sector sizes allowed subsystem architects to store a check byte field along with every 512-byte host sector. Hitachi subsystems use HDDs low-level formatted with 520 byte sectors in drives that use the SCSI command set, namely SAS and Fibre channel HDDs.
In this way, the subsystem maker can provide a checking mechanism to assure the accuracy of the virtualization layer, as well as to ensure that the sector is not corrupted along its journeys through subsystem data paths and cache memory.
The problem with SATA is that there is no provision in the spec for sector sizes other than 512 bytes. And the reality is that for every possible bit pattern in a sector, the host has to be able to write that 512-byte sector and then read it back again. So we need to use all 512 bytes to store the information contained in the host sector, because if we were able to condense 512-bytes worth of data into less than 512 bytes of space to make some room to store some additional information, then there would have to be at least two host bit patterns that would result in the same smaller-than-512-byte encoding. So there’s no room to encode more information in 512 bytes of space on top of what the host is storing in that bit pattern.
So you can’t provide virtualization mechanism protection and subsystem data path corruption detection assurance and still map each host 512-byte sector on a SATA drive to one 512-byte sector on disk.
Thus you can either decide to fly blind, trusting that there are no design flaws and that the hardware works perfectly, storing each host sector as one 512-byte SATA sector on disk, or you can decide to provide the usual assurance mechanisms, albeit with a performance penalty.
Hitachi protects against subsystem virtualization errors even with SATA
What Hitachi does is to expand each sector as it is written by the host as usual into a 520-byte sector that contains the 8 check bytes guarding against virtualization and corruption errors, and then at the point of writing these 520-byte sectors on disk, we write 64 of them in a “clump” (my term) of 65 physical 512-byte sectors on disk. Doing it this way means that the customer is protected as usual against any subsystem architectural or algorithmic flaws.
For read operations, it doesn’t matter much, because you just read the whole clump (it’s only 32K out of a track size of about 1 MB anyway) even if all you want is a 4K bit within the clump.
But for writing there will be extra I/O operations required with SATA drives that are not necessary with SAS or FC drives.
This is because a 4K write from a host will be for a set of sectors that once they are mapped to 520-byte logical sectors will always require updating part but not all of at least one physical sector on disk. (Every 520-byte logical sector gets mapped to range that is bigger than one physical sector but shorter than two physical sectors. And any write smaller than 32K will always need a “pre-read” of the old contents of the clump so that you can update the bit newly written from the host before writing it back. If you think about it, this only applies to RAID-1,because in RAID-5 and RAID-6 random writes, you read the old data before you write the new data, and while you are at it reading the old data you just read the whole clump.
SATA-W/V vs. SATA-E on enterprise subsystems
But this isn’t the only performance issue with SATA. Our design engineering department is very concerned about the potential for lower reliability with SATA drives. Therefore on the Enterprise subsystem, we offer the user two choices, “SATA-W/V” or “write and verify” where every write to disk once destaged to the disk surface is followed by a read from disk to assure the write didn’t encounter any “silent write failure” condition. There is a known failure mechanism which is common to all disk drives, regardless of host interface type, that can have silent write failures for a short period of time before the host (or the drive) figure out that nothing changes as a result of writes any more. Doing a read verify operation after every write allows us to detect if this rare but nevertheless possible failure mode is occurring.
The second option offered to Enterprise customers is the “SATA-E” mechanism. With SATA-E, we also protect against silent write failures, but in such a way that we don’t need a read-verify operation after every write. Instead, what we do is randomize the mapping from the 64 host 520-byte sectors to the 65 physical sectors within a clump on every write. That way if you write something and there was a silent write failure, on the I/O that failed, the physical location of that LBA on disk in the clump would have changed, and if you try to read the data back after a silent write failure, you will read from the new location, not the old location, and therefore the data at the physical place you are looking would not match the LBA and the I/O would fail.
So SATA-E actually is faster for pure random writes than SATA-W/V on Hitachi enterprise subsystems, and it detects silent write failures.
But the problems with SATA-E are twofold. Firstly, there are so many clumps on a SATA drive that this mapping information that that records for each clump the permutation of the logical 520 byte sectors for that clump, which is called the Volume Management Area or VMA, is too big to entirely fit into Shared Memory. This means that when you read from a SATA-E parity group, there a chance that you will need to do a pre-read, an additional I/O operation, to fetch the section of VMA from disk before you can satisfy the host read request. Thus random reads are slower on SATA-E.
There is some chance you will have to write to the VMA with every host data destage operation as well.
The second problem with SATA-E is that the extra computation required to essentially add another virtualization layer substantially increases microprocessor busy. In the USP V, the guidance from engineering was that SATA-E would increase BED utilization by 70%. For this reason alone, we generally don’t recommend SATA-E because it disproportionately consumes MP resource, and this MP resource is what limits the overall IOPS throughput of a subsystem.
OK, so if you bare with me so far, and there is light approaching at the end of the tunnel. We don’t know whether it’s the end or whether Ian’s just going to move to the next phase of the explanation.
How bit is too big?
We’ve seen that performance on a SATA drive is much worse than performance with SAS or FC drives, because SATA drives rotate slower, seek slower, and because the mechanisms that assure the integrity of subsystem virtualization layers, and detect data corruption “end to end” require extra I/O operations to be issued to SATA drives that don’t have to be issued to SAS and FC drives, the performance of SATA drives in subsystem applications is further impaired.
The good thing about SATA drives is the huge capacity. The bad thing about SATA drives is 1/3 the IOPS capability to handle not only host I/O, but so called “SATA supplemental” I/O operations that aren’t needed with SAS or FC.
Is this a big problem? Let’s run a couple of numbers to see where we are in the ballpark.
A very large dot com customer with an instantly recognizable name has told us that the overall average access density to their data over their entire shop is about 600 host I/O operations for every TB of host data. This corresponds exactly with some data that IBM published a while back, so this is a reasonable average amount of I/O activity per TB of data.
Let’s compute very roughly what a 10K 300 GB SFF drive and a 2 TB HDD look like in terms of host IOPS per TB of data.
First let’s look at the drive itself. A 2 TB drive can do about 100 IOPS and the capacity of the drive is 2 TB, so the SATA drive can do about 50 IOPS per TB. If the AVERAGE activity of the customer’s data is 600 IOPS per TB, that would mean that if the drive were directly attached to the host (without RAID), on average you could only use less than 1/10th of the capacity of the HDD before you ran out of IOPS. Oh. And then 3 TB drive is coming and then even bigger drives after that, all with the same IOPS capability!
What about our 10K 300 GB SFF drive? It can do about 300 IOPS, so the direct attach access density capability if you fill the drive is 300 IOPS per 0.3 TB, or about 1,000 IOPS per TB of data. So for direct attach, you should be able to fill 300 GB 10K drives with data of average activity.
You could even almost fill 600 GB drives which have the same IOPS but twice the capacity, so they can do 500 IOPS per TB of data.
But that’s only for direct host attach.
The picture gets worse in a subsystem
In a subsystem you have some sort of “RAID penalty” for writes that depends on the RAID level. Just for the purposes of illustration let’s look at the number of HDD-level I/O operations needed to support one host read and one host write. For the host read, you need to do one HDD I/O. For the host write, you would need to do 4 HDD I/Os (read old data, read old parity, write new data, and write new parity) and more than that if you need SATA supplemental I/Os as well.
So for this workload, you would need 1+4 = 5 HDD I/Os, or 5/2 = 2.5 HDD I/Os for each host I/O. This ratio gets as bad as 1:4 for pure random writes in RAID-5, and worse still for SATA drives. So let’s use 2.5 HDD I/Os per host I/O for our ballpark estimation.
Our SATA drive that we thought could do 50 IOPS / TB now looks like it really can only do 20 host IOPS per HDD TB, before we even TALK about the SATA supplemental I/Os. So for ordinary data with an average access density of 600 IOPS per TB of data, we can only fill the drive to less than 20/600 = 3% of its capacity. (Yes, I’m sure the sharp-eyed reader will have noticed that this doesn’t account for the fraction of the potential drive’s capacity that is used for parity, but with the SATA supplemental I/O, thinking that you will hit the drive IOPS limit at about 2% or 3% full from a capacity point of view is about right.)
If you buy SATA drives and then try to use them for average activity data, they will be hideously expensive if you buy enough drives to handle the host IOPS, since you can only fill them a couple of percent full with average activity data, and therefore you will need a LOT of HDDs. If you do fill the drives more than a couple of percent full of average data, the drives won’t be able to keep up with the IOPS, and you will have horrible response times, and Write Pending will instantly fill to the limit with data waiting to be destaged to disk. (That’s why many people put SATA parity groups in their own CLPR, addressing the symptom rather than the root cause.) Performance problems with SATA are sad in my opinion, since for normal applications, you either spend much more money buying a lot of SATA drives then the money you would have spend on SAS enterprise drives, or else you buy an insufficient number of SATA drives, trying to use the capacity of the drives, and then not be able to cope with host workloads and have serious problems.
So ANYTHING you can do to improve the IOPS capability of SATA drives will proportionately increase the amount of data that can fit on the drive before the drive reaches its max physical throughput.
Putting a SAS interface on a 7200 RPM LFF platform
What does the SAS 7.2K 2 TB drive offer us compared to the SATA 7.2K 2 TB drive? There are two main differences from a performance point of view.
The first is that SAS offers native 520-byte sector formatting, and thus there are no SATA supplemental I/Os on a SAS 7.2K drive. (The same sharp-eyed reader will have raised an eyebrow as to why the SAS version of the same drive can be trusted not to have silent write failures, but it’s always a judgment call in the end when it comes to what is “sufficient” protection, and Hitachi engineering is very conservative on protecting customer data.)
The second performance advantage of putting the SAS electronics on the 2 TB 7200 RPM LFF drive is that the Tagged Command Queuing or TCQ acceleration capability is much higher on the SAS electronics card.
SATA drives are optimized for the absolute lowest cost, and there just aren’t the microprocessor cycles nor the number of logic gates in an ASIC that you get with the more expensive electronics in the SAS electronics card.
The TCQ feature is what allows the host to independently issue a bunch of different write operations to the drive (called queuing I/Os in the drive) where each I/O operation is identified by a “tag” number, a small integer between say 0 and 31 typically for a SAS drive. The drive has the luxury of browsing the queued I/O requests from the host, and deciding what order to perform the I/O operations in regardless of the time sequence received from the host. Of course no I/O operation can be indefinitely delayed, but within such a constraint the drive as the freedom to re-order the I/O operations so as to be able to visit the associated physical locations in a sequence that minimizes seek time and rotation time.
TCQ can accelerate SATA I/O by about 30% (of course when the workload is multi-threaded) compared to the drive’s single threaded throughput. SAS drives can accelerate the IOPS to about 60% higher throughput compared to single threaded throughput.
So where we computed the throughput capability of SATA and SAS drives above, I’ll leave it as an exercise to the reader to calculate the access density capability of SAS and SATA drives in subsystem applications.
The bottom line is that SATA drives are desperately slow. They are so slow that you can only fill them to a couple of percent of their capacity before they run out of IOPS. The SAS version of the same drive, without the burden of having to perform SATA Supplemental I/O operations, and being able to perform the IOPS faster (1.6 / 1.3 or 23% faster) results a combined doubling of the access density capability of the drive.
This is a BIG DEAL, and it’s why I expect all customers that learn this stuff to switch pretty much from SATA to SAS 7.2K except for those rare cases where the data truly is below cryogenically cold in terms of its activity level, or where the customer is fixated on SATA being cheaper per inaccessible GB.
For everybody that gets this, even a modest bump in drive cost with the SAS card will pay enormous returns in the form of more than doubled throughput.
This trend will only be driven harder by a shift to 3 TB and then even larger drives over time.
Other trends in disk drives:
The 10K SFF platform is the mainstream product for Enterprise HDDs. This means that in addition to the current 300 GB and 600 GB models, we should expect even higher capacity 10K SFF SAS models over time.
I’ll do the homework for you. If we have a drive with 600 GB that can do 300 physical IOPS further accelerated by 60% using Tagged Command Queuing, and where the RAID mechanism causes backend IOPS to be 2.5x the host IOPS, that drive can accommodate ((300 IOPS * 1.6) / ( (7/8)*600 GB) ) / (2.5 backend IOPS per host IOPS ) or 365 host IOPS per TB. Given that a global average host access density is 600 IOPS per TB, this means that the 600 GB 10 SFF drive is a “fat” drive, that can only be completely filled with data that has about half the activity of average data. In other words, the majority of your data is going to be too active to store on a 600 GB 10 SFF drive, if you plan on using all 600 GB and fill the drive with data. Today the “sweet spot” for average data is somewhere between a 10K 300 GB and a 10K 600 GB drive.
Future 10K SFF drives that have a capacity of even higher than 600 GB will still be capable of essentially the same IOPS as current drives, and thus these future 10K SFF drives with higher capacity will be even “fatter”, meaning that you can only use the entire capacity of the drive for low activity data.
This is a great lead-in to talk about what happened to 15K RPM.
The end of 15K, even in SFF?
Why are the HDD vendors saying that with SFF they can’t make any money selling 15K RPM drives and they are going to drop them? Hitachi certainly has been saying that, and although Seagate did launch a 15K RPM 300 GB SFF drive, it may be that this might not be viable for Seagate going forward at least short term. This is one of those things that could go either way, and one of those things where if you look at it in the short term (next year or two) there is one trend, and when you look at it longer term (e.g. 3-5 years) there could be quite a different trend.
To understand what happened, we should observe that disk drive platter diameters (and their associated external form factors) have been getting smaller and smaller when you look at them over a 50-year time span. The original IBM RAMAC drive had 24-inch diameter platters. The very next generation of drives used 14-inch platters, and at least for enterprise drives, that’s where it stayed for a long time. But over time we saw 14-inch platters give way to 8-inch platters, to 5 ¼- inch platters, then 3.5-inch platters, and now we are transitioning to the 2.5-inch SFF. (OK, Hitachi buffs will note that Hitachi enterprise drives introduced 9.5-inch platters with the Hitachi “Q2″ drive that was compatible with the IBM 3380-K 14-inch platters, and then went to 6.5-inch platters that had this cool “reactionless” linear actuator design before finally switching over to standard OEM type 3.5-inch drives.)
The factor that drives an industry shift to a smaller form factor has to do with drive access density capability.
We talked about this earlier. Basically once you have fixed the platter diameter and drive RPM, and these are the factors that characterize a “form factor”, then all drives of that form factor basically are capable of the same random IOPS regardless of drive capacity, because the random IOPS is determined by the mechanical rotation time and the mechanical seek time. And you can’t increase the RPM or improve seek time in any big way without moving to smaller platters.
Some of you may remember 9 GB drives going to 18 GB going to 36 GB going to 73 GB going to 146 GB etc. So you can well imagine that if you build a 10,000 RPM drive that only has 9 GB of capacity (actually, probably 10K RPM didn’t come along until later, but bear with me for the sake of argument. I just don’t remember of the top of my head what the capacity was of the first 10K 3.5-inch model). Imagine if you will a 10K RPM drive with only 9 GB of storage capacity. You can immediately grasp that the ratio of IOPS to GB would be very high, so high in fact that you would almost always run out of GB before you would run out of IOPS.
So if you run out of GB before you run out of IOPS, then higher RPM drives look horribly expensive.
The reason for this comes from the relationship between drive RPM, platter diameter, and drive capacity.
If you double the RPM of a drive, you will need a little more than 4x the power to turn the platters, because the power required to create air turbulence goes up as a little more than the square of the speed. For example, to make a car go twice as fast, you need more than 4x the horsepower.
So we can’t double the RPM of a disk drive and keep the same platter diameter, because the drive would burn too much power and it would be difficult to cool.
But if we keep the RPM of the drive the same, and double the diameter of the platters, you increase the power required to spin the platters by over 16x. How can this be? Well, each unit of surface area at the edge of the platters is going twice as fast if the platter diameter is doubled. Therefore each unit of surface area needs a bit more than 4x the power. At the same time, if you double the diameter of the platter, it has 4x the surface area, and we just noted that each unit of surface area needs 4x the power, and thus the total power required goes up by over 16x.
In other words, we can increase the RPM and keep the power consumption the same if we decrease the size of the platters only moderately. All within a 3.5-inch LFF form factor,7200 RPM drives have (roughly) 3.5-inch platters,10,000 RPM LFF drives have 3.0-inch platters, and 15K LFF drives have 2.5-inch platters.
The problem early on in the life of a form factor is that you run out of GB before you run out of IOPS.
It turns out that a 15K RPM drive has platters that are have a bit bigger than ½ the area of platters in a 10K RPM drive. Thus most of the capacity difference comes from the area of these platters that have a smaller outer diameter. There is also a smaller factor associated with the higher linear velocity of the heads which in an LFF drive at the outer edge is about 100 mph or 160 km/h for 15K LFF vs. 60 mph or 100 km/h for 10K LFF. With the higher flying speed in 15K comes more flow induced vibration of the head due to turbulence and more difficulty to fly close to the surface without hitting it. This means that 15K RPM drives run a bit lower recording density than 10K RPM drives.
So the problem is that if you are going to the biggest capacity point that you can achieve in the form factor, using all the platters that fit within the form factor, in other words, if the GB are the limiting factor, not the IOPS, then 15K RPM drive are twice as expensive per GB as 10K RPM drives because the 15K RPM drive with the same number of platters and heads has only ½ of the storage capacity, and it actually costs a bit more to make with the bigger actuator and its magnet. (The cost of a platter is about the same regardless of diameter – cost is by number of platters.)
OK so at the beginning of the life of a form factor, where GB are more important and you are trying to make the biggest drive that will fit within the form factor, then 15K RPM drives cost twice as much per GB as 10K RPM drives.
But later on as we keep doubling the capacity of the drives over and over as new drive generations come out, each with twice the recording capacity (for enterprise drives) of the previous generation, there comes a point where there’s no point to increasing the capacity any more, because you run out of IOPS before you run out of GB.
At this point is where the mainstream of the market switches to the higher RPM, when the recording density gets so high that you might as well use advances in recording density to let you make the platters smaller, since you can’t use more capacity at the original platter diameter anyway.
Making the platters smaller in diameter means you can rotate the drive faster, and voila, this is the point where the 15K RPM drive displaces the 10K RPM drive as the mainstream product. This happened above the 300 GB point on LFF. In fact, Hitachi decided not to build a 10K 600 GB drive in LFF when it became possible to do so. Instead we made new generation drives that also had 300 GB but had fewer heads and media. Well, actually, we did do a 400 GB 10K LFF drive as a kind of last-kick at 10K LFF, but if you look at HDD vendors’ web sites today you will see that there are no more 10K LFF drives for sale at all.
From an economic point of view, we have found that customers are willing to pay up to 25% ~ 30% more for a 15K RPM drive than a 10K RPM drive, but they generally will buy very few 15K RPM drives if they are twice the price of 10K RPM drives.
And that’s where we are right now with SFF. We are making the absolutely biggest drive the we can with the most platters and heads that fit in within the form factor. And even in SFF,15K platters are smaller than 10K SFF platters. The capacity of a 15K RPM SFF drive is one half of the capacity of a 10K SFF drive of the same generation.
And that’s the economic factor that’s squeezing out the 15K RPM drive right now. You don’t get a 100% improvement in IOPS for a 100% increase in cost going from a 10K SFF drive to a 15K SFF drive, and this makes 15K RPM drives look very expensive in SFF.
Hitachi did build a 15K 146 GB SFF drive in the same generation as the 300 GB 10K SFF drive, but decided against making a 300 GB 15K SFF drive in the generation of the 10K 600 GB SFF drive. Seagate did come up with a 300 GB 15K SFF drive, but again, it has ½ the capacity of a 10K RPM SFF drive at a bit higher cost, and thus doesn’t look very attractive financially.
That’s where we are at right now. 15K SFF is just not very attractive financially. 15K $/IOPS is actually worse than 10K $/IOPS in SFF because 15K doesn’t yield 2x the IOPS of 10K,but it’s 2x the price. So 15K SFF never really makes sense in terms of cost effectiveness to achieve the necessary IOPS.
The 15K sale only makes sense where the customer’s business can increase revenue or profit with faster HDD response time. This means that 15K is very much a niche product in SFF at this time. (With SSDs, the situation is the same, that you can only justify them where there is business value to having faster response time, because SSDs not only have worse cost per GB, but they also have worse cost per IOPS.)
Where we are at the moment is that we are still early in the life of the SFF form factor. Seagate has decided to go for it and make a 15K 300 SFF drive, but Hitachi GST couldn’t see how it could sell in enough volume to make any money building one.
But if you think about it, in the disk drive business sometimes it’s like “déjà vu” all over again.
The point at which we basically started to transition from 10K to 15K as a mainstream product in 3.5-inch LFF was “above 300 GB”.
Yes, SFF drives have smaller platters and thus shorter arms that seek faster and thus we can make 10K SFF drives bigger in GB than 10K LFF drives because the SFF drives are capable of higher IOPS.
Ian thinks 15K SFF will come back in the next few years
My own personal view is that since 10K SFF is mainstream and we expect 10K SFF drives with more than 600 GB in future, then we can’t be all that far off from the point where 15K starts to look more attractive again. This will happen when we start using increases in areal density to decrease the numbers of platters and heads instead of increasing the capacity.
Wait, there’s another light appearing at the end of the tunnel …
This brings us to the topic of increasing areal density. A transition to 15K as the main product in SFF would be driven by increases in areal density.
Can the researchers keep performing magic?
The big news here is that although over 50 years of HDD evolution we have come to expect that our brilliant scientists will keep solving problems and inventing new technologies to keep doubling the capacity of the drives over and over and over at an average rate over those 50 years of something like 40% compound annual growth rate, we may actually be reaching the “areal density end point” where we reach fundamental physical limits.
Just for your amusement, I remember when Dr. Jun Naruse, who later became HDS’ CEO, was head of HDD R&D at Hitachi told me that the theoretical maximum recording density that ever could be achieved was about 65 megabits per square inch. At the time, we were shipping about 5 megabits per square inch using particulate oxide media (basically rust particles in epoxy resin) and inductive read/write heads. Today’s product are shipping at about 500 gigabytes per square inch or about 10,000 times higher capacity than we previously believed possible.
But we appear now to really be getting close to the ultimate physical limit.
There are some technologies that are being worked on, most notably Bit Patterned Media and Thermally Assisted Recording (called Heat Assisted Magnetic Recording by Seagate),but they haven’t quite hit the market as fast as originally targeted. In fact, at least at last year’s Diskcon HDD industry convention, no vendor would publicly speculate on what year either BPM or TAR/HAMR will appear in production products.
So what can we do to keep increasing disk drive recording capacities? Well, one thing that is now very publicly being talked about by multiple HDD vendors is the possible introduction of Shingled Magnetic Recording or SMR HDDs. The basic idea here is that without any advances in read/write technology, but just reconfiguring the write head so that you give up on random writes and write relatively wide tracks overlapping like shingles on a roof, you can still easily read back each track from the bit that’s exposed. Using this technique with the same head technology you can generate much stronger magnetic fields for writing and thus you can use higher coercivity media that are harder to write on, but let you make the bits smaller. Higher recording density means higher capacity drives with the same read/write head technology.
What about 7200 RPM SFF?
Why don’t we offer a 7200 RPM SFF drive? These drives are available from some HDD vendors. The issue here is that because 7200 RPM SFF drives use 2.5-inch platters, and 7200 RPM LFF drives use 3.5-inch platters, these SFF and LFF drives would cost about the same to make, but the LFF drive would have twice the capacity. Since both the VSP and the AMS2000 family support both SFF and LFF drives, we are offering the LFF 7200 RPM drive because it has twice the capacity at about the same price.
The last prediction, anticipating humans will become rational
If people were ever to really think about the economics, realizing that if you try to put normal computer data on a SATA drive you could only fill it to a couple of percent of its storage capacity, then who cares if you can get 2 TB if you couldn’t possibly even use 1 TB? To me the 7200 RPM SFF drive looks like a solid price performer that hasn’t been given sufficient consideration. So I think that if people ever figure this out we’ll see the majority of the requirement for “fat” drives to be on 7200 RPM SFF drives, and only the truly cryogenically cold data going on 7200 RPM LFF drives. For a set-top box recording that records and plays video, 100 IOPS is plenty for handling a few HD video streams, so keep on using 7200 RPM LFF and bring on all the TB you can! And for archival applications with almost no read activity, 7200 RPM LFF will always offer the best price per GB.
But for anything the resembles normal computer data,7200 RPM LFF is “too big to make sense” and 7200 RPM SFF would be plenty big enough to get into trouble running out of IOPS before running out of GB. Of course, we would want the SAS interface 7200 RPM drive, not the SATA version.
And then there’s the fact that SFF drives use a fraction of the power that LFF drives use.
I hope this gives some perspective on how the HDD roadmap will impact subsystem performance and economics over the next year or two and beyond.
Comments (7 )
Phew, this took a while to read, but some great content contained therein! I have a few questions;
1. Of the 600IOPS, how many are read/write? Is this the standard 70/30 split? If so, surely an array would mitigate the problem with cache.
2. Following on from 1, what about hybrid drives? What’s your opinion and how could they help mitigate issues, especially in heavy read environments?
3. What consideration is there in your calculations for the random versus sequential mix? We’re always told to use SATA for sequential workloads, for example.
4. What’s your view on SSD? Many of the issues discussed here could be mitigated well with SSD and block-level dynamic tiering. Would you agree?
Once again, great post.
Thanks for your questions. Let me address these individually:
> 1. Of the 600IOPS, how many are read/write? Is this the standard 70/30 split? If so, surely an array would mitigate the problem with cache.
The average host IOPS per TB of data (called “access density”) at a very large dot com that we all know, across all their data, is 600 IOPS per TB. I didn’t ask what their average read:write ratio is (alternatively expressed as “% reads”). In my example in the blog entry just to get a ball-park figure on whether we can fill the drives, I used a 50-50 mix of reads vs. writes, just to make the math easy to explain. You can re-do the calculation for 70% reads and 30% writes, and you will be able to fill the drives a little more, but as a ballpark figure you will get a similar result.
It’s a common misconception that when cache absorbs host writes, it relieves pressure on disk drives. In actual fact, you CANNOT reduce the back-end I/O using cache. (Well, at most maybe 25%.) Yes, the subsystem handles a host write at electronic speed, accepting data into cache to complete the host write. So host writes are very fast as long as there is room in cache to put stuff. But the data has to destage from cache to disk at some point, and here’s the problem. In a RAID parity group, each host LBA has a fixed location on disk where that data gets stored. So to destage the data from cache to disk, the disk drive access arm and read/write head has to visit each physical location on disk to write the host data to disk.
You can’t keep all the data in cache, because the size of cache is approximately 1/1000th the size of disk.
So the data has to destage at some point. Yes, cache absorbs writes in terms of accepting the data from the host at electronic speed without involving disk drive I/O for the host write I/O. But asynchronously later on the data does have to go to disk within a few 10s of seconds.
So you can’t eliminate writes to disk, you can only delay doing them to a time of your own convenience that doesn’t involve the host.
The reason I said “well, at most 25%” is that there is an algorithm called “gathering write” or “write coalesce”. If the host writes are not random, but have some degree of clumping, then you can organize the destage writes in such a way as to reduce the number of I/Os required for the destage, but again, only if the physical locations are not completely random. For completely random writes, you can’t do anything to optimize. For typical real-life customer I/O, you can optimize out approximately up to 1/4 of the destages.
> 2. Following on from 1, what about hybrid drives? What’s your opinion and how could they help mitigate issues, especially in heavy read environments?
Hybrid drives have some flash memory in addition to the space on disk. But it too is a small fraction of the size of the entire drive. So for truly random reads, they don’t help at all. However, if your I/O activity is clumped together or if you have I/O activity that is only going to a very small file, then hybrid drives can be much faster.
But hybrid drives are more expensive as well.
Hybrid drives internally do something very similar to what Hitachi Dynamic Tiering does at the subsystem level, so this is something to keep an eye on.
> 3. What consideration is there in your calculations for the random versus sequential mix? We’re always told to use SATA for sequential workloads, for example.
SATA doesn’t have as big of a disadvantage for large block sequential writes. For large block I/O, there is less seek and rotational delay as a percentage of the overall disk drive busy time, so this dilutes the effect of the slower mechanical positioning.
But I wouldn’t like to see an overall statement that SATA drives are best for sequential I/O. HDS systems engineers have access to a tool that can show what the estimated throughput is for different workload types for any subsystem configuration.
SATA drives are rarely more cost-effective, except for cryogenically cold data.
> 4. What’s your view on SSD? Many of the issues discussed here could be mitigated well with SSD and block-level dynamic tiering. Would you agree?
Yes, I do agree. As disk drives get bigger, the problem is that if we just fill them with data, too many I/Os come with the data and the drive gets overloaded. With SSDs and Hitachi Dynamic Tiering, we can “skim off” a substantial portion of the total I/O count with only a small fraction of the capacity on SSDs. This works because HDT tracks I/O activity on a per page basis, and then only the hot pages go on SSD. So SSDs are what let you exploit larger disk drives without overloading their IOPS capability.
Before the VSP and HDT, logical volumes had to be placed in their entirety on SSD. This meant that using SSDs was more difficult.
Now with HDT, the subsystem automatically places only the hot pages on SSD, which really makes using SSDs much more practical.
[...] you haven’t read it already, I urge you to have a look at the following post on Claus Mikkelsen’s (@yoclaus) blog over at HDS. It’s a guest post from Ian [...]
[...] interesting guest post on Claus’s Blog (Claus Mikkelsen of HDS) by Ian Vogelesang of HGST provided some technical/economic insights on why [...]
Really fascinating post.
There was one comment in the parenthesis, that explains why you omitted SSDs from the discussion but I think really need to be more drawn out or revised:
“The 15K sale only makes sense where the customer’s business can increase revenue or profit with faster HDD response time. This means that 15K is very much a niche product in SFF at this time. (With SSDs, the situation is the same, that you can only justify them where there is business value to having faster response time, because SSDs not only have worse cost per GB, but they also have worse cost per IOPS.)”
Claiming SSD have a higher cost per IOPS is very bold statement. If it’s accurate why bother tiering to SSDs on an I/O count basis? If SSD do have a better $/IOPS then they should be included in this analysis.
A super article. Have tweeted a link as its a vital read for anyone buying drives for enterprise use.
I think this analysis is excellent, and Ian has opened our eyes on the frailty of SATA. One question for Ian or Claus, however. You state that you think the SATA interface will remain dominant in the consumer space. Certainly for the near term that is true, but do you think that with Intel integrating SAS in the Patsburg chip set things could start to change? I would think that the small extra cost to Seagate or WD for the SAS electronics would be paid back by not having to manufacture and stock the supply chain with two different-interface drives that are otherwise the same. If the host side of the SAS/SATA question is cost neutral, why wouldn’t even the desktops and laptops finally drive a stake through the evil heart of the SATA drive?