United States
Site Map Contacts Hitachi Global Community
Hu's Blog - Data Storage and Virtualization Thought Leader Hitachi - Inspire the Next

Hu Yoshida's Blog - Vice President | Chief Technology Officer

Home > Corporate > HDS Blogs > HDS Bloggers > Hu's Blog
Products, Solutions and more

Hu's Blog

Does Page Size Matter – Redux

by Hu Yoshida on Jun 28, 2011

Automated tiering is one of the hot technologies for 2011, with all the major storage vendors providing this in one form or another. Automated tiering generally refers to the ability to move parts of a LUN or volume to different cost tiers of storage based upon the I/O activity against that part of the LUN. This is more efficient than volume level tiering, which requires the movement of the entire LUN from one tier of storage to another and requires capacity for the entire LUN in each tier.

202-dataCarol Sliwa published a Search Storage blog post on 5 key questions to consider when evaluating sub-LUN level tiering. One of the questions was, “Does the size of the data chunks being moved matter?” Her conclusion, which was supported by analysts from Enterprise Strategy Group, Gartner, and Wikibon, was that the chunk size did not matter as much as how much money you are going to save.

This has been Hitachi Data Systems’ position from the beginning. We use a 42 MB page size which has often been criticized for being too large and inefficient in the use of storage space.  EMC uses a 768KB chunk, which they claim to be 54 times more efficient than our 42 MB page size. While from a capacity standpoint that could be true (depending on whether 768KB is really a chunk or a chunklet of a much  larger 370 MB chunk), the overheads can be much higher making the cost of this page size more expensive than our 42 MB page.

The Impact of Data Chunk Size

These pages or chunks have two major effects on the storage subsystem. First each page requires memory to store the metadata.  The smaller the chunks, the more metadata. In EMC’s case, that is 54 times the memory required for Hitachi’s page level tiering. Since EMC’s VNX and VMAX architectures do not have a separate control store like Hitachi VSP,  that memory for the metadata has to be subtracted from the data cache. If the metadata has to be stored in the cache memory of each of the VMAX engines, that means 8 x 54 more memory. If it is stored in one of the engine caches, then there is overhead accessing the metadata across the RapidIO switch.

The second impact is on the processing power required to keep track of the calculations of I/O activity against each page and the movement of chunks across the tiers. Here again there will be 54 times more calculations and 54 times more data movement. In Hitachi VSP we have a separate pool of global Intel quad core processors that handles the page level tiering so that it does not impact the performance of the Front End or Backend data accelerators. EMC VMAX and VNX use the same processors to do the FED/BED and FAST 2 processing, as well as other processing like replication and VAAI offload.

The net effect is better cost performance of the total system when page level tiering is implemented using 42 MB pages with separate control store and separate global processors. In order to compete on a cost basis, other systems would have to increase performance by the use of higher performance, higher costs disks, or limit the amount of capacity that can be used with sub LUN level tiering.

Of course, I am not an expert on how EMC has implemented FAST 2, so if any end users have had experience, please send me a comment on how it is really done.

Related Posts Plugin for WordPress, Blogger...

Comments (12 )

the storage anarchist on 01 Jul 2011 at 9:31 am

You admit you are not an expert on VMAX or FAST, but still you proclaim theoretical deficiencies that you have clearly not actually seen or measured.
You have overlooked the fact that checking metadata is far, far faster than moving 54 times as much data.
Worse, the memory wasted by moving 42MB when only 768KB is being used by the application means that you waste FAR more memory than the metadata it would have required to track at a smaller increment. For example, if your metadata is just 8 bytes, tracking at a granularity of 768KB would require only 448 bytes of metadata while HDT unnecessarily moves AND WASTES 42,240 bytes of expensive DRAM.
Smaller really IS better.
Head-to-head workload comparisons I have seen demonstrate that your theoretical computational overhead is virtually zero. And in fact, VMAX routinely delivers faster response times than VSP (and the predecessor USP+USP-V) for any workload and I/O size. With or Without FAST VP.
My *opinion* is that the VSP’s extremely poor latency is due to the fact that 100% of data cache AND control store references must always traverse TWO PCIe Gen 1 busses and the notorious crossbar switches inherent to the architecture. Since PCIe is significantly slower than the memory bus, response times suffer significantly.
Enginuity processes, on the other hand, have direct memory access over the separate memory bus found in modern Intel processors, without having to traverse PCIe or the RapidIO fabric for any data that resides within the local engine. And guess what – VMAX keeps all the FAST and VP metadata for the physical drives connected to an engine IN THAT ENGINE’S LOCAL MEMORY. In addition, the data cache slots for those connected drives are ALSO biased to be local as well, so the only time the data traverses the fabric is when the requesting I/O port is on another engine.
These very real optimizations (and others) more than nullify your theoretical overheads. Which is probably why VMAX keeps gaining market share…
Which, in turn, is probably why you keep trying to stir up FUD. Fortunately, facts and truths belie your theories.

[...] into the storage will demand more compute power and metadata storage in the storage subsystem (as pointed out in my previous post). This means that storage systems will move up the stack and become less of a [...]

Hu Yoshida on 06 Jul 2011 at 9:22 am

Welcome to the club, Barry. Using cache within the processors is a common practice with multi core processors. We use the processor memory in the front and back port processors as well as in the VSD processors. Where do you back that data up? You do acknowledge that you have to traverse the external switch when the requesting I/O port is in another engine. How often does that occur or do you have to restrict your configurations to avoid that overhead?

You seem to think that we move pages every time an 8 byte record is written to a page. We move pages based upon the activity against that page during an interval that can be set by the user. Not all the pages will be moved. We will perform responsive promotions of active pages and demotions of long-term less active pages. It will take less overhead and less housekeeping to move one page that maps efficiently into cache slots and track tables than to move a number of smaller chunks that may not be aligned to these boundaries.

By the way how do you map your chunks? It sounds like you have a 368 MB chunk which you divide into approximately 480 x 768KB chunklets. When you create your 768KB chunklet do you have to reserve the whole 368 MB? Some vendors do this to cut down on the addressing overhead by addressing a large chunk than indexing into the smaller chunklet. With 42 MB pages we address the pages directly and do not need to reserve space for a large chunk so that we can create a smaller chunklet.

You keep referring to some performance test that is supposed to show the superiority of the VMAX. Would you like to share the configurations and workloads that you tested so that we can validate your numbers? As you know, one can easily skew the results by choosing different reset intervals.

the storage anarchist on 06 Jul 2011 at 11:02 am

I’m not talking about using processor cache, but the (up to) 128GB of DDR3 DRAM the is adjacent to the processors within each VMAX Engine. The VSP only has 2GB of such directly-accessible DDR3 – everything else (as you point out) is separately managed on the other side of 2 PCIe Gen1 busses as Data and Control store. EVERY time the VSP needs to access user data OR control data, it has to traverse those PCI busses, whereas VMAX enjoys the majority of memory references directly over the local memory channels, and only has “reach out” over the low-latency RapidIO fabrics for a small fraction of total memory operations.
I never said you moved a page for every access – I gave a theoretical example for metadata overhead of 8 bytes, and demonstrated that the relative “cost” of relocation was 1/6th as much on VMAX as on VSP.
Everyone gets confused by FAST VP extents and meta data. Although I have explained it before, and there are several good white papers on the subject, I’ll explain again.
You started from the wrong end, though. As you know, Symmetrix is a track-oriented system, and even to this day we maintain metadata at the track level (such as whether any of the bytes within each track have ever been written to by the host). And the building block for FAST VP is Symmetrix Virtual Provisioning. So…
* VMAX tracks are 64KB in size.
* VMAX Virtual Provisioning allocates data in 12-track VP extents – 768KB, because that optimizes how we stripe data across RAID sets. Zero space reclaim on VMAX can release unneeded capacity in increments of 768KB, while the smallest amount of capacity USP/VSP can unmap is 42MB. While the LBA range within a VP extent is contiguous, VP extents need not be consecutive in the pools (which enables FAST VP).
* FAST VP tracks utilization in *logical* FAST VP sub-extents of 10 VP extents – 7.5MB. (because that is the optimal size we found after monitoring access patterns of over 3500 production applications across a broad range of applications and platforms). These FAST VP extents can be “sparse” – only the allocated VP extents are counted, and only those tracks within each of the VP extents that have actually been written to by the host are ever moved.
* FAST VP also records “aging” metadata at a granularity of 480 FAST VP sub-extents, or (approximately) 360MB. When FAST VP needs to free up space in a tier, it can quickly identify the least-busy 360MB and schedule it for demotion – and no, if any of that 360MB hasn’t been actually written to, it won’t be consuming memory (at a granularity of a track), nor will it actually get copied.
Finally, as I am sure you can appreciate, it is difficult to create a repeatable workload that demonstrates the agility of a particular Auto-Tiering implementation while also being representative of the real world. EMC’s performance engineers have done just that, and we routinely compare our products to competitors actual products in our own labs using scientific methods, strict controls, and engineers well-versed in the theoretical and experiential best practices for the platforms being tested. I posted the results of some of those comparisons in earlier blog postings on FAST VP (4.001 compares VMAX FAST VP to VSP HDT, for example). We also validate our results with customers who have also tested both products, and so far, there has been no disagreement that FAST VP is more efficient, requires less SSD capacity, delivers better response times and reacts more quickly to workload changes with less impact on response times.
In the end, I guess I have an advantage over you: I don’t have to make up my competitive assertions…see my upcoming blog post for more background on why smaller is actually, much, MUCH better…

Hu Yoshida on 06 Jul 2011 at 6:08 pm


So how much metadata do you store for each track, each FAST VP extent, and each sub extent? What if the tracks in an extent or sub extent were in different VMAX engines? I checked the blog post you referenced, but it’s lacking configuration details, which makes it is hard to know what is really being compared. From the chart, it is apparent that you are comparing a VMAX that started relocation at +8 hours to a VSP that started relocation at +24 hours, but this essentially gives VMAX a 16 hour head start. Therefore, the comparison should have been done using the same relocation start time, but without the configuration details, this chart is meaningless.

the storage anarchist on 07 Jul 2011 at 7:44 am

The incremental metadata for FAST VP (and the algorithms used to create and manipulate it) are trade secret property of EMC, but generally in the neighborhood of the example given. That is, while 42MB of FAST VP extents may use more metadata than does 42MB of HDT-managed data, that metadata allows FAST VP to perform much smaller data relocations, avoiding the overhead of moving excess data and wasting precious DRAM cache unnecessarily.
The referenced chart compares default settings for FAST VP and HDT, using the only HDT setting that allowed for continuous monitoring and relocation at the time (24 hour). Both products started at the exact same time, but FAST VP started moving sooner because it was designed to do so. Apparently HDT (and IBM EasyTier, for that matter) first collects data for 24 hours before starting any relocations. I assure you that there was no funny business in monkeying up the comparison – we even used the SAME (not identical, the SAME) servers, network switches and workloads in comparing the two machines.
And no, the test should not start with the same relocation start time – the test period starts the moment that FAST VP/HDT is enabled on the LUNs under test. As demonstrated by the workload change inserted, the long gap of time between workload change and HDT reaction means that customers run with bad response times for longer than they will with FAST VP.
The important part of those charts is to look at the response times: VMAX response times start out faster than VSP (running only on 15K rpm drives), remain lower while relocating extents, and ends up with faster response times after things have been optimized. The two systems where identically configured with the same total amount of DRAM, Flash, FC/SAS and SATA drives, and the exact same workload was run on both (OLTP-type workload with mixed large/small block, sequential/random workload).
FWI – FAST VP can be tuned (dynamically, under load, without application downtime) to react to workload changes as quickly as 20 minutes. And the FAST VP change rate can be accellerated (rebalance quicker, with added I/O overhead) or decellerated (reduced overhead but longer rebalance).
What is the SHORTEST possible gap between workload change and reaction time that HDT can be configured for? Can you tune HDT’s relocation rate dynamically? Does HDT support priorities to ensure the most important applications get serviced more quickly and get priority access to the Flash tier? Can different applications (groups of volumes) be assigned different policies (or percentages of Flash, FC/SAS and SATA/Fat SAS)? Can you change policies on the fly?
In the end, I believe I have shown that FAST VP has LESS CPU overhead, wastes LESS DRAM, and imposes LESS of an I/O overhead than does HDT for the same workloads under the same conditions. Further, customer feedback supports these claims.
Your theoretical assertions to the contrary are inaccurate, hypothetical and baseless, and thus are nothing more than misleading misrepresentations intended to create Fear, Uncertainty and Doubt among prospective customers seeking to compare the two products in an objective manner.

Hu Yoshida on 10 Jul 2011 at 10:14 pm

Barry –

Again it sounds as if we agree to disagree. Let the customers be the final judge.

- Hu

the storage anarchist on 12 Jul 2011 at 7:03 am

I guess I was unduly optimistic that you might acknowledge that your theories were incorrect. To your credit, you have not attempted to deny that HDT by defintiion moves more data, nor that this frequently will result in up to 6 times as much expensive Flash being consumed by HDT than would FAST VP.
But I bet our customers would still like to know the answers to my questions about your product.
Surely you have no reason not to explain how HDT handles different workloads with different needs and priorities. Or the quickest elapsed time before HDT recognizes and reacts to changes in workload. Or whether the policies can be changed per application, on the fly. And if the copy/relocation rate is configurable.


Hu Yoshida on 13 Jul 2011 at 12:38 pm

I still do not understand enough about your data movement of different ‘extent’ sizes in FAST 2 to make a comparison. A lot is said about the 768KB 12-track VP extent, however when you mention access patterns you readily combine 10 VP extents calling them FAST VP sub-extents. It’s curious how the larger of the two is called a ‘sub’. Yet when it comes to demotion and promotion you aren’t looking at one ‘VP extent’ or even one ‘FAST VP extents’ but at 480 FAST VP extents, 360 MBs. It is obvious that when VSP moves data, it does it in 42 MB Pages. VSP keeps the mapping, moving, IO tracking, balancing, and reclamation very straight forward. We do not have to defragment or recombine 4,800 VP Extents to optimize storage. VSP does not need to move more data than FAST 2 in any given period. FAST looks like it’s interested in moving 360MBs at a time although granted it maps LBA addresses in groups of 768KB.

We do believe that automation should means less knobs and levers for an operations person to worry about. So we keep it very simple. First we believe that we should make the most use of expensive SSDs, and so we load the SSD tier first. We do continuous monitoring of the pages and do a weighted average of page activity over several cycles. We use this weighted average to demote pages in order to avoid demoting pages of high activity data that may have a temporary lull in activity. We promote pages that have increasing activity at the beginning of a cycle. In this way we do fast promotion and slow demotion, with the target of keeping the more active data in the higher performance tiers of storage despite temporary periods of inactivity. This avoids thrashing.

Some high performance data should not reside in an HDT pool. It might be better served in an HDP pool of high performance storage where we get the benefits of thin provisioning and wide striping, and the ability to non-disruptively move all the data to a lower performance HDP pool when there is no longer any need for the high performance.

udubplate on 14 Jul 2011 at 10:58 pm

When is EMC going to provide a highly available method for management and configuration of FAST VP? The single point of failure that exists today is unacceptable to many customers. If the service processor is unavailable, the algorithms will still run as I understand it (evaluating extents and promoting/de-promoting between tiers), but there is no way to manage the algorithms if the service processor is offline. What if some application goes nuts and it needs to be excluded for a critical process? Seems archaic?

Hu Yoshida on 15 Jul 2011 at 12:38 pm

Thanks for the comment.

[...] solutions, EMC FAST and Hitachi Dynamic Tiering, respectively. You can see both sides of the debate here and [...]

Hu Yoshida - Storage Virtualization Thought LeaderMust-read IT Blog

Hu Yoshida
Vice President and Chief Technology Officer

Connect with Us


Switch to our mobile site