A global pool of processors: a second look
by Hu Yoshida on Dec 15, 2010
I wrote about the need for a global pool of processors in November, and focused on the new requirements for different architectures. This topic is a big one, and deserves another look here.
The need for processors in storage systems
A storage system uses processors to handle the I/O from host server ports, the movement of data in and out of cache, the generation of RAID parity, and the I/O to the back end media. Over time, additional processing requirements have been placed on these processors to support functions like snapshots, clones, copy on write, concatenation, tiering, replication, and migration. More recently, new functions like thin provisioning, wide striping, dynamic provisioning, storage virtualization, and virtual tiering at a sub-LUN or page-level has placed even more demands on storage processors.
Processors in Modular Storage Systems
In modular storage systems, there are two processors with separate caches. Each processor supports up to 8 host ports and LUNs that are accessed by those ports are assigned to the cache that is supported by that processor. Ownership of the LUNs has to be assigned to one processor cache or the other in order to avoid thrashing between the caches. Writes to the LUNs are mirrored to the other processor cache for availability in cases of a failure in the active processor cache. These processors handle generating the RAID parity and access to the disks that are mapped to the LUNs; in addition to doing all the other functions like clones and replication.
Because of this limitation in the number of processors, modular systems do not scale to support a large number of applications. When one processor fails, the data in cache is protected in the other processor’s cache, but then this becomes a single point of failure and the storage system should be stopped to fix the other processors before continuing.
Processors in Enterprise Storage Systems
In enterprise storage systems, there can be more than two processors that share a global pool of cache. A global pool of cache provides a single cache image of a LUN that can be shared by multiple processors. Control data in the cache or in a separate control store (in the case of Hitachi enterprise storage systems) enables the processors to know where their LUN resides in the cache and synchronizes the sharing of that LUN image with multiple processors.
The USP V can support up to 128 processors sharing up to 512 GB of global cache. Unlike modular storage systems, where each processor has to perform all of the functions, including RAID for the back-end storage, enterprise storage systems such as the USP V, can use separate processors to do the RAID and back-end storage while the front-end processors handled all the other functions. The use of a global cache and multiple processors enables an enterprise storage system to scale and provide enterprise availability even when one or more processors are stopped.
The demand for more storage processing power
However, as more and more workload is added to storage systems, these processors are beginning to struggle. Virtual servers are increasing the I/O workload with every additional virtual machine. The front-end FC bandwidth is increasing to 8 and 16 Gbps. The back-end connections are converting to higher speed, point-to-point 6 Gbps Serial Attached SCSI or 10 Gbps FCoE, and the back-end workload is increasing with the use of wide striping and Flash drives.
Dynamic tiering and virtual tiering are demanding the use of more processing power to track the activity and movement of pages or sub-LUN increments. Applications are offloading their software bottlenecks to the storage through API’s like VMware’s VAAI which enables the storage to do formatting, cloning, and cluster locking of virtual machines. All these higher speeds and additional functions can impact the primary functions and performance of the processors in a storage system.
The VSP introduces a global pool of processors>
Hitachi took this into consideration when we built Virtual Storage Platform (VSP). The main architectural change was the creation of a separate pool of Virtual Storage Director (VSD) processors that can be shared across a switch matrix with all the front- and back-end processors. The VSD processors provide most of the functions other than the processing of I/O on the front end and the RAID generation and processing on the back-end.
The front- and back-end processors are dual core processors which are optimized to do I/O functions. Since their instruction sets are optimized for I/O, they can perform that task better than general purpose quad core processors. The global pool of VSD processors can scale out from two quad core processors to eight quad core processors, and are shared across the storage system to support general functions, like tiering, replication, etc. There are other storage systems that use quad core processors but those processors are locked up into storage nodes and cannot share their processing power with other nodes. You can think of a global pool of processors like a global pool of cache. It can scale incrementally and be shared dynamically across the front end and back-end processors.
Since VSP also scales deep to support attached external storage, this global pool of processors can support external storage with functions like virtual tiering and VAAI integration. So if your storage system has trouble scaling, is lacking functionality, or lacks performance when it tries to add functionality, attach it to the VSP and realize the benefits of a global pool of processors.
Comments (9 )
[...] think it was Nigel Poulton who mentioned this post by Hu Yoshida regarding Hitachi’s “global pool of processors” approach with the VSP. Based on what I read there, it sounds like the VSP and the VMAX share some architectural [...]
Is HDS preparing for the next generation many-core architecture or does HDS have any comments on this tech? Thanks
Thank you. This is a great question.
Hitachi is already using Multi-core technology. The FED and BED that you see in the diagram above are dual core processors. The FEDs provide I/O processing for host server connections and for external storage systems. The BEDs provide I/O processing and RAID generation for internal storage media. The instruction set for these dual core processors is optimized for I/O processing which makes them comparable in performance to general purpose quad core processors. The VSD, Virtual Storage Directors are Quad core Intel processors which are used to manage the general tasks of tiering, paging, copies, replication, and cache management.
Other storage vendors are moving toward the use of multi-core processors in their controllers, but they are using them in the same old way, substituting multi-core in place of single core controllers, and maintaining the same storage architecture where one controller does all the work of I/O processing, replication, tiering, copies, etc, in dedicated silos. They have not taken advantage of the multi-core technology by creating a shared pool of resources. Multi-core technology is game changing, but they have not changed their game to take advantage of it.
So yes, Hitachi is using multi-core processor architectures within the VSP storage system, and in addition, Hitachi has extended the storage architecture to optimize the use of multi-core technology by creating pools of quad core (VSD) and dual core (FED and BED) that share a global cache across a switch matrix.
With this multi-core processor and storage architecture the VSP can optimize the utilization of storage, offload more work from the applications via APIs, simplify operations, and provide multi-dimensional scaling for increasing server workloads.
[...] my last post on the global pool of processors in the Hitachi Virtual Storage Platform (VSP), there has been a some discussion of this by Nigel [...]
Thank you for your nice explanation！4 more question hope you can shed some light, appreciated!
Does VSD/FED/BED share the same memory space or each has its own memory?
Does the 8 VSDs share the same memory space via crossbar switch?
How does VSD/FED/BED communicate with each other, through which kind of external protocol or share memory?
Can we say that the VSD/FED/BED modules compose an NUMA matrix, while each D it self is a SMP matrix?
Hello Mellon Head. Sorry for the delay as I have been travelling around Asia the past two weeks. Here are my responses to your questions.
1. Does VSD/FED/BED share the same memory space or each has its own memory?
Both. Each board has local memory and each has access to cache via switches. Memory on the VSD also provides “shared memory” functionality, with each VSD owning all the pertinent information for a set of LDEVs. This information is backed up in cache to allow fast loading to an alternate VSD in the event of a board failure.
2. Does the 8 VSDs share the same memory space via crossbar switch?
Yes, in the sense that global cache is a memory space.
3. How does VSD/FED/BED communicate with each other, through which kind of external protocol or share memory?
PCI Express through the switches
4. Can we say that the VSD/FED/BED modules compose an NUMA matrix, while each D it self is a SMP matrix?
NUMA is associated with server architectures. If there are multiple processor/RAM boards in a modular server (like an IBM pSeries) that are interconnected via cross-bar switch, and each processor can access both the fast local RAM or the higher latency remote RAM – that is NUMA. This RAM is used for a single purpose (O/S, applications, data).
The VSP is not a NUMA architecture. Although it resembles one, as each board (FED, VSD, BED) has local RAM and two of these (FEDs and BEDs) also access remote RAM (shared cache striped across the DCA boards) ) – it isn’t. The local RAM is for the “O/s”, applications, and control tables, while the remote RAM is used for data and metadata.
Note that user data is moved through the FED or BED boards to cache (DCA boards), but not kept in the FED or BED local RAM. Data never goes to a VSD board, but a VSD controls operations on data in cache using the FED or BED processors. The separation of control from user data ensures security for remote maintenance and removes contention for data cache.
Thank you very much for your great and clear explanation! You mentioned that VSD doesn’t access the DCA memory space, then how can VSD deal with snapshot/replication and other upper layer functions?
The VSD cannot touch any data. It is blocked from all data cache regions. The FED/BED “Data Accelerator” chips move the data once given an address and command by the VSD.
[...] VSP can scale incrementally by adding pairs of central processor blades, I/O port processor blades, back end processor blades, cache modules, and disk modules as required. These storage resources are tightly coupled through an internal switch matrix to provide monolithic scale up. Multiple applications also can be supported with modular increments from a common pool of storage resources. VSP can also scale deep through virtualization of external storage systems, which can be integrated into this common pool of resources. [...]