The Use of Switches in Storage Systems
by Hu Yoshida on Feb 8, 2010
Hitachi Data Systems was the first vendor to deliver a switch based storage architecture over ten years ago. Recently we are starting to see storage vendors deliver storage systems that include a switch in their architecture. However, the new switch architectures are designed for loose coupling of modular storage nodes while the Hitachi architecture is designed for tight coupling of storage resources.
In 2000, Hitachi Data Systems introduced the Lightning 9900 storage subsystem with an internal switch that tightly coupled Front End (FE) and Back End (BE) port processors through a global cache. This enabled any to any connection between the FE storage ports and the BE disk controllers. If an application needed more FE port processing power it just connected more FE ports through alternate paths that could be switched to the cache image of their data. The internal switch provided the ability to scale up to meet increasing server demands. (see my previous post on scale up versus scale out) Besides scaling up it enables the storage system to dynamically scale out by partitioning the storage resources for different applications at different times based on policies that are triggered by time or events. Availability is also improved through its ability to switch around hotspots and failures.
Supporting this internal switch is a separate control memory which contains control information about port assignments, cache slots, track tables, and now, with the USP V, paging information for Dynamic (thin) Provisioning. By changing bits in the control memory the configuration can be dynamically changed and data can be moved between tiers of storage without interruption to the application. This also enables the ability to attach external storage systems which are virtualized through the USP V’s global cache. The separation of control from data eliminates the contention that would occur between the two if they resided in the same cache memory. A separate control memory also provides security for call home maintenance since the remote service rep can monitor and diagnose problems without exposing the data cache.
Other storage vendors have introduced storage systems that include switches for loose coupling of modular storage systems. This enables storage systems to scale out to large capacities by modular increments. But it does not enable dynamic scale out across the resources of multiple modules to meet different application needs, and it does not scale up to meet increasing server needs. In a loosely coupled configuration the switch resides outside the modular storage systems and has some capability to transfer workload between modular systems, but the maximum storage resources that can be applied to a workload is that of one modular system’s FE and BE processors, cache, and attached disks. It can not scale up to combine the use of resources across multiple modular storage nodes. Since control data is not separate, the meta data needed to manage the transfer of workload between modular nodes and synchronize the consistency of the separate caches can degrade performance.
Attached is a simplified drawing that illustrates the differences in how switches are used to tightly couple or loosly couple storage systems.
Comments (7 )
You left out the 3PAR model, which uses tightly clustered nodes that communicate in a mesh topology.
Sigh – more misleading information and twisted assertions.
FUD masquerading as insight is still FUD.
You assert that scale-out cannot meet increasing server demands – patently false.
You assert that scale-out cannot leverage any resources on other nodes to service “increased server demands” – patently false.
You assert that only by separating data cache from control store can you eliminate memory/resource contention – patently false.
You assert that managing memory as a single global resource across a scale-out architecture “can degrade performance” – while linguistically true (almost anything “can” happen), in most current scale-out archtiecture implementations, that statement is also FALSE.
You continue to use these uninformed (and fictitous) assessments to defend your aging USP-V architecture, and to cast doubt on your competitors’ products.
Please explain to me how an application server can load balance across alternate paths to separate VMAX nodes or 3PAR nodes and apply the processing power of both those nodes to the same I/O workload.
I’ll explain for Symmetrix – I cannot attest it is the same for 3PAR.
First, with any Symmetrix, the drive slices that make up a LUN, whether mirrored, RAID 5 or RAID 6, can be on ANY drive in the system, off of ANY of the back-end channel pairs, connected to any of the DAs (DMX and earlier) or Engines (V-Max). In fact, this is the best practice, as it engages as many CPUs as possible to drive subsets of the device/LUN (or CKD volume, for that matter). (And indeed the RAID calculations require that the CPUs share data, which they do across the Direct Matrix in DMX and across the Virtual Matrix in V_Max). In this fashiom, all of the back-end CPUs are supporting te application I/O demands for each LUN.
On the front-end, these LUNs/volumes can be mapped to every front-end port, but usually the best practice is to map them to as many different Engines (FAs on DMX) as required to meet the application/server I/O requirements. Any I/O for a given LUN can thus be directed to any port, and thus every CPU can be engaged to service I/O requists.
Every application/host can load balance across these ports in whatever logical manner they choose. Many path management products will only round-robin across the available paths, and this is indeed one form of load balancing. EMC’s PowerPath provides an even more intelligent approach by monitoring the response time of each path, and then sending requests down the fastest path with the shortest service queue of outstanding requests. This approach is truly dynamic, and will automatically adapt to changes in load on he fabric (perhaps from another server/application demanding more resources). It will also adapt to an FA/Engine port that is responding slower – again, perhaps due to changes in the port/CPU workload.
By the way, this is nothing new, nor a special configuration – this is fundamentally how Symmetrix has worked for generations, with each hardware change merely providing a different approach to interconnecting the cooperating processies that service I/O. This approach to multiprocessing that engages all available resources in a symmetrical fashion is in fact the root of the name Symmetrix. And still, customers can choose to segment the resources of their Symmetrix by limiting their LUNs and host connections to using only a subset of the available paths. But even with such physical segmentation, all of the global memory and even some processing from every Engine will be utilized (cache can be logically partitioned to limit the amount used by a group of devices).
Any more questions?
Barry, an old English idiom says “People who live in glass houses should not throw stones”.
It is true that IBM invented the FUD in the 70-ies, but EMC excel in the last two decades. Your blog and your comments contain more FUD than any other industry blog.
As Robert Weilheim pointed out, the 3PAR architecture is a mesh. Data on disk is distributed across multiple nodes and all controllers conduct I/O transfers with all other controllers in the system. As I/Os enter a 3PAR system they are “fanned out” to other controllers to complete the task. That means that even single-path I/Os in a 3PAR system can use the processing power of multiple nodes simultaneously.
Barry, thanks for the reply. I have no doubt that you can have an I/O request come in from one VMax node and have it routed to another VMax node for access to the cache and back end disk. The question was how is that done?
I think I understand how EMC does it in the DMX. You statically map the global cache to the Front Adapters and Back Adapters over a Direct Matrix with a BIN file. Once that mapping is done you can process I/Os from one or more FAs to a BA for disk access.
On the VMax it looks like you have a cluster of nodes where each node has a pair of controllers, each with its own local memory and integrated FA and BA ports. I understand that you still use BIN files to do the configuration mapping. But now instead of going directly from one or more FAs to a global cache to BA, you seem to take a couple of extra hops. It looks like you have to involve two controller nodes to do what used to be done with one in the DMX .