Primary Storage Deduplication without Compromise
by Hu Yoshida on Apr 3, 2013
Deduplication (dedupe) has been available for some time; however, the performance impact of using dedupe has made it impractical to use during primetime file serving. Data has to remain un-duped until time can be scheduled, during off-peak hours, for this task. This results in having to pay a capacity consumption “tax,” plus it works against the primary goal of data deduplication and increases complexity due to the need to monitor and schedule files for deduplication.
Hitachi Data Systems Delivers Primary Storage Dedupe
Hitachi Data Systems has introduced primary storage dedupe for our Hitachi NAS Platform (HNAS) and our Hitachi Unified Storage (HUS) family that does not compromise on performance. Our “Deduplication without compromise” :
- Is automated for less administration.
- Implements data-in-place deduplication.
- Will throttle back automatically when workload thresholds are reached.
- Leverages an advanced cryptographic hashing algorithm to ensure data integrity.
- Scales to dedupe the entire usable capacity of a filer.
Deduplication without compromise results from our unique NAS architecture that includes an object-based File system Offload Engine (FOE) powered by FPGAs. Essentially this means that hashing and chunking (the hard part of deduplication) becomes an attribute of persisting a file, making our offering unlike other NAS appliances. Specifically, hashing and chunking is accelerated in hardware not in pure software. A base hashing/chunking engine license is included free of charge, and depending on user performance requirements three additional hashing/chunking engines can be licensed. (Note: Additional engines work in parallel, so the increase in dedupe performance is nearly four fold.)
We started shipping the latest version of our HNAS OS with deduplication in early January 2013 and the adoption numbers have far exceeded expectations. The majority of our shipments have been the base version, which meets the needs of many of our customers. We have also shipped a number of premium licenses to customers requiring higher performance.
An example of customer usage follows:
A Fortune 500 global semiconductor and electronics company is an early customer who evaluated our dedupe capabilities with products from other vendors. They found that the HNAS dedupe method was better than the other vendors and they were impressed with the speed of a single hashing/chunking engine, which can dedupe 1.2 million files in 16 minutes! They were further impressed by the fact that there was minimal impact to primetime file serving activity. As a result, this company has decided to deploy HNAS deduplication in their environment.
Automation Eliminates The Impact On File Serving Performance
The minimal impact to file serving activity is due to an intelligent deduplication process that knows when new data is added and automatically starts up the deduplication engine(s) as long as the system is not busy. When file serving workload reaches preset thresholds, the deduplication engines throttle back preventing any impact to file serving performance, then automatically throttle up when the system is less busy.
Minimal Overhead for Deduplication
Data is stored as it normally would be in the file system with the capacity reclamation phase of the deduplication process operating outside of the data path. Instead the reclamation phase combs through data already in the file system eliminating redundancies. To do this a database of hashes is used to identify candidate chunks that can be deduplicated, yet due to our unique architecture the database is very capacity efficient. We believe that a highly efficient database of hashes and our differentiated approach to storing the database lengthens the distance between us and other offerings in the market. (Note: In a future post, my colleague Michael Hay will explore how our unique architecture benefits the deduplication process.)
The maximum size of a HNAS file system is 256 TBs and data from the entire file system is a target for deduplication. Interestingly, we are hearing that other products on the market have artificial boundaries, like 100TB fences, forcing customers into a higher capacity “tax” bracket. Furthermore our multi-petabyte HNAS global name space can virtualize multiple numbers of HNAS file systems (even deduplicated ones) under a common directory address space, so enterprise scalability is assured.
The efficiency of dedupe depends on the dataset and file system block sizes. Dedupe of virtual server and VDI environments are extremely efficient. The efficiency will be comparable to other dedupe algorithms; the major difference will be its performance, scalability and ease of use.
The Net Benefit is Lower TCO and Consistent Capacity Efficiencies
The net benefit of HNAS dedupe is lower total cost of ownership. This is achieved through increased capacity efficiency without compromise to performance and scalability, and less manual intervention due to no special scheduling, configuration, tuning or monitoring.
Primary storage deduplication without compromise on HNAS and HUS is available now, and we have customers already in production achieving better results than they could realize from competitive products.
For more information see here.
Great blog post Hu. The one big concern IT has with implementing any primary storage storage deduplication solution is performance and the biggest value here is the ability to manage I/O performance automatically so it’s not an issue.