To Delete or Not Delete – Is that the question?
by Hu Yoshida on Aug 18, 2011
I came across this interesting post by Floyd Christofferson at the Active Archive blog.
Floyd cites a 2008 study by the University of California, Santa Cruz, which analyzed the utilization of network file system workloads of a 22TB active storage pool by 1,500 employees. The study showed that 95% of files are reopened fewer than five times, and over 60% of the reopens occurred within a minute of the first open. It also found that over 76% of the files are never opened by more than one client, and most files are not re-opened once they are originally closed.
The majority of the files in this 22TB pool are not going to be reopened or changed. Christofferson concludes that business users have a difficult time determining which files to delete or remove from active storage, and as a result the data center disk storage continues to grow at an astronomical rate. The problem is actually even worse than that, considering these file systems are also replicated for backup cycles and distribution.
It should also be noted that even if the users were conscientious about deleting files, it doesn’t mean that the storage capacity for those files is available for reuse. Deleting files does not recover or recycle space unless the file system does it. The storage system cannot recover the space for a deleted file unless the file system tells the storage that the extents for that file system are available for reuse, and the storage system has the capability to recover that space through page-level thin provisioning. So, in some cases, the deletion of a file may not make any difference at all. Look for file systems and storage systems that can recover file space through a Write Same Command or the new UNMAP command.
Another way to answer this question is through Hitachi Data Ingestor, which looks like an NFS or CIFS filer to the user. This replicates files automatically over REST interfaces to a Hitachi Content Platform, where ultimately it is single instance stored in an object format.
Since the file is replicated, it does not need to be backed up.
Hitachi Data Ingestor acts like a bottomless filer; when the local filer capacity reaches 90% capacity, it deletes the files in excess of the threshold, and replaces it with a 4KB stub, which then links it to the file that is contained in the content platform.
This way, a user does not need to worry about deleting files. Let Hitachi Data Ingestor handle that.
Comments (4 )
Regarding your comment “Since the file is replicated, it does not need to be backed up.” this really comes down to a matter of trust and how many replicas you have or can afford to have.
At what point do static point-in-time backups become redundant (pardon the pun)? Are we to take on faith that all issues with data corruption or destruction are solved and no longer require mitigation?
This is exactly the fire we are fighting today. Although we are not solving it with HDI we are working towards a solution that allows us to get the stale data out of our live data steams (be it live projects or just the daily/weekly backup firehose).
Yes replicas are a matter of trust and affordability. A replica that is stored at the same location is exposed to a location failure. If you can afford it you can have a replica stored off site or on multiple sites which adds costs but provides greater assurance.
There are some things that we need to take on faith, but protection of data that is stored in a content platform like the Hitachi Content platform eliminates or addresses a lot of those issues. Taking a hash of the data as it is ingested enables us to check the hash from time to time and when we retrieve it to ensure immutability, we encrypt it to ensure privacy, we create partitions to ensure safe multi-tenancy and no escalation of management privileges, set policies for data shredding at eol, set legal holds when required, we do single instance store but we do not do deduplication on the rare chance that different data could have the same hash, and we have the ability to replicate and version it x number of times.
Thanks for the comment Steven. Knowing what to delete is a big problem.
Another problem is deciding what to keep when you are decommissioning applications. Is this usually an IT decision or a Records Management decision?