Hadoop and Storage Growth
by David Merrill on April 2, 2010
A fun way to start yesterday (April fools) with a clever view of the storage blogosphere. Thanks Devang for some comic relief.
I have been spending some time and effort recently (mostly as a student) of new server/storage cloud infrastructures and architectures. My span of this subject is limited, but growing, with most of my cloud-storage-econ theories emerging from the MS Azure architecture. Yesterday was my first on-site view and ‘pulling-apart’ of the Hadoop cloud infrastructure. I can see the obvious benefits of these low-cost, open source architectures with a specific type of search and distributed processing workload. Working for HDS some 14 years now, I tend to see my customers as large block and file, enterprise-class data center customers. These new distributed systems may be a niche player now (like client/server was a decade ago), but I can see how these architectures could morph into and influence core IT infrastructures in the future.
My interest with Hadoop, Azure, S3 and others is on how the storage economic factors and principles behave. The customer yesterday started with a middle-class JBOD storage approach when they started some 12-18 months ago, but have since moved more of the workload into a highly available NAS infrastructure. Many of these cloud filesystems (HDFS) can sit on older file systems, thus enabling them to work on enterprise-class arrays etc. I am sampling, learning and talking with various clients this month and next to see if:
- Traditional storage econ principles apply to cloud storage economics
- Are there new types of costs that could emerge from cloud storage architectures (adding to the 33 we have defined already) ?
- How does the scale up (capacity) and scale-out (IO) differ from other storage architectures and upgrade costs
- Where are the inflection points for TCO, at scale? At performance?
- The trade-off for RAID, and server/OS sprawl must come at a point, and is likely dependent on the overall workload and capacity
- Do these new architectures revert back to some older problems of reliability that will force a more focused view on the cost of risk (outage, re-build times, management)?
On the surface, it appeared to me that we are going back to the past by taking out RAID, intelligent controllers, local/cheap disk etc. I am getting a better understanding how the economics would benefit this type of cloud architectures at small configs, but like everything else there will come a point that sustainability and expansion gets more expensive. This is the allusive cross-over point, that helps define the economic sweet spot for recommending a move from JBOD to DAS to possibly even SAN/NAS storage architectures.
If you have experience with scale and performance on cloud storage pools, please send a note or post to this blog.



