Why are Hadoop and MapReduce Eating the World? (Part 1 of a Series on Hadoop Architecture)
by Ken Wood on Jul 2, 2012
As my guest blogger, Matt is starting to provide more insight about interesting subjects. Here, read part 1 of a multi-part series on the Hadoop and MapReduce architectures compared to traditional programming approaches. Why are Hadoop and MapReduce eating the world? What makes Hadoop so popular and how does Hadoop process volumes of data – dare I say it – BIG DATA – easily and cheaply? This is nice contrast to past approaches to solving this problem and the new modern techniques being employed today. Enjoy!
Hadoop and MapReduce – A Brief Introduction
Google’s pioneering MapReduce parallel programming framework, which was later cloned and extended by the open source community via Hadoop and its ecosystem (including HBase, Hive, Pig, and other software), today drives much of the web’s infrastructure. Search engine index creation and search operations, ad-targeting analytics, and social media graph processing are the core tools driving web operations today. Most of these infrastructure applications are based on Google’s GFS file system or the open source Hadoop file system, and related MapReduce programming techniques. The techniques are characterized by a series of transformations between groups of key-value pairs, an initial process of local transformations known as the map phase, followed by a sort-partition-and-combine process known as the reduce phase. More complex applications generally require multiple map-reduce phases for a complete calculation.
Why has MapReduce become so popular? There are really three parts to that story, but it’s also important to remember that MapReduce was not invented in a vacuum. In the early 2000s, Google’s engineering teams had a very specific problem that they had to solve: scaling operations by a factor of 100x (they thought: it later turned out that they needed to scale even further, by a factor of 10,000x, and more). Existing distributed and parallel file systems couldn’t come close, either in terms of storage capacity and performance (100s to 1000s of petabytes in a single system using commodity servers and disk storage), or in the ability to tolerate faulty software and hardware in extreme-scale environments with millions of components (and where 2% to 5% of the system may be inoperable at any given time).
Let’s look at three reasons MapReduce has become popular.
Part 1: Simplified Parallel Programming
Parallel programming is hard. Really hard. In fact, it’s so hard that the most productive way to program for parallelism is to never do it directly. Instead, find some way to allow the programmer to express the work they want to perform in such a way that (a) they can efficiently, succinctly, and easily express the computation they want to perform; and (b) the programming model is expressive enough to meet the requirements in (a), but restrictive enough so that the parallelism inherent in the calculation is not hidden or lost via the act of writing the program, and automated software tools can be used to execute the program efficiently. MapReduce provides both (a) and (b), while scaling from small to extremely large datasets within the same programming and system framework.
Part 2: The File System is the Computer (and the Database Too!)
Historically, large-scale computer systems have separated memory and compute resources from disk (and tape) storage. System (Infiniband) or storage (Fibre Channel) networks are then used to connect the two. The problem is that these networks are expensive and generally lack web- or supercomputer-scale connectivity. Supercomputers still use this compute-and-memory-separated-from-storage paradigm, but require custom networks to make it work. However, the more serious problem with this approach (beyond cost and scalability limitations) is that it does not exploit the benefits of locality when computations are co-located with the file data they need from the storage system.
MapReduce is designed to co-locate computations with data, and in fact, goes beyond this most excellent idea by leveraging the system-wide replication already required for data resilience to flexibly co-locate data with computations. Furthermore, it can exploit the co-location to aid load-balancing (e.g., it avoids co-locating computation on an already busy server and has alternatives to do so because data is replicated to other, potentially less busy, servers). With MapReduce, the file system is essentially embedded into the parallel computer, with significant performance benefits (both from load-balancing and from exploiting locality between computations and data).
Part 3: Simplified Operations via Fault-Tolerance in Extreme-Scale Environments
Extreme-scale web infrastructure, with millions of components and 100s or 1000s of petabytes of storage, absolutely require system designs that transparently and automatically tolerate and recover from failures of all kinds, including data, storage, networking, server and software faults. This fault-tolerance requirement for scalability allows for two additional benefits:
(a) Building on the simple MapReduce programming model for parallelism, the
framework also tolerates and transparently recovers from a variety of system faults and load imbalance conditions, with no additional effort from the programmer. The fault-tolerance is built right into the framework. This contrasts with other popular parallel programming models such as MPI or OpenMP, or pre-MapReduce programming frameworks at Google, which require that the programmer write custom, application-specific code to handle faults. This adds significantly to program complexity, and this complexity reduces performance and programmer productivity. MapReduce allows the programmer to avoid this complexity entirely.
(b) What’s the difference between a server failure and a system administator simply shutting down a server? From the MapReduce framework’s perspective, absolutely nothing! Hence, routine system maintenance does not generally require downtime: instead, administrators can easily replace servers, disks and network switches while maintaining high system performance and throughput. Failed components (e.g., disk, servers, file data, etc.) can be replaced when convenient, without disrupting operations, while the system automatically re-replicates data copies from other servers to maintain data redundancy at prescribed (generally N=3) levels.
MapReduce intregrates a widely applicable programming model that is implicitly parallel and scalable, tolerates system faults and load imbalance, and integrates file storage directly into the computer. The difference between MapReduce and existing parallel programming and data storage models reminds me of the difference between aircraft carriers and battleships. During the first part of the 20th century, the battleship reined supreme and naval battles were primarily between fleets of battleships, which also provided limited coastal bombardment and amphibious operation support. Aircraft carriers were originally used strictly for reconnaissance, but faster ship speeds, more efficient operational practices, more powerful aircraft and better tactics led to their use for both reconnaissance and attack for operations over 100s or 1000s of miles, instead of the 10s of miles to which battleship fleets were restricted. By integrating long-range operations (think big data), reconnaissance and attack (computation and storage), aircraft carriers revolutionized naval warfare to a similar degree that MapReduce is now revolutionizing how computing systems process huge data sets.