A Series on Hadoop Architecture
by Ken Wood on Jul 13, 2012
As my guest blogger, this is Matt’s second installment on his series on Hadoop and MapReduce. Specifically, he explores the viability of the Hadoop and MapReduce framework for Computational Science and Engineering. Traditionally, CSE environments have focused on the inter-processor and parallel communications for forming huge compute clusters. Hadoop and MapReduce are very different approaches than these types of complex systems, but can they handle the CSE types of workloads and eventually change the game? Read more to find out…
MapReduce’s Potential in Computational Science and Engineering (CSE) (part 2)
Hadoop and MapReduce — A Brief Introduction
As we mentioned in our previous post, Google’s pioneering MapReduce parallel programming framework, which was later cloned and extended by the open source community via Hadoop and its ecosystem (including HBase, Hive, Pig, and other software), today drives much of the web’s infrastructure. Search engine index creation, search operations, ad-targeting analytics, operations management and optimization, and social media graph processing are some of the core tools driving web operations today.
The simplicity and regularity of MapReduce’s semantics allow the underlying system software (including Hadoop Data Filing System or Google Filing System) to be optimized for the streaming data flow while providing automated mechanisms for recovering from server, application, storage or networking problems. The automated failure recovery made possible by the MapReduce framework brings two significant benefits: (1) application codes can rely on these automated recovery mechanisms, reducing application development complexity and avoiding the need to replicate these failure recovery mechanisms in a custom way in each application; (2) system operation and maintenance of a MapReduce cluster is simplified since many failure modes can be recovered from automatically.
Programming for Computational Science and Engineering
One area where MapReduce is rarely used today is applications in computational science and engineering (CSE). These applications typically model the underlying physics of the science being studied (e.g., ocean circulation, atmospheric weather, gas dynamics in star formation, etc.) or engineering device or process being developed. These codes have been developed and tuned over several years, sometimes decades, using traditional programming languages like Fortran, C, or C++ and libraries like MPI and OpenMP to support parallel operations. They are written as a series of fine-grain mathematical transformations on input data and intermediate data in memory, and it’s likely that the overhead of MapReduce’s framework is too high to support this kind of fine-grain parallelism. Completely rewriting these applications to use MapReduce would require that the prior software development investment be thrown away, a very expensive and probably impractical strategy.
Limits of Exploitable Parallelism in CSE
In addition, most CSE applications are run on a few hundred processors maximum (though the median number of processors per run is probably no more than 8 to 16 cores). This hasn’t changed much since the 1990s, when MPPs first appeared, although large supercomputers and MapReduce clusters today often have more than 10,000 cores. For applications that are parallelized using threading libraries like Apple’s Grand Central Dispatch or Linux’s pthreads, the exploitable parallelism in applications is most likely going to be on the order of 4 to 8.
Even in applications where a lot of parallelism exists and the codes are written in parallel form, there is often not enough data to justify running across more than a few dozen processors. On the other hand, a few dozen cores are becoming the commodity server sweet spot these days. Since the 1990s, we’ve gone from 1- to 2-processor machines to multi-core servers with 16-24 processors, with 32- and 64-processors on the horizon. Today it is possible to run parallel calculations on 16-way multi-core single box servers that were considered exotic parallel machines in the 1990′s (when your humble correspondent was very active in parallel application development). Given Amdahl’s
Law, which dictates that small amounts of load imbalance and inefficiency can greatly limit the speedup achievable via parallelism, it’s still difficult to run most parallel scientific applications efficiently on more than 16 to 32 cores.
MapReduce Cluster Nodes Have Optimal Parallelism for CSE Applications
On the other hand, typical MapReduce clusters have hundreds of large-memory, high-core-count (at least 4- or 8-cores today, moving towards 16- and 32-cores in the near future) multi-core servers with multiple terabytes of non-RAIDed local disk storage. Using MapReduce directly to parallelize typical scientific or engineering codes does not appear to be the best approach. As mentioned earlier, these codes require fine-grain data sharing and regular synchronization; MapReduce’s semantics involve localized computation on independent pieces of data, followed by a shuffle and reduce phase to aggregate results. It’s semantics are probably not general enough to support most codes built for computational science and engineering. The whole point of MapReduce is to simplify the programming model so that large amounts of data can be processed within a highly scalable compute and storage framework. So the question is: if we look at the problem differently, can we still exploit the MapReduce framework to get real computational science and engineering work done?
Why Not Use MapReduce to Run CSE Ensembles?
CSE software uses input data from external measuring devices or input models to drive its execution. It can be very useful to vary the input model or perturb the measured input data to create N input data sets, and then run N calculations on these varied/perturbed inputs and observe the differences between runs. These ensemble calculations are useful in many ways, including:
- Determining how sensitive the output results are to the inputs because systems with outputs that are highly sensitive to inputs are less predictable, which can be extremely useful information when analyzing the results
- Determining how sensitive the output results are to changes in one or a few input variables can help pinpoint the effects of specific inputs on the overall CSE software model
- Various input configurations can be modeled to determine which engineering design is the most efficient, high performance, or optimizes some other desirable feature relative to other input configurations
As a concrete example, ensembles are used in numerical weather forecasting to determine how predictable the atmospheric state is at a particular point in time. Unstable atmospheric conditions that are less predictable can be tagged as such when providing forecast results.
The scalability and automation of MapReduce has the potential to accelerate the computation and analysis of ensembles. Each MapReduce node can be given one parallel calculation in the ensemble to perform during the map phase, while the reduce phase can be used to combine results to provide comparisons between ensemble members and well as aggregate statistics on the overall ensemble calculations. The resulting data output can also remain in-situ in the MapReduce cluster, available for further analysis, reprocessing, and long-term archiving. There is no need to provide an external shared storage file system (e.g., Lustre or a scale-out NFS server) as the data can stay resident within the MapReduce cluster. The goal is to “exploit the restricted programming model (of MapReduce) to parallelize the user program automatically and to provide transparent fault-tolerance” (this quote is from the original MapReduce paper by Dean and Ghemawat) for CSE ensemble programs.
Using MapReduce for ensembles could provide:
- Simplified management of ensemble computing and results
- Higher performance and efficiency by leveraging the moderate parallelism in each MapReduce node, rather than trying to make a single calculation run efficiently across a huge number of processors
- Automated load balancing and simplified cluster management
- Fault-tolerance in commodity server clusters allowing the ensemble jobs to complete, even in the presence of system failures
Leveraging Hadoop MapReduce clusters for ensemble computing could exploit most of the existing CSE software development investments (that don’t use MapReduce programming) while making ensemble computing easier to program and execute, and allowing the ensemble outputs to be conveniently analyzed and stored in a scalable way. So, although Hadoop and MapReduce have not been used much for computational science and engineering, ensemble computing may be the killer application for their use in this domain.
Great post. I guess the difficulties of parallel programming are in line with the difficulties of parallel hardware engineering. We saw similar problems when parallel buses in the likes of SCSI etc came to a dead stop with extended physical lengths. Maybe a processes of serialised map-reduce schemas could solve the problem.
Hmmm, food for thought…