Software Scalability with MapReduce

First published online April 2010

The architecture of a software system is the structure and design of the components that work together in unison to perform the goal of the system as a whole. On large scale systems, the architecture can describe independent components and their interactions; interface specification, communication mechanisms and coupling. Components can be internal to the software system, they can be multiple software components on a server or multiple components executing across a network of servers.

Reusability has appealed to computer scientists since the first stored program computer EDSAC (Electronic Delay Storage Automatic Calculator) at the University of Cambridge in the United Kingdom in 1949, where the first subroutine library was proposed. Reuse can occur in many ways in a computer system; common code factored into subroutines and called multiple times from within the application software or common modules linked together in different configurations. There are now widespread examples of softwarecomponent reuse in commercial and open-source systems. The softwareindustry has embraced the concept of reuse to such an extent that it is now ubiquitous and seldom referred to as reuse anymore.

Established examples of software reuse are component libraries that are developed and maintained for the sole purpose of being integrated into a software application – i.e. being reused by a number of applications. These libraries are widely available in the open-source community and from commercial software houses. Application frameworks are an extension to component libraries that provide a set of subroutines that are fundamental to the structure and execution of the applications. Rather than providing functional elements that are incorporated into an application, the library provides a rich architectural environment into which the developers add bespoke functionality. Application Programming Interfaces (APIs) are a popular mechanism for software vendors to expose functionality from their applications to third party developers in a controlled and secure way without compromising their intellectual property. This enables vendors to improve their market penetration by making their application functionality available through other applications, while maintaining the core business assets. In the internet era, the idea of APIs has been instrumental in the adoption and development of the Web as a platform for modern computing and access to a new audience.

Traditional application-bound APIs have been re-engineered into lightweight and loosely coupled interfaces for exchanging data over relatively slow networks, and new technologies and concepts such as Service Oriented Architecture, AJAX and Adobe Flash have appeared. Web technologies have matured and application frameworks are now available that provide a rich set of application development tools; Google Gears, Dojo and Adobe Flex are examples of these. With the advent of Web 2.0, a new reusable concept has been developed, called a mashup. A mashup is a web page or web application that combines data or functionality from two or more external sources to create a new service. The external sources are integrated into the application by use of their published APIs.

The concept of software component reuse is readily extendable to the architectural design of computer systems. Architecture frameworks provide an execution environment within which a computer system runs. The .Net Framework from Microsoft and the Java Runtime Environment from Sun Microsystems are two very popular frameworks.

MapReduce is an architecture framework designed for creating and processing large volumes of data using clusters of computers. The concepts behind MapReduce have a long history in the functional programming community, in languages such as Lisp. However, it was not until 2004, when Google presented a paper at the Sixth Symposium on Operating System Design and Implementation [OSDI'04], that the idea of applying paired map and reduce functions to large data processing problems became a source of widespread interest in mainstream academia, open-source and commercial organizations. Google's paper entitled MapReduce: Simplified Data Processing on Large Clusters [Dean1] is now often cited as the introduction of MapReduce to the problem of data processing.

The MapReduce programming model and distributed execution platform was originally described for easily scaling systems to run efficiently across thousands of commodity machines on a dedicated high-speed network. The model has gained significant momentum in commercial, research and open-source projects since the original paper was published and has been successfully applied to a number of problems, demonstrating its versatility for real-world data processing solutions. An important development from Google's original paper is in applying the MapReduce concept to parallel processing on multi-core environments such as multi-core/multi-processor machines and graphics processors. The scalability achieved using MapReduceto implement data processing across a large volume of CPUs with low implementation costs, whether on a single server or multiple machines, is an attractive proposition. In 2008 the original authors updated and republished their seminal paper [Dean2] in response to its growing popularity.

Data Partitioning

As we have seen, to achieve good scalability, a computer system needs to be able to run many simultaneous threads of execution in isolation of each other, or at least with minimal dependency between executions.

To successfully have independence in the execution, a system also needs to have independence in the supply of input data and the production of output data. If either the input to the processing or the output from the processing needs synchronization of between the simultaneous executions, then the benefits are severely diminished. The impact on the performance of a computer system caused by the synchronization simultaneous executions attempting to access a shared resource is called contention.

A means to overcoming contention on shared input data is to employ a technique known as Data Partitioning. This involves isolating the data that is being processed by each execution into groups - partitions - which can be read and processed independently of other data partitions. Data partitioning is a fundamental principle to enabling scalability.

Figure 1-1 Data Partitioning

Figure 1-1 illustrates data partitioning. On the left, a data source is supplied to a processing unit which is capable of processing multiple streams simultaneously. Data feeds in through a pipeline into a synchronization barrier which allocates data to each of the processing execution streams. It is this barrier that causes contention. While data is being allocated to execution Y, execution X may be starved of data and become idle. If this is repeated regularly, then the contention on input data causes a lot of starvation and the overall runtime increases. The alternative scenario shows the input data has been partitioned and each of the execution streams has its own data source supplying input data. There is no requirement for synchronization and therefore no contention, no starvation of data and no idle processing time.

Data Partitioning Methods

The rules by which data is apportioned to a partition can be a very simple record-counting method or more complex key-based algorithms. A record-counting method, for example, may divide the dataset into partitions based on a principle that a pre-defined number of records belong to each partition. Starting at the beginning of the dataset, the first n records belong to the first partition; the second n records belong to the second partition, etc. until all records have been allocated to a partition. A key-based algorithm may define a key to be the first k characters of a record and for each record in the dataset the partition assignment is calculated based on the value of the key. A common mechanism is to apply a hash function such as Secure Hash Algorithm (SHA), Message-Digest Algorithm 5 (MD5) or Cyclic Redundancy Check (CRC) to the key character string to generate a hash value, and then modulo m is calculated to give the partition number, where m is the number of partitions required.

partition = md5(key) modulo m

An important consideration when selecting a partitioning schema is the total number of partitions that will be created. In the first example, the record-counting method, the number of partitions is logically unlimited and will be determined purely by the size of the input dataset. Undefined limits such as these can be dangerous in a computer system. Even the largest and most scalable systems have a limit, so they must be prepared for the situation where the physical limit is exceeded because of the logically unlimited parameter.

The key-based partitioning mechanism pre-defines the maximum number of partitions that will be created. It is important to select a good hash function that will generate each potential hash value with roughly equal probability. This will guarantee that each partition is broadly equal in size to ensure that each data processing execution has the same amount of work to do.

To illustrate the selection of hash functions, consider an experiment performed using four hash functions; a 32-bit word implementation of the Secure Hash Algorithm SHA256, the MD5 producing a 128-bit hash value, and two versions of Cyclic Redundancy Check, a 16-bit and 32-bit wide implementation CRC16 and CRC32 respectively.

Figure 1-2 Sample Partition allocations per has function

Figure 1-2 shows an example of partition allocations in a test where a sample dataset was passed through a number of hash functions and the partition number in each case was calculated using modulo 10 of the calculated hash value. For each partition number along the x-axis, the frequency of keys allocated to that partition for each hash function is plotted on the y-axis. The scale of the y-axis is large to show the variance clearly as the range is relatively small; between 9.64% (SHA256's allocation to partition 7) and 10.24% (MD5's allocation to partition 0). The optimum hashing function is the one that is plotted as the flattest line, which in this sample dataset is the CRC16 function (9.94% - 10.07%) and the most unbalanced partition set was generated using SHA256 (9.64% - 10.23%).

Note that these results are from a sample test dataset. Hash function performance is very dependent on the data itself, so it is impossible to select a single hash function that will produce an optimum distribution of data partitions for all input datasets.

Hash functions themselves have a cost in execution time, so there is a balance to consider. Some hash functions are very expensive to calculate and the overhead of using one hash function over another may not be justified with the saving in processing time gained by a slightly more balanced dataset.

Figure 1-3 Hash function runtimes

There is little benefit in using a complex and computationally expensive hash algorithm to calculate a perfectly distributed hash value if is it used to simply calculate the partition assignment by taking modulo p of the value; particularly if p is small. Figure 1-3 shows the relative runtimes of each hash function from the experiment above. In this example, the CRC32 performed the quickest, followed by CRC16, so the CRC algorithms produce the best distribution and execute the fastest of the four test hash functions, but there is a choice on distribution versus performance between CRC16 and CRC32.

Balancing the dataset sizes in this way is important to optimize the data processing and reduce the overall running time. Each partition of data is processed independently and simultaneously, and - assuming there is enough processing resource available to process every partition simultaneously - the end-to-end time of processing the entire set of data will be determined by the longest running process. It follows that the longest running will be the largest partition, and therefore if all partitions are the same size then this is the best chance of an optimal-fit to finishing in the quickest time.

Figure 1-4 Processing time of datasets

Figure 1-4 illustrates the relative execution time for processing the same ten partitions in this experiment. The overall time-to-completion is dictated by the longest partition execution time, and an unbalanced set of data partitions therefore skews the overall runtime performance. Here, the MD5 hash function created a large partition 0 disproportionate to partition 6, so processing completes on partition 6 in about half the time of partition 0. Conversely, the CRC16 hash function created values with a more even distribution and resulted in more evenly sized partitions. These partitions were processed in parallel and all finished within a short time of each other, thus completing the overall processing in a lesser amount of time than using the other hash functions.

Data Storage Strategies

Using data partitioning, the contention is removed from the processing to the supply of the data. If each partition of data is stored on a single disk volume, which has the ability to read only one piece of data at a time, then contention is seen here instead. Modern hard disk drives and their software drivers are very efficient in servicing multiple simultaneous read/write requests, nonetheless reading data from a hard disk drive is relatively slow in any computer system. A means to overcoming contention on the hard disk is to use multiple drives. Storing independent data partitions on separate volumes avoids disk contention and can improve performance significantly.

Fault Tolerance

As a computer system scales, it consists of more hardware and more software, and that means there is more to go wrong. On the positive side, it also means that there is more opportunity to automatically overcome a single point of failure.

If a system is running entirely on a single machine, there is no communication with other machines to go wrong, and the only points of failure are local; the machine components may fail, or the software on the machine may exhibit a fault. In either case, there is often very little that can be done to automatically recover from these failures. A hardware failure will typically cause the system to halt, in which case there is no opportunity for any automatic recovery action. A software fault may be more recoverable, but with any remedial action being taken, a restart of the software will most likely see a repeat of the fault in the near future.

Conversely, consider a computer system that is distributed across multiple machines connected together by a local area network, such as Ethernet. Each of the machines processes data and produces a subset of results, working in conjunction with each other to its end goal. Passing data and messages between the machines is critical to the successful and efficient functioning of the system.

Where are the potential points of failure in this computer system?

Each machine has many hardware components that may fail causing the machine itself to fail
Each machine runs software that will contain programming errors that may cause the machine to fail
Each machine is connected to every other machine via a network. Networks are inherently unreliable and a failure in the network can prevent machines from communicating

The first two points are the same as in a single machine system. However, the difference is that the failure of an individual machine does not need to conclude in failure of the entire system. As the volume of hardware and software that is used to construct a computer system increase, then so does the opportunity and probability of failures occurring. Consequently, automatic reaction to failures within the system and an ability to reconfigure the working environment to have the system succeed in the event of individual failures becomes increasingly important in the scalabilityof a system. This resilience to failure is known as fault tolerance.

The most fundamental mechanism for software fault tolerance is built into every good software application without even being called fault tolerance, but rather it is understood to be a good practice in software engineering. Error handling, by checking error codes after calls to an API function or exception handling, for example, should be built into the core of a software system to take appropriate action if an error occurs. Typically, expected errors are those that can occur during the normal running of a problem and are treated simply as a conditional path of execution. Exceptions are errors that are known to be possible, but not expected during the normal course of operation of the system. An expected error is therefore recoverable whereas an exception can often result in the system aborting with no appropriate recovery action being available.

Complex safety critical systems can employ a technique called diversity to increase the reliability of the system. This technique involves multiple implementations of one specification, which enables the system to identify and recover from errors in one of the implementations. Each alternative implementation is independently designed to produce the same result. The module determines the correct result based upon the individual results obtained from the alternative implementations. Ideally, all implementations produce the same result; otherwise a choice is made based upon select criteria [Inacio].

Handling hardware failures can be more challenging than handling softwarefailures as the remedial action must be taken from outside of the failing component. Software can intrinsically detect error conditions and react appropriately and while individual hardware components can do the same, handling the failure of hardware must be done externally.

To guarantee recovery from all hardware failures, the overall system must always maintain a state where the absence of any individual piece of hardware will have no effect on the ability of the system. Two techniques can be used in conjunction with each other to satisfy these requirements; hardware replication and hardware redundancy. Replication can be applied to data and processing; data is stored on multiple storage devices so that it is available from a different device if one fails, and processing is executed several times in case one fails to complete. Redundancy is implemented by duplicating key components within the system so that if one of the components fails, then the other can be used in its place. A common example of redundancy is a RAID device (redundant array of inexpensive disks) which combines multiple hard disk drives and duplicates data written to the device across the multiple disks so that if a single disk fails, the data is still accessible.

The MapReduce framework provides for fault tolerance in its execution. This inherent support is a critical benefit to the framework.

Scale and Context of Software Systems

It is only quite recently that mainstream engineers and architects have begun to think about software systems that can be larger than a single machine. The size and capacity of the machine and its peripheral devices has previously been the constraint on the size and complexity of the software and the volume of data that can be processed by a single system.

The introduction of multi-core processors to laptops, desktops and servers has done little to advance the architecture of software. Instead, the industry has embraced these processors as a way for machines to run more software rather than an opportunity to improve software to run faster and more efficiently.

Writing good, fast and efficient multitasking software is difficult with most current programming languages because the languages were designed before the advent of multitasking capabilities and therefore lack primitives to support synchronization of execution thread, cache management and other considerations for high performance multitasking software.

Scalability is therefore an important consideration at every degree of granularity in a software system where a new dimension of execution is introduced.

Can the software scale efficiently from a single thread of execution to running multiple concurrent threads?
Can the system scale efficiently from a single machine to a cluster of interconnected machines on a local area network?

Each of these step changes in dimension of software execution has traditionally been tackled as separate, distinct problems with their own complex solutions which themselves have shortcomings. The combinations of which have highlighted the complexity and difficulty in developing truly efficient scalable software systems.

While MapReduce is no silver bullet, it does provide a framework and programming discipline that can be successfully applied to the problem of scaling software applications through multi-core processors and multiple machine cluster infrastructures. The framework abstracts the complex parallelization, synchronization and communication mechanisms and protects the programmer from having to consider these aspects of scalability in each system implementation. A common architecture, albeit with a necessarily different implementation for each granularity of scale, also enables a degree of reusability that has not been seen before. A system can be re-designed from a current single-threaded implementation to MapReducearchitecture, implemented to scale to a multithreaded single-machine environment as well as a multiple-machine cluster of interconnected computers.

MapReduce Execution

Figure 1-5 MapReduce Job Execution

At the core of MapReduceare the two phases of execution that give the framework its name. A Map Phase processes the system's input data and produces intermediate results which are sorted and collated by the infrastructure and passed to the Reduce Phase which further processes data in groups of identical keys, and produces the final results. An overview of a MapReduce Job is shown in Figure 1-5.

MapReduce references data using pairs of data items – a key item and a value item – called a key/value pair. A MapReduce job is executed by first applying a mapping function to a set of partitioned input data of key/value pairs. Multiple mapping functions are executed simultaneously, each processing a distinct partition of data. The number of simultaneous functions executed is dependent on the capacity and workload of the environment as well as user-specified recommendations. Other data partitions are queued by the framework until capacity is available to execute them, either by more resource being made available to the system, or by other map functions completing. Output from the map phase is an intermediate sequence of key/value pairs, which the framework collates and sorts by their key. A reducing function is then applied atomically to each group of values with identical keys. A number of reducing functions run concurrently to produce a result set of key/value pairs which form the output of the job.

The algorithmic pattern of MapReduce is not itself new, but what Google have done is to apply the theory and create an infrastructure for using the pattern on large datasets across many machines on a local network. In doing this, they have popularized an already available but largely unexposed technique, and made it available to the computer science community by documenting sufficient details of the implementation of Google's own framework that others have been able re-create a version of it and further develop the MapReduce concept. There are a number of MapReduceimplementations available across a plurality of hardware platforms; the most widely known and popular of which is Hadoop. I have developed an open source C++ implementation of a MapReduce framework for a single machine multitasking environment.

Summary

MapReduce will not solve all the world's scalability and data processing problems. It is very rare for one-size-fits-all to apply in computing, but the complexity of scalable software is definitely one situation where it does suit. MapReduce is a very useful framework that enables massive scalability of appropriate computing problems. The important point here is appropriate computing problems; the best advice, as ever, is to use the right tool for the job and don't try to force a problem into fitting a solution. Let the definition of the problem and the constraints of the system determine the selection of an appropriate solution. Be aware of architectures that are well defined and proven to work, and take time to understand the pitfalls and drawbacks as well as the more obvious advantages of each. Finally, be prepared to use a collection of tools where necessary to create a hybrid architecture that defines a good solution to the problem at hand.

Software Scalability with MapReduce的更多相关文章

MapReduce C++ Library
MapReduce C++ Library for single-machine, multicore applications Distributed and scalable computing ...
MapReduce: 一个巨大的倒退
前言 databasecolumn 的数据库大牛们(其中包括PostgreSQL的最初伯克利领导:Michael Stonebraker)最近写了一篇评论当前如日中天的MapReduce 技术的文章, ...
hadoop之mapreduce详解（基础篇）
本篇文章主要从mapreduce运行作业的过程,shuffle,以及mapreduce作业失败的容错几个方面进行详解. 一.mapreduce作业运行过程 1.1.mapreduce介绍 MapRed ...
Hadoop & Spark
Hadoop & Spark 概述 Apache Hadoop 是一种通过服务集群并使用MapReduce编程数据模型完成大数据的分布式处理框架,核心模块包括:MapReduce,Hadoop ...
YARN与MapReduce1的对比
Apache YARN (Yet Another Resource Negotiator)从Hadoop2开始.YARN为集群提供资源管理和Applications的调度.YARN的API用于操作集群 ...
hadoop杂记-为什么会有Map-reduce v2 (Yarn)
转自:http://www.cnblogs.com/LeftNotEasy/archive/2012/02/18/why-yarn.html 前言: 有一段时间没有写博客了(发现这是我博客最常见的开头 ...
Scalability, Availability & Stability Patterns
https://blog.csdn.net/ajian005/article/details/6191814 一自我有要求的读者应该提出问题:(研习:掌握层次:)能力级别:不会(了解)——领会( ...
MapReduce 单词统计案例编程
MapReduce 单词统计案例编程一.在Linux环境安装Eclipse软件 1. 解压tar包下载安装包eclipse-jee-kepler-SR1-linux-gtk-x86_64.ta ...
[MapReduce] Google三驾马车：GFS、MapReduce和Bigtable
声明:此文转载自博客开发团队的博客,尊重原创工作.该文适合学分布式系统之前,作为背景介绍来读. 谈到分布式系统,就不得不提Google的三驾马车:Google FS[1],MapReduce[2],B ...

随机推荐

Redis----windows下的常用命令二
Redis 是一个开源,高级的键值对的存储.它经常作为服务端的数据结构,它的键的数据类型能够是strings, hashs, lists, sets(无序集合) 和 sorted sets(有序集合) ...
Hadoop streaming使用自定义python版本和第三方库
在使用Hadoop的过程中,遇到了自带python版本比较老的问题. 下面以python3.7为例,演示如何在hadoop上使用自定义的python版本以及第三方库. 1.在https://www.p ...
Linux文件IO与通用块层的请求合并
本文参考https://mp.weixin.qq.com/s/Imt4BW-zoHPpcOpcKZs_AQ, 公众号“Linux阅码场” 请求合并就是将进程内或者进程间产生的在物理地址上连续的多个IO ...
如何安装Magento插件
Magento有着非常多的插件,其实就是模块,那么怎么安装需要的插件呢? 具体方法如下,以安装DeveloperToolbar这个开发插件为例讲解: 1.首先到Magento的官方网站查找到相应的插件 ...
css怎么让li水平排列和div居中
让li向左浮动即可给div定一个宽度,然后margin:0 auto;即可:
CodeForces - 645F：Cowslip Collections （组合数&&欧拉函数）
In an attempt to make peace with the Mischievious Mess Makers, Bessie and Farmer John are planning t ...
BZOJ3473: 字符串【后缀数组+思维】
Description 给定n个字符串,询问每个字符串有多少子串(不包括空串)是所有n个字符串中至少k个字符串的子串? Input 第一行两个整数n,k. 接下来n行每行一个字符串. Output 一 ...
HPU ：字符串的统计
字符串的统计时间限制: 2 Sec 内存限制: 128 MB提交: 15 解决: 1 题目描述给定n个字符串,我想知道第i个字符串已经出现多少次? 输入第一行输入一个整数t,代表t(t < ...
mysql完整版
mysql: dbs 数据库系统 bdms 数据库管理系统 bda 数据库管理员 db 数据库 dba通过dbms来操作db! 关系型数据库和非关系型数据库登录mysql mysql -h主机地址 ...
度限制最小生成树 POJ 1639 贪心+DFS+prim
很好的解题报告: http://blog.csdn.net/new_c_yuer/article/details/6365689 注意两点: 1.预处理环中权值最大的边···· 2.可以把去掉度限制后 ...

Software Scalability with MapReduce