Context-Aware Task Assignment for MapReduce in Heterogeneous Clouds

The MapReduce programming model is designed to process large data sets based on parallel computing among multiple computer nodes (CNs). Because the data size is considerably increased (data are collected from sensors in most cases), the optimization problem of task assignment becomes important to improve the performance of MapReduce. Unfortunately, this problem is even more difficult in heterogeneous clouds in which the CNs have different capabilities and available resources. In this paper, the context-aware task assignment (CATA) approach is proposed to improve the performance of MapReduce in a twofold manner. First, CATA takes the resource demands for different types of jobs into account. Second, CATA can assign tasks to CNs according to their capabilities and available resources in a resource-proportional manner. The experimental results show that CATA can efficiently reduce the job execution time by 10 to 40%.


Introduction
Cloud computing is a computing paradigm intended to provide resources as services on-demand. (1) For most applications of Internet of Things (IoT), such a paradigm can provide processing of massive data streams from a large number of sensor nodes in cyber-physical environments. A cloud data center consists of a large number of commodity computer nodes (CNs). With parallel processing on these CNs, MapReduce is considered as a suitable model for large data processing. In MapReduce, input data collected from sensor nodes are first divided into a large number of data splits. Then, these data splits are processed by tasks, called mappers and reducers, assigned to CNs in a parallel manner. MapReduce has a built-in fault-tolerance mechanism (2) to ease the development of parallel applications. Thus, a programmer can focus on applications without considering the synchronization among CNs. As a result, many cloud service providers, such as Google, Yahoo, and Facebook, have utilized MapReduce for data analysis.
Apache Hadoop and Spark are the most famous open source frameworks to realize MapReduce. In these frameworks, the default task assignment (DTA) is suggested to refer the number of CPU cores on CNs. In practice, the number of tasks assigned to a CN can be 1 to 2 times the number of CPU cores on this CN. Unfortunately, this simple task assignment becomes inefficient for In the above execution flow, the job execution time consists of (1) map processing time, (2) shuffle communication time, and (3) reduce processing time. Job execution time can depend on several parameters, such as the number of CNs executing mappers in the map phase, the number of CNs executing reducers in the reduce phase, and the communication load in the shuffle phase. Thus, the performance of MapReduce can be improved by carefully selecting these parameters. (11)

Performance improvement of MapReduce
Several approaches were proposed to improve the performance of MapReduce. Kambatla et al. first proved that the best practices of task assignment are different for different types of jobs. (12) According to the experimental results, the Hadoop parameters, such as the number of CNs, the number of mappers, and the number of reducers, can affect the performance of MapReduce. Moreover, a signature-based approach is proposed to determine the types of jobs based on the signature of this job and then indicates a predefined task assignment strategy. Unfortunately, this approach requires time to create the signature database and cannot adapt to fine-grained job diversity using a predefined signature database. (13,14) Although Tian et al. proposed a fine-grained job classifier, (15) there is still a problem of selecting the appropriate number of CNs, mappers, and reducers in the task assignment strategies for different job types. Under the multicore with a tiling platform, Chen et al. proposed the tiled-MapReduce that speeds up at least 19% compared with the original MapReduce. (16) However, a specific hardware platform is required. Moreover, Ahmad et al. observed that communicationintensive jobs, which tend to output as much as the input, incur a considerable performance overhead (e.g., 30-40%). The reason for this is that a high data volume is required to be transferred using an affordable disk and network bandwidths in the shuffle phase. Thus, overlapping the shuffle delay with mapper and reducer computations is performed to reduce the job execution time. (17) In summary, the above approaches tend to reduce the job execution time by finding the best practices for task assignment according to the type of job. Unfortunately, most of them assumed that the CNs are homogeneous and did not take the cloud heterogeneity and network dynamics into account. However, heterogeneous clouds become increasingly common. Thus, a new task assignment approach, which is able to adapt to the capabilities and available resources of CNs, is required.

Problem Statement
Prior to the problem statement, Table 1 shows the notations and their descriptions used in this paper. As the task assignment in heterogeneous clouds shown in Fig. 2(a), assume that the processor speed of CN A (denoted by P A ) and that of CN B (denoted by P B ) are 10 and 5 million instructions per second (MIPS), respectively. In addition, assume that a job requires one instruction to process one bit (IPb) of input data. If this job needs to process 300 Mb of data that are equally assigned to A and B, then both A and B need to process 150 Mb of data. Thus, A and B need to spend 15 and 30 s to finish the tasks, respectively. Although A can finish its tasks in 15 s, the map phase can be accomplished only after both A and B finish their tasks. Thus, the time required to accomplish the map phase will be 30 s. However, as the other task assignment shown in Fig. 2(b), if 200 and 100 Mb of the same job are assigned to A and B, respectively, according to their processor speeds, then A and B will both spend 20 s to finish the tasks. Thus, the map phase can be accomplished in 20 s.
As we can see from these two cases, the job execution time can be prolonged if the loads among CNs are unbalanced. Thus, optimizing load balancing among CNs can significantly improve the performance of MapReduce. Unfortunately, it is difficult to optimize load balancing in heterogeneous clouds. The challenge is that CNs have wide varieties of computation and communication capabilities and available resources, such as CPU time, memory, and storage in heterogeneous clouds.  In this paper, CATA is proposed to solve the above problem by balancing the loads of CNs in heterogeneous clouds. The main concept of CATA is to adaptively assign mappers and reducers to CNs according to the type of job and the capabilities and available resources of CNs. In the following paragraphs, we will describe the challenges and opportunities to improve the performance of MapReduce in heterogeneous clouds.

Computation speedup in heterogeneous clouds
In MapReduce, a job is typically executed in a parallel manner for computation speedup. Intuitively, the computation speedup is higher if more CNs are involved. Unfortunately, it is not exactly true in heterogeneous clouds. For example, in Fig. 3(a), the map phase can be finished in 15 s with two CNs. However, although the third CN C is added to execute this job as shown in Fig. 3(b), the processor speed of C is too low to prolong the map processing time. Therefore, if the loads are not balanced among CNs, the computation speedup cannot be guaranteed even though there are more CNs selected to execute a job. In this paper, the criterion of computation speedup in heterogeneous clouds is defined and described as follows.

Definition 1 (Task Assignment).
Assume that there is a job with a data size of L bits, which will be executed by a selected set of CNs denoted by N. A task assignment for this job can be defined as l 1 , l 2 , ..., l i , ..., l |N| , where l i is the size of the data needed to be processed by the tasks assigned (18) Different task assignments will have different task processing times defined as follows.

Definition 2 (Task Processing Time).
Since the map (or reduce) phase can be accomplished only after all the CNs have finished their tasks, (19) the task processing time is defined as where P i is the processor speed of CN i ∈ N and C T is the complexity of the task (e.g., mapper or reducer) of the current job. The objective of a task assignment is to reduce the task processing time while considering the load balance among CNs. In this paper, the perfect load balance (PLB) is defined as follows.

Definition 3 (PLB). A task assignment among all CNs in N is PLB if and only if
In the following paragraphs, we will prove that the task processing time is shortest if the task assignment is PLB. Theorem 1. If the task assignment among all CNs in N is PLB, then the task processing time is shortest. Proof. By Definitions 2 and 3, if the task assignment among all CNs in N is PLB, then the task processing time is defined as If CN i ∈ N yields x bits to CN j ∈ N where x > 0 and i ≠ j, then this task assignment will not be PLB by Definition 3. The new sizes of data that need to be processed by the tasks assigned to CN i and CN j are l i = l i − x and l j = l j − x, respectively. Thus, the new task processing time is Since x > 0, the following inequality holds: According to Eqs. (3)- (5), it is easily proven that Based on Theorem 1, we can prove that the task processing time is shorter by adding more CNs if the task assignment is still PLB. Theorem 2. Assume that the task assignments among two sets of CNs denoted by N and N′, where |N| = n, |N′| = m, and m > n, for a job are both PLB, then PT N > PT N′ . Proof. According to Eq. (3), if the task assignment is PLB, then the relationship between l i and l j can be obtained as l j = L, the following equation holds: According to Eq. (7), l i is then given by Without loss of generality, we may assume that P i ≤ P j when i < j. Assume that a new CN k is added to N, where P i ≤ P k ≤ P i+1 and the new set of CNs is denoted by N′. To preserve PLB, Since |N| j=1 P j < |N | j=1 P j , it is obvious that According to Eq. (3), According to Theorem 2, the computation speedup of parallel computing on more CNs can be guaranteed if the task assignment is PLB.

Communication load in heterogeneous clouds
In addition to the task processing time, the shuffle communication time also influences the performance of MapReduce. Since the communication among CNs is based on networking, the insufficient bandwidths among these CNs can incur a substantial performance reduction. In a heterogeneous cloud, the bandwidths among these CNs may vary greatly, so the task assignment becomes more complex. For example, as shown in Fig. 4, assume that this is a communicationintensive job where the size of the output data is equal to that of the input data. In addition, assume that the bandwidth between A and C is 1 Mbps, and the bandwidth between B and C is 10 Mbps. Although the load is balanced among A and B in Fig. 4(b), the map processing time (i.e., 20 s) is shorter than that in Fig. 4(a) (i.e., 30 s). However, owing to the communication bottleneck between A and C, the shuffle communication time in Fig. 4(b) (i.e., 210 s) is much longer than that in Fig.  4(a) (i.e., 165 s). By summing the times required to complete the map and shuffle phases, the task assignment in Fig. 4(a) (i.e., 195 s) is better than that in Fig. 4(b) (i.e., 230 s).
In conclusion, the optimization problem of task assignment must consider not only the computation speedup but also the communication load. Unfortunately, it is impractical to find the optimal number of CNs, mappers, and reducers while considering the type of job and capabilities and resources of CNs. Thus, CATA is proposed to find the approximation solution in a reasonable time by exploring the above challenges and opportunities of task assignment in heterogeneous clouds.

CATA
CATA has two adaptation levels: (1) node-level adaptation (NLA) and (2) task-level adaptation (TLA). In the first level (CATA-NLA), the appropriate number of CNs is determined according to the type of current job. Then, the determined number of CNs is selected to execute the current job using a resource-proportional approach. In CATA-NLA, the number of tasks assigned to a CN only depends on the number of CPU cores. In the second level (CATA-TLA), these tasks are reassigned to the selected CNs according to the processor speeds of CNs for approximating PLB. The flow chart of CATA is shown in Fig. 5 and described as follows.

CATA-NLA
CATA-NLA improves DTA by selecting the appropriate CNs while taking (1) the type of job and (2) the capabilities of CNs into account. CATA-NLA has four steps as described in detail below.

Job type classification
Typically, a processing-intensive job demands more computation resources. In contrast, a communication-intensive job demands more communication resources. If the task assignment does not take the type of job into account, it can prolong the job execution time. Thus, the first step in CATA-NLA is to determine the type of current job.
In CATA-NLA, the job type classification is simplified by determining the communication load, where S MOD and S MID are the sizes of mapper input data (MID) and mapper output data (MOD), respectively. (15) A job (e.g., max) with a low communication load (i.e., ω ≈ 0) is more likely to be a processing-intensive job. This is because few data are needed to be transmitted from mappers to reducers among CNs. In contrast, a job (e.g., sort) with a high communication load (i.e., ω ≈ 1) is more likely to be a communication-intensive job.

Node number determination
More CNs can be selected to execute a processing-intensive job for computation speedup. In contrast, a communication-intensive job can be executed by a few CNs to decrease the communication among CNs. Thus, CATA-NLA determines appropriate number of CNs to execute the current job according to its type. Because the shuffle phase is typically the bottleneck in MapReduce, (20,21) we only consider the case of only one CN selected to execute reducers and the reduce processing time is ignored.
The problem is to select n CNs among all n CNs in the data center with the objective of minimizing the job execution time. Assume that the current job is executed by x CNs in the map phase and the expected job execution time is where EET x , MPT x , and RMT x are the expected map processing time, reduce processing time, and shuffle communication time, respectively. According to the finding of Chowdhury et al., (19) the map processing time decreases linearly with the number of CNs, but the shuffle communication time increases linearly owing to data transfer among CNs. Thus, in this paper, the expected map processing time and reduce processing time are modeled to be in linear positive correlation with the number of CNs as where P is the average processor speed of all CNs in the data center. Moreover, the expected shuffle communication time is modeled to be in linear negative correlation with the number of CNs as where R is the average bandwidth among all CNs in the data center. Finally, node number determination is actually the optimization problem to find n such that

Execution node selection
Although the appropriate number of CNs, denoted by n, has been determined in Step 2, there is still a problem on how to select the n of all CNs in the data center. (22) Since the available resources of CNs are different, the number of n CNs must be carefully selected according to the type of current job and the available resources of CNs. Basically, CATA-NLA selects CNs with more computation resources for a processing-intensive job. In contrast, the CNs with more communication resources are selected to execute a communication-intensive job.

1507
CATA-NLA adopts the resource-proportional node selection strategy. (23) Assume that there are K kinds of resources. First, CATA-NLA calculates the selection priority SP i,j of CN i, where i = 1, 2, ..., n, for job j by where ω j,k and RP i,k are the weight on resource k for job j and the resource proportion of resource k on CN i, respectively. Since different types of jobs have various resource demands, the node selection strategy, suitable for the type of job j, can be configured as the specific demand weights on resources as (ω j,1 , ω j,1 , ..., ω j,K ), where K k=1 ω j,k = 1. Moreover, the resource proportion of resource k on CN i is where r i,k and R i,k are the available amount and maximum amount of resource k on CN i, respectively. Then, CATA-NLA can simply sort these CNs by selection priority and select first CNs to execute this job.

Uniform task assignment
In the last step of CATA-NLA, the number of mappers assigned to CN i, denoted by M i , is determined. Although each mapper can handle one data split at one time, it does not mean that a higher M i incurs a higher performance, which depends on the capabilities of CN i. On the basis of our experimental results on homogeneous clouds, as shown in Fig. 6, it is found that the job execution time is shortest if the ratio of the numbers of CPU cores to mappers is about 1:2 regardless of the data size and the number of CNs. Thus, in CATA-NLA, M i is defined as where c i is the number of CPU cores of CN i.

CATA-TLA
CATA-NLA can improve the performance of MapReduce by selecting the appropriate number of CNs to execute the current job while considering the computation speedup and communication load at the same time. However, the task assignment in CATA-NLA assumes that the cloud is homogeneous. Thus, according to Theorem 1, CATA-NLA can be further improved by reassigning mappers to CNs according to the processor speeds of CNs for approximating PLB.
Thus, the objective of CATA-TLA is to achieve PLB by performing task reassignment. In CATA-NLA, the total number of mappers in each task execution round is the summation of the numbers of mappers in CN i, where i = 1, ..., n, and is defined as To achieve PLB, the number of mappers in CN i must be reassigned according to the processor speed of CN i in CATA-TLA. According to Theorem 1, the reassigned number of mappers to CN i, where i = 1, ..., n, is defined as

System Evaluation
These experiments are conducted to evaluate the benefits of using CATA. The performances of CATA-NLA, CATA-TLA, and Hadoop DTA are compared in terms of different types of jobs, data sizes, and the numbers of CNs, mappers, and reducers.

Experimental setup
The jobs selected to represent processing-and communication-intensive jobs are sort and grep, respectively. The experimental environments are built on two physical computers with a 2.66 GHz quad-core processor, an 8 GB main memory, and a 1 TB hard disk. The virtualization software, XEN, is employed to emulate a homogeneous cloud and a heterogeneous cloud with 8 and 4 CNs, as shown in Tables 2 and 3, respectively.

Experimental results
In all experimental results, the notation NnMm indicates that the MapReduce configuration for the experiment is composed of m mappers that run on n CNs. Each experimental result is the average of 10 runs of experiment with the same configuration.

Experiments in homogeneous cloud
First, CATA-NLA is compared with DTA in the homogeneous cloud as described in Table 2.
Since the size of MOD is far less than that of MID in grep, a higher computation speedup can thus be expected if more CNs are selected. In contrast, since the size of MOD is equal to that of MID in sort, a higher communication load can thus be expected if more CNs are selected.
When the data size is 10 GB, the numbers of CNs for grep and sort are determined as 8 and 4 by CATA-NLA, respectively. Moreover, since each CN has a single CPU core in this homogeneous cloud, each CN is assigned 2 mappers by Eq. (20) in each task execution round. As shown in Fig.  7, since the size of MOD for grep is small, CATA-NLA is intended to execute grep on more CNs for a high computation speedup. In contrast, as shown in Fig. 8, CATA-NLA executes sort on few CNs for a low communication load. According to the experimental results, the job execution time, under the numbers of CNs and mappers determined by CATA-NLA, is the shortest among all other combinations.
For different data sizes of 10, 20, 40, and 80 GB, the experimental results for grep and sort are shown in Figs. 9 and 10, respectively. The experimental results show that the advantage of CATA-NLA becomes more obvious with increasing data size.

Experiments in heterogeneous clouds
In this section, CATA-TLA is compared with DTA and CATA-NLA in the heterogeneous cloud as described in Table 3. In this experiment, the numbers of CNs selected for grep and sort by CATA-TLA are both 4 CNs owing to the heterogeneous cloud, so we only show the experimental results with different numbers of mappers. Because the CNs in heterogeneous clouds have different processor speeds, CATA-TLA reassigns the numbers of mappers to CNs for approximating PLB. As shown in Figs. 11 and 12, CATA-TLA can further improve CATA-NLA by considering the processor speeds of these CNs.

Conclusions
For most IoT applications, one of the challenges is to efficiently process massive data streams from a large number of sensor nodes. However, the task assignment approach in the MapReducebased cloud framework is inefficient to cope with the enormously increasing size of sensor data and the heterogeneity of CNs in the data center. In this paper, the CATA approach is proposed to improve MapReduce performance. CATA can assign tasks while considering different resource demands according to the type of job. Moreover, CATA also takes the capabilities and available resources of CNs into account while performing task assignment. There are two contributions of CATA. First, CATA can adaptively assign tasks to CNs according to the capabilities and available resources of CNs with the objective of achieving PLB. Thus, the job execution time is efficiently reduced because of the balanced load among CNs. Second, CATA does not require any specific hardware platform. The experimental results show that CATA can reduce the job execution time by 10-40% whether the job is processing-or communication-intensive.
Currently, CATA is based on the Hadoop first-in first-out (FIFO) scheduler, so the decision of task assignment is only for a single job at one time. Thus, it is still possible to incur starvation if the high-priority and long-running jobs exhaust cloud resources. In the future, we will modify CATA for other efficient schedulers, such as fair, capacity, and Hadoop on Demand schedulers. (24)