From: Qingxi Li [cs.qingxi.li SpamElide] Sent: Thursday, February 17, 2011 12:37 PM To: Gupta, Indranil Subject: 525 review 02/17 LATE: Improving MapReduce Performance in Heterogeneous Environments In MapReduce there are some nodes called straggler which are available but are performing poorly. Avoiding the straggler slowing down the whole jobs, the Hadoop, the implementation of MapReduce, will run a speculative copy of its task. This mechanism can avoid the misbehavior of some nodes to slow down the whole job. Hadoop assumes that all the machines are homogeneous. However, in fact, the machines are not homogeneous both in private data center as the hardware are in different generations and virtualized data center as the uncontrollable various of virtualized resources. For Hadoop, it consider the Map and Reduce equally with 1 score. For Reduce, it has been separated as three phases, copy, sort and reduce. Each phase has 1/3 scores and in each phase, the score give depends on the fraction of data processed. When a task’s is less than the average score-0.2 and has already run at least 1 minute, it will be considered as straggler. The problems are as follows. 1. All the stragglers are consider equally, but in fact, run a speculative copy of someone can save more time than the others. 2. It assumes all the tasks start at the same time. 3. The score isn’t reasonable as different phase need different time. For LATE, it will always speculatively execute the task which has the longest estimate time to finish if this task’s progress rate which is the progress score/running time is lower than the threshold which is 25th percentile of task progress rates. And speculation will be run only if the speculation execution is lower than SpeculativeCap in the whole cluster and the nodes which request the speculative execution is fast node whose total progress cores are larger than the 25th percentile. SpeculativeCap can avoid all the speculative task spends too much resource. The definition of the slow node is considering the number of finished tasks which seems not very reasonable if all the nodes don’t start at the same time which is the common situation in the data center. Besides this, as the author said, this algorithm assumes that that the nodes run at consistent speeds which also seems not possible as some phase’s speed is faster and some is slow. The other thing should be mentioned is that even the author complains so much before for the score process of Hadoop, they still use it in the LATE and put the task which is finding a better score as future work. Quincy: Fair Scheduling for Distributed Computing Clusters Quincy addresses the problem of scheduling jobs on the computer which is close to the data. For each job, the scheduler in the cluster, will finds a node to sign the root task. This node will submit a list of worker to the scheduler. All of these works have no dependency relationship. For each worker, the root will calculate the preference list of computers and racks which have more than a percent of data in the computer or in the rack of computers. There are many scheduling algorithm: 1. Each task will be put into the queue of the computers and racks in the preference queue so does the cluster-wide queue. If a node is idle which means the queue is empty, it will find the task in the queue of rack. If the rack queue is also empty, it will find task in the cluster queue. When a task in the queue is worked by one node, all the other same tasks in the other queues will be deleted. This algorithm will keep all the nodes busy, but the nodes without preference will wait for a long time. 2. For each job, it has a limit which is min{# of computers / #of jobs, # of workers} of the total workers which are running. If there are still more idle nodes, it will be separated by the jobs which have more workers. It will not cut down the workers which exceeds the limit. This algorithm needs a long time to become fair if there are many long-live jobs. 3. The same as 2, but it will cut the workers which exceeds the limit. This will make all the time and resources spent on this worker be wasted. 4. Quincy, build a graph based on the preference list and the cost of each edge is the cost of the worker running on this computer which includes both the cost of cutting original worker and the cost of getting data. Using the min-cost algorithm of network flow to find out best solution. This algorithm is complexity and need much more time to find out the optimal solution. From: w SpamElide on behalf of Will Dietz [wdietz2 SpamElide] Sent: Thursday, February 17, 2011 12:30 PM To: Gupta, Indranil Subject: 525 review 02/17 Will Dietz cs525 2/17/11 First paper I read was the Berkeley OSDI paper "Improving MapReduce Performance in Heterogeneous Environments" by Zaharia, et. al. In this paper they introduce "LATE", their new scheduler for MapReduce/Hadoop. The primary contribution is the way they handle speculative tasks. They authors did a very good job of describing the existing scheduling techniques for speculative execution, and the assumptions that were made in those systems to make those scedules appropriate. However, they claim those assumptions aren't generally true--especially that of a homogeneous system (both in node capability and in task complexity). In the end they claim to increase response times by a factor of 2, which is a solid claim. They do this by limiting the total number of speculative tasks, and by picking tasks to speculate on based on which is likely to finish LAST so as to benefit the most from the speculative task. They did a good job presenting it, and discussing the heuristics they used--both in motivating why they worked both informally and through evaluation, but also where they're break down. This made for a better read and would be important if I was designing and implementing such a scheduler myself. After an extensive evaluation, they leave the reader with a few important lessons that drive the rest of their work. Second paper I read was the "Quincy" paper from Microsoft. In essence they took the standard scheduling algorithms used for MapReduce-like systems (in their case, Dryad), and replaced it with their new one. One important difference between their system and its goals vs a more "traditional" dryad system is that their nodes have large local storage, which motivates something they discuss a lot in their paper: data locality. The short version is that they designed a new scheduler that in essence maps the system to a flow network, taking advantage of costs and capacities to model the scheduling task itself--but also as a means to implement an arbitrary set of cost functions indicating various things like data locality, cost to kill a task, etc. The big idea is that in doing so they can make decisions using fine-grain information and use that to decide for the /global/ interest. They then solve it as you normally would, and they show through their evaluation how this helps, especially if you configure the various weight factor knobs apprpropriately for your system. What I liked best about this paper is that they reduced the scheduling task rather directly to well-known problem, while taking into account much more detailed information in the system. This allows it to be done efficiently (they report average scheduling time on their system being 7ms, and argue even on rather large systems an incremental solver would probably still scale well) while taking the entire system into consideration (data transfer costs, benefit from data being in same server, rack, etc), allowing for the possibilty of a much more idealized schedule of tasks. This is illustrated in the effectiveness of their results, as demonstrated in the evaluation. They report a (up to) 3.9x reduction in the volume of data transferred across the system, as well as a throughput increase of up to 40%. ~Will From: trowerm SpamElide on behalf of Matt Trower [mtrower2 SpamElide] Sent: Thursday, February 17, 2011 12:05 PM To: indy SpamElide Subject: 525 review 02/17 Quincy Quincy is a scheduling system used to distribute jobs amongst machines in cloud like infrastructure. The original work is designed for DryadLinq but the concepts learned can be applied to other common systems such as Google’s MapReduce. Existing solutions to cloud scheduling are based upon queues. In order to try to keep data local to the computation, queues are based on the hierarchical nature of the cloud networks. Queues exist for machines, racks, and the system. This system works well when data locality is even and job lengths are approximately equal. Quincy expands upon previous work by using a weighted graph to compute globally optimal solutions to scheduling. Despite having an element of preemption in their system, they still guarantee nice properties like starvation-free. The paper acknowledges that they have other related areas to investigate such as using run-time predictions to improve scheduling. I would have liked to have seen work to reduce the number of root nodes required during computation. Furthermore, the network seems to have a large effect on scheduling penalties when data is moved across the network. I would have liked to see experiments run on non-hierarchical network configurations. LATE This paper presents a new scheduling algorithm for the Hadoop MapReduce framework aimed at reducing job latency by detecting and removing execution outliers. Hadoop currently searches for stragglers in the system and reschedules those jobs at other nodes to try to speed up “slow” tasks. Experiments showed that up to 80% of reduce jobs are speculated currently with the Hadoop system. This system does not work well when the systems in the cloud are heterogeneous, which is the case with most large deployments. In order to remedy this shortcoming, the authors use a new selection algorithm. LATE computes progress towards finishing a task and then estimates a finish time for the node. Then the task with the latest estimated finish time is speculatively executed on a fast node elsewhere. The authors also added in hysteresis type measures to reduce the number of falsely speculated jobs. This work made an important contribution by changing the focus from progress rates to finish times for speculating jobs. The group still assumes steady progress towards completion in a job. For data-intensive tasks things such as hard-disk failures might cause temporary speed of progress drops which will not be resolved by scheduling the node elsewhere. The algorithm seems to definitely help with heterogenous systems, but it was not addressed whether this could be deployed in homogenous systems as well. A performance comparison of previous work and LATE in homogenous systems would have made a stronger case for LATE. Speculative execution is also only applicable when throughput can be sacrificed for latency of jobs in the system limiting the applicability of LATE. From: anjalis2006 SpamElide on behalf of Anjali Sridhar [sridhar3 SpamElide] Sent: Thursday, February 17, 2011 11:45 AM To: Gupta, Indranil Subject: 525 review 2/17 Improving Map-Reduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008 The paper attempts to address some of the basic assumptions made by Map-Reduce, a popular programming paradigm for large scale data processing. Map-Reduce compares the performance of each node with the average performance and anything below it is assumed to be a straggler. This might lead to decreased performance since heterogeneous nodes can be misidentified as stragglers. LATE (Longest Approximate Time to End) algorithm that is proposed in this paper can improve the response time of Map-Reduce jobs by a factor of 2. Heterogeneity causes a decrease in performance because of wrongly identifying slow jobs as stragglers and starting speculative tasks on machines. The main assumption used to identify stragglers is that all jobs progress at the same rate. The 1/3 percent allocated to each reduce phase is an example of not having a weighted system for calculating job progress. In this case the assumption leads to 80% of the jobs being speculated unnecessarily. Not all map jobs start at the same time and the wave of map jobs might have a significant start time difference. This can lead to speculatively starting new jobs when comparing with the average of the old jobs. LATE uses the remaining time rather than progress rate to speculate early on. It also runs the speculative task on fast nodes based on a threshold thus breaking the assumption that all nodes are equal. Given that the nodes are going to be running different generations of hardware, it important to consider the heterogeneity of the environment in which you are running your jobs. LATE evaluates speculative execution and variant node performance. LATE is based on deciding what job would take the longest time to finish in the future. The slow tasks are run on fast nodes i.e. their nodes have a better performance than a threshold. You also cap the number of speculative tasks running on the node. LATE decides on speculation jobs that hurt the response time of the entire task. The applied heuristic of calculating the remaining time will fail in the case where a task that is launched later finished before the earlier task. A way to avoid this would be to have phase independent calculation of heuristics to give a better estimation of the remaining time of the currently running jobs. The LATE equation assumes that all tasks progress at the same rate which might not be true. This is aiming at typical Hadoop jobs which are very specific. Data locality is not taken into account when launching speculative map tasks in LATE. By taking into account the state of the network and future location of reduce jobs when duplicating the slow job at a fast node LATE might improve performance. Seems that network hotspots might be avoided if jobs were started that were closer to the input data that they needed to read from. All jobs are assumed to be running late because of contention which might not be the problem at all. There is no per phase solution that is considered because of this earlier assumption. It would be good to see some results where jobs are restarted unnecessarily. ************************************************************************************************************************************************************************ Reining in the Outliers in Map-Reduce Clusters using Mantri, G. Ananthanarayanan, OSDI 2010 The operation of Map Reduce jobs have outliers present which slow down job completion. These outliers might be present due to contention for memory, I/O bandwidth or other resource constraints or failures. By detecting these outliers early on by the help of Mantri, the resources that might be hogged by these laggards are freed up and the job completion time is sped up. The paper attempts to study and classify outlier behavior in order to better design a system that will handle it. Instead of focusing on only the progress rate or job completion time to restart a process, Mantri studies the behavior of outliers and implements targeted solutions. The distinguishing factor here is that, Mantri oversees the resource currently in the system as well as a possible reason for the outlier and takes action accordingly. Mantri restarts a job only where there is a high probability that it will be completed before the already running task. If there are two copies, the slower task is killed after a span of time to conserve resources. In order to deal with network aware placement, Mantri estimates the uplink and downlink time of the rack where the map data is present and starts the reduce job where the network latency is at a minimum. When duplicating jobs, they are also placed in network locations where they can be finished faster than the old job by estimating the network links around the data that has to be moved. Mantri keeps track of tradeoffs in order to better estimate when to restart a job versus when to continue on. In order to conserve resources, it makes smart decision on when to replicate output that might be lost. By looking at the time taken by prior phases, the unreliability of a machine is determined and data is subsequently moved. As mentioned in the paper Mantri does perform costly re-computations based on the high probability that output data becomes unavailable. This provides a safety net against data loss but increases in job completion time. Mantri also assumes a moving average of the time remaining when progress reports are lost. Perhaps by looking at the amount of data read, Mantri can better estimate if the job is still in the input reading phase or might have finished. When network placement is considered, two reduce tasks might be simultaneously run in close proximity due to previous positive information about the low latency links. I was also curious if there was any overhead involved since the job scheduler in Mantri is keeping track of the resources through the progress reports send by each task. It will be interesting to see how often the progress reports are lost or delayed and the impact on the decision making. From: kevin larson [klarson5 SpamElide] Sent: Thursday, February 17, 2011 11:42 AM To: Gupta, Indranil Subject: 525 Review 02/17 LATE is a modified MapReduce scheduling algorithm applied to Hadoop. Hadoop’s performance is closely tied to its scheduling, and LATE attempts to improve that performance by better accounting for the heterogeneity of machines in a cluster. Heterogeneity results from a variety of different things, ranging from imbalanced loads on machines due to virtualized environments, varieties of hardware, or hardware/software problems on a node. The result of this are slow and straggler nodes. Hadoop uses speculation to attempt to alleviate this problem, but doesn’t fully account for the variances in machines. Hadoop uses a progress score to determine where to speculate. LATE builds on this and combines the score with progress rate to create an estimated completion time. Using estimated completion time to determine where to speculate, LATE offers significant improvements over Hadoop. The authors of LATE did a very good job at investigating the assumptions made in Hadoop’s scheduler. By re-evaluating those assumptions and better taking heterogeneity into account, they were able to find significant improvement over the standard scheduler. They also have a strong evaluation, demonstrating their algorithm with a variety of thresholds in order to better understand the effects of homogeneity. LATE was tested on a variety of testbeds, and unfortunately new innovations and discoveries were made at points in which certain testbeds were no longer available. It would have been preferable to see the (expected) improvements across larger clusters of machines. Quincy introduces the concept of fairness to scheduling in a cluster. The authors goal is to approach completion of a task in Jt, where t is the time taken given exclusive access, and J is the number of concurrent jobs on the cluster. The closer these times are to Jt, the more fair the scheduling was. The largest obstruction to achieving fair scheduling is data locality. Across a variety of tasks, all of the necessary data is rarely all on the same computer, let alone the same rack. As tasks move data around in order to collect the data required for execution, the network bandwidth becomes a bottleneck, delaying the execution of, and as a result, the completion of tasks. The authors make a graph based model of network flow to assist with their algorithm. They evaluate the algorithm in a variety of scheduling policies, taking into account various combinations of fairness weighting, preemption, and levels of queueing (computer-wide, rack-wide, and cluster-wide). They demonstrate significant improvements in terms of fairness, performance, and data transfer across the variety of scheduling policies. The QFPX algorithm, which denotes queue-based scheduling with cluster-wide queues, seemed to shine through as the best intersection of their metrics. Quincy identifies a new metric with which to evaluate clusters. Not only are the authors able to create significant improvements by that metric, but they were also able to discover significant improvements in well established metrics as well. The incorporation of fairness into scheduling had secondary effects which actually helped overall performance. In comparison to LATE, the authors of Quincy did not experiment on as wide a variety of testbeds. LATE showed improvements in a large variety of environments. Quincy only detailed operation with different datasets and partition layouts and does not even show numbers for these differences. From: wzhou10 SpamElide Sent: Thursday, February 17, 2011 11:31 AM To: Gupta, Indranil Subject: CS525 Review 02/17 CS525 Review 02/17 Wenxuan Zhou Review on “Improving MapReduce Performance in Heterogeneous Environments” Core idea The author argued the implicit assumption of Hadoop, that nodes are homogeneous, is increasingly turning out to be wrong, while environments like Amazon’s EC2 becomes more and more popular. What’s worse, Hadoop’s task scheduler is designed based on this assumption, and thus the performance is affected negatively, especially in terms of response time. To solve this problem, a new scheduling algorithm, Longest Approximate Time to End (LATE) is presented in this paper, which improve Hadoop response time by a factor of 2. LATE uses estimated finishing time instead of progress rate to prioritizing tasks to speculate. Then farthest to finish tasks are executed on idle fast machines. Pros This paper breaks several unpractical assumptions of Hadoop’s scheduler, and improve Hadoop’s scheduling algorithm to make it suitable to heterogeneous environments. The design principles: prioritizing tasks to speculate, selecting fast nodes, and capping speculative tasks, are simple, intuitive, and powerful. LATE only focuses on estimated time left, and thus can detect the slow task in a very early phase. Cons 1. The thresholds, SlowNodeThreshold, SlowTaskThreshold, and SpeculativeCap are all percentile values, of node progress, task progress, and availale task slots, respectively, which seems not quite reasonable. The SpeculativeCap is introduced to avoid overloading the network, so it might be better to be some value related to the network capacity. For SlowTask/NodeThreshold, it’s for determining whether a task/node is slow enough. Maybe all the tasks are fast, and the slowest 25% tasks are actually fast enough, so it’s a waste of resources to run backup for them. 2. The way they present evaluation results in section 5 is not clear to me. The average data may demonstrate LATE works best among the three scheduling methods. But the worst and best data don’t seem to tell much, since there may be some extreme cases occurred. It might be better if they plotted the CDFs of the results, to show the distribution of the results. Suggestions 1. When deciding to assign a node a new task, the scheduler could take locality into account, to achieve lower latency and to avoid overloading the network. 2. The parameters of the machines being used are available, so maybe when assigning tasks and estimating time left, these parameters could be used. Question In section 5.2, I don’t get it why each experiment was performed 5-7 runs, instead equal number of runs. Review on “Quincy: Fair Scheduling for Distributed Computing Clusters” Core idea This Microsoft work designed a new framework to schedule concurrent distributed jobs with fine-grain resource sharing, and built an implementation of this framework, called Quincy. For cluster where nodes deal with both application data storage and computation, closeness between storage and computation is crucial to the performance. Besides locality, fairness is another important factor considered, since it affects user experience a lot. This work addresses the fairness and locality problem by mapping the scheduling problem to min-cost flow, a graph matching problem, which outputs the globally optimal schedule subject to the locality and fairness constrains. Pros 1. I appreciate the novel idea, mapping the scheduling problem to a min- cost flow problem. In this way, the mature results of graph theory can be adopted. 2. As the authors argued, scheduling based on both fairness and locality concerns is very important in the environment they studied, and there’s no previous study on this subject, which makes this work more valuable. 3. From the evaluation, Quincy shows good fairness for a workload consisting of many short jobs and a few long jobs. Cons 1. The approach, a global optimization, seems hard to scale, and the central control is prone to single point fault. 2. The heterogeneous argument of the “improving mapreduce performance in heterogeneous environment” is quite convincing to me, so I doubt the homogeneous assumption in this paper, even it is the case in their experiments setting. This assumption makes their work less likely to be applied in other environments. 3. There are some questions with their definition of fairness. They only balanced the number of running tasks, but didn’t take into account other metrics like completion time, data throughput of the jobs. Thanks. Wenxuan From: Simon Krueger [skruege2 SpamElide] Sent: Thursday, February 17, 2011 11:15 AM To: Gupta, Indranil Subject: 525 review 02/17 Improving MapReduce Performance in Heterogeneous Environments The core idea of the paper was to change the way stragglers are scheduled for heterogeneous environments in MapReduce programs. Stragglers are a few slow/long running jobs that cause MapReduce programs to take a long time time to finish. Stragglers are especially common in heterogenous environment since each machines cpu, memory, disk, and network utilization will vary. An example of a common heterogenous environment the paper mentions is Amazons Elastic Compute Cloud (EC2) because EC2 uses Virtual Machines (VMs), the resource utilization on the physical machine will vary depending on how many VMs are running on top of the physical machine and what tasks those VMs are preforming. Traditionally, to combat the straggler problem MapReduce frameworks will schedule a speculative copy on a different machine to run but as the paper points out this is not free as the network will be affected because speculative execution needs to transfer the data to be processed to a different machine. In order to fix this problem the authors designed a new scheduling algorithm called Longest Approximate Time to End (LATE) where they speculatively execute jobs that they predict to finish farthest into the future. Pros of the approach They are able to handle heterogenous computing environments effectively They cap the number of speculative jobs that are ran to prevent job flooding. They choose the fast nodes to run speculative jobs. They realize that progress rate is not an accurate measure especially since Hadoops normal progress meter treats map, copy, and reduce as 1/3 of the job while often they are not equal. Instead they use estimated time remaining to launch speculative jobs. I think it was interesting how they could adjust the scheduling of speculative tasks to handle stragglers better but it seemed like this was not a big focus for the original design/development of MapReduce because the way they handled the problem was very poorly managed. They did poor measurement of job completion and speculative re-execution was vary sloppy. I also think that this paper does not provide the best solution since they do not tried to prevent the straggler problem from happening or figure out why it is happening, they just continue the approach of punting the problem by relaunching the tasks. Additionally, I think this title is very misleading because it is too broad and does not mention that they are only fixing a particular problem in heterogeneous MapReduce environments. Reining in the Outliers in MapReduce Clusters using Mantri Core idea of the paper The core idea of this paper is to closely monitor jobs in a MapReduce system to prevent stragglers (or what they call outliers). The system they designed to do this is called Mantri which preforms speculative execution, measures network utilization in the system, and analyzes the cost benefit for the output of tasks. Overall, the difference between this and the Improving MapReduce Performance in Heterogeneous environments paper is that they analyze the cause of stragglers which happen because of workload imbalance, poor placement, resource contention, or problematic machines and then try to use this information to schedule tasks. Pros of the approach Mantri tries to figure out what causes the stragglers and tries to effectively remedy them by being aware of the resources of the system and placing them in the appropriate location to allow them to have the best chance of finishing Mantri experimentally starts speculative jobs and if it realizes that it is not beneficial it kills that job. Other questions/thoughts/criticism Does the measuring that Mantri preforms increase the network utilization and make problems worse? I think this paper does a better job combating stragglers than the Improving MapReduce Performance in Heterogeneous environment paper since they try to figure out what is causing stragglers and use the resource utilization of the system to effectively speculative execute tasks. But maybe the straggler problem points out that the MapReduce paradigm has its flaws and other paradigms will need to be considered in the future. Simon From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Thursday, February 17, 2011 11:11 AM To: Gupta, Indranil Subject: 525 review 02/17 Improving Mapreduce performance in Heterogeneous environment Hadoops performance is tied to its task scheduler which assumes a homogeneous cluster. The performance of a Mapreduce job is based on its slowest task and speculative copies of straggler tasks are scheduled based on the assumption that nodes are homogeneous. This assumption leads to incorrect and excessive speculative execution. LATE is a simple algorithm for speculative execution which has three principles -prioritize tasks : Schedules tasks that will finish farthest into the future -select fast nodes to run on : To beat the original task, the speculative task should be launched only on the fast nodes -capping speculative tasks: A cap on the number of speculative tasks that can be running at once Pros: -The results shown after these were incorporated in Hadoop shows enhanced speedups. -This is using smarter ways to determine stragglers and prioritize them. This would definitely improve the performance as stragglers are prioritized. -This takes into account node heterogeneity when deciding where to run the tasks and would select fast nodes over straggler nodes. -Caps are used to prevent overloading the system Cons and Improvements: -In this algo, speculative tasks are given priority over non-running tasks. The straggler task might finish by the time all the tasks in the system finishes but in this case they would occupy the slots meant for non-running task. But re-executing them first will free up resources as well. So it seems like it is necessary to identify the reason , consider its impact on the running time of the job and carry out an action accordingly. -Scheduling the straggler tasks on the fast node again would affect non-running task and also it could potentially turn this fast node to a straggler. Random assignment of task might be better. Also if it is towards the end of the job, we could use only the fast nodes. -Speculative execution of task is based on which will finish farthest into the future. Ideally only those tasks should be re-executed whose finish time seems to be extending beyond the whole job runtime. -The actual reason for the stragglers are not taken into account before re-executing them. It could just be the nature of the task. Quincy Quincy tries to solve the problem of scheduling jobs on clusters with consideration for locality and fairness. The scheduling problem is mapped to a graph datastructure where edges and capacities encode the demands of fairness and locality. A jobs workflow is represented by a directed acyclic graph of workers where edges represent dependencies and root process monitors the tasks. The techniques used by Quincy are running the jobs task in sub-optimal location, and killing running task of one job to free resources for another without sacrificing data locality. When a job starts, a computer is assigned for the root job. Each root job submits a list of worker tasks to the scheduler. Whenever a worker task is ready (data required for it is available), the root task computes, for each computer, the amount of data that would have to be read across the network. This information is used in the flow-based scheduling. The scheduling problem is encoded as a flow network and solvers are used to solve this. The scheduler updates the graph whenever an event occurs and on a regular interval. This might lead to starting or killing of task. Pros: -This scheduling approach takes into account both locality and fairness. -This can be extended to various network models. There can be various parameters that can be considered while constructing this graph. -This can be extended to incorporate different queuing policies for eg with priorities. -This problem is elegantly mapped to flow based scheduling for which there are solvers. Cons and Improvements: -In this system, the scheduler seems to be the bottleneck. It also has to reconstruct the graph after every event. As the system scales to the level of a datacenter, this would be a bottleneck. -Killing the over-quota tasks, even if most recent one, would be wasting the computation. An addition could be to store the state and resume from that when re-executed. Atleast providing such an option could make users utilize it. -Each root task has to compute, for each computer, the amount of data that would have to be read across the network. This would not be scalable. A way to overcome this would be to compute this at each rack layer then pick the top few and compute the data overhead for the computers in those racks. Thank you -Harshitha (gplkrsh2) From: Jason Croft [croft1 SpamElide] Sent: Thursday, February 17, 2011 10:45 AM To: Gupta, Indranil Subject: 525 review 02/17 Hi Professor Gupta, Below is my review for the 2/17 papers on improving MapReduce. Thanks, Jason Croft ---------------------------------------------------- Improving MapReduce Performance in Heterogeneous Environments Zahariass heuristic, Longest Approximate Time to End (LATE), attempts to improve job completion time in MapReduce by accounting for the heterogeneity of co-located VMs in data centers, such as Amazon EC2. In addition, it builds off the insight that the completion time of any MapReduce job is no faster than its slowest task. MapReduce's standard speculative execution method assumes the hardware resources of all nodes are homogeneous, which is not the case in data centers that may have multiple generations of hardware, or multiple VMs running on the same hardware causing contention in disk or network bandwidth. Unlike the standard speculative execution in Hadoop, which only considers the progress score of a task, the LATE heuristic chooses tasks that will finish farthest into the future. In addition, LATE improves on Hadoop by choosing fast nodes to rerun speculative tasks, since rerunning a task on a machine that tends to produce stragglers will not improve job completion time. It also limits the number of speculated tasks, since these tasks are not "free". The authors' evaluations show improvement in Hadoop response times by a factor of two. On a heterogeneous cluster, LATE jobs finish 27% faster than Hadoop's native scheduler, and 58% faster when stragglers were introduced by running background processes to create contention for resources. However, these evaluations were run on a separate cluster within EC2 and contention had to be simulated with these background processes. Therefore, the results may be optimal, since the contention created by other co-located VMs may not have as much impact as those in the authors' simulations. One possible weakness of LATE is the lack of data locality in speculating tasks. That is, when a task is rerun because of a straggler, the locality of the data required for the task is not taken into account, and this itself may further slow the job completion time. Further delay can be introduced by read time for disk or, even worse, transferring over the network, which in the case of data centers can be highly congested due to over-subscription. In addition, while LATE tries to detect stragglers early and speculate them, this is not necessarily the case compared to Mantri and, as the Mantri authors note, LATE does not take into account the true cause of the straggler. Reining in the Outliers in Map-Reduce Clusters using Mantri Ananthanarayanan et al. examine the causes of outliers in MapReduce jobs in order to detect outliers early in a job's execution. The causes are examined in more detail than simply the heterogeneity of the environment. The authors identify three categories of causes for outliers: machine characteristics, network characteristics, and imbalance. Machine characteristics are similar to those examined by Zaharia, and result from hardware reliability or resource contention. However, the authors also discover outliers can result from network congestion, since data centers tend to be over-subscribed. These characteristics are used in Mantri, which restarts an outlying task by considering resource constraints, work imbalance, and network placement. Furthermore, the authors note that certain tasks can have a large effect on a jobs completion time at barriers in the workflow. That is, until a certain task completes, the job cannot continue. By studying a large cluster, over an order of magnitude larger than those in previous work, Ananthanarayanan gains some insight into the prevalence of outliers. In 25% of phases, for example, more than 15% of tasks are outliers. As this is a significant portion of tasks, Mantri considers the cost of recomputing or duplicating tasks. The authors discover that in 40% of phases, duplicating tasks with high runtimes would have no effect on completion time and will merely waste resources, due to the amount of data they process or move onto the network. Evaluations show Mantri's performance to be better than the LATE heuristic and Hadoop's native scheduler. This is not surprising, as Mantri consider many other factors that may cause stragglers or outliers than LATE. Perhaps its biggest strength over LATE is accounting for delay that may be incurred by network transfer. An additional strength of this design is that it is currently deployed and in use in Cosmos clusters. From: mdford2 SpamElide Sent: Thursday, February 17, 2011 10:02 AM To: Gupta, Indranil Subject: 525 review 02/17 Reining in the Outliers in Map-Reduce Clusters using Mantri Mantri is a task manager for Microsoft's clusters running DryadLinq. Specifically, Mantri focuses on handling tasks that take unexpectedly long to compute. Since the structure of MapReduce requires a barrier between the map and reduce phases, and before another iteration of MapReduce, any stragglers have a large impact on job completion times. Microsoft observed three root causes of outliers in their DryadLinq systems; imbalance in workload, network characteristics, and machine characteristics. Network characteristics, for instance, may include information about data locality. Monitoring these factors help in determining where to schedule a possible replacement task, but they do not indicate when replacement tasks should be scheduled. The authors indicate that Mantri starts over 50% of its task copies before the original task has completed 42% of its work. This is in contrast to 77% in an earlier version of the task management system. The question arises, are Mantri's performance gains only due to their aggressive rescheduling? The paper does not show a false-positive rate, where original tasks actually complete before the replacement. While there are technical achievements that move the restart time from 77% to 42% completion, one could imagine using this as a parameter to answer the question, is restarting work earlier always better? Perhaps we should also note that resource usage also plays an important role in large-scale clusters, and the paper does show the effects of Mantri on resource usage, but does not show the interplay between resource usage and completing time. Quincy: Fair Scheduling for Distributed Clusters Schedulers have been studied extensively, and optimized for everything including batch throughput, real-time systems, user interaction, and mobile devices. Quincy is the result of studying task scheduling in a new environment, that of MapReduce, Hadoop and Dryad. Though scheduling was studied in the context of grid computing, new systems have large, local disks, allowing for some data storage. Moreover, the network is not heterogeneous, but rather, machines are connected in a hierarchical way. Thus, transferring data between all machines does not require the same resources. In this new setting, scheduling tasks at the same machine, or the same rack, where the task began becomes increasingly attractive. The basis for Quincy is a min-cost-flow algorithm that prioritizes data locality. Fairness is addressed by attempting to keep the number of machines computing on various jobs equal. This does not take other resources, or even flow-cost of data, on which Quincy is based, into account. From: david.m.lundgren SpamElide on behalf of David Lundgren [lundgre4 SpamElide] Sent: Thursday, February 17, 2011 2:38 AM To: Gupta, Indranil Subject: 525 review 02/17 IMPROVING MAP-REDUCE PERFORMANCE IN HETEROGENEOUS ENVIRONMENTS, In Improving MapReduce Performance in Heterogeneous Environments, Zaharia et al. ask: what happens when the homogeneity, linear progress, and other assumptions of MapReduce task schedulers do not hold? LATE considers node heterogeneity when assigning speculative tasks to nodes (it prefers to assign tasks to faster machines) and by only speculating on limited set of the slowest tasks. The authors measure the amount of heterogeneity in EC2 and also the performance across different workloads and varied schedulers on EC2. Hadoop's default speculative execution algorithm is improved upon using the authors' Longest Approximate Time to End (LATE) scheduling algorithm. LATE's "simple heuristic" for estimating time to completion which assumes constant progress rate seems overly simplistic. The authors mention that they plan to pursue other heuristics for time remaining estimation, but given that they emphasize the ease at which these time functions can be swapped, I would have like to have seen a comparison of multiple estimates. Also, I feel their purposeful ignorance of data locality when calculating the fastest node to send a speculative task to oversimplistic. I found it interesting that the best and average-case performance of Hadoop's scheduler vs. a no speculation scheduler exhibited similar running times in a heterogenous clusters. I thought their measurements of parameter sensitivity were thorough. ---------------------------------------------------------------------- REINING IN THE OUTLIERS IN MAP-REDUCE CLUSTERS USING MANTRI Ananthanarayanan et al. present, Mantri, an outlier detection and context-sensitive rectification system for Map-Reduce clusters. The authors analyze the pervasiveness of high runtime outliers (defined as Map-Reduce tasks that complete in 1.5x the median task duration) and the various causes of outliers. Ananthanarayanan et al. then identify a group of outliers that, when properly classified, are improveable. These outlier classes include: straggling tasks "due to contention for resources," taks waiting for recomputed input, and work imbalance. Mantri solves these problems through resource-aware restarts, network-aware placement, avoiding recomputation, and data-aware task ordering. Mantri is evaluated on Bing's production clusters and shows improvements over existing schedulers (including LATE) for all classes of outliers. I thought this paper was more thorough and theoretically motivated than its immediate competitor, LATE. The authors' analsysis of the different classes of outliers was more detailed. This finer grained classification of stragglers led to better results and a more powerful system. Like LATE, Mantri may use to simple of a function to calculate the remaining time required to complete a task. I think these papers both imply a future direction of research in improving the accuracy of task completion time prediction. From: Long Kai [longkai1 SpamElide] Sent: Thursday, February 17, 2011 1:11 AM To: Gupta, Indranil Subject: 525 review 02/17 Long Kai (longkai1) Improving MapReduce Performance in Heterogeneous Environments Summary: This paper presents a new way to schedule speculated tasks which are re-executed straggler tasks. The original schedule algorithm implemented in Hadoop MapReduce is simple but naive. The assumption of homogeneous machines can be easily broken in real practice. Also, the algorithm to identify straggler is not accurate enough. Another drawback of the original schedule algorithm is that the speculated tasks may be run on slow machines, which does not help to solve the straggler problem. The LATE schedule algorithm addressed the problems mentioned above by running the speculated tasks from the very beginning instead of the last “wave”, by giving a new algorithm to compute the remaining time to identify the stragglers, and by running the speculated tasks on fast-speed machines. Pros: > Address the problem of running Map-Reduce on heterogeneous environments. > Improve the overall speed of Map-Reduce by using the methods mentioned in the summary above. > Easy to implement. Cons and future work: > When use Map-Reduce to compute long jobs, the efficiency may be worse, since the speculated tasks were needlessly run from the very beginning and meanwhile consume computation and I/O resources. > If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running, according to the paper, the node will run a speculated task. The paper does not make comparison between straggler tasks and new tasks. If the new tasks require more time to finish, then the best choice is not to run a speculated task. This issue is not addressed in this paper. > The algorithm to compute the remaining time is more accurate than the naive algorithm in Hadoop, but in some cases, it’s still not accurate enough. The paper has mentioned this potential problem. One simple way may be able to improve the accuracy is to do numerical interpolation based on the previous process reports to compute remaining time. > The ignorance of locality of speculated tasks may cause efficiency problems. Other thoughts: > Why in principal cannot run several copies of a speculated task? > If the straggler is very slow and meanwhile fast machines are available, can we kill the straggler to save resources? Reining in the Outliers in Map-Reduce Clusters using Mantri Summary: Mantri is a system that addresses the problem of outliers (a.k.a. stragglers) by specifying the causes of the outliers and use different methods to accommodate the problem accordingly. The causes of outliers can be congestion of network, contention of other resources, and imbalance of the size of input data. Mantri use different strategies including restarting the outlier task and taking into account the locality which will reduce and congestion of the network. Another feature is that, Mantri also duplicates the intermediate results that are relatively likely to be lost and the computation time is relatively high. This will reduce the expected time to recompute the lost data. Pros: > Address the different causes of outliers and solve them accordingly. > Further increase the efficiency of Map-Reduce by scheduling. > The model is general and can be used for other distributed systems. Cons: > The system is more complicated to implement > The actual algorithm to realize this design purpose is hard to be general and very accurate. Thus when different job conditions raises, the efficiency of this system may vary. Regards, -- Long From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng [ltseng3 SpamElide] Sent: Wednesday, February 16, 2011 11:47 PM To: indy SpamElide Subject: 525 review 02/17 CS 525 Review 02/17: Cloud Scheduling Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008 The paper had several contributions. First, the paper identified the emergent usage of using MapReduce framework over heterogeneous environment, especially over virtualized data center, like Amazon’s EC2. Due to potentially incompatible performance over different virtual machines, which might be because of different amount of resource allocated or contention over shared resource for each virtual machine, many assumptions made by Hadoop are violated. Therefore, Hadoop’s speculative execution plan does not increase performance. Worse, such plan sometimes severely affects performance. Second, based on more realistic assumptions about the computing environment, the paper proposed a heuristic and greedy scheduler called “Longest Approximate Time to End (LATE).” This algorithm estimates finish time of each task after running for 1 minute, then speculatively executes those tasks that take longest time to finish on fast nodes (not straggler nodes). The paper argued that LATE algorithm not only is robust to heterogeneity over nodes, but also improves job response time (delay) whenever a speculative job is executed. The paper backed up their claims by several evaluations done on EC2 and a more controlled environment of local cluster. LATE algorithm is simple and seems to be effective and is backed up by solid evaluations over both commercial machines and local testbed. Moreover, the assumptions about heterogeneous environment are more practical than the original Hadoop’s. Therefore, LATE algorithm should be practical. On the other hand, there are something to ponder on. First, the paper did not consider faulty node in the algorithm. It will be interesting to see whether percentage of faulty node in the cluster will affect the performance of LATE algorithm or not. Second, the paper only focused on short jobs to shorten delays. However, it might be useful to use the same estimation of finish time to measure the progress of long job and force some slow node running long job to abort and execute some short job, instead. Third, the paper argued that the phase-aware heuristic version of LATE might not be useful, because the job with non-constant progress rate is rare and by the time the estimation finishes, one wave of such jobs are finished. However, this kind of job depends on the nature of data and how those data are provided. For example, if the data is provided in a stream fashion and has strong dependency, then it might be possible to predict and categorize data according to different progress rate of each phase. In this way, such phase-aware algorithm might be helpful. Quincy: Fair Scheduling for Distributed Computing Clusters, M. Isard et al, SOSP 2009 Quincy is a centralized scheduler running on Dryad with consideration of both fairness and locality. One main contribution of the paper is to identify the key difference of scheduling in a distributed cluster from the one in the context of multi-core operation systems, virtual machines, or gird clusters. In a distributed computing cluster, like the framework of Dryad or Hadoop, policies are more fine-grained, and less or no correlated constraints and cross-cluster communication between nodes, since in each of map and reduce step, the tasks are complete distributed and parallel (except for reading files from remote storage). Due to these special features, the paper proposed using a graph-based approach rather than traditional queue-based scheme to tackle the scheduling problem efficiently. The idea is to first use a flow network to represent information about policies, processing and transmission cost, cluster structure, and locality of data and then to use min-cost flow solver to find the min-flow on the given flow network and thus the desired scheduling. In the end, the paper compared the performance four different Quincy policies (with or without Fairness and Preemption) and common queue-based and greedy scheduling approaches on a medium-sized cluster and backed up the claim that Quincy outperform traditional approaches.. One main contribution is the novel idea to use flow network to specify the scheduling problem with different constraints. This mapping between two different subjects is important, because min-cost flow problem is well studied and can be solved efficiently. The contribution will be even more significant, if the idea can be generalized smoothly into different context, like multi-core operating system or gird clusters as the paper claimed. Comments/Questions/Future works: Flow-based scheduling seems to work only in centralized scheduler, what if in the future, the system/cluster becomes so large that distributed scheduler is the only choice? Can flow-based approach be adapted to this situation? In this paper, the migration of data between racks is not considered, would such scheme help the performance? Especially, when they want to generalize the fairness into a more broad term, i.e., take capacity of I/O, CPU, memory into account, then migration of data might help free some resources around hot spot and achieve fairer scheduling. The assumption about homogeneous nodes and one node only takes one job at a time might not be very practical. Can the graph-based approach be generalized to relax these two assumptions by adding more nodes or adjust the weight of node’s incoming edge in the flow network? From: Agarwal, Rachit [rachit.ee SpamElide] Sent: Wednesday, February 16, 2011 11:18 PM To: Gupta, Indranil Subject: 525 review 02/17 ----- Qunicy: Fair Scheduling for Distributed Computing Clusters The paper addresses the problem of fair scheduling in Map-Reduce. It proposes a flow based algorithm with carefully chosen parameters that allows to achieve fairness as defined in the paper. In general, the paper proposes a new problem and gives an initial solution. The formulation of the scheduling problem in Map-Reduce as a min-cost flow problem is interesting. That said, in my opinion, the paper has a lot of "catches": 1. The definition of fairness is tricky. Suppose I have a job that takes L time and starts at time t = 0; hence, it is supposed to finish at time t = L. Assume that at time t = L - \varepsilon, N >> 1 jobs are put into the cluster, each of which takes l << L time. According to their definition of fairness, it is fair for the first job to finish at time (N+1)L, despite the fact that it was about to finish. Seems unfair! 2. The authors mention that they only aim for guarantees when there is no contention in the network; however, the case with network contention *actually* seems more interesting. 3. The complexity of computing the min-cost flow is pretty large. This seems like the real killer for two reasons: (a) the authors propose to run the algorithm every time a new job is submitted; and (b) as far as my understanding goes, most of the Map-Reduce jobs are pretty small -- running in several minutes and some time in tens of minutes. The scheduling algorithm may become a bottleneck for such cases. 4. Also, it is not clear why the cost of each job should be independent of all other jobs in the system. In fact, I would assume the job completing times to be some what correlated. Are the authors assuming no network bottlenecks? ----- Reining in the outliers in Map-Reduce Clusters using Mantri The paper proposes a scheduling algorithm to handle outliers in Map-reduce. It also proposes several optimizations like network-aware placement of tasks and protecting outputs of tasks based on a cost-benefit analysis. The paper starts on a very interesting note -- getting into the roots of the outlier problem and understanding the problem using evaluation results. They also demonstrate the impact of outliers and characterize the main factors that, if handled, would improve the system performance significantly. I liked the first principle approach to this -- they take an evaluation strategy rather than analytical strategy, which in the case of Map-Reduce seems the right thing to do. The paper, however, could have been improved on several points. I list some of them below: 1. The whole improvements provided by Mantri depends on how well can it estimate the values of t_{rem} and t_{new}. Their approach leads to a chicken and egg problem -- they show that the running times of a task may have little correlation with the fraction of the task that has been completed; on the other hand, they use linearity in running times to estimate the remaining times. This, in general, seems like a hard problem to me given that the execution time of a task may be non-linear in the "amount of data" processed. There is actually some evidence (Randy Katz's other paper on heterogeneity in Map-Reduce) that this is hard. 2. Mantri estimates the probability of losing an output for each machine independently over long periods. I am wondering if over a long period of time, the variation over the set of machines will be small. 3. The technique they use to estimate the time to recompute the output is also confusing. In particular, I believe that this time will depend on which other machines in the system fail, which other jobs need to be assigned and many such conditions. In fact, this is what makes the "fault handling" difficult in Map-Reduce. This is exactly the straggler problem. 4. In general, I believe there are a lot of issues with the way they use bandwidth in the system. In particular, its not clear why their strategy would be any better (for the cases of interest) than just random placement of jobs. Or may be, I was so excited to see the evaluation results in Sections 3 and 4 that I completely missed the interesting ideas in later sections :) ----- Best Regards, Rachit -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan [njordan3 SpamElide] Sent: Wednesday, February 16, 2011 11:16 PM To: Gupta, Indranil Subject: 525 review 04/17 Reining in the Outliers in Map-Reduce Clusters using Mantri This group set out to decrease the overall running time of MapReduce Jobs, by minimizing the time that outliers stretch the computational time at each phrase. They identified the basic reason for a task to take unreasonable amount of execution time compared to other task in the phase. Interplay between storage, net- work and structure of Map-Reduce jobs were the main culprits. I feel that many of the optimizations are a natural progression. For instance, only allow a small number of duplicate copies of task. They used 3 as the maximum. Also, towards the end of a job, when there are idle machines, it schedules duplicates more aggressively. They make the claim that their experiments are the first study on optimizing outliers, but I find it hard to believe that Google, Facebook, and Yahoo haven’t had their own private solutions to this. I wasn’t expecting a 33% reduction in overall time; this is amazing considering the scale and the money that these resources use. They defined as outlier if its computation time is 1.5 times longer than the median. I wish they had shown the data of how performance was if they used 1.3 or 1.7 as the metric, so other people who wish to implement Mantri would know more data. Improving MapReduce Performance in Heterogeneous Environments The paper defines speculative tasks as tasks that are running slow. Also speculative execution is finding a way to improve the overall run time of a MapReduce my address these speculative tasks either by restarting or duplicating the task on another machine. Their Approach is LATE is longest approximate time to end. They improve on the Hadoop heuristic, which is only progress based. It will decide to possible duplicate depending on who is the farthest from completion. Their model LATE, basically takes into account progress but based on the past progress guesses when it will finish, which is a more accurate model. They said that there work is similar to multiprocessing but different. In the respect that, in a multiprocessor environment the processor speed is know before hand and can in advance which tasks will be speculative. I question that assumption because you can’t determine before hand with some computation how long a program will run. I did research over the summer where were presented with this problem. I would like to see them deploy LATE in high performance computing. I feel that the techniques can carry over. From: Shen LI [geminialex007 SpamElide] Sent: Wednesday, February 16, 2011 11:07 PM To: Gupta, Indranil Subject: 525 review 02/17 Name: Shen Li Improving MapReduce Performance in Heterogeneous Environments This paper propose to improve MapReduce performance in Heterogeneous Environments via launching speculative executions more wisely according to several heuristics. Firstly, different from original Hadoop, LATE use the estimation of time left rather than progress rate to decide which task should be backed up by a faster machine. Secondly, they apply a SpeculativeCap to curb the number of speculative tasks. Thirdly, they abandon the locality consideration when assign speculative tasks. Instead, they take the speed of machine into account, i.e., speculative tasks are only executed on fast machines. Cons: As the cluster is heterogeneous, can we assign different workload to different machines, such as more slots on faster machines? For example, if we adopt the default setting (2 map slots and 2 reduce slots for each machine) while the fast machine as 32 cores and both disk and networking resources are not in short, I think launching more tasks on this machine is better. They wait until a task is run for 1 minute before evaluating it for speculation. Will that be too long for short jobs? Can we do it adaptively? In section 4, they argue that "network utilization during the map phase is low", so launching speculative tasks on fast node will not harm much. However, there are some techniques to allow starting reduce phase before the end of map phase. In this case, if some reduce tasks are in their copy step, the network resources may be not that abundant. As they mentioned in Section 3.2, "these waves get more spread out over time due to variance adding up", will there estimation of left time for one tash still be reasonable? Reining in the Outliers in Map-Reduce Clusters using Mantri This paper presents Mantri which can reduce the prevalence of outliers in Map-Reduce. They first study their logs and figure out several major causes of outliers: 1) works is not divided evenly; 2) the join phase between map and reduce will consume a lot of network resource and will lead to severe network contention, which is not considered when assign tasks; 3) recomputations are localized, i.e., many recomputations occurs on a small set of machine. Based on these observations, they proposed several counterplan. Tasks with more works to be done will start earlier to deal with the uneven division problem. Network aware placement are used to handle contentions. They replicating and pre-computing to alleviate the effect of unavailable inputs. Their experiments show that Mantri improves job completion times by 32%. Pros: The designers of MapReduce mainly focus on how to make parallel computing easy to deploy and many detailed performances issues are left behind. This paper is one of those which make "patches " to MapReduce framework and according to their evaluation, they did a pretty good job 1. Many users found that MapReduce occasionally behaves oddly. Unexpected execution delay is a common phenomenon. This paper presents a thorough analysis of the cause of outliers which helps us to better understand what is going on inside the cluster. 2. To the best of my knowledge, this is the first paper to point out that we should take the network condition into consideration when placing tasks. Cons: 1. When running a MapReduce job with hundreds of machines, the master node is a potential bottleneck in the system. And their design highly rely on a smarter master node, such as ordering tasks according to their input data, which just makes the master node more busy. 2. They early replicating the outputs whose cost to recompute exceeds the cost to replicate. However, the cost of replicating can be significant. From: cyhong1128 SpamElide on behalf of Chi-Yao Hong [hong78 SpamElide] Sent: Wednesday, February 16, 2011 10:42 PM To: Gupta, Indranil Subject: 525 review 02/17 ---- Quincy: Fair Scheduling for Distributed Computing Clusters, SOSP’09 ---- Quincy is a scheduling algorithm that provides both high efficiency and fairness for distributed jobs on clusters. According to recent measurements, the majority of job in clusters are short in terms of running time – more than a half of the jobs take less than 30 minutes. On the other hand, the distribution of job duration suffers from a long-tailed – some jobs could take up to 800 minutes to finish. With this great variance of job running time, the fairness becomes increasingly important to avoid large job monopolizing the whole cluster. Quincy is a flow-based scheduling that achieves good fairness while considering the data locality at the same time. Moreover, a variety of policies is provided and can be imposed on Quincy. The results show that Quincy could considerably reduce the communication overhead by a factor of 4. Pros: 1) Problem modeling is new. Quincy formulates the scheduling problem for cluster jobs as a graph-based flow minimization problem. 2) Leveraging the min-cost flow algorithm, Quincy achieve much better performance as other greedy scheduling methods which only look at local information. Cons: 1) It is unfair to compare the performance of flow-based approach directly with that of queue-based approach. Queue-based is much easier to implement in a distributed fashion. It does not require the global knowledge, like link capacity and global layout of the cluster. Also, all the queue-based approaches under investigate are fairly greedy. One could come up with a more sophisticated approach as a baseline scheme. 2) When clusters grow incrementally, heterogeneity shows up. Quincy does not address this potential issue. 3) To eliminate the starvation, Quincy imposes an additional cost to edges from the running task to other nodes. This clearly increases the completion time of tasks – an intuition is that long tasks will be interrupted multiple time. The tradeoff between fairness (or weak starvation) and completion time is not addressed in the paper. ---- Improving MapReduce Performance in Heterogeneous Environments, OSDI’08 ---- Mantri is a system monitor that collects the Map-Reduce task profiles. It detects the outliers, and it improves the job completion time by restarting the task. Mantri will restart tasks that suffering from reading data using a low-bandwidth path. Mantri assigns tasks based on location and the network link utilization. Also, it protects outputs based on a cost-benefit analysis. In particular, when the recomputation cost is relative high, Mantri will deplicate the results. Pros: 1) The measurement analysis looks sound. It successfully motivates the design. For example, as the high running times of tasks do not necessarily indicate slow execution, Mantri will not blindly restart any job with high running time. 2) The problem motivation is strong. Simply by removing the outliers, job completion time could be improved by 34.7%. Cons: 1) It is unclear why restart algorithm could accurately detect the outlier. A study could help reader to better understand the sensitivity of restart algorithm. 2) The evaluation workload is somewhat limited. It is unclear whether the selected applications are representative. Thanks, Chi-Yao Hong From: mark overholt [overholt.mark SpamElide] Sent: Wednesday, February 16, 2011 8:30 PM To: Gupta, Indranil Subject: 525 review 02/17 Mark Overholt CS525 02/17/2011 Improving MapReduce performance in Heterogeneous Environment Summary: This paper aims to improve the scheduling module of the Map/Reduce implementation of Hadoop. The authors claim that Hadoop assumes a number of things about its system, some of which can break down in certain, likely situations. The main assumption being that the underlying infrastructure is homogeneous. They also assume that each phase of the reduce task completes in the same amount of time, which does not hold true. The issue that these failed assumptions caused were that too many tasks were marked as straggler tasks. This was causing many tasks to be re-run and slowing down the system. They proposed a new scheduling algorithm that tries to cure some of these issues, and they claim is 2 times faster. Their solution is called LATE, which stands for Longest Approximate Time to End. The main idea is that it measures how long it thinks a task will take to finish, as compared to the normal Hadoop scheduler, which measures the percent of progress made. On a homogeneous system, this might work well, since it can be assumed that tasks will complete at the same rate on all computation nodes. However in a heterogeneous system, this is not the case. A nice example they give is a machine that is 5x slower than the average, at 90% will have a shorter time to finish than a faster machine (2x) with only 10% finished. While the 5x machines is orders of magnitude slower, it will finish before you can spin up the task on a new, faster machine. The LATE algorithm works on just a few rules. If the number of speculative tasks is below the threshold (a value set by LATE), an idle node is given a speculative task to complete. That task is chosen by the following rules: 1. Don’t assign a new task if that node is running slowly, ie it is below the SlowNodeThreshold, which is a calculated value. 2. Rank all non-speculated tasks by estimated time left. 3. Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold. Discussion: Pros: Simple concept with simple metrics. Seems like it would be easy to put into use. The schedule is not too sensitive to parameter settings. The results were verified on many different systems (prod cloud, VMs, etc) Cons: Explicitly ignores data locality in the map scheduling for speculative tasks. The authors admit, the algorithm breaks if a task has multiple phases, and the earlier phases are faster than the later phases. It screws up the remaining time calculation. Quincy: Fair scheduling for Distributed Computing Clusters Summary: Many current distributed systems consist of large clusters of inexpensive commodity computers. In such system models for distributed computation, it's common that application data is stored on the computing nodes. This paper introduced a graph data-structure based framework for scheduling concurrent jobs in such systems. The idea is to map the scheduling problem to a min-cost flow problem in a directed graph. Then a optimal scheduling policy can be obtained by performing global search for a min-cost flow solution. It's demonstrated with experiment data that comparing to existing queue-based scheduling algorithms, Quincy provides better fairness and improved data locality. The main advantage of the graph based scheme proposed here is that it seems general enough to encode other, more complicated policies that could use additional information available about the jobs. The only complaint here is that the "fairness" terminology can be somewhat misleading. Example: if predicted job running times are available, the shortest job first policy is known to be very good at minimizing average latency, even if it is unfair to the more resource-intensive jobs in some sense. However, the graph based scheme is perfectly capable of encoding such a policy. Discussion: Pros: Works well for data mining and network traffic analysis. Simple and highly-applicable framework gives the ability to apply algorithms to different classes of high-performance applications. Cons: Quincy verifies its results only on the Dryad platform. Can these results also be verified on other cluster platforms? A primary concern is with the scalability of this approach. It has been implemented with only a 243 node cluster. How would the performance be affected if there were 1000 nodes, or 10000 nodes? From: Tengfei Mu [tengfeimu SpamElide] Sent: Wednesday, February 16, 2011 5:29 PM To: indy SpamElide Cc: Gupta, Indranil Subject: 525 review 02/17 Tengfei Mu (mu3 SpamElide) 1. Quincy: Fair Scheduling for distributed computing clusters The paper proposes a scheduling approach for scheduling concurrent jobs on clusters with fine-grain resource sharing. The main challenge is the consideration of both fairness and locality, which are always conflicts with each other. They constructed mapping from fair-scheduling problem to min-cost flow graph structure, then use global cost model to optimize the scheduling decision. Then by comparing quincy with the queue based scheduler, quincy achieves better fairness and data locality. Pro: 1. Improving both data locality and fairness scheduling is very useful scheduling. 2. Using graph structure to represent the cloud structure and do optimizations is a novel way to be used in the future cloud computing problems. They could even be extended to use some other useful information in the graph to make decision. Con: 1. As for cloud computing, scalability is a very critical issue. However, they didn’t measure about the overhead for scalability. 2. Improving Map Reduce Performance in Heterogeneous Environment This paper proposes a scheduling algorithm, LATE, to robustly perform speculation to maximize performance. Map reduce programming model is really popular nowadays. But there are flaws existing within the threshold-based scheduling algorithm in Hadoop and within the progress-rate-based algorithms in general. LATE uses estimated finish time to speculatively execute the tasks that hurt the response time the most. It breaks the homogeneity environment assumption of Hadoop. By comparing with Hadoop native scheduling algorithm, LATE improved Hadoop response times by a factor of 2. Pro: 1. LATE extends the Hadoop scheduling to the heterogeneous environment and increases the response time. 2. LATE scheduler is not very sensitive to the parameter setting Con: 1. The paper didn’t evaluate the overhead for running on long jobs 2. LATE didn’t take the data locality into consideration. 3. The assumption that all jobs proceed at constant rate. From: Tony Huang [tonyh1986 SpamElide] Sent: Wednesday, February 16, 2011 2:40 PM To: Gupta, Indranil Cc: Tony Huang Subject: 525 review 02/17 Zu C Huang SID: zuhuang1 Quincy: Fair Scheduling for Distributed Computing Clusters ---------------------------------------------------------- Summary: In this paper, the authors introduce a new framework for schedule concurrent distributed jobs. The job placement and scheduling is modeled as a directed graph, and solve as a maximum flow on the general graph. Each job is modeled as a source in the graph, rack aggregators, computers or unscheduled jobs are modeled as intermediate nodes on the graph, and a sink node is where all flows exit the graph. A maximum flow on this graph is claimed to be also the optimal scheduling on the actual machines. Parameters are introduced to control Quincy's behaviors and aggressiveness in optimizing for data locality. Prominent parameters include the cost of transfering data across core switch or rack switch as well as cost of waiting in the queue. Quincy's use global information to determine optimal execution sequnce during one cycle. When a prominent event occurs, a job finishing for example, Quincy will compute the next optimal scheduling scheme and make correspoinding scheduling. Existing jobs can be preempted according to the scheme chosen. Pros: * Max flow is a mature and proven techniques. Quincy's mapping from job scheduling to direct graph is intuitive, and detailed enough to include most of the prominent factors that should considered in scheduling. * The frame woek is general enough that if other prominent factors are identified, it can also be incorporate into the graphical model by adjusting the penalty function. * Using global information can generate near optimal allocation. * Providing a general fair scheduling scheme for large cluster. Cons: * The paper does not discuss what happen if a job fails (say machine failure), adjust network topology and other issues. * Since Quincy is a centralized placement model, it also becomes a bottleneck and single point of failure. What happened if the machine running quincy is undergoing outages? Opinion: * Quincy assumes homogeneous environment, though I think heterogeneous environment can easily be incorporated by assigning different penalty functions to different nodes. However, this may lead to explosion in the parameter space, and make fine-tuning the system difficult. * Sensitivity to parameters. While the authors did not address how sensitive the system is to different parameters, the effects of different combination of parameters have on the system should be explored or explained more clearly in the paper. Reining in the Outliers in Map-Reduced Clusters using Mantri ------------------------------------------------------------ Summary: In this paper, the authors introduces a system, Mantri, that monitors and reduces outliers in a cluster. Outliers are map-reduce jobs that take exorbitant time to finish, thus slowing down the finish time for the whole map-reduce job. Mantri also try to prevent recomputation due to data lost. Mantri attacks the problem by restarting outliar tasks in a resource and network aware way, and it protects outputs from tasks by replicating them. Mantri detects outliars by detect jobs whose execution time far exceeds the median time. Mantri restarts a job if the estimated benefits of restarting exceeds the cost. Trade-off between wasteful restarts and larger speed-ups are controlled by system parameters. Network-aware placement of jobs are achieved by a local approximation algorithm. Pros: * Is a more comprehensive system than the system introduced in the other paper (LATE). * At least take network conjestion into consideration by introducing a local optimization algorithm. Cons: * How accurate the local optimization algorithm is neither analytically nor experimentally explored. * Accuracy of the estimation of complete time still uses a time-average approach, thus is does not consider sudden changes like hard-drive malfunction or network outage. * Replication should be implemented behind the scene by underlying cloud file system like GFS instead of being done by the outlier detection system. Opinion: Mantri is a much more polished and well developed system than the ad-hoc system in LATE. However, it still builds on estimation and heuristics. It may still suffer from estimation error, suboptimal local estimation. However, this is a classic trade-off that designers have to make for large scale distributed systems. -- Regards -- Tony