From: Rini Kaushik [rinikaushik@yahoo.com] Sent: Thursday, February 18, 2010 12:34 PM To: Gupta, Indranil Subject: CS525 review 02/18 CA-NFS ------ The authors propose CA-NFS which is holistic in that it manages all resources, including network bandwidth, server I/O, server CPU, and client and server memory utilization. It accelerates, defers, or cancels asynchronous requests in order to improve application-perceived performance directly. They use congestion pricing via online auctions to coordinate the use of system resources by the file system clients so that they can detect shortages and adapt their resource usage. As the server prices increase, the clients that are not resource constrained will defer asynchronous operations for later time and, thus, reduce their presented load. Pros: 1) The underlying pricing algorithm, based on resource utilization, provides a log-k competitive solution to resource pricing when compared with an offline algorithm that “knows” all future requests. In contrast to heuristic methods for moving thresholds, this approach is system and workload independent. 2) In contrast to regular NFS, CA-NFS clients adapt their asynchronous write behavior by either deferring or accelerating a write. Cons: 1) The write accelaration (forces syncs to the disk) may not be conducive for the file systems which have a small write penality. Lot of file systems including WAFL used in the NetApp servers, cache small writes for several reasons: 1) several writes may get overwritten or deleted and hence, there is no need to write to the disk; 2) coalascing writes leads to a better data placement on the disk. So, I don't agree with "write acceleration" has almost no negative effect on system performance which is one of the key assumptions in the paper. 2) There are several assumptions in the paper such as utilization tracking and block checking to determine the pricing. However, the evaluation doesn't talk about the overhead of these checks, their accuracy, the sensitivity analysis of the thresholds etc. All these would be important considerations. 3) The paper makes a statement that asynchronous reads are only read-ahead in nature. I don't agreee with the same as an application may use asynchronous reads simply because of the non-blocking nature of the asynchronous read. 4) The paper makes an assumption that CA-NFS users do not need to set the value of k explicitly, as it is precomputed for most existing device types. I don't agree with this. For example, the characteristics (bandwidth, latency) differ substantially between SCSI and SATA disks. 5) The authours claim that the prices create a single view of system load in a resource independent manner. I would argue that some applications may have diverse needs from resources and hence, may not like/want a single view of the system resources. 6) The paper mentions that if the value of k differs significantly from its optimal (as calculated) setting, performance degradation is notable. Low values of k lead to underutilization of the system resources, while high values make the system less adaptive, as prices increase very rapidly. The authors don't show how can an optimal selection be done in a workload agnostic manner? 7) If some clients understand the pricing model and some don't, what happens then? Map-reduce in heterogenous environment -------------------------------------- This paper identifies the problems of Hadoop's task scheduler and proposes a simple new algorithm called Longest Approximate Time to End (LATE) to improve its performance in a heterogeneous cluster. Hadoop's task scheduler assumes a homogeneous cluster where each node runs in a more or less similar speed. In such an environment, task progress rate may be used to identify the straggler task. Such a task is assigned to any node that has an idle task slot and the data needed for that task is local to that idle node. Altough this simple heuristic works for homogeneous environment, in heterogeneous clusters where nodes have different speeds, this policy might cause a slow idle node to run a speculative execution. Thus unnecessary speculative executions will be run. In order to resolve this problem, the authors proposed LATE where remaining time to complete a task is used to identify a straggler task. This speculation is done only after the task has run for at least 1 minute. In order to limit the number of speculative executions LATE proposed a threshold based scheme. Pros: 1) Unlike, the standord Hadoop scheduler, LATE considers node heterogeneity and relaunches only the tasks which have the potential to compromize the response time, and only a small number of tasks. This should considerably reduce false negatives whereby a task was deemed slow and unnecessarily relaunced. 2) LATE also chooses the most conducive nodes for running the speculative tasks instead of blindly picking up a node as done in the bare-bone Hadoop scheduler. Cons: 1) The heuristics assume constant rate of progress which is not necessarily true. There can be several transient situations which could cause a change in the progress rate. Does the utilization of the systems remain constant over time? A node may have performed well in the past and may qualify as a speculative execution node; however, the same node may behave like a straggler if its utilization becomes so high that the service time starts to get impacted. Such, transient behaviors also need to be accounted for in the calculations. 2) What if a task was lagging behind because it was waiting on the IO because the disk was either bottlenecked or erroring out. Moving the task to another machine will also not help in this case till the underlying disk problem is solved. 3) It would be interesting to compare the effectiveness of LATE's heuristic thresholds with different kinds of workloads -- disk bound/cpu bound/varied distributions and characteristics. 4) The paper is assuming that the progress rate reporting is accurate. What if there are lags there which could create false negatives? From: Giang Nguyen [nguyen59@illinois.edu] Sent: Thursday, February 18, 2010 12:26 PM To: Gupta, Indranil Subject: 525 review 02/18 nguyen59 02/18/2010 CA-NFS: A Congestion-Aware Network File System The paper says that network file systems need to be aware of the resources of the whole system. Currently, all client requests are give the same importance, and clients do not explicitly know about the congestion the server is experience, and the existence of other clients. So the paper describes a system in which the server monitors its resources such as CPU, memory, and network and holds a reverse auction when processing clients' requests. When such a resource gets to near capacity, its price goes higher. So, a client's request that consumes that resource (eg, an asynchronous write consumes the server's memory to buffer the write) will have to pay the higher price. If the client can delay its request (eg, it has a lot of free memory to hold the write to delay it), then it can wait to pay a lower price once the server's resource becomes free. The 20% improvement in execution time is a good improvement. At a higher level, I wonder, however, if this idea can be applied to other network file systems, or it is more specific to NFS protocol. By the way, I think the paper should have given a quick overview of the NFS protocol for the unfamiliar readers. -------------------------------------------------- Improving MapReduce Performance in Heterogeneous Environments Yahoo's Hadoop MapReduce system's task scheduler does have speculative execution: automatically starting a slow running task on a different machine. However, Hadoop's native forumla/algorithm for determining "slow running" assumes a homogenous environment of cluster machines. Furthermore, with no cap on the speculative execution, the authors observe 80% of the reduce jobs were speculated. These and other assumptions/reasons prompt the authors to improve Hadoop's task scheduler. The key point is that, to improve the response time of map-reduce jobs, we should speculatively execute the task that we think will finish farthest into the future. Also, we should select "fast" nodes to run the speculative tasks on (not just any available node), and also put a cap on the speculation to prevent thrashing. The author observes a 2x improvement over Hadoop's native scheduler. The observation that 80% of reduce jobs are speculated is "spectacular." That is a clear indication of something wrong. I am curious as to why Yahoo researchers did not forsee the shortcoming of their implementation. On the other hand, the heuristics presented in the paper are still "simple" in the authors' words. Probably someone will work on improving those heuristics, and I am wondering if it's possible to show a theoretical "best" limit that these speculative execution can achieve. From: Shivaram V [shivaram.smtp@gmail.com] on behalf of Shivaram Venkataraman [venkata4@illinois.edu] Sent: Thursday, February 18, 2010 12:22 PM To: Gupta, Indranil Subject: CS 525 review 02/18 Shivaram Venkataraman - 02/18 Improving MapReduce Performance in Heterogeneous Environments This paper presents LATE (Longest Approximate Time to End), a scheduler for MapReduce jobs that helps improve the response time in heterogeneous environments. MapReduce uses a technique called 'speculative execution' and schedules additional copies of tasks to overcome poorly performing 'straggler' nodes. Hadoop, an open source implementation of MapReduce, calculates the progress of a task and schedules speculative tasks if the progress is less than a particular threshold. This technique is found to be inefficient as the rate of progress may vary among different machines in a heterogeneous environment due to network and I/O contention. Additionally Hadoop assumes that the tasks progress at a constant rate and that tasks in the same category require the same amount of work. On the other hand, LATE picks the task that will finish farthest into the future for speculative execution. To do so, LATE calculates the progress rate of each task and estimates the time it will take to complete. If the total progress is below 'SlowTaskThreshold' and if a node's total progress is below 'SlowNodeThreshold', then a copy of the task is scheduled. The progress of the node is measured to ensure that speculative tasks are only scheduled on fast nodes. To guard against overloading the system, LATE also maintains the total number of speculative tasks below a threshold known as the 'SpeculativeCap'. In practice, a good choice for SpeculativeCap was found to be 10% of available task slots and the 25th percentile of node and task progress rates was found to be a good choice for SlowNodeThreshold, SlowTaskThreshold respectively. The major lessons from this work are that scheduling decisions should be made early and that finishing times are better indicators than progress rates for data intensive computations. Pros - Exhaustive evaluation on Amazon EC2 and a test-cluster for determining how sensitive the scheduler is to each variable. - LATE halves the response time compared to the native scheduler. Cons - Progress rate heuristic could be wrong for certain types of tasks. - Scheduling optimization is only applicable for MapReduce jobs with low response time. Interesting points - The I/O and network throughput is heavily dependent on whether Hadoop instances are co-located in EC2. CA-NFS: Congestion Aware Network File System This paper presents a framework for scheduling asynchronous requests in a distributed file system and uses an online auction model based on the resources available on the client and server machines. Distributed file systems often have the option of deferring a write or reading ahead to improve the overall performance. CA-NFS uses the insight that if the server has more congestion, clients can buffer writes in memory and postpone asynchronous read ahead operations. On the other hand, when the resource utilization is low, clients can accelerate a write by forcing the server to sync the data to stable storage and aggressively perform read-ahead operations to increase the number of cache hits. The auction framework used in CA-NFS is modeled on the algorithm of Awerbuch et al, originally proposed for bandwidth sharing in networks. The algorithm offers a performance degradation of log(k) when compared to a off-line optimal algorithm, where k is the ratio between the maximum and minimum benefit realized by the online algorithm over all inputs. The pricing function used in the algorithm depends on the performance degradation observed by the end user as the resource usage becomes congested. CA-NFS picks the right thresholds for different resource types through experiments and the cumulative cost of an operation is estimated by the highest cost resource. The resources which CA-NFS uses in the framework include server CPU usage, client and server network bandwidth, server disk write queue length, read cache hit rates and client read-ahead effectiveness. This model is implemented on the NFS protocol by modifying the NFS server, client and the OS daemon which flushes dirty pages to disk. Pros - Scheduler is holistic and takes into account heterogeneous set of resources including CPU, memory, disk and network. - Implemented using minimal change to the NFS protocol. Cons - Client cannot specify lower priority for tasks like garbage collection whose synchronous operations can be halted as they are not time critical. - Would be interesting to see how non-cooperative clients are scheduled as ensuring fairness between clients is mentioned as a future direction. Interesting points - High speed networks that are faster than disk transfer rates could result in a cascading throughput crash. From: pooja.agarwal.mit@gmail.com on behalf of pooja agarwal [pagarwl@illinois.edu] Sent: Thursday, February 18, 2010 12:19 PM To: Indranil Gupta Subject: 525 review 02/18 DS REVIEW 02/18 By: Pooja Agarwal Paper - CA-NFS: A Congestion-Aware Network File System Authors - A Batsakis, R Burns, A Kanevsky, J Lentini, T Talpey Conference – USENIX 2007 Main Idea: This paper presents CA-NFS which is a performance management framework build on top of NFS. It provides priority scheduling of asynchronous tasks by either deferring or accelerating them over synchronous tasks. The scheduling is based on the current system load and state of different resources. To improve the performance during congestion, it defers asynchronous writes, while when ample resources are available, it performs accelerated asynchronous writes and reads. It takes a holistic view of all the critical resources belonging to the complete distributed system, like the cpu cycles, memory, network bandwidth, and disk i/o. The key idea of the paper is the pricing mechanism in which the cost of each operation is calculated dynamically based on the current use and availability of different resources on both server and clients. Based on these prices, CA-NFS uses the reverse auction model to decide if an operation needs to be accelerated or deferred. The pricing mechanism unifies the congestion information across resources to make them comparable. Their evaluation shows about 20% improvement in execution time above NFS. Pros: 1) The system has been evaluated against several benchmarks ranging from microbenchmarks to macrobenchmarks. They also point out the hazards occurring from high speed network but slow disk speed. 2) The pricing for each operation takes into account dynamically changing resource usage which provides more appropriate prices at a given time. 3) The results show that CA-NFS achieves 12%-23% improvement above NFS on different benchmarks which is fairly high. Cons: 1) The scheme does not take into account priority and sequence of operations. It is possible that a given asynchronous operation has higher priority above the synchronous operations or the asynchronous operation needs to be finished before other synchronous operations due to read-write dependencies. Deferring/accelerating these operations can lead to poor performance for asynchronous operations in first case and can lead to long delays or incorrect results for dependent synchronous operations. 2) The pricing calculated at each client is prone to selfish behavior. A client can fake higher prices for operations on it’s side(in this system client does not even need to fake prices, as server does not verifies them) forcing the server to perform operations for it. This becomes unfair for other clients. Server needs to perform resource allocation to each user based on usage dynamics of the user. Paper – Improving MapReduce Performance in Heterogeneous Environments Authors – M Zaharia, A Konwinski, A Joseph, R Katz, I Stoica Conference – USENIX 2008 Main Idea: The paper presents a new scheduling algorithm LATE(Longest Approximate Time to End) which aims to improve Hadoop’s task scheduler’s performance in heterogeneous environment. Hadoop is based on several homogeneity and other scheduling assumptions. The authors discuss the scenarios in which these assumptions are invalidated. For example based on homogeneity assumption Hadoop spawns speculative copy for straggler tasks that are progressing slower than the mean progress. The idea of mean progress is invalidated in heterogeneous environments like clouds or when different generations of hardware are used. Many of the assumptions do not even hold for homogeneous environments. To increase the performance of Hadoop in heterogeneous environments the authors propose LATE scheduling algorithm which comprises of following key points: 1) Execute the task which is likely to finish farthest in the future giving opportunity for speculative copy to catch up with it if needed. 2) Use progress rate to estimate the end time of a task. 3) Use fast nodes to launch the speculative copy. 4) Speculative cap on the number of the copies that can be running in a system. 5) Data locality independent task scheduling. 6) Prioritizing the slow jobs based on the effect on response time. The scheme is evaluated on EC2 and on localized setup employing mixed nodes running different number of VMs per machine. On an average LATE improved the response time by a factor of 2 as compared to Hadoop. Pros: 1) Presents a novel scheduling algorithm LATE taking into account heterogeneous environments which are emerging in lieu of clouds and large private infrastructures. 2) The decrease in execution times is impressive ranging from 27% to 56% faster than Hadoop on different applications. Lower execution time results in lower hours used by clients in cloud computing and hence saves both money and time. 3) The heterogeneity produced in experiments by mixing nodes running different number of VMs provides a more credible evaluation strategy. 4) The idea of using execution rate provides greater performance in both heterogeneous and homogeneous environment. Cons: 1) LATE is based on the assumption that the tasks makes progress at constant rate which they themselves point out through an example that is not a good assumption. 2) The evaluation is not done for homogeneous environments, it is not clear if it atleast gives comparable performance to Hadoop in such cases. 3) The EC2 environment used only co-located VMs which might have added favorable touch to the results generated. 4) Highly sensitive parameters lead to different parameters for different applications. It is an overhead to calculate these parameters before using any new application. From: Vivek [vivek112@gmail.com] Sent: Thursday, February 18, 2010 12:16 PM To: Gupta, Indranil Subject: 525 review 02/18 Improving MapReduce Performance in Heterogeneous Environments Core Idea: The design of a new scheduling algorithm known as LAST is used to improve performance of MapReduce on heterogeneous set of nodes. The work describes how to implement and improve speculative execution in Hadoop to improve reduce response time (they claim that their methods can improve response time by a factor of 2). They show that the algorithm as a whole works better than the Hadoop version of speculative execution, and reduces overall contention on the interconnect network. Pros: - Puts a limit on the number of speculative tasks to avoid resource contention. This allows for moderation is resource usage, and reduction in network contention. - Reduces the number of stragglers in the system, so that "slow nodes" are not consistently allowed to do only small amounts of work. - Identifying tasks that are slow is done way in advance, so that resources are used as effectively as possible, all the time. Cons: - Can this type of infrastructure address the needs of both static performance optimization and dynamic performance optimization? Perhaps it can, but more support for scientific applications that have regular communication patterns would be useful. - The general experimentation done does not seem to cover different sets of applications. Is this true? This might be very useful to illuminate how it behaves for (perhaps) workloads that are both computation and data-intensive, rather than just data-intensive. Quincy: Fair scheduling for Distributed Computing Clusters Core Idea: In this paper, the fundamental problem of scheduling concurrent jobs on clusters running data-intensive workloads is examined. A general framework is introduced, and the belief is that fine-grained resource sharing is a must for distributed computing clusters. The framework is essentially represented as a graph where edges represent needs for data locality and fairness. A solver is used to compute the optimal online schedule according to some cumulative cost model. The framework implementation is named Quincy, and uses the Dryad execution environment. Pros: - Works especially well for applications such as data mining, machine learning, and network traffic analysis. - allows for fine-grain resource management, and this is particularly important in systems like Hadoop and Dryad. - Simple and highly-applicable framework gives the ability to apply algorithms to a different classes of high-performance applications. - The two competing forces of data locality and fairness are clearly in defined in the performance cost model, and a systematic methodology for finding the right balance between the two is demonstrated. Cons: -If one is to significantly deviate from traditional resource allocation policies implemented by high-performance clusters, and suggest a policy that is generally applicable, then more applications should be tested. This paper claims that it works well only for certain applications that are currently in use on distributed computing clusters. But what about other scientific applications such as computational cosmology or physics simulations? - Quncy is evaluated against a queue-based algorithm which has been used by Dryad earlier on. They are doing the evaluation in a very theoretical manner, which is very good from a conceptual and academic point of view. But do they need to consider system features or details of underlying hardware? Or perhaps the network topology? It would seem some more details on this might be useful. - Quincy particularly talks about Dryad execution environment. As this is a Microsoft paper, this seems to be natural. However, the argument of generality of their techniques is weakened by their focus on Dryad. What about scheduling used in other systems(even if they are less well-known)? This would seem to strengthen their argument more. From: gildong2@gmail.com on behalf of Hyun Duk Kim [hkim277@illinois.edu] Sent: Thursday, February 18, 2010 12:11 PM To: Gupta, Indranil Subject: 525 review 02/18 525 review 02/18 Hyun Duk Kim (hkim277) * Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008 As the need for large scale data analysis increase, the use of open source program, Hadoop, is increasing. Current Hadoop's performance is closely tied to its task scheduler. One of the problems which the scheduler has is that it assumes cluster nodes are homogeneous. However, many of latest virtualized data center such as Amazon's Elastic Compute Cloud has heterogeneous environments. This paper suggests a new scheduling algorithm, Longest Approximate Time to End (LATE) algorithm, for robust computation in heterogeneous environment. LATE mainly optimize speculative task execution. Hadoop automatically execute straggler to improve entire task speed. However, because current speculative execution in Hadoop is inefficient for heterogeneous nodes, LATE suggests new concepts like SlowNodeThreshold, SlowTaskThreshold and new ways to estimate remaining execution time. According to the experiment results, LATE shows better performance in heterogeneous environment and robust with different parameter settings. This paper does very close and structural analysis in various assumptions of current Hadoop scheduler and explains why each may be bad in heterogeneous environment. Various experiments are also well showing how each suggestion can be helpful for improving performance. LATE algorithm is mainly for improving speculative execution. This does not consider other possible scheduling improvements. Also, many of the experiment situations which showed good performance of LATE algorithm is pretty special case. Therefore, in general problem setup, LATE algorithm may not have benefit over the normal Hadoop operation. Even there is a chance that normal Hadoop execution performs better. Current completion time estimation is based on static progress rate. As the paper already mentioned, this may cause a problem because it assumes progress rate is constant. We may think update progress rate dynamically. For example, we can calculate progress rate based on the last n minutes execution. LATE algorithm does not consider data locality for launching speculative map task. We may be able to improve this by adding a simple heuristic, 'use data locality for node ranking'. * Quincy: Fair Scheduling for Distributed Computing Clusters, M. Isard et al, SOSP 2009 This paper proposes a scheduling method, Quincy, which aims both locality and fairness. The problem of scheduling concurrent jobs on cluster is an important issue. However, current queue-based scheduling showed limitations. Quincy maps schedules to a graph structure, where edge weights and capacities encode the competing demands of data locality, fairness, and starvation-freedom, and a standard solver computes the optimal online schedule according to a global cost model. Quincy enables users to schedule jobs with fine-grain source sharing. According to the experiment results, Quincy showed better fairness then existing queue-based algorithm and also substantially improved locality. This paper executed variety of experiments. It compared the new algorithm with existing ones; also new algorithms are well divided depending on functions (one with/without fairness, one with/without preemption). Quincy assumes a homogeneous computing environment. As we can see in another paper of this session (Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008), the need of heterogeneous cloud computing is increasing. Quincy can be improved for heterogeneous environment. For example, we may add more parameters which decide weights in scheduling graph. Also, the concept of the fairness also should be changed. Current fairness is just defined as even distribution of jobs. However, because each node has different ability to run jobs, it is not fair to assign the same number of jobs to different nodes. We need to consider the capacity of nodes as well as job size. In addition, we may be able to adopt some ideas of OSDI08 paper. This algorithm still does not solve starvation problem. This paper mentions 'starvation-freedom' several times. However, finally the paper does not tell the clear answer of it. According to the experiment results, preemption based algorithm showed good performance. However, there is a possibility that the consecutive short jobs may cancel and starve the long for ever. Here, job fairness was measured by 'how jobs are evenly distributed'. We can also consider other numerical measures like 'average waiting time'. -- Best Regards, Hyun Duk Kim Ph.D. Candidate Computer Science University of Illinois at Urbana-Champaign http://gildong2.com From: Sun Yu [sunyu9910@gmail.com] Sent: Thursday, February 18, 2010 11:51 AM To: Gupta, Indranil Subject: 525 review 02/18 Sun Yu 1.Quincy: Fair scheduling for distributed computing clusters. Many current distributed system consists of large clusters of inexpensive commodity computers. In such system models for distributed computation, it's common that application data is stored on the computing nodes. This paper introduced a graph data-structure based framework for scheduling concurrent jobs in such systems. The idea is to map the scheduling problem to a min-cost flow problem in a directed graph. Then a optimal scheduling policy can be obtained by performing global search for a min-cost flow solution. It's demonstrated with experiment data that comparing to existing queue-based scheduling algorithms, Quincy provides better fairness and improved data locality. Comments and questions: it seems that this min-cost flow based scheduling method is robust to graph topology changes, i.e. it's not sensitive to events like a computing node breaks down or a new node join. Even if the topology is constantly changing, as probably will happen in mobile environment (say, vehicle platoon), this is still true, is it? 2.Improving Mapreduce performance in heterogeneous environments. Hadoop assumes homogeneous nodes within cluster, which is not realistic in some cases. For example, Amazon's Elastic Compute Cloud (EC2). It's shown in this paper that Hadoop can cause severe performance degradation in such environments, mainly because Hadoop's simple decision mechanism on starting the speculative execution of tasks: it often leads to excessive speculative executions. The authors designed a new scheduling algorithm called LATE, which, despite its unlucky name, is quite robust to heterogeneity and delivers much better performance. The authors pointed out four important issues in scheduling heterogeneous networks: make decisions early (avoid using mean and variances which leads to "eventual"--and late, decision), use finishing times rather than progress rates to sort tasks to speculate. Speculate tasks should run on fast nodes and capped to prevent overuse. Comments and questions: the estimation of finishing time is tricky. As pointed out by the authors, the current method that they adopt (based on progress score) is not satisfactory. But it seems this doesn't give too much negative impact? From: Kurchi Subhra Hazra [hazra1@illinois.edu] Sent: Thursday, February 18, 2010 11:23 AM To: Gupta, Indranil Subject: 525 review 02/18 CA-NFS: A Congestion-Aware Network File Systems ---------------------------------------------------------------------------- Summary -------------- This paper presents a modified Network File System, that the authors term as Congestion-Aware Network File Systems (CA-NFS). A holistic congestion pricing mechanism is used that takes into account all critical resources across server and client nodes, and helps dynamically assess system load and schedule asynchronous client operations. It accelerates or defers asynchronous requests in order to improve performance of applications. A server receives bids, that is, requests and their associated prices, from the clients, and a bid is accepted only if the associated price is higher than that it has set for the particular request. Severs increase or decrease the price of asynchronous read and writes depending on its resource constraints. Due to this, the clients that are not resource constrained and hence with lower price for asynchronous reads or writes will defer their requests, until the server price drops below their own price. This reduces congestion at the server. On the other hand, when the server resource utilization is low, the server sets operations at lower prices, and the clients perform reads and writes to the server more aggressively. Similarly, when the client buffer is almost full, and it needs to write back to the server, it increases the price of its write, to increase its chances of winning over the server price. A congested server will not welcome read ahead requests, and will increase the price for the same. A client with empty buffers and low network utilization will favour read-aheads and increase the price for the same. However, preference always goes to blocking calls over non-blocking calls like read-ahead. The pricing mechanism used is dynamic and identifies bottlenecks across all clients and the server in a collective fashion. All resources are priced separately, and the cumulative cost of all resources is set to the highest cost resource, since this represents the system bottleneck. The authors consider the pricing of five resources in the paper, for example, server CPU, client and server network and so on. They also demonstrate the improvements they inject into NFS and the resultant performance improvements via extensive experimental results. Pros ---------- -- The idea introduces dynamism into the existing NFS. Since we are dealing with a distributed system, treating the entire system, along with clients and server, as one unit and trying to optimize resources all across the unit is apt and a step in the right direction. -- The pricing model strives to increase server load when its resource utilization is low and vice versa, allowing maximum resource utilization in the system as a whole. -- The idea tends to improve the performance that individual applications running at the clients see, rather than facilitate the working of the server. Cons ----------- -- As pointed out in the paper, fairness over time is not guaranteed. -- It is very easy for a malicious node to drive up the server resource prices and cause all requests to be rejected. Thus, the system fails as a whole. Improving MapReduce Performance in Heterogeneous Environments ------------------------------------------------------------------------------------------------- Summary ------------- In this paper, the authors present a new scheduling algorithm for Hadoop called, Longest Approximate Time to End (LATE), that takes into account the heterogeneity of the nodes in a cluster, and hence scores above the Hadoop task scheduler. The Hadoop scheduler implements speculative execution of tasks that are running slower than a threshold by re-executing them at another idle node. To identify such slow tasks or nodes called stragglers, the Hadoop scheduler uses a simple function to determine a task's progress rate. Although this improves performance, there are certain implicit assumptions that the scheduler makes, like nodes perform work at the same rate, tasks progress at the same rate throughout their lifetime and speculation is does not cost anything. These assumptions, however, break down in a heterogeneous environment. To do away with this weakness, LATE identifies stragglers by using an estimate of the time left to finish a task. Also, to launch speculative tasks only on fast nodes, such tasks are not launched on nodes that are below SlowNodeThreshold, of total work performed in the whole system. Besides, in order to limit the number of speculative tasks being run, since these tasks are not inexpensive, a cap called SpeculativeCap is used, which represents the total number of speculative tasks being run in the system. The authors show via extensive experimental results that the LATE scheduler reduces the execution time of Hadoop tasks Pros-- --------- -- By using the longest time to finish as a metric to evaluate slow nodes, LATE identifies stragglers depending on how much they hurt job response time. Thus the response time of the system, as a whole, increases. -- LATE limits the number of speculative tasks to be executed in the system, reducing contention for shared resources. -- LATE also does with unnecessary launch of any speculative tasks when the system is running fast enough. -- The paper generalizes an existing problem of scheduling tasks on clusters using the MapReduce paradigm, and makes it robust to real-world use by introducing heterogeneity in the model. Cons-- --------- -- The algorithm breaks when tasks are submitted at different times to the nodes, since younger tasks seem faster. -- LATE does not take into account data locality when mapping speculative tasks to nodes. They assume that most maps are data local and network utilization during the map phase is low. However, they also admit that this could be seen as an extension to their work. Thanks, Kurchi Subhra Hazra Graduate Student Department of Computer Science University of Illinois at Urbana-Champaign From: Virajith Jalaparti [jalapar1@illinois.edu] Sent: Thursday, February 18, 2010 10:11 AM To: Gupta, Indranil Subject: 525 review 02/18 Review of “CA-NFS: A Congestion-Aware Network File System”: The paper provides a holistic framework for a network file system which tries to make an efficient use of the resources by taking into account the consumption of various resources such as network bandwidth, server CPU utilization, server disk utilization and client and server memory utilization. The main idea behind it is that certain operations can be performed while the system utilization is low and can be postponed while the system is being used near its peak performance as under such conditions, these “non-critical” operations might end up decreasing the performance of the system as a whole (causing interference to the more essential operations by creating, for example, congestion in the network). CA-NFS modifies NFS by deferring/accelerating writes which delay the consumption of resources by the client/utilize the resources when the system’s utilization is low and by determining whether performing a read-ahead operation would cause benefits as opposed to decreasing the efficiency of the whole system. Each host in the network and the server maintain a price for performing asynchronous writes and read-aheads. These prices are updated regularly at the server and the clients and are a measure of how much would the system as a whole is affected because of allowing/deferring a write/read to be performed. As the resource consumption at a client/server increase, the price increases. Clients perform the asynchronous writes if the price of deferring it is more than that at the server and perform the read-aheads if the price paid for not performing it is more than the price (at the server)of performing it. The paper then goes on to provide utilization measures for various resources and presents experimental results which show that CA-NFS can better utilize the system resources causing upto a 20% speed up in execution times. Pros: - CA-NFS provides a new system architecture which takes into account the utilization of the system as a whole and ensures that the system’s efficiency is not decreased because of tasks that can be deferred from being performed immediately. - It is adaptive and uses a pricing model which allows for the comparison of the utilization of various system resources, providing a holistic approach. Cons/comments: - The paper uses network bandwidth consumption as the only measure as the network utilization. However, this is not entirely correct. Another important factor that needs to be considered here is the queuing delays experienced at the various intermediate nodes in the network which increases the latency of the requests. The topology of the network can also affect this (clients on same switch would interfere with each other more). - The results presented are just for some benchmarks and it not clear if such gains would apply for real-world work-loads and how this system would scale with the number of hosts using the file system. - Each client takes into account the prices, for the various operations, at itself and the server only. If a client is able to know the prices of the others in the network, it might be possible to perform better since the client would know what is the status of others and would get the view of the network as a whole. It is not very clear if this would lead to appreciable gains or not. - It is not clear how the values of the k_i’s have to be selected. The paper shows that these values can drastically affect the performance of the system. - The basic system seems to be based on the NFS architecture. It is not clear why that should be a basic building block. It is completely possible that a completely different architecture for the file system can result in much better gains as compared to one based on NFS. Review of “Improving MapReduce Performance in Heterogeneous Environments”: This paper introduces a simple heuristic which can be used for speculatively executing stragglers in Hadoop jobs running in heterogeneous environments. The main motivation behind the paper is that virtualized environments like the Amazon EC2 can lead to heterogeneity between the various VMs utilized for the jobs because of co-location of different VMs on the same physical machine, thus violating the basic assumptions of Hadoop’s scheduler that the various nodes in the network are essentially homogenous. This paper provides a scheduling algorithm (LATE) which tries to reduce the response time of the Hadoop job by ensuring that (a) the task which is expected to finish farthest into the future is speculatively executed. This is measured by assuming that the task executes at a constant rate and using that along with the amount of processing left (b) only the fast nodes (those which work at a rate greater than the (average- threshold)) are selected to speculatively execute tasks (c) the total number of speculatively executed tasks are limited to a certain fraction of the total number of tasks to ensure that these tasks don’t cause much interference to the tasks already running (d) the task that is speculated upon is running slower than a particular progress rate. The paper then goes on present results of experiments performed using the Amazon EC2 service to show that heterogeneity is actually a factor in such environments. It further shows that LATE can help in reducing the response time of various jobs as compared to the native Hadoop scheduler by as much as 93%. (20-30% on an average in most cases). Pros: - The paper provides a scheduling algorithm which helps to decrease the response time of Hadoop jobs in heterogeneous environments by ensuring that the slower stragglers are speculatively executed on the faster nodes. - It takes a more general approach than Hadoop’s native scheduler by taking into account not only the %of a task completed by a particular node but also the rate at which a task progresses. - It clearly shows that virtualized environments can be far from homogeneous as opposed to the common premise that VM’s allows all the jobs to run on a similar environment (if there are sufficiently large number of them) Cons/Comments: - LATE assumes that tasks progress at uniform rate at various VMs. This is not generally true. It might very well be the case that other users’ VMs are interfering with the job being monitored resulting in an varying progress rate. The paper also shows an example in which this is inherent in the job being described. A prediction which is very different from reality can severely affect the performance of the scheduler as compared to Hadoop’s native scheduler. - The experiments which have been presented in the paper show that LATE is only effective if various VMs are collocated. Further, those experiments which compare the performance of LATE against that of Hadoop’s native scheduler have been performed only with a few VMs in the production environment of EC2. It is not clear if the advantages of LATE will be significant in real production environments. - Although the paper presents results measuring the effects of modifying the various parameters of LATE, it doesn’t show how one should select these parameters for a different workload/job running in a different environment, in which cases the optimum values presented in the paper may not hold. - A different approach to this problem might be to have the scheduler exploit the history of the runs of various jobs on the cluster and use that to make more accurate predictions. From: Fatemeh Saremi [samaneh.saremi@gmail.com] Sent: Thursday, February 18, 2010 8:49 AM To: Gupta, Indranil Subject: 525 review 02/18 Paper 1: Improving MapReduce Performance in Heterogeneous Environments This paper considers heterogeneity in MapReduce based environments and proposes a solution for the problem of speculative execution in these systems. Considering simple thresholds for identifying slow tasks, like what Hadoop does, can fail in unexpected ways when the degree of heterogeneity is more than the expected. Distribution based approaches, though more reasonable, need some time to measure mean and variance to be able to recognize slow tasks. To this end, the authors of this paper propose to identify the tasks that hurt the response-time the most, instead of finding slow tasks. They present a simple approach, LATE Longest Approximate Time to End, that speculatively execute the task that is thought to finish farthest into the future. Estimating completion times, the LATE algorithm works as follows. If a node asks for a new task and there are fewer than threshold speculative-cap speculative tasks running: (1) Ignore the request if the nodes total progress is below threshold slow-node-threshold. (2) Rank currently running tasks that are not currently being speculated by estimated time left. (3) Launch a copy of the highest-ranked task with progress rate below threshold slow-task-threshold. LATE has been evaluated in clusters on EC2 and a local virtualized testbed, and a sensitivity analysis of the parameters is presented as well. Identifying slow (harmful) tasks early enough is appreciable (though not completely efficient and optimum), as it is more efficient and decreases end-to-end delay. However, some improvements are required regarding the tasks that do not fall into the speeding up over time category and estimating their finish times. Design decisions like considering different caps are suitable and avoid overloading the system. This work considers only short jobs and performs appropriately. However, performance of this approach for longer jobs is questionable and it could be valuable to have the effect of jobs length included in the paper as well. It would be useful to present how far the results of this approach are from the optimal scheduling (a static accurately designed schedule). Though designing such schedule is of much work, this kind of comparison gives a good sense of the gap in between. Paper 2: CA-NFS CA-NFS is a holistic framework for adaptively scheduling asynchronous requests in distributed file systems. This framework is based on congestion pricing mechanism that incorporates all critical resources among all clients and servers, from client caches to server disk subsystems. It accelerates, defers, or cancels asynchronous requests in order to improve application-perceived performance. Servers encode their resource constraints by increasing or decreasing the price of asynchronous reads and writes in the system. As the server prices increase, the clients that are not resource constrained will defer asynchronous operations for a later time. This reduces the load due to non-critical operations when the system is congested. The online strategy of the framework is provably logarithmic competitive with the optimal offline algorithm in the maximum usage of each resource. Though the competitive ratio is weak, it is unprecedented in the literature of storage system. The framework has been evaluated; the experimental results confirm that CA-NFS outperforms NFS and for a wide variety of workloads report more than 20% improvement in application-perceived performance. The online pricing mechanism is light in the sense that it assumes no correlation between past and future requests and it is only aware of the current system state. However, in reality there is such correlation and exploiting it might lead to better results. It is appreciable that theoretical model does not make any explicit assumption about the type of resources managed. However, tuning parameter k the parameter representing the performance degradation experienced by the end user as the resource becomes congested is extremely crucial as performance and efficiency of the framework totally depends on it. This parameter depends on many different factors and not only is it resource dependent, but also it depends on the relative ordering from the congestion point of view of different resources in the system. The other point is that the system resources are not independent; the correlation among those should be taken into account. The proposed framework, though able to considerably manage congestion, is not fair over time and needs improvement in this direction albeit the authors have counted it as a future work. From: Nathan Dautenhahn [dautenh1@illinois.edu] Sent: Thursday, February 18, 2010 8:23 AM To: Gupta, Indranil Subject: 525 review 02/18 Paper Review: Quincy and Improving Map Reduce Performance in Heterogeneous Environments Nathan Dautenhahn February 18, 2010 1 Quincy: Fair Scheduling for Distributed Computing Clusters 1.1 Summary and Overview The primary focus of this research paper is to solve the problem of fair job scheduling on clusters. They focus on the issue of developing a fine-grained scheduling approach. This is in contrast to a course-grain approached where a given set of nodes will be allocated to a job, and left until relinquished by that job. The job that has access to the set of nodes has exclusive use of the nodes, and reduces overall performance if it does not fully utilize the allocated nodes. They have implemented and compared both a queue-based scheduler and a flow-based scheduler. In terms of the theoretical CS realm they are comparing a greedy algorithm verses the flow algorithm. The include the use of locality into their scheduler, which is novel. • I like their focus of providing a description about the baseline scheduler that they choose. It was not only a good way to prove that their evaluation means something, but also a great job of organizing the paper in a simple to read manner. I also have one more comment about this point. They implement the same scheduler that Hadoop uses, therefore, they have not only showed how good their new scheduler is, but also how they are much better than Yahoo! by implementing the flow-based scheduler. • They have built into the scheduler a finer grained quantification of cost than in queue-based scheduling. What this really does is give them additional dimensions to use in comparing one scheduling decision verses another. I really like the use of locality and its potential to allow for greater utilization of the cluster. • Really cool innovative use of the cost-flow graph. 1.2 Comments and Criticisms The following are my primary criticisms with the paper: • This problem seems like an old one that may already be solved. When it comes down to it this problem is an algorithmic issue: greedy algorithm verses min-cost flow algorithm. In general we know that greedy algorithms usually do well, but don’t perform well verses optimized versions of other types of algorithms. In addition to this algorithms argument, this seems like a similar problem that operating systems has researched very thoroughly. Scheduling is not a new problem, and as such I question the full novelty of their approach. • A primary concern I have is with the scalability of this approach. They have implemented it with only a 243 node cluster. What would happen to their performance if there were 1000 nodes, or 10000 nodes as there may be in some of the larger data warehouse companies? 2 Improving MapReduce performance in Heterogeneous Environ- ments 2.1 Summary and Overview This paper discusses the development of a new scheduler for MapReduce jobs in Hadoop. The authors identify that the main problem with Hadoop’s scheduler is that it uses a simplistic approach and assumes that the nodes it is executing heterogeneous in terms of there performance. As such, the paper proposes a new scheduling algorithm to increase the scheduling. It is important to note that the scheduling under concern is not the initial scheduling of new jobs, but rather dealing with the concept of speculative scheduling. Speculative scheduling is focused on the problem of fault tolerance in dealing with map reduce jobs. The primary contributions of the authors is a new scheduling algorithm that uses the estimated latest finish time to schedule tasks that will hurt the overall progress rate of the job. 2.2 Comments and Criticisms I have several concerns with the way this paper describes information, and whether or not this work is novel. They are as follows: 3 • The author’s motivation for the problem seemed to start very slowly. I was two pages in before I was surely convinced that this is a problem worth investing time into. Some of their statements up front are contradictory to those later. For example, they claim at the beginning of the paper that Hadoop performs extremely well when processing shorter jobs, but then in section 2, they mention that they are focusing only on optimizing the shorter jobs. If Yahoo already performs fastest here, and better than Google, then why would they want to go after it? Obviously, it is important to attack these issues, but they didn’t do the best job of showing this. It is important to note that they put a few really amazing points in section 2 for motivation. These should be moved to the front so the reader really gets the need for this work. • There algorithm uses a lot of experimentation to find its values for the specific parameters. It would be nice to see a somewhat more theoretical approach at identifying a better scheduling algorithm. It appears as though even this algorithm has much room for improvement. Common Themes Both of these papers are concerned with the development of better scheduling algorithms for MapReduce type jobs. The seek to provide better task scheduling for smaller jobs. Quincy is more in depth and attempts much more than the Zahria et al. *** Nathan Dautenhahn *** ((( dautenh1@illinois.edu ))) From: mukherj4@illinois.edu Sent: Thursday, February 18, 2010 5:28 AM To: Gupta, Indranil Subject: 525 Review 02/18 Cloud Scheduling Improving MapReduce Performance in Heterogeneous Environments: by Zaharia et. al: In the paper the author describes an algorithm Longest Approximate Time to End (LATE) as a remedy to the performance degradation of Hadoop scheduler on a heterogeneous system. LATE can improve Hadoop Response time by a factor of 2 in clusters of 200 Virtual machines on EC2. In this work, the authors demonstrate how the speculative execution can be done robustly to improve performance by minimizing the job response time. Speculation may be beneficial to prevent prolonged life of many concurrent jobs all suffering from stragglers tasks. Authors described how and why heterogeneity affects the basic assumption made by Hadoop Scheduler. On a heterogeneous system, the nodes do not perform at the nearly same rate, so tasks also do not progress at a constant rate throughout time. Different VMs are co-located on the Cloud Computing Environment. LATE is based on 3 principles: Prioritizing tasks to speculate Selecting fast nodes to run on Capping speculative tasks to prevent thrashing LATE estimate progress rate of each task as ProgressScore/T, where T is amount of time the task has been running for. Time of Completion is estimated as (1-ProgressScore)/ProgressRate. They only launch speculative task on faster nodes. LATE algorithm works as follows: If a node asks for a new task and there are fewer than SpeculativeCap speculative tasks running: Ignore the request if the node’s total progress is below SlowNodeThreshold. Rank currently running tasks that are not currently being speculated by estimated time left. Launch a copy of the highest-ranked task with progress rate below SlowTaskThreshold Pros: The paper identifies the reason why heuristic based Speculative Execution adopted by Hadoop does not perform well on heterogeneous machines and they propose LATE as an elegant solution. Cons: LATE does not take into account data locality for launching speculative map tasks, although this may or may not be detrimental. There are situations when the progress rate heuristic of LATE algorithm can backfire. Detecting slow task is not sufficient for a good response time. There is no sound mathematical model to prove that the parameters (like SlowCapThreshold, SpeculativeCap, 20% progress rule) chosen are optimal. It is also based on some heuristics. Comments: The shortcoming in the form of backfiring as described on section 4.2 is really helpful as it gives more insight and provides motivation for further improvement. In table-2, how they calculate standard deviation and what it signifies if we increase no. of VMs is not properly explained. CA-NFS: A Congestion-Aware Network File System: Batsakis et. al: In this work, the authors described a framework to schedule asynchronous requests adaptively on distributed file system. They described it as “Holistic” as it manages all resources, including network bandwidth, server I/O, server CPU, and client and server memory utilization. An extension of Network File System (NFS) Congestion pricing via online auctions for better resource management Identifying the difference of importance of different requests, like asynchronous operations are generally more urgent than the synchronous ones. Also, the system perform asynchronous operations more aggressively. Taking special care when system resources approaching critical capacity. Priority scheduling, preference of blocking to non-blocking requests, and priority inheritance. CA-NFS advertises cost-information to clients which implements the scheduling logic. They overridden FSSTAT protocol operation to include pricing information about the resources. The underlying pricing algorithm provides a log-k competitive solution to resource pricing. The paper describes a practical pricing function that is competitive with the offline algorithm in the maximum usage of each resource. They measure disk utilization by sampling the length of the devices dispatch queue at regular, small time intervals. Pros: The paper described Asynchronous read/write quite explicitly. Most of the details are just implementations issue. The pricing mechanism is described is competitive as it is based on auction model. CA-NFS allows system to exchange memory consumption between the clients and the server. Cons: The model has been built on top of NFS, so low level programming can violate the security of the system. Virtualization is needed to provide the security which will definitely incur overhead. The model considering bandwidth Sharing in Circuit sharing networks does not seem to be a brilliant approach for determining pricing to me. The proposed framework does not address the fairness issue overtime. Comment: The authors pointed out the bottleneck of their approach on high-speed network, which is helpful. But, the graphs showing the results are not very impressive as for writer there is not much improvement as shown in Figure-4. Also, on a distributed system how they measure the cache hit rate is not very clear. It may depend on the data-flow traffic at the instant. Only appealing part is the pricing model, which shows a potential opportunity in terms of cost cutting. From: Ashish Vulimiri [vulimir1@illinois.edu] Sent: Thursday, February 18, 2010 3:30 AM To: Gupta, Indranil Subject: 525 review 02/18 Quincy: Fair Scheduling for Distributed Computing Clusters, M. Isard et al, SOSP 2009 Quincy is a scheduling framework, originally designed for the Dryad infrastructure, that allows a centralized scheduler to make decisions optimizing the tradeoff between constraints such as fairness, locality and performance. These tradeoffs are encoded as weights on a graph representing the available jobs and resources, and the optimal policy is computed by solving the min-cost flow problem on this graph. The authors also define a specific notion of fairness, and describe several example policies that attempt to optimize for this fairness metric. Through an evaluation of the scheduler in a homogeneous, hierarchically organized cluster, they show that optimizing for their fairness metric provides better performance than a simple greedy allotment when the network is the bottleneck resource. The main advantage of the graph based scheme proposed here is that it seems general enough to encode other, more complicated policies that could use additional information available about the jobs. The only complaint here is that the "fairness" terminology can be somewhat misleading. Example: if predicted job running times are available, the shortest job first policy is known to be very good at minimizing average latency, even if it is unfair to the more resource-intensive jobs in some sense. However, the graph based scheme is perfectly capable of encoding such a policy. Comments/questions: * A lot of assumptions about the execution environment: homogeneous cluster, network traffic is the bottleneck, centralized scheduling is feasible, etc. Again, except for the assumption about centralized scheduling, these do not detract from the graph based scheme itself, but they do raise questions about how indicative the experimental evaluation is of the performance of the few policies they explore in this paper. * The scheduler uses a task level granulairty -- the two alternatives presented here are that each individual task either runs to completion or is killed before it can succeed. Why is preemption (i.e. process supsension as done by traditional operating systems) not an available option? * In the queue-based schedulers: why is it that root tasks are alloted exclusive access to one computer? Possible future work: * Would it help if the scheduler were allowed to actively redistribute the data on the machines (say during the time periods when they are not fully utilized), instead of just trying to passively take advantage of the data distribution already present? Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia et al, OSDI 2008 The authors discuss the assumptions behind the Hadoop task scheduler and describe how these assumptions break in the presence of node heterogeneity. They then describe an alternate scheduler which provides a better response time for short jobs in the presence of heterogeneity by improving the way speculative execution (preemptive duplicate executions of tasks that seem to be running too slowly) is conducted. The new scheduler, which uses a strategy the authors call the Longest Approximate Time to End (LATE), improves upon the old one by factoring in node performance via the following two heuristics: i) The simple task progress metric used by the Hadoop scheduler (which is a roughly linear function of the fraction of input data already processed by the task) is improved by factoring in the speed of execution at the node (by simply dividing the old metric by total time taken) ii) Ensuring speculative task execution is only done on fast nodes, defined as those nodes that have high task progress rates LATE also uses thresholds to determine how many speculative tasks to run and to determine which tasks are to be marked as slow. + Solves the heterogeneity problem without increasing the complexity of the Hadoop scheduler by a lot. - No attempt to handle data locality. - The focus is on optimizing response time for short jobs. But how does this scheduler affect, for example, average throughput in a system with a lot of long-running batch jobs? - The assumption that similar tasks run at similar rates might not always be reasonable. - There is also an assumption that all tasks proceed at constant rates, although the authors argue that this is reasonable for MapReduce jobs. - Only simple programs (viz. sort, grep and wordcount) were evaluated. The above two issues might have arisen with more complex algorithms. From: liangliang.cao@gmail.com on behalf of Liangliang Cao [cao4@illinois.edu] Sent: Thursday, February 18, 2010 1:54 AM To: Gupta, Indranil Subject: 525 review 02/18 Reviews by Liangliang Cao, cao4@illinois.edu, Feb 16, 2009 Paper 1: Improving MapReduce performance in Heterogeneous Environment This paper observes that Hadoop’s heuristic scheduler is not optimal in the scenario of heterogeneous environment, which happens when the data center is not virtualized or virtualized but with multiple VMs on the same physical host. This paper argues that in such scenario, nodes will not perform roughly in the same speed, and tasks will not progress at a constant rate. The classical Hadoop scheduler will have difficulty in reliably estimating the priority of different jobs. To handle this problem, this paper proposes LATE (longest Approximate Time to End) algorithm, which estimates the finished time of each tasks to finish, and speculatively execute the task with longest finished time. Thoroughly experimental results validate the successful of LATE algorithm. Pros: • The experiments are extensive and convincing. It is very interesting to know LATE can improve Hadoop response time significantly (by a factor of 2). • LATE algorithm is relatively simple, and it is not difficult to modify classical Hadoop schedulers using LATE. Cons and potential improvement: • LATE is not guaranteed to outperform classical Hadoop. As shown in section 4.2, LATE is not efficient when the task’s progress rate changes and is not linearly related to actual process. Although the authors argue that such situation does not frequently arise in typical MapReduce jobs, the non-rigid nature of LATE algorithm prevents its general use for scheduling distributed systems. • In heterogeneous environment, it might be helpful to explore the perform history as a prior to schedule the tasks. Mining the performance log might be helpful in designing a better speculative task monitor. Paper 2: CA-NFS: a congestion-aware network file system This paper employs the idea of pricing and online auction to coordinate the use of system resources in network file system. This idea leads to a holistic solution of scheduling I/O, CPU, memory, network bandwidth. The “pricing” mechanism is clear and impressive, and the experimental results are very convincing. Pros: • The “pricing” and “auction” mechanism is very intuitive to balance the system resources and task priority. • CA-NFS follows NFS’s consistency property so that the user need not worry about the details of file system but focus on the development. Cons and potential improvement • There might be multiple ways to model the pricing or auction functions. Moreover, as a holistic algorithm, CA-NFS should discuss more on whether some resources are correlated, for example, server memory and CPU are often correlated to each other. • In a heterogeneous environment, a global pricing/auction model might not be optimal. More constraint should be taken into account. From: Chia-Chi Lin [lin36@illinois.edu] Sent: Thursday, February 18, 2010 12:17 AM To: Gupta, Indranil Subject: 525 review 02/18 Improving MapReduce Performance in Heterogeneous Environments The paper introduces a novel MapReduce scheduler, Longest Approximate Time to End (LATE), to improve the performance of MapReduce in heterogeneous environments. The main goal of the paper is to devise a way to identify stragglers quickly and accurately in heterogeneous environments. In addition, the scheduler should run speculative copies (backup tasks) of these stragglers on faster machines and under reasonable resource usage. The problem of current scheduler of Hadoop MapReduce implementation is it assumes all machines have the same power, tasks progress at a constant rate, and no cost to launch speculative copies. However, these assumptions fail to hold under heterogeneous environments, e.g., Amazon EC2. LATE deals with difference in machine power and progress speed of stages by coming up with a heuristic estimating the time of completion. Moreover, LATE takes costs to launch speculative copies into account by constraining the number of active speculative copies. The work is evaluated with an implementation on Amazon EC2 as well as a local cloud, and the results show that LATE improves Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2. Pros: - Simple heuristics that seem easy to implement (not evaluated, though) and can achieve good performance improvement. - Evaluated with different workloads and in a production cloud. - The scheduler is not too sensitive to parameter settings. Cons: - Although the authors mentioned (Section 2) the scheduler could have ill-effect (wasting resources) on long jobs, they never provided evaluation on that. - The heuristic still assumes tasks progress at the same rate (Section 4) and might not be true in some cases. - Instead of capping the number of speculative copies, maybe it is more reasonable to cap the amount of resource used. Or, even dynamically adjust the limit. - Does not take locality into account. CA-NFS: A congestion-Aware Network File System CA-NFS improves the execution times of NFS by 20% by scheduling asynchronous operations (asynchronous read/write) according to system resource utilization. The paper address one problem of NFS: NFS fails to prioritize asynchronous operations in different situations, e.g., when client's memory consumption is high, asynchronous writes should have higher priority than when client’s memory consumption is low. CA-NFS deals with the problem by adopting an algorithm devised by Awerbuch et al. and coming up with a pricing function. Essentially, clients and servers provide pricing and bidding information, and only when a client has an appropriate bid, would an asynchronous operation take place. In addition, several resource utilizations are considered: server CPU, client and server network, server disk, client and server memory, and client read-ahead effectiveness. Some of the resource utilizations require sophisticated methodology to assess. The system is implemented in Linux NFS and is assessed with different benchmarks. CA-NFS results in performance improvements in both micro- and macro-benchmarks. Pros: - Considers many resource utilizations and achieves good performance. - Has a real implementation and is evaluated with different micro- and macro-benchmarks. - The system is not too sensitive to parameter settings. (Not as good as LATE.) Cons: - Doesn’t take fairness into account, and hence, there might be starvations. - For network utilization, it seems that the system only takes local information into account but not global information. - The authors mentioned (Section 2) the system might incur overheads, but they never evaluated that. - Some resources are related, but the paper didn’t exploit the possibilities, e.g., compressing the data needing to be transferred when CPU utilization is low and network utilization is high. - Servers don’t actively announce resource availability in the current implementation. Chia-Chi Lin From: ntkach2@illinois.edu Sent: Wednesday, February 17, 2010 8:27 PM To: Gupta, Indranil Subject: 525 review 02/18 Nadia Tkach – ntkach2 CS 525 – paper review 3 Cloud Scheduling Paper 1: CA-NFS: A Congestion-Aware Network File System The authors suggest a new and improved framework (Congestion-Aware Network File System – CA-NFS) based upon Network File System (NFS) which introduces a holistic approach to task scheduling with performance management via assessment and analysis of system load. Based on availability of each and all resources in the distributed network the system manages these resources as a whole and schedules asynchronous operations. Essentially CA-NFS framework balances resource usage via deferring or accelerating write operations and scheduling read-ahead (asynchronous reads). As such, if the server utilization is low or the client wants to maintain the high number of cache hits, the system can perform the accelerated write or defer one. CS-NFS promises to improve the performance of the distributed network as a whole and the evaluation results showed 20% improvement in execution times. Pros: • Improvement in performance (20%) and decrease in congestion • No overhead associated with calculation of server utilization even with the intervals of 1 second Cons: • Performance measured in execution and completion times while there can be other factors that might have to be considered when evaluation performance such as accuracy of the information at any given time (consider the case when data gets accidentally overwritten by a different process or being accessed by a different threat while the most recent write request have not been committed from the cache yet) • The client request server utilization stats/pricing every 10 reads/writes or every 10 seconds. What overhead is associated with constant ping operations? Consider different time intervals and the tradeoffs associated with it Paper 2: Improving MapReduce Performance in Heterogeneous Environments The paper describes an improved scheduling algorithm for speculative execution on heterogeneous systems. As the authors pinpoint the Hadoop open-source model makes a critical assumption that the environment it runs on is homogeneous. While the straggler condition can be easily handled with speculative execution by Hadoop (it is the condition when one of the nodes performs poorly and part of its job is being transferred/shared with another node to speed up the process), it can have a significant negative effect on performance when applied to heterogeneous systems. The authors of the paper offer a so called Longest Approximate Time to End (LATE) scheduling algorithm which is robust to heterogeneity and can improve the performance up to two times (in a cluster of 200 virtual machines). The algorithm works by evaluating the task progress statistics at each node comparing them to the average value of the entire system, then prioritizing the slowest nodes and identifying the fastest ones that the task can be re-launched. The main principle is the tasks that are assumed to finish the latest in the future are of a higher priority and are to be considered first. Pros: • Improved performance of the system and execution times • Different methods can be used for estimating time left for a task Cons: • Where the performance statistics evaluated and analyzed, at the control remote server or at each individual node? What is the overhead associated with such calculations, how often are they performed? From: Shehla Saleem [shehla.saleem@gmail.com] Sent: Wednesday, February 17, 2010 5:54 PM To: Gupta, Indranil Subject: 525 review 02/18 CA-NFS: A Congestion-Aware Network File System This work is motivated by the authors’ understanding that the kinds of goals that current distributed file systems try to attain and optimize are not appropriate and sometimes even incorrect. This effect is especially more pronounced when the behavior of network file servers is studied under congestion. The servers try to optimize I/O throughput and latency and are oblivious to application perceived performance. The authors try to come up with a design that prioritizes different file system operations. They propose CA-NFS as an extension to their earlier work. CA-NFS employs a form of online auction to manage resources such that the application perceived performance is optimized. They identify how some operations are urgent and how others can be deferred. Servers then reflect their resource constraints in an increase or decrease of the prices of file system operations. Clients then adapt their own behavior by deferring their low priority operations during a high price time. The authors come up with a reasonable pricing function that provides feedback to the user about resource utilization. They also have a constraint to avoid deadlocks in case both the server and the client are overloaded. Their pricing model effectively incorporates a set of resources including server CPU, disk usage, memory and network utilization and can therefore adapt to the dynamics workloads and system configurations. I am not very clear about how fair this system would be to different clients. There should be a mechanism which would make sure that no single client, malicious or otherwise is able to throttle the server resources and starve other clients. Also, once a control knob is available for resource allocation, maybe the system can offer some more sophisticated options and differentiated services, e.g. guaranteeing a particular Quality of Service to clients that behave well. Improving MapReduce Performance in Heterogeneous Environments The paper presents LATE: The Longest Approximate Time to End, a new scheduling algorithm for speculative execution that is highly robust to heterogeneity. The authors motivate the work by identifying the shortcomings of MapReduce and Hadoop. MapReduce is simple, scalable and handles failures well. Hadoop is an open source implementation of MapReduce with better response times. However, it suffers severely in heterogeneous environments where stragglers are not easily identifiable and also in virtualized computing environments where there may be contention between virtual machines residing on the same physical machine. In such cases, Hadoop which schedules tasks based simply on their progress may degrade performance even below that without speculative execution. LATE on the other hand schedules tasks that are expected to finish farthest into the future because this can provide the highest gain terms of reducing response time. LATE also minimizes excessive and wasteful speculation and identifies stragglers correctly and fast. It also avoids assigning jobs to stragglers thus improving performance and works under the principle of responding early rather than waiting and letting wasteful usage of resources. Furthermore, LATE bounds the maximum number of concurrent tasks to minimize waste of system resources and reduce overhead. To validate the claims, the paper also shows results of some experiments done on the Amazon’s Elastic Computing Cloud where response times improved by a factor of 2. The fact that the results come from real world experiments does significantly back the practical applicability of LATE. On the downside however, first of all, the performance improvement offered by LATE comes from the accuracy of predicting the finish time. How accurate will this prediction always be? Can there be some concrete way of doing this prediction rather than some heuristic? It seems that LATE might not work well if a task’s underlying computations are such that it proceeds at a variable rate. What would happen if a task starts with a fast progress but then it has to proceed slowly? Task progress rate could also be highly variable and may even fluctuate between high and low in a virtual machine environment where tasks are sharing physical resources with others. The simple finish time prediction would be incorrect in this case and the possibility of a dynamic predictor might be looked into. Also, data locality should also be given some weight for making decisions about speculative execution. Overall, the paper is simple and well written and the current results do advocate the design’s strength.