From: Igor Svecs Sent: Thursday, April 21, 2011 12:49 PM To: Gupta, Indranil Subject: 525 review 04/21 Igors Svecs CS 525 2011-04-21 Optimizing Cost and Performance in Online Service Provider Networks SUMMARY This paper introduces a traffic engineering strategy called Entact to optimize cost and performance (latency) in online service (such as search, maps, messaging) provider networks to its users. It is assumed that an OSP will have multiple datacenters that are connected by a leased channel, and will have access to BGP routing information advertised by the ISPs. As BGP doesn't include performance information, they first estimate it by using route injection – installing a route for ip/32 in a prefix of interest, that gets injected into routing tables because of the longest prefix matching. Second, the authors use multiple parameters – cost, performance, traffic, link capacity – in a “joint optimization” problem to compute the trade-off relationship and find an optimum balance. COMMENTS The decision to exclude inter-datacenter traffic from cost estimation is debatable; although OSPs do have fixed-capacity leased lines, the capacity of these lines has to be decided based on the traffic. That is, inter-DC traffic still imposes an indirect cost on the OSP. Route injection may cause performance problems with the core routers. Another interesting direction the authors could have pursued is source and multipath routing. While their study takes existing environment and policies into account, it would be interesting to see performance gains under different assumptions. Using latency (and not bandwidth) as the only performance metric reduces the number of online services that this study is applicable to (e.g., cloud storage is excluded). Reducing costs of spot instances via checkpointing in the Amazon Elastic Compute Cloud SUMMARY This paper examines a new pricing strategy recently introduced by Amazon – spot instances – and develops a strategy called checkpointing to minimize cost and maximize reliability of computation. Amazon's spot instances allows users to place a bid on each instance, and while user's bid is higher than the current bid, the resource is allocated to the user, and all previous computation is terminated. Amazon charges by the hour, and the user does not need to pay for partial computation for the hour if it is forcefully terminated. Whenever a process makes a checkpoint, it is able to resume from the saved state in case it is terminated. The authors examine several variations of checkpointing policies, such as hour-boundary checkpointing (periodic checkpoints are taken each hour), rising edge-driven checkpointing (event-driven checkpointing, where events can be available resources, bids from other users, and the number of bidders), and checkpointing with adaptive decision (considering taking a checkpoint at the current time). Simulation was done comparing different policies to an optimal policy that assumes future knowledge. Depending on instance type, adaptive hour-boundary policy generally performs best. COMMENTS One clear criticism of this work is that it applies to one specific pricing strategy by one cloud storage provider; however, if this strategy becomes more popular with other providers, the paper may have more impact. It will also be interesting to see an experiments evaluation rather than a simulation. From: Curtis Wang Sent: Thursday, April 21, 2011 12:26 PM To: Gupta, Indranil Subject: 525 review 04/21 Curtis Wang (wang505) 4/21/11 Cloud Pricing Market-Oriented Grids This paper provides a summary of the technology of market-oriented grids. Traditional grid computing has encountered problems: specifically, there is no incentive for users to act honestly in a non-greedy fashion and there is no compensation for resource providers for their contributions. In addition, the metrics that the grids run by (maximizing throughput, minimizing waiting time, etc.) do not capture the quality of service for the users. Market-oriented grid computing seeks to remedy these issues by adding market-inspired resource management techniques. There are three actors in a market-oriented grid: users, service providers, and brokers. Brokers act as middlemen and manage access to resources that come from heterogeneous service providers, abstracting this burden from the user. Currency in the market can be either virtual or real, both of which have its own advantages and disadvantages. Virtual provides lower barriers to entry and are lower-risk but lack liquidity and flexibility. Real currency is universally recognized and easy to transfer/exchange, but it suffers from the same problems as in the real world—distribution of wealth can affect accessibility to the grid. The systems that are discussed implement a wide range of techniques to “run” the market—fixed-priced or variable-priced models, using combinatorial auctions, using aggregate utility, etc. Finally, the authors describe the idea of “Catallaxy”—a “free market” of self-interested participants where the economy is self-organizing. Pros - Provides a very in-depth discussion of the motivation and necessity of market-oriented grids and the state-of-the-art technology - Discusses future directions with the idea of “Catallaxy”. Cons - As a survey paper, the paper gets quite detailed and assumes a fair amount of theoretical knowledge from the reader, such as auctions. Reducing Costs of Spot Instances in Amazon EC2 This paper discusses several checkpointing methods to cope with Amazon EC2’s spot instances. Spot instances are unused computing resources that Amazon users can bid on. The prices of these spot instances vary dynamically based on supply and demand. If a user’s bid exceeds the current spot instance price, the user has access to the requested resources, otherwise, access is terminated immediately without any notice. Since Amazon releases all of their previous spot pricing data, checkpointing seems like a natural possibility to reduce the total computation cost while improving the reliability of spot instances. The authors propose three different checkpointing schemes and combinations of the approaches: checkpointing on the hour, checkpointing on a rising edge (when more resources seem to be used), and checkpointing adaptic decision (deciding whether or not to skip or take checkpoints). Based on their simulation, the hour-boundary checkpointing can significantly reduce costs when there are failures while rising edge-driven checkpointing performs better in certain scenarios. Pros - Simple checkpointing scheme can save money and reduce computation time - Interesting problem with lots of potential to save users money Cons - Simulations based off of 500-minute tasks—checkpointing seems like it would be much more effective at reducing costs of longer tasks Hourly-based checkpointing seems to be a very natural idea considering the nature and pricing of the spot instances. Also, since it seems to be more “risk-adverse” than the other two methods, it is not surprising that it performs better most of the time (since it seems very difficult to accurately predict the price shifts) while being outperformed occasionally by other methods. From: anjalis2006 SpamElide on behalf of Anjali Sridhar Sent: Thursday, April 21, 2011 12:26 PM To: Gupta, Indranil Subject: 525 review - 4/21 Optimizing Cost and Performance in Online Service Provider Networks, Z. Zhang et al, NSDI 2010 The paper presents a traffic engineering technique to optimize delivering data to end users from Online Service Providers (OSP). The design presented is called Entact which uses route injection to determine the quality of alternate paths through different ISP’s. The cost of sending data through these links is calculated based on the volume and the best path is eventually chosen. The goal is to deliver a curve with all possible operating points of cost and performance so that the OSP can make the best choice. The cost is calculated for external links connecting the Data Centers (DC) to neighboring ISP’s. The performance of links is calculated based on the RTT of probes send to different paths. The number of alternate paths being probed is equal to number of DC * number of ISP ‘s connected to the OSP. The strategy employed by Entact also makes room for user’s choices. For example, the OSP might have an additional parameter that it needs to optimize at the cost of latency. Hence the original curve cannot be used anymore and a utility curve of the strategy is drawn. The paper aims to optimize both performance and cost so that the OSP is able to choose the best point on the curve that it wants to operate at. OSP’s might shift huge volumes of traffic from one ISP to the other. How does the QoS negotiation take place if that happens?. If the OSP’s are paying the ISP’s extra to carry their bandwidth, changing the volume of traffic might not be feasible. The sudden injection of traffic might cause congestion or throttling of bandwidth for other users. The impact of this kind of adaptation with respect to ISP policies needs to be looked at. If you are constantly changing paths, caching of media cannot be carried out as before at intermediate points. Reducing costs of spot instances via checkpointing in the Amazon Elastic Compute Cloud, S. Yi et al, IEEE Intl. Conf. on Cloud Computing, 2010 The paper addresses the tradeoff between cost and reliability owning an EC2 instance. Spot instances at Amazon were introduced in order to use the spare capacity of their data centers. Users obtain an instance if their bid is greater than the current price of the instance. However if the price increases the instance is revoked. Hence there is a tradeoff between cost and reliability. The basic three types of checkpointing that is presented in this paper is 1) Hour-boundary checkpoint – the checkpoint is carried out at regular intervals 2) Rising Edge driven checkpoint – the point at which users begin to bid and raise the price of the spot instance. 3) Adaptive checkpoint – the users are given a choice to skip checkpointing at a particular point and risk failure. The paper compares different checkpointing schemes in order to reduce cost and task completion times. The results show that the hour boundary checkpointing scheme reduces costs when there are failures. The paper presents simulation traces but usage statistics by users will be helpful in determining if this work in a high dynamic price environment. It would also be helpful to find out the behavior of prices with respect to number of users in the system. This would help in predicting a sudden rise in the price of the instance and hence the necessity of a checkpoint. From: kevin larson Sent: Thursday, April 21, 2011 12:20 PM To: Gupta, Indranil Subject: 525 review 04/21 Yi et al present the issue of effectively utilizing spot instances in Amazon EC2. These instances offer reduced costs, at the expense of reliability. Additionally, the prices can fluctuate, sometimes to levels greater than a user is willing to pay, which further reduces the effective reliability of the system. The authors propose the use of checkpointing, the process of making a recoverable state, in order to improve costs on EC2. In their implementation, they explored several different policies to checkpoint their system, and evaluated them on a variety of instances. The two basic policies they use to checkpoint were edge based and interval based, with edge based checkpointing (or deciding to checkpoint) at any price fluctuation, and interval based checkpointing (or decisions) at fixed time intervals. They used a variety of combinations of these methods and compared them to executions with no checkpointing, and with optimal checkpointing, which involves checkpointing immediately before (and only before) failures. It was determined that all but the most frequent interval based checkpointing performed well. The hybrid methods also performed comparably. Edge based checkpointing varied wildly, and in some cases, worse than no checkpointing. The authors thoroughly evaluated the combinations of their policies and did a good job of characterizing and explaining their results. They mentioned some additional tweaks (delayed termination) to further improve performance on EC2. The conclusion and related work sections were interesting and motivated interesting future works. It would have been interesting to see how the duration of the jobs run on EC2 affect checkpointing, exploring at what durations checkpointing provides benefits over non-checkpointing implementations, and if the optimal policies differed in respect to different durations. There was also room for additional characterization of the instances they ran on and explanation of how they affected the results. Broberg et al explored various economic distributed scheduling models. First the authors explain and motivate various situations for both the users and the providers of various distributed systems and motivated them in terms of costs and fairness. They explore a number of systems, which implement a variety of mechanisms to distribute and schedule tasks. All the listed systems had made design decisions around a pricing scheme, selecting either fixed or variable pricing, and using different criteria and sometimes imposing penalties in order to improve fairness. The systems made different decisions on how to use these pricings, implementing things such as auctions. They evaluated the usage of both virtual and real currencies, and explored the implications of both in a variety of environments. The paper does a good job of summarizing and explaining the trade-offs amongst the explored implementations. The authors did a good job of addressing both profit and fairness in their evaluation and applying them to public and private systems. It would have been interesting to see some sort of evaluation of the systems, even with naive or black-boxed approaches. While obviously, all of the systems may have not been available, it would be interesting to see the effects of the design decisions explained throughout the paper. From: Simon Krueger Sent: Thursday, April 21, 2011 12:12 PM To: Gupta, Indranil Subject: 525 review 04/21 Optimizing Cost and Performance in Online Service Provider Networks, Z. Zhang et al, NSDI 2010 The core idea of this paper is an algorithm for delivering traffic in an online service provider network while keeping cost low and performance high. They measure the performance of paths by injecting a packet into the route which is used measure the path without affecting the current traffic on the path. Finally they allow the operator of the online service provider network to choose a strategy formulated from a trade off curve based on cost, performance, traffic volume and link capacity. They receive a 40% reduction in traffic cost in a Microsoft network They allow the operator to choose strategies based on cost, performance, traffic volume, and link capacity, while before only cost and performance were allowed and everything else was a fixed constant. They use trace based simulations. Why did they use google maps in their paper instead of Microsoft's maps? Reducing costs of spot instances via checkpointing in the Amazon Elastic Compute Cloud, S. Yi et al, IEEE Intl. Conf. on Cloud Computing, 2010 The core idea of this paper is to compare checkpointing schemes based on cost and job completion times for EC2’s spot instances. Spot instances allow amazon or other providers to sell their on used computational resources, which provides a trade off between cost and reliability. They try to find the best checking pointing scheme which minimizes cost and maximizes reliability. They use real amazon spot traces. Market-oriented Grids and Utility Computing: The State-of-the-art and Future Directions, J. Broberg et al, Journ. Grid Computing The core idea is to consider the problems with current resource management techniques in shared grids and other distributed systems (e.g., users do not judiciously and/or appropriately request the accurate amount of resources, and owners of shared resources have little inattentive to contribute more resources). This paper looks at the current resource management strategies and look at the future directions. They look at ‘market-inspired’ resource management schemes to allow fairness amongst users. Allow the user to have job constraints such as deadline, importance, and satisfaction. They describe ‘Catallaxy’ which is a paradigm where a Grid market changes to a decentralized free market. Does this only apply to grid computing? How does this work relate to clouds and peer to peer networks? Simon Krueger From: Michael Ford Sent: Thursday, April 21, 2011 12:10 PM To: Gupta, Indranil Subject: 525 review 04/21 Optimizing Cost and Performance in Online Service Provider Networks The authors present a technique for probing alternative path performance using route injection. Combining this with cost, traffic and link capacity information, they build an operations-tradeoff curve for an online service provider (OSP) interacting with multiple ISPs. While there has been considerable work in traffic engineering, but the authors argue that the scale of OSPs and their geographical distribution change some assumptions from those of previous work. The cost functions typically are price * volume, where volume is often measured as the 95th-percentile. Performance is measured as round trip time. This RTT is then weighted to take traffic and link capacity into account. By plotting all ISPs on a cost vs. wRTT graph, one can find a set of ISPs such that all others have greater cost or wRTT. This gives an optimal strategy curve along which the OSP can select their desired strategy. The authors use data collected over a one-week period. Since they are using stale data, and do not have access to the actual system, they simulate the effects of their traffic engineering solution. The authors then point out that the assumptions they must make are common in the field. They are able to reduce the cost by 40% while not increasing the wRTT. Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud Amazon introduced spot instances in their Elastic Compute Cloud (EC2) which offer cheap computation with reduced reliability and variable pricing. The authors introduce an adaptive checkpointing scheme to counteract the decreased reliability and reduce job completion times, and therefore cost. The bidding scheme used in the spot instances allows users to bid on resources for an hour, but if the price, which changes with supply and demand) increases over the user's bid, they are kicked off immediately. The authors compute the theoretical expected execution time for a process without checkpointing and for one with checkpointing. The 12 checkpointing techniques evaluated are a combination of three basic schemes, namely hour-boundary, rising edge-driven, and a probabilistic driven decision. The authors use Amazon's spot instance price history to drive a simulator which computes the total job completion time, total cost, as well as other metrics. Edge-driven checkpointing performs poorly, as do the probabilistic approaches. The most effective techniques use hour-boundaries. These schemes come close to the optimal, or savant, checkpointer. From: trowerm SpamElide on behalf of Matt Trower Sent: Thursday, April 21, 2011 12:08 PM To: indy SpamElide Subject: 525 Review 04/21 Optimizing Costs in OSP's The focus of this paper is reducing costs and improving performance in Online Service Provider networks using traffic engineering. As different links usage goes up the cost of using those links will increase. Furthermore, a company such as MSN will purchase their own backhaul network to transport data. Using any of these 'internal' links comes cost free. The authors propose a system which will dynamically measure the performance and cost of links and redirect data based on these costs. This is possible in OSP networks as traffic can be handled at any destination. The authors validated their proposed solution on MSN's LiveMessanger traces. The proposed solution seems to put a good deal of stress on the network with the traffic switching between 'external' links. The authors addressed this in the concluding parts of the paper, but was only addressed with respect to one OSP performing the switch. Reducing costs of Spot Instances Amazon recently has started offering AWS users the opportunity to buy leftover computing power from their infrastructure using a market-based cost scheme. These instances are offered at a cheaper cost than the dedicated option, but allow for Amazon to terminate your instance at any time due to their needs. The authors of this paper propose several check-pointing strategies to leverage this cheaper infrastructure while maintaining reliability in computation. If a job is long running, then getting kicked off without check-pointing will cause it to start over and thus negate any benefits from the cheaper service. The authors built a model to describe the costs and latency of completing jobs based on a pdf of getting kicked off at any time. This probability function is taken from traces made available by Amazon. The authors provided three times during which one might want to checkpoint: at each hour boundary, when an increase in the market is seen, and based on the history of the market. The decision of which strategy is best depends heavily on the type of instance being used and thus the pdf of getting evicted for that size instance. This paper presents a nice idea for a middle-ware which would provide a cheaper cost to computation which could be used at many datacenters. It seems this work would be beneficial to Amazon as well who could offer free checkpointing to provide a more reliable “cheap” solution. From: Jason Croft Sent: Thursday, April 21, 2011 12:00 PM To: Gupta, Indranil Subject: 525 review 04/21 Reducing Cost of Spot Instances via Checkpointing in the Amazon Elastic Cloud Compute This paper proposes the use of dynamic checkpointing strategies to reduce the cost of using Amazon's spot instances while improving reliability. Spot instances are spare capacity in Amazon's data centers that have a dynamic pricing model based on bids by users; if a user's bid price is above the spot instance price, resources are allocated to the user. Amazon provides 42 different spot instance types, differing by computing and memory capacity, OS type, and geographical location. When a user's bid price is less than or equal to the bid price, Amazon stops service without any notice, which the authors call an out-of-bid event or failure. Users are charged for the last partial hour if the user terminates the instance, but not if Amazon terminates it due to an out-of-bid event. The authors also define the term rising-edge, which is the time the spot price for a given instance type has increased. The authors define several different checkpoint schemes: hour-boundary, rising edge- driven, adaptive decision, and combinations of these. Hour-boundary checkpointing takes checkpoints at each hour; an hour is used because it is the lowest granularity of spot instance pricing. Rising edge-driven checkpointing takes checkpoints at rising edges, but will fail if the rising edge does not occur during an availability period. Checkpointing with adaptive decision takes a checkpoint at an hour boundary if the expected recovery time without hour-boundary checkpointing (Hskip) is higher than the expected recovery time with hour-boundary checkpointing (Htake). Simulation results vary depending on instance type. For example, edge-driven checkpoint policies perform poorly on some Linux instances compared to hour- boundary. On some Windows instance types, the edge-driven checkpointing performs better. Compared to an optimal checkpointing scheme, these policies give results between 10% and 45% higher. However, the authors do not explain why their schemes vary so much between different instances, or why the price-time product varies greatly without any type of checkpointing scheme. Knowing why this happens could be helpful in reducing cost, since in addition to checkpointing schemes, their design could take this into account and dynamically rent different instance types based on this price-time product. Optimizing Cost and Performance in Online Service Provider Networks In this paper, the authors describe Entact, a traffic engineering scheme for online service providers that can optimize a combination of cost and performance, unlike previous work which only optimize one metric. There are two challenges: performance is unknown for paths that can be used but are not currently used since BGP does not include performance information, and a real time traffic engineering strategy must consider the cost, performance, traffic volume, and link capacity to find an optimization matching the OSP's goals. To measure real time performance and cost using an alternative path, the authors develop a route injection technique to select alternative paths from the routing table and install these alternative routes in the routers. That is, it selects an IP address and installs the route to this IP as IP/32 in the routers within the OSP. Since the routers use a longest-prefix match algorithm, only packets on that route will be affected and the remaining traffic will use the default route. RTT can then be measured on this route to gather performance data. A Live IP collector is also used to probe IP addresses from Netflow data and collect IP addresses using random probing. This performance measurement technique requires core and edge routers be configured to keep injected routes to themselves to prevent route propagation inside or outside the OSP network. However, the authors do not explain how this is done; is there an attribute to set this or does their design requiring modifying BGP? To evaluate their design, the authors run simulations on MSN, which has 2k external links from 11 data centers. To reduce overhead, only the top 30k prefixes are considered, since these account for 90% of the total traffic volume. Traffic volume is recorded at 5 minute intervals, as wells as weighted RTT. Active probing and route injection is only done at 20 minute intervals, however, the authors find the difference from a finer time-scale is small (between 1-5ms in 78-92% of cases). Simulations show cost can be almost completely eliminated by using free peering links, but wRTT increases by 38ms. Optimizing only on performance reduces default wRTT by 3ms but increases default cost by 66%. Entact, however, can reduce cost by 40% without inflating wRTT. From: Long Kai Sent: Thursday, April 21, 2011 11:52 AM To: Gupta, Indranil Subject: CS 525 Reviews 04/21 CS 525 Reviews 04/21 Long Kai (longkai1) Optimizing Cost and Performance in Online Service Provider Networks Summary: This paper studied the problem of optimizing cost and performance of carrying traffic for an OSP network. This problem is unique in that an OSP has the flexibility to source traffic from different data centers around the globe and has hundreds of connections to ISPs, many of which carry traffic to only parts of the Internet. Two key considerations for OSPs are the cost and the performance of delivering traffic to its users. But traditional methods only optimize one factor not both. Their method, called Entact, is based on two key techniques. First, it uses a novel route-injection mechanism to measure the performance of alternative paths that are not being currently used, without disturbing current traffic. To measure an unused path for a prefix, Entact selects an IP address ip within the prefix and installs a route for ip/32 to routers in the OSP network. Because of the longest-prefix match rule, packets destined to ip will follow the installed route while the rest of the traffic will continue to use the current route. Second, based on the cost, performance, traffic, and link capacity information, it computes the optimal cost vs. performance curve for the OSP. Each point on the curve represents a potential operating point for the OSP such that no other operating point offers a simultaneous improvement in cost and performance. The OSP can then pick the operating point that represents the desired trade-off. This paper addresses the jointly optimization for cost and performance. However, when it comes to multi-location, the number of high-volume prefixes may exceeds their given number and they cause the result to be inaccurate. Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud Summary: In 2009, Amazon released spot instances, which sell the spare capacity of their data centers. Their dynamic pricing model is based on bids by users. If the users’ bid price is above the current spot instance price, their resource request is allocated. If at any time the current price is above the bid price, the request is terminated. Amazon offers lower resource costs in exchange for reduced reliability. Mechanisms and tools that deal with the cost-reliability trade-offs under this schema becomes valuable for users seeking to lessen their costs while maintaining high reliability. This paper studied how checkpointing can be used to minimize the cost and volatility of resource provisioning. Using real price traces of Amazon’s spot instances, this paper studies various dynamic checkpointing strategies that can adapt to the current instance. Their result shows that the dynamic checkpointing strategies significantly reduce the monetary cost, while improving reliability. One improvement about their work is that they can make prediction about the price traces. This can be easily done by many existing techniques in machine learning. -- Larry From: w SpamElide on behalf of Will Dietz Sent: Thursday, April 21, 2011 10:44 AM To: Gupta, Indranil Subject: CS525 Review 4/21 Will Dietz 4-21-2011 cs525 "Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud" So we all know Amazon runs this big cloud thing called EC^2. They have some rate they charge and you get the resources you request. In order to properly provide these resources, they need to have them already available. You can imagine that they don't have some poor sysadmin running around fulling requests by plugging in machines on demand :). Alright, so they play a game of trying to always have enough machines available to satisfy the kinds of requests they expect, and barring absurd requests (which I'm sure they handle specially) they probably do a decent job of it. This is all well and good, but when the number of instances users are using don't fully utilize their hardware, what should they do with this extra power? In 2009 Amazon started offering "spot instances" where essentially they reuse the existing ec2 infrastructure only with a flexible pricing model that essentially allows all users to simply bid on the right to use this extra computing power. You issue a bid, and they run your instance when the current price is at or below yours, and instantly terminate instances whose prices are below the current price. Essentially they just say "everyone pay us the most you're willing to pay and we'll get the most we can out of this". For interruptible workloads this can be a pretty sweet deal, trading the lack of reliability for lowered costs (which often happen during off-peak hours, etc). What the authors of this paper said was an attempt to figure out "Okay, so how do we most effectively use this service?". Computation is very useful if you don't do something with the result (send it out, etc) and the manner in which Amazon kills your instance (and doesn't charge for the last partial hour) isn't very friendly to saving your work. Their solution is checkpointing periodically, so that if your instance is killed you can keep chugging along. Checkpointing policies are complicated because you're trading off compute power ('wasted' time checkpointing), versus your ability to recover if you were in fact terminated. They implemented a few ideas, and took the cross product of them to produce 12 checkpointing algorithms. Inh short the basic ideas were checkpointing every hour (since they only charged per-hour, and didn't charge for the last partial hour), a rising-edge reactive algorithm, and adaptive versions of each of these (that used thresholds to determine if they should checkpoint). They also had adaptive periodic checkpointing algorithms on 10 minute and 30 minute intervals. They evaluated these on 10 days of data, using all 42 of the Amazon instance types (including geographic location). In short, they showed that many of the algorithms reliably did significantly better than no checkpointing (which is excellent!), but all were ~10-30% worse than the 'optimal' (that magically knew when to checkpoint and did so right before the instance was terminated). Also I'm a fan of them posting their data, simulator (and source!), as well as additional results at a public URL on a host like sourceforge that's been up for a long time and probably will for a good amount of time into the future. :) I wonder what would happen if data over, say, the past *year* (vs the 10 days they used) was similarly analyzed? They claim Amazon makes the price information public, and since their simulator is publicly available... just a thought. Also, how many people have similar ideas (only potentially more complicated?) that are *unpublished*? Wonder if Amazon minds people gaming this pricing model? I'd imagine not, but I'm curious regardless. Anyway, very interesting work and something to consider if I'm using EC2's spot instances. "Optimizing Cost and Performance in Online Service Provider Networks" This paper tackles the problem of optimizing routing between datacenters of Online Service Providers (OSP's). Lots of companies that provider online services have distributed datacenters that need to talk to one another as well as the end-users. How can this all be done as cheaply while maintining the desired performance attributes? Reducing cost is self-motivated, but ensuring performance is because the user experience (and corresponding monetary gain of the company providing the service) is highly related to the latency and other performance characteristics of the services offered. This work presents "Entact", which uses route-injection to dynamically measure the performance of routes (which need to be evaluated relatively frequently to respond to changing network conditions and the like) by redirecting small network segments (/32 segments), and using this to build the cost vs performance curve for the OSP, so the OSP itself can pick their own performance vs cost trade-off. By measuring this they find that they can reduce costs by up to 40% for ISPs without increasing their performance metric (wRTT) significantly. ~Will Dietz From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman Sent: Thursday, April 21, 2011 10:43 AM To: Gupta, Indranil Subject: 525 review 04/21 Optimizing Cost and Performance in Online Service Provider Networks Summary: The paper proposes an optimization framework for jointly optimizing cost and performance of traffic delivery from an online service provider (OSP) to its users. The system Entact is based on two novel ideas. First, it uses a passive route injection mechanism to evaluate the opportunity cost of alternate paths that are not being used. Second, it computes a pareto optimal cost - performance curve, so that the OSP can choose any point on the curve based on its current traffic conditions. From a graph theoretic point of view, the traffic engineering problem can be formulated as finding the optimal data center (dc), and external link (l) for each destination prefix. For each value of the performance metric (wRTT), the authors formulated a cost minimizing TE optimization problem. This results in a pareto optimal curve of cost vs performance. However since the optimization problem is formulated as a linear program, the traffic for each prefix can be split across multiple paths. The authors propose randomized heuristics for rounding the fractional solutions to integer solutions. The authors implemented a prototype of Entact by converting the theoretical ideas into practical solutions. Experimental results for a live OSN validate the claims of the system. Pros: (1) First automated traffic engineering solution for OSP networks. (2) It proposes a novel route injection technique to passively monitor unused paths. (3) The system jointly optimizes cost and performance for OSPs. (4) The authors have developed a prototype of the system. The experimental results cover the trade-offs of all the system parameters including the selection of prefixes, data center locations, and alternate routes, effect of TE window, and computational overhead of the optimization problem. Cons: (1) The TE optimization formulation doesn’t seem to use any traffic matrix data, which is prevalent in ISP traffic engineering. (2) Is wRTT the best performance metrics? Why not use traditional traffic engineering metrics like MLU? Market-oriented Grids and Utility Computing: The State-of-the-art and Future Directions Summary: This paper provides a comprehensive and up-to-date survey on market based resource management techniques for grids and utility computing systems. The large scale of these systems and presence of multiple self interested parties naturally leads to market based solutions. Usually the participants include clients, service providers, and brokers. The authors note that in systems with high variability in resource demand, simple fixed pricing schemes cannot be optimal. This leads to variable pricing schemes like auctions. Also in commercial systems, the overall system wide objective of social welfare maximization might not generate maximum revenue for resource providers. Recent technologies like para-virtualization enables fine-grained resource allocation and seamless VM migration which can actually facilitate market based resource allocation. The survey discusses and compares pricing, scheduling, and management mechanisms of several systems including Nimrod-G, Bellagio, Mirage, Tycoon, Libra, and so on. The authors mention that variable pricing is much more effective compared to fixed rigid pricing models. The dilemma of virtual vs real grid currency is also touched upon. The authors specially highlight catalaxy based market architectures, and discuss the benefits of this new decentralized utility maximizing framework for utility computing. The paper concludes with some pointers towards future research directions. Discussion Points: (1) Market based solutions are obviously very attractive for managing large scale systems. However alot of market based systems are never commercialized. The authors could have pointed out the reasons for this. An example for OS is the market based lottery scheduling system, which was never implemented in real OSs due to high complexity. Market based solutions might also turn out to be too complex compared to many ad-hoc simple policies. Companies like Google, Facebook, and Amazon seem to prefer quick and dirty solutions rather than elegant solutions like catalaxy. (2) Recently there is a-lot of focus on crowd-sourcing solutions like Amazon Mechanical Turk. Can this also be classified as a market based solution? (3) The paper does not discuss an important aspect, that is how to specify utility functions. Unlike human society, grids are engineered systems. So the utility function models will be different compared to that for human beings. -- Muntasir From: Nipun Sehrawat Sent: Thursday, April 21, 2011 10:11 AM To: Gupta, Indranil Subject: 525 review 04/21 Market-oriented Grids and Utility Computing: The State-of-the-art and Future Directions This survey analyses the market-driven resource management techniques that can be applied in grid/utility computing scenarios. It has been argued that traditional resource management (allocation. scheduling, admission etc) techniques lack fairness and equity in access to resources in a multi-user grid computing scenario. Most such techniques focus on metrics such as minimising turn-around time, maximizing system throughput etc - but this fails to capture the requirements specified by the users - such as diminishing utility of task with time. Thus, grids are now being considered as a ‘market-place’ where, as in real markets, three main entities are: 1. Buyers: Are the users of computing resources, to fulfil they requirements. 2. Sellers: Are the service providers, providing computing resources. 3. Brokers: Are the middlemen between buyers and seller, performing jobs such as negotiation and integration of resources from multiple sellers. Following key points were discussed in the survey: - Prices for resources shall vary with cumulative demand of resources, this’ll help in addressing transient peak loads as users will avoid scheduling their jobs when they have to pay more - thus variable pricing will act as a negative feedback. - Centralized resource management is not suitable for the scale at which huge grids operate. - Forecasting resource requirements is difficult but will result in a better resource management. - A currency system is needed to make users and service provider to do an honest evaluation of their needs and resources respectively. There is some discussion about bidding/auctions for resources: - Bidding shall adapt to supply-demand conditions - future bids are modified (increased/decreased) according to current market conditions. - Users shall define utility (diminishing with execution time) of their jobs, that is amount gained by service providers when a job finishes. - Bidding and subsequent allocation of resources shall have a bounded running time and communication overhead. Admission control (of user tasks) is an important aspect and factors such as the time for which a job can be delayed before it’ll become unprofitable to accept it (because of a diminishing utility define it the user). This is known as the slack-metric and jobs with high slack-metric are preferred as they can be reschedule to accommodate a more profitable job. Another interesting topic discussed in this survey is that of Catallaxy Markets which makes realistic assumptions like - participants are self-interested, lack of global knowledge and the market being just a communication medium. Pros: - Provides an extensive survey of market-oriented approaches being used in grid/utility computing Cons: - Some approaches are not explained well enough, but maybe that is because of space limitations. Discussion: - How will these approaches map to clouds? Clouds are generally structured as single-service provide, that also behaves as the sole broker and there are multiple users. Does bidding make sense in clouds? Or does it make more sense to go with a pre-defined resource usage cost, as done currently? --- Optimizing Cost and Performance in Online Service Provider Networks This paper proposes a method, called Entact, that enables Online Service Providers (OSPs) to choose an optimal balance between cost and performance of delivering data to their users. The key motivation behind this work is that large OSPs have multiple geographically distributed Data Centers (DCs) which are connected to multiple ISPs - thus resulting in large number of possible ways in which data can be routed to a user. Entact provides an automatic Traffic Engineering (TE) solution for OSPs. OSPs have to strike a balance between the cost of delivering data to its users (amount paid to ISPs) and performance (most suitable metric is latency seen by users), because a drop in performance may lead to permanent loss of customers. An OSPs profit thus depends on a complex trade-off between cost and performance. Entact’s main focus is to minimize the cost of handling outgoing traffic (OSP to users), as typically incoming traffic consists of small user requests (e.g. request for a file) and outgoing traffic is large data (file itself). When a user sends a request, an OSP has an option of redirecting the request to any of its data centers, using a load balancing scheme. Also, the selected data center has an option to reply back to user over any of the ISPs it is connected to. Thus, the number of “paths” available for a request-response pair are a product of number of data center and number of ISPs involved. Entact uses a route injection technique to calculate RTT performance of available alternate paths that are currently not in use - traffic is shifted from currently used path to an alternate path if there exists an alternate path that is better in both latency (performance) and cost. Entact performs such measurements to calculate an optimal strategy curve - a curve joining points representing optimal strategies. A strategy is termed optimal if no other strategy has both a lower cost and a higher performance (low RTT) than it. An OSP can then pick a point on this curve, as an optimal strategy. Entact measures the performance of alternate paths to a given IP prefix in background, without actually routing actual response to user through these alternate paths, as that may degrade performance seen by some users. It has a component called Live IP collector, which discovers IP addresses in given prefix. A route injector then installs custom paths for a prefix and probes are sent over these paths for performance measurements. Pros: - Provides a range of cost-performance options to OSPs, that can be used by an OSP to strike a good balance between cost and performace. - Alternate route inspection is done ‘offline’ without affecting users. - Automates the TE problem, that is generally handled by manual tuning. Cons: - Latency is the only performance metric considered. For websites such as youtube, available bandwidth etc also become important. - Measurements were carried out only on prefixes that had enough responsive IPs - can this affect the generality of the results? From: david.m.lundgren SpamElide on behalf of David Lundgren Sent: Thursday, April 21, 2011 3:25 AM To: Gupta, Indranil Subject: 525 review 04/21 Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud Yi et al. analyze existing checkpointing protocols for ensuring high reliability while decreasing costs within the Amazon Elastic Computing Cloud's (EC2) volatile spot instance pricing framework. Spot instances refer to EC2 instances that are only run when the clients' bid prices are lower than EC2's current price. The Spot price fluctuates based on customer demand and the availability of excess computational resources. The dynamics of the Spot price are such that a client session may be terminated (i.e., fail) when the Spot price rises above her bid price. Such volatility requires regular and (semi) frequent checkpointing to prevent wasted excess computations. The authors propose two novel checkpointing algorithms: rising edge-driven checkpointing and adaptive decision checkpointing to reduce the cost of wasted computation after failure. These schemes are compared to traditional hour-boundary checkpointing and various combinations of all the aforementioned algorithms. Simulations are run using historical Spot traces. It is observed that hour-boundary checkpointing has the greatest cost reduction, although this is dependent on instance type (e.g., Windows vs. *nix). Pros, Cons, Comments, and Questions: - The problem is interesting, non-intuitive and solutions are immediately practical. The authors are also very forthcoming with their empirical data and source code. - Future work could focus on explaining why edge driven check pointing is only effective on Windows traces. In general, explaining why different checkpointing schemes performed distinctly on dissimilar machines is an interesting point that is somewhat glossed over in the paper. - Another salient dimension that could have been examined is variable bid price. It seemed the paper was focused on minimizing time and/or cost given a constant bid. It seems this artificially constrains the problem and that lower cost/more efficient solutions could be be found with dynamic bidding in addition to some of the checkpointing methods discussed. ------------------------------------------------------------------------- Market-oriented Grids and Utility Computing: The state-of-the-art and Future Directions Broberg et al. review the existing state of resource management for grid computing from the viewpoint of ``market-oriented'' methodologies that treat users more fairly and more properly compensate grid providers than traditional resource allocation, admission control, and scheduling protocols. Traditional techniques (defined as those that seek to maximize throughput and minimize mean wait time) subject providers to transient supply failures (often around major conference deadlines), and cannot incentivize users to be honest about their job deadlines or resource requirements, nor can these traditional protocols motivate providers to prioritize high utility jobs (there is no concept of ``utility'' in such traditional systems). The authors review a variety of systems that incorporate utility (often in the form of some sort of ``currency'') into resource and scheduling markets where users, providers, and brokers interact and negotiate prices for jobs in a self-serving manner. The definition of utility in a myriad of state of the art systems is examined according to price setting (e.g., fixed price models vs. variable ones) and negotiation/acceptance criteria (e.g. threshold-based or minimum cost). Resource management is then reviewed in these systems. Resources are typically allocated via auction and jobs are scheduled according to a multitude of mechanisms (e.g. proportional share which schedules tasks proportionally to the utility of the task). Admission of new tasks in these systems is also touched upon. Economy management is raised as a multifaceted issue. Topics arising under management include: currency management and dispersal; virtual vs. real currency; proper pricing; and trust models and accounting. The authors close with a discussion on the state of affairs of the grid marketplace and what can be done to modify existing techniques in the face of the increasing relevance of ``Catallactics'' (decentralized markets where individuals seek to maximize their own utility with limited local knowledge and use the market as a communication medium). From: wzhou10 SpamElide Sent: Thursday, April 21, 2011 1:54 AM To: Gupta, Indranil Subject: CS525 Review 04/21 Review of “Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud” Core idea This paper studies various checkpointing mechanisms that deal with the cost-reliability trade-offs of spot instances in Amazon EC2. These instances can be revoked abruptly due to price and demand fluctuation, so checkpoints are made in order to continue unfinished tasks. Two basic types of checkpointing is hour-boundary and rising edge-driven checkpointing, which have a adapative version each. Using optimal case that allows perfect prediction, as baseline, this paper studies the execution price and task completion time of all those four types of checking points and the 8 combinations of any two of them. Simulation based on history trace shows that proper checkpointing can reduce significantly both price and completion times. Pros 1. The idea is coming up from practical needs, simple but useful. The paper gives a thorough analysis on a variety of checkpointing mechanisms. 2. The simulation is based on real trace, which is convincing. 3. It shows that finer grained checkpointing doesn’t necessarily have better performance, which is a bit surprising at first glance, but well explained. 4. The argument that a rising edge indicates less available resources is fair, so it’s more useful in some sense, I guess. Cons 1. All parameters are set as constant and identical across tasks and spot instances, which is not so reasonable. 2. There’s no argument on the representability of the spot instances they picked. 3. The idea isn’t fundamentally novel, more like a measurement survey. The future work sounds more interesting compared with the body of the paper. Review of “Optimizing Cost and Performance in Online Service Provider Networks” Core idea This paper formulates traffic engineering problem in online service provider networks. It also presents the design and implementation of Entact. There are two key techniques of Entact. First, a route-injection-based measurement technique, which allows us to measure performance of multiple alternative paths simultaneously. The second technique is the online TE optimization framework, which gives OSP operator the choice on desired cost- performance tradeoff. Extensive evaluation is performed on MSN, one of the largest OSP on the Internet today. Results shows that Entact can reduce the cost comared with the default TE strategy by roughly 40% without sacrificing average latency or high overhead. Pros 1. The route injector is a smart idea. In this way, we can measure performance on multiple paths simultaneously, with only minimum disturb on current traffic. 2. Entact uses TCP ACK/RST, instead of ICMP probes to measure RTT, because this is closer to the latency experienced by applications, that is ICMP packets may have lower priority while being forwarded. Cons 1. The cost concerned in the paper doesn’t cover energy cost on data centers. It might be nicer if this be part of the optimization object. 2. Can OSPs do anything to reduce the user request ingoing latency besides the outgoing one? 3. The computation complexity is a little high, though the paper argues it’s not. 4. They probe the same number of alternative paths to one prefix, no matter how many IPs in that prefix. This may be not a fair way to implement Entact. Thanks, ~Wenxuan From: Agarwal, Rachit on behalf of Rachit Agarwal Sent: Wednesday, April 20, 2011 11:33 PM To: Gupta, Indranil Subject: 525 review 04/21 Rachit Agarwal ---- Optimizing cost and performance in online service provider networks The authors argue that the richness of OSP networks makes the traffic engineering problem significantly different and more complex compared to that in ISPs and multi-homed stub networks. They provide an optimization framework that jointly optimizes performance perceived by the users and the cost incurred by the OSP in transferring the user data. This optimization framework gives various Pareto optimal points on the curve and allows the network operator to choose a desired operating point. Interesting paper. Almost no catches. The idea of measuring performance of alternative paths is cute. Some minor comments: 1. It was not entirely clear why authors chose to ignore the cost of internal DC links. While this cost may be independent of the traffic volume, but it may have a recurring non-zero value and may change the Pareto optimal curve. 2. On the performance metric, the authors argue that many interesting applications that care about latency do care about RTT (there are other applications which care about bandwidth etc. but that seems like a significantly harder problem). However, there are applications like e-mail and instant messaging that care about latency but not RTT values -- in fact, they care about the latency between the two end-users, which can be significantly different from RTT for any one of the two users. 3. It wasn't clear how authors managed to estimate the performance of alternative paths in a scalable fashion. In practice, there may be a large number of paths to any given prefix and updating the routing table entries will take significant amount of time. In order to avoid updating, they can store the routing table entries, but that may require large amount of memory. ---- Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud The spot instances in Amazon EC2 provide the users with a new market to compete for resources. On the one hand, users would like to reduce the cost of resources; on the other hand, a significantly low bid may leave the user's resource terminated by the cloud provider. Hence, there seems to be a natural dynamic market. This paper tries to understand the dynamics of the system based on historical observations and presents a couple of checkpointing schemes which may help an user to observe the performance against the cost curve. Some of my thoughts/comments on the paper: 1. While such an analysis definitely improves our understanding of the market, I am wondering how a user may decide on a particular checkpointing scheme in practice. There are two problems: one, the user may not be aware of the checkpointing cost of the running program (assumed to be known in the analysis) a priori. This may depend on the underlying OS, machine capacity and speed, and other such criteria. Second, even if the user can estimate the checkpointing cost, other users running VMs on the same machine may distort the checkpointing cost (in terms of time) dynamically. 2. One (rather surprising) observation is that checkpointing schemes that have low total price typically also have low completion times. Is there a nice sweet spot? Or is it that given an OS, machine capacity, program complexity, I can decide on a checkpointing scheme ahead of time? 3. I found it interesting, in general, that the performance of checkpointing schemes depend on the underlying OS (Section III). What aspects of an OS actually lead to this? ---- -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: mark overholt Sent: Wednesday, April 20, 2011 11:16 PM To: Gupta, Indranil Subject: 525 review 04/21 Mark Overholt CS525 Review 04/21/2011 Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud Summary: This paper discusses the use of Amazon’s EC2 Spot Instances, and their costs. It then proposes different paradigms to help curb the cost and reliability of those instances using a system called checkpointing. In late 2009, Amazon released a service called Spot Instances. The concept was quite simple. At any given point, Amazon had a variety of resources available for use that wasn’t already spoken for in their data centers. Amazon offered 42 different types of spot instances that differed by computing capacity, memory capacity, OS type and geographical location. Amazon let users bid on the unused capacity. Essentially it was saying users could set their own EC2 price for the unused capacity. The catch was, there was a “reserve price”. This price fluctuated based on how many resources were available and the demand. If this price dipped below your bid, your computation was automatically discontinued. This meant that your task could stop at any point if the price changed and went above your bid. So while you had the opportunity to get EC2 computation at even cheaper costs, there were failure risks. One upside was that if Amazon did cut off your service due to a price change, they would not charge you for the partial hour that you were currently in. But that was little consolation if your task was all or nothing and it ran for more than 1 hour. You would lose all your results from previous hours, but still have to pay for them. That is where checkpointing came in. Finding different places in your execution to save your state and try to minimize loss when an out-of-bid event happened. Discussion: Pros: Solid mathematical proof showing recovery times, failure times, and fluctuating costs compared to different checkpointing schemes. Good analysis on all 42 different Spot Instance types Cons: Found that a dynamic method of determining checkpoints is needed, but doesn’t really offer a way of doing that. Optimizing Cost and Performance in Online Service Provider Networks Summary: Online service providers such as Google and Microsoft use many resources to get data to end users. Products such as search, maps, and IM cause up to a petabyte of data per day for any of these large organizations. The cost of transmitting that data and the speed at which that data arrives is crucial to the bottom line of any online service provider. This paper presents a new solution called Entact. Entact is an automatic Traffic Engineering method that dynamically determines the best routes and cost metrics for a given data transfer. Currently, there exist many TE methods for multihomed stub networks. But unlike multihomed stub networks, OSP networks do not have an efficient TE method because of the different in how OSP network are setup. Operators of these OSP networks have to manually configure a delicate balance between cost and performance of these network transfers to end users. Because of the complexity of the OSP networks, it is very difficult to achieve a good balance. Entact does two things to attempt to dynamically create an effective TE. First, it uses a novel route-injection mechanism to measure the performance of alternative paths that aren’t currently being used. Second, based on the cost, it computes the optimal cost vs performance curve for the OSP. Discussion: Pros: Given the large amounts of data that OSPs of today need to transfer, it would save companies a lot of money in excess costs. Experimental setup was good, using Microsoft’s own datacenters and infrastructure to conduct extended tests. Cons: Wasn’t really clear what kind of extra overhead was incurred by using this system. From: Qingxi Li Sent: Wednesday, April 20, 2011 10:38 PM To: Gupta, Indranil Subject: 525 review 04/21 Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud For Amazon EC2’s spot instance price model, users’ can bid the unused resource in EC2. However, when the price of the instance exceeds the user’s bid, then the service will be cut off which means that the service can be cut off at any time without notification. To avoid data lost by this kind of failure, this paper introduce multiple ways to make checkpointing and when the failure happened then the task can quickly be recovered from the last checkpointing. In this paper, it provides four ways, take checkpointing per hour as one hour the smallest time unit for payment. The second one is taking checkpointing when the edge rising, as edge rising implies the increasing of the possibility that the service will be cut off. The other two are adaptive checkpointing which will determine whether take this checkpointing by comparing the expected recovery time if we skip this checkpointing and if we take this checkpointing. From the result of this paper we can find out that the hourly checkpointing can significantly reduce both the task’s completion time and the total price. However, both of the adaptive mechanism didn’t reduce much completion time and total price compared with the mechanism without adaptive method, especially for the adaptive edge rising checkpointing, sometimes it even need more time than the edge rising checkpointinmg. Besides this, the edge rising and adaptive edge rising neither significantly reduce any completion time and total price compared with the result of none checkpointing which means that these two mechanism don’t work. Optimizing cost and performance in online service provider networks This paper introduces the Entact which will help the online service provider (OSP) to find out the best way to choose the ISPs. From this paper, Entact calculates the minimum cost v.s the performance. In this paper the performance is considered as the RTT latency and the other matrices will be added in the future work. For the cost, this paper considers the cost of the bandwidth between the datacenter and the IP prefixes. And the cost of the bandwidth between data centers will not be considered. I think it’s reasonable not consider the cost between data centers, but we should consider about the latency that caused by the data moving between the data centers as we cannot put one copy of the data in all the data centers. To calculate the cost of each RTT, we should consider about O(pn) situations in which p is the number of the ISPs and n is the number of data centers. As the time complexity is too large, this paper uses a proximate algorithm to calculate the result. Even in the evaluation, it said the whole computation process only need a few seconds, however, I also wonder about the time complexity of the whole algorithm which I think will be large. From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan Sent: Wednesday, April 20, 2011 8:02 PM To: Gupta, Indranil Subject: 525 review 04/21 Nicholas Jordan njordan3 cs525 Review 4/21 Optimizing Cost and Performance in Online Service Provider Networks By: Zheng Zhang, Y. Charlie Hu, Ming Zhang, Ratul Mahajan, Albert Greenberg, Blaine Christian This is a system that tries to optimize the online service providers cost by exploring alternative routes via different datacenters (DC) and the ISP (links) that are the external gateway from the OSP. In the optimization, for each user, they take into account all possible DC and links that could service the user. They weight the cost of using the link and performance by round trip time (RTT). They do a network flow optimization that at first routes data through fractions of a link/IP prefix. Then they go back and assign integer capacities to the link and it is still very optimal to the fractional optimization. I was spectacle of this, but their results back up this claim. Pros: - lowered cost by almost 40%, while still not decreasing RTT Cons: - results may vary with different OSP Additional Comments: Their algorithm only has to adjust 12% of the paths with a majority being higher price/longer RTT and lower price/ longer RTT. After they do their optimization, they get this utility curve and there are couple of points that weigh cost and performance more and vice versa, I wish they would have talked about in more detail for the specific application what RRT was considered unacceptable. For the Microsoft OSP they found that you really only have to consider 2 alternate DCs and 4 alternate routes, this is due to RTT increases with more options. With this knowledge we can limit the possibilities for each optimization and you don’t have to search the whole space reducing it essentially to a NP-hard time to a pseudo polynomial time problem. Market-oriented Grids and Utility Computing: The State-of-the-art and Future Directions By:James Broberg · Srikumar Venugopal · Rajkumar Buyya This paper explores way to allocate resources in a grid that is more effective. Traditional resource management systems have been lacking fairness in resource allocation, admission control and scheduling. Traditional systems also fail to take into account Quality of Service(Qos) for the user. Also, before conferences, “grids” like PlanetLab get swamped with tasks with researchers trying to finish their experiments. In situation like this, grids tend unfairly distribute resources and light tasks can be starved of resources. They have noticed that there needs to be a more meaningful and tangible form of currency. With unlimited currency or currency with no direct association with real monetary value, there is no incentive to consume resources wisely nor to lend resources. So Market Driven Grids, correct the failed scheme of FCFS and flat prices of traditional systems. As hinted to before, FCFS can lead to starvation and flat prices are not adaptive to the market and the user’s utility. For scheduling, Libra’s proportional share is a good scheme for most jobs, but not for memory intensive jobs because of expensive context switching. So they propose a “Catallaxy” paradigm, which is basically a free-market economy, implemented in a grid system. Participants work for their own interest and try to maximize their own utility. No single entity, whether it is a user, provider, or a broker, has any global knowledge. Also, increase and decrease in prices makes the market dynamic which will encourage people to look for alternative sources if need be. Price auction is the suggested scheme to obtain these dynamic prices and to respond to the condition of resources. Additional comments: In the conclusion, it mention that growing popularity of on demand services like Amazon’s EC2 and how the Grid should move towards that scheme. It seems that Grids like Planet lab get overloaded before conferences, but they were linked up with a non-related computer science grid, then it could off load its workload onto that Grid. This is an interesting suggestion, it seems that it would eliminate some of the scheduling problems and starvation mentioned earlier in the paper. It’s similar to what companies are doing when there resources get overloaded to pass some of the workload to cloud providers. -- Thanks, Nick Jordan