From: harshitha.menon SpamElide on behalf of Harshitha Menon Sent: Thursday, April 14, 2011 1:30 PM To: Gupta, Indranil Subject: 525 review 04/14 A comparison approach to large scale data analysis This paper compares both the paradigms of Mapreduce as well as parallel SQL database. They have defined a benchmark to compare the two. Below are the comparison of the two Mapreduce : -Programming model is simple -The program is injected into a distributed processing framework and executed. -The input data is divided into chunks or partitions in distributed file system -The scheduler schedules processing of these chunks -It uses pull model to fetch the data to be processed on -Easy to setup Parallel DBMS: -Supports standard relational tables -Tables are partitioned over nodes in the cluster. -The data is structured -They use push model to transfer data -This seems to be 3.2 times faster than MR when run over 100 nodes. -The setup was difficult. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing This paper does a comparison and contrast of P2P and Grid computing. P2P is defined as a class of applications that take advantage of resources available at the edge of the internet. In general Grid system integrates resources that are more powerful, diverse and connected by P2P resource. There is a considerable variation in the range and scope of Grid applications but P2P systems tend to be vertically integrated solutions to specialized resource sharing problems. Basically Grid applications tend to be data intensive. In Grid computing, there are a moderate number of participants and the handling failures have not been well addressed. Whereas P2P systems have robust self-management of large number of nodes. Both these are concerned with pooling and coordinated use of resources within the distributed communities and are constructed as overlay structures that operate independently. Both of them take the same general approach towards solving problems. From: kevin larson Sent: Thursday, April 14, 2011 12:25 PM To: Gupta, Indranil Subject: 525 review 04/14 Pavlo et al made a paper comparing MapReduce (MR) and database management systems (DBMS). They present MR as a newcomer and question the validity in comparison to the more established DBMS. They compare the approaches how map and reduce work with those of queries and joins in the DBMS, and compare how data is added, distributed and processed in both systems. Fault tolerance, execution strategies and flexibility are also compared. They then present the Hadoop MR implementation the DBMS systems DBMS-X and Vertica and proceed to evaluate them. They evaluate these three systems on a grep task, aggregation task, join task, and UDF aggregation task. MR outperforms the DBMS systems significantly on the load time tests, performs comparably with the UDF aggregation test, decently worse on the grep and agregation tasks, and significantly worse in the selection and join tasks. The authors reflect upon the approaches in respect to their evaluation and discuss the difficulties and challenges with setting up and using both systems. The authors provided a lot of interesting details and insights comparing and contrasting the implementations of MR and DBMS systems. They evaluated MR and DBMS on a variety of tests, varying the number of nodes between 1 and 100, and reflected upon the results in respect to their previous comparisons. The authors also gave interesting insight into the future of these systems and systems to come. Despite being described as a comparison paper, the authors seemed to have a significant negative bias against MR Maybe they were attempting to give it something of a trial by fire, but things like “why not use a parallel DBMS instead” seemed inappropriate in the context of a fair comparison. Additionally, although evaluation like the grep benchmark had been given as an example, only a single benchmark was described as “the task MR is believed to be commonly used for”. Also, the tests had limited variation in data-set size, which as demonstrated in figures 7+8, MR seemed to scale much better with. Foster and Iamnitchi present a comparison between P2P and grid computing systems. Both distributed in nature, they have an interesting set of similarities and differences. The objective of both systems are to utilize some large pool of resources, but the communities and foundations of have resulted in significant differences. P2P systems, on one hand, have an unstable base of users who are largely independent. On the other hand, grid systems are usually an organized collaboration of professionals with common goals or ideas. As a result of this, P2P systems have focused on failure tolerance and fairness metrics, where grid systems have focused on infrastructure and optimizations. Grid systems tend to have a higher end hardware component, but smaller numbers of systems, while P2P systems have huge quantities of conventional hardware. Where a wide variety of P2P systems have little to no integration between each other, grid systems again prioritize infrastructure, with many systems sharing APIs. Recent and future systems in both P2P and grid have begun to converge. P2P systems have lots go gain by adopting more infrastructure, and grid systems have become and are continuing to become more fault tolerant. The paper brought together two distributed systems, which had not previously been associated highly, most likely as a consequence of the variance in their applications and background. The insight of the similarities and the observation that their directions seemed to be converging was quite novel. It would have been interesting to back up the observations made in the paper with numbers. There was nothing more than general information about the sizes of only a few P2P and grid systems. It would also have been interesting to see how rapidly users move to and leave systems in both P2P and grid. From: Igor Svecs Sent: Thursday, April 14, 2011 12:24 PM To: Gupta, Indranil Subject: 525 review 04/14 Igors Svecs CS 525 2011-04-14 On death, taxes and the convergence of peer-to-peer and grid computing SUMMARY The authors of this paper argue that peer-to-peer and grid computing convergence is inevitable. The main difference between two approaches is that grid systems focus on infrastructure, while peer-to-peer systems address failure. First, they examine other differences between grid and p2p computing. Grids are formed by established communities that assumes some degree of trust and accountability, while peer-to-peer application communities consist of independent anonymous individuals that have little incentive to cooperate. Thus, peer-to-peer infrastructure cannot rely on trust and must enforce fair contribution rules. Grid computing applications tend to consume more bandwidth than peer-to-peer application (e.g., data analysis vs. SETI SpamElide computing). Peer-to-peer application tend to be much larger – millions of simultaneous nodes vs. thousands of computers in grid computing. Grid computing tends to provide more services such as discovery, authentication, resource access, and others. Its infrastructure is more modular as grids can be used for different purposes. On the other hand, peer-to-peer tends to be highly vertically integrated and specialized for a particular purpose. For example, file-sharing application like Gnutella include their own search and network maintenance protocols. COMMENTS I think the main argument here is that both hardware, cost, and algorithms relevant to distributed systems are improving; therefore, all systems tend to converge to an optimal solution. It is just a matter of how we define peer-to-peer and grid computing systems. This paper has the following pluses: first, gives a good introduction to typical characteristics of both grid computing and peer-to-peer systems (of which grid computing is probably less known to an average reader); second, it surveys a lot of systems, services, and applications, and provides plenty of examples. One criticism is that the whole argument is just a matter of how we define terms “grid” and “peer-to-peer”. A comparison of approaches to large-scale data analysis SUMMARY This study compares two parallel data analysis paradigms – MapReduce (MR), introduced in 2004, and parallel DMBS that existed since the 80s. The authors first give an overview of both MR and parallel DMBS paradigms. One key difference between them is that parallel SQL processing requires data to conform to a rigid schema, while MR data can have no structure at all. However, they argue that for large-scale projects some models of the data must exist; otherwise, programmers would have to decipher MR programs manually. Indexes: DBMSs use B-trees while MR frameworks do not provide indexing due the their simplicity. They evaluated the original MR task – grep, as well as analytical tasks related to HTML document processing on three systems – Hadoop (MR), DBMS-X (row-based DBMS), and Vertica (column-based DBMS). In grep, Hadoop outperformed DBMSs in data loading, while Vertica outperformed other systems in task execution. In analytical tasks, Vertica consistently outperformed DMBS-X by a small margin and Hadoop by a much larger margin. The conclusion is that parallel DMBSs are 3.1 to 6.5 times faster than Hadoop after some tuning, while Hadoop provides a simpler interface and generally reduces programming time/effort. COMMENTS (PROS) * Very comprehensive study that thoroughly and quantitatively compares performance of MapReduce and parallel DBMSs. * Used both original grep task and more complex document processing jobs. * Discussion section that talks about various other aspects: system management (parallel RDBMs are more challenging to install and configure), task start-up (Hadoop has to spawn a new JVP process for each instance, while DBMSs start with the system), compression (DBMSs allow meaningful data compression to improve I/O), and others. From: Anupam Das Sent: Thursday, April 14, 2011 12:12 PM To: Gupta, Indranil Subject: 525 review 04/14 i. On Death, Taxes and the Convergence of Peer-To-Peer and Grid Computing This paper compares P2P systems with Grid computing. Both of these distributed systems have evolved a great deal and they focus on different requirements even though their main objective is to pool and coordinate large sets of distributed resources. This paper reviews the two systems from five different aspects namely- target community, resources, scalability, applications and technologies used. Grid computing technologies were initially developed to address scientific collaboration but commercial interest grew which lead to the establishment of devoted communities managing the required infrastructure. P2P on the other hand has been popularized by diverse anonymous individuals who have little incentive to act cooperatively. In terms of resource grid systems integrate resources that are more powerful, diverse and better connected than P2P systems. Since in grid systems resources are well managed and organized, they tend to have higher availability compared to P2P systems where participation is intermittent and highly variable. Applications of grid technologies vary significantly in the range and scope, depending on the target community, and they also tend to be far more data intensive compared to P2P applications. Currently deployed, P2P systems share either files or compute cycles where grid systems are used to perform complex real-time simulations and numerical analyses. In terms of scalability and fault tolerance, P2P systems are more scalable and resilient to faults than grid systems. The main reason behind this is that grid systems were initially developed for a moderate number of users/communities and as a result scalability and self-management were not in their design priorities. P2P systems on the other hand have to handle intermittent behavior and thus were developed with the objective of being scalable and fault tolerant. Finally, while grid systems have expended much work on technical and organizational issues of providing persistent and multipurpose infrastructure services, p2p systems have focused on integrating simple resources to provide specific (vertically integrated) functionality. Pros: 1. The paper provides good comparison between P2P and Grid computing systems. 2. Merging the attractive properties of P2P and Grid system to build a new distributed system is both interesting and promising. In a sense Cloud computing systems merge some the features of P2P and grid computing systems as it provides a standardized service infrastructure which is both scalable and reliable. Cons: 1. The paper only points out the possibility of merging the attractive features of both the systems without giving any concrete ideas of doing so. 2. Since both P2P and Grid systems have different requirements it might not actually be beneficial to build a system that merges their attractive features. Some form of tradeoff analysis will have to be done before making such an attempt. -------------------------------------------------------------------------------------------------------------------------------------- ii. A Comparison of Approaches to Large-scale Data Analysis This paper provides a detailed comparison between Map-reduce (MR) framework and Parallel database management system (DBMS). The authors evaluate both the systems in terms of performance and development complexity. They perform extensive benchmark experiments to highlight the strengths and weaknesses of both the systems and finally argue that DBMSes are better. Among some of the findings the followings are quite interesting- a) Parallel DBMSs require data to fit into the relational paradigm of rows and columns. In contrast MR paradigm is quite flexible since it does not require the data files to adhere to a particular schema. b) MR is better in terms of ease of usage and deployment. c) MR is better adept to handle failure during execution than DBMS. d) MR has serious problem in handling the data transfer between map and reduce tasks. MR uses the “pull” strategy whereas DBMSes use the “push” strategy to transfer data. e) MR is more flexible in terms of writing the desired program compared to the complex SQL programming paradigm. f) MR does not have good indexing scheme compared to DBMS. Pros: 1. The paper provides helpful insights about MR framework and parallel DBMSes. 2. The paper contains extensive real-life experiments. 3. Both the performance and software aspects were compared in this paper. Cons: 1. The paper was unjust is claiming that DBMS are better than MR. Sure the performance of DBMS is better as they were tuned to do that, but MR are much more scalable and fault tolerant than DBMS. 2. They paper does not discuss the cost of setting up a DBMS as compared to a simple MR framework. 3. The paper only compares Hadoop with two DBMS, but it would be interesting to see how they stack up against other map-reduce based frameworks like Pig Latin (which are more tuned to do particular tasks). 4. I think MR and DBMS were built for different purposes, so doing strict comparisons between them actually goes against their initial objective. -----Anupam From: w SpamElide on behalf of Will Dietz Sent: Thursday, April 14, 2011 12:08 PM To: Gupta, Indranil Subject: CS525 Review 4/14 Will Dietz cs525 4-14-2011 "On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing" Foster and Iamnitchi The paper starts out with a _citation_ for "life holds but two certainties, death and taxes". The citation is their own previous work?! Haha, talk about an intense 'hook'. The following somewhat playful explanation is useful in that context as well. Anyway the primary thrust of this paper is to compare P2P systems to Grid computing--with the suggestion that while presently they're different solutions to the same problem, they might just be steps towards a common shared solution later. The authors point out that both Grid and P2P are solutions to the same problem--"the organization of resource sharing within virtual communities", and that they even solve it similarly (using overlay networks). One of the more interesting comments here was the claim "Grid computing addresses infrastructure but not yet failure, whereas P2P addresses failure but not yet infrastructure". They make the seemingly natural step in indicating that the best solution is going to probably be a combination of the two. The paper suggests the two are presently at very different points but due to their similar goals and solution strategy will probably converge down the road. I'm not as convinced? While the technologies are very similar it does seem clear (and arguably rather useful!) to have distinct trade-offs that one can make depending on your application. Simply combining the two gives you neither benefit (infrastructure or failure-resistance, although of course we're grossly simplifying the comparison of the two), being semi-suitable for much but not really great at anything. I mean to say that's possibly the case, and I'm not sure their claim that you can combine in a benefit-preserving way is well motivated. That said, the authors do propose a tantalizing idea--taking the autoconfiguration and failure-resistant nature of P2P networks and combine with the structured nature of Grid computing (addressing issues such as reliability and trust). Here's hoping we see such a combination in the near future :) "A Comparison of Approaches to Large-Scale Data Analysis" Pavlo, Paulson, Rasin, et. all This paper asks a similar question--hey there's this new hyped up technology that _is_ great and new (as in different), but look it's rather similar to this perhaps presently undervalued existing technology that's been around for ages. How do the two really compare? The objects of this query in this work as the MapReduce (MR) paradigm as compared to similarly structures in parallel database management systems (DBMS), both being large-scale data analysis techniques. MR Pros: Simplicity Light and Flexible Fault tolerance Fast load times Cons: Might not scale as well *Requires* you parse your data Slower execution DBMS Pros: Mature Structured data (no need to write a parser) Built-in Indexes Faster execution Cons: Ridid and complicated Slower load times Complicated Less failure-resistant From: anjalis2006 SpamElide on behalf of Anjali Sridhar Sent: Thursday, April 14, 2011 12:05 PM To: Gupta, Indranil Subject: 525 review 4/14 A comparison of approaches to large-scale data analysis, A. Pavlo et al, ACM SIGMOD 20 The paper compares the performance and ease of use of two types of large scale analysis – Map Reduce and Database Management Systems. Map Reduce consists of first storing input data in key/value pairs in the map process. Reduce is efficiently grouping similar key values together. DBMS involves executing a query using SQL type language on a set of databases. DBMS requires data to be stored based on a schema defined using the relational data model. MR programs do not require such a model. However in order to use the MR program to handle data in a format other than simple key value pairs, the user needs to specify the scheme in which the input data is going to be stored. The lack of a common scheme makes this inflexible to be used by a large number of users. The paper also mentions decreased data over network when DBMS is used due to the use of parallel query optimizers. In terms of ease of use, the authors mention that MR is easy to get started with but as the system grows it gets harder to add and maintain it. However SQL’s schema model makes it more flexible to expand. Tests were conducted on three systems – Hadoop, DBMS-X and Vertica. All three systems were deployed on a 100 node cluster. Each of the benchmark tasks were carried out three times ( In one of the previous lectures it was definitely a point of critique to average a result only over 3 runs). The results indicate that the database systems performed better than MR due to the use of B-tree indices , column based tables, operating on compressed data and parallel algorithms on relational databases. The main reason for the rapid popularity of Hadoop was attributed to the start up ease of using the software and limited system issues that the users had to deal with. The paper does not look at performance when there are failures in the system and nodes are unavailable. MR handles failures better than DBMS. If there is a failure only that particular job is restarted for MR as opposed to restarting the entire query in the case of DBMS. Working with compressed data in Map Reduce would decrease the network I/O which is one of the disadvantages mentioned above. The increased CPU power required might be a tradeoff that the user might consider. The paper does not talk about some of the optimizations and tradeoffs that can be (or is already) readily applied to Map Reduce. On death, taxes and the convergence of peer-to-peer and grid computing, I. Foster et al, IPTPS 2003 The paper aims to contrast Grid computing and P2P based on the target users, resource sharing and scalability. The authors believe that both the above mentioned large scale distributed systems have to deal with the common challenges of failures and maintenance of infrastructure. Grid Computing provides services to organizations limited in size. There is a standard protocol for maintaining this resource distribution to various users. A resource shared by different organizations under some common policy that defines the rules of resource sharing is called a virtual organization. In grid computing, the users are generally part of the same VO and hence issues related to security and trust are much different than P2P networks. In a P2P system, the users are anonymous and there is no incentive to cooperate. The users of a P2P system join and leave in an adhoc fashion and hence cannot be asked to take part in building an infrastructure. The paper attempts to show that both Grid Computing and P2P have common challenges. The very characteristics that make a P2P system (adhoc joins and fails, lack of cooperation, self organizing networks) are what ultimately prevent it from having a standardized infrastructure. Self organizing P2P systems are constructed and torn down as and when the need arises. Similarly, in case of Grid Computing, the challenge of scalability along with self organizing systems poses a challenge. The middle area of research that might be explored is how scalable can a P2P system be while having a standardized infrastructure. This system will need to be persistent and be able to support multipurpose applications. It seems almost like the two systems are at two ends of the spectrum and gradually move from one to the other as you specify the application and its requirements. From: trowerm SpamElide on behalf of Matt Trower Sent: Thursday, April 14, 2011 12:02 PM To: indy SpamElide Subject: 525 review 04/14 Death & Taxes This paper presents an in depth look at the different paths that P2P and grid computing have taken to get to their current positions and makes predictions about the convergence of the two technologies. The authors try to make comparisons between implementations rather than theoretical models as much as possible. The majority of the paper is spent discussing how the different users of grid and p2p have lead the two technologies down different paths. Grid computing offers a mature selection of services present on all machines where p2p has robust and scalable networking but very few services (file sharing, small compute problems). I think the most important difference between the two models has been the trust allowed to nodes. This is also where I see the difference remaining as p2p is a sharing model and grid computing a collaborative model. The economic functions motivating the technologies are different. I think including cloud computing into this model would make for an interesting conversation. In some ways, I think cloud computing fits between these two models in other ways not. Comparison of Approaches This paper presents a comparison of modern MapReduce systems with the veteran parallel DBM systems. The authors try to give a complete comparison of the two technologies from deployment all the way to performance on a suite of tests. The main difference between the two models is whether the data is stored based on some schema. Because MapReduce does not do this, its programming model is very simple and easy to write. On the other hand, DBM systems can do many sorting tasks very quickly. It doesn't come as any surprise that MapReduce is significantly easier to program than DBM's as the general community has become much more involved in cloud computing than they ever did with DBM's (reporters, etc.). I thought the fact that the DBM's were hard to setup was interesting given their age. I think the authors underplayed the importance of fault tolerance in their discussions. Their argument was that DBM's can do equal computation with 10x fewer “fast” machines. The motivation for using cheap small machines has always been price although new efforts have shown power savings as well (Intel Atom). This aspect of the problem is never discussed by the authors but an obvious factor in real world implementations. From: Qingxi Li Sent: Thursday, April 14, 2011 11:33 AM To: Gupta, Indranil Subject: 525 review 04/14 On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing This paper compares the Grid computing and P2P computing. Both of them are distributed system used to share resource between users. Grid computing starts for sharing the resource of computer to do some science computation task which need be finished by some super computers. And P2P created for sharing files between users. The authors compared them in task, resources, scale and failures and so on. Grid P2P Task Sharing resource for science computation task Sharing files, like music Resource Powerful, diverse and better connected need Less powerful, sometimes only memory used for storing files and a little computation Trust Users can be trusted by explicitly administration, higher cost of membership or stronger community links within scientific Cannot be trusted, there may be many malicious nodes Users Small amount (100s or 1000s) with incentive to act cooperatively Large amount of users with little incentive to cooperate Scalability Not scalable with centralized components, like resource management. Scalability and self-management has low priority Only first generation has central server, second generation uses flooding and third generation uses hash table. Scalability is one of the most important problems of P2P applications. I think the differences between Grid and P2P are mainly because of whether the users can be trusted. As the users in Grid can be trusted, they can have diversity and powerful resource. And to make sure the users can be trusted, it needs explicitly administration and can use service a small number of users which makes the scalable has low priority in grid computing. In opposite, as the users cannot be trusted in P2P, they can only share little resource but has large amount. And both security and scalability are most important problems in P2P. A Comparison of Approaches to Large-Scale Data Analysis This paper compares the MapReduce (MR) and database management system (DBMS) in schema supporting, indexing, programming model, data distribution and many other areas. It mentions that the error input checking has been done automatically in DBMS but need be done by programmer in MR. Besides this, the indexing of DBMS also has been build-in but the indexing in MR need be designed by the programmer. This also happens in balancing computational workloads and minimizing the amount data transmitted. MR uses “pull” strategy for data and DBMS uses “push” for data. The only thing the authors think MR does better than DBMS is fault tolerance. From the 100 nodes evaluation, DBMS works at least 3.2 times faster than MR. In fact, I don’t think this paper compares two things objectively. First of all, MR is working for more than 1000s nodes cluster and DBMS is working for 100s nodes cluster. The authors uses 100 nodes test bed will make the DBMS works better than MR definitely. Even they argues that not many MR users really need 1000 nodes, in fact, for Google, Amazon or other big companies, they real have more than 1000 nodes. And besides this, I think scalability is a very important problem for these data analysis algorithms. Additionally, there are advantages and disadvantages for doing thing, like indexing, error input checking, manually or automatically. Doing things manually can be more flexible but be more complex for programming. However, the authors don’t mention the flexibility for the manual indexing and so on. And for the programming model, the authors seem like the declarative language model. However, when I use SQL, I found either the logical of the query will be very complex or you need many tables. The other thing I think the author should mention in the paper is price. I think the author need to compare the price building and maintaining the database using MR or DBMS for same tasks as they have different requirement for hardware. From: Shen LI Sent: Thursday, April 14, 2011 11:16 AM To: Gupta, Indranil Subject: 525 review 04/14 Name: Shen Li On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing This paper provides some high level comparison between P2P network and Grid computing. The two approaches try to solve the same problem, which is, resource sharing, by using the similiar technology (build overlay networks and provide services above it). Grid provide services to moderate-sized communities and cares more about qualities of service, while the P2P network has more participants with more limited and specialized services. With the guaranteed qualities of services, Grid is able to hold more diverse applications, such as HotPage portal, numerical solution of the "nug30" problem, NEESgrid system and so on. On the other side, the P2P system always focuses on the specialized resource-sharing problems. Accoriding to their arguement, these two system are complementary to each other in some aspects, and will grow closer over time. Pros: It is now clear that they have made the right predication on the future direction on the two systems. Now people does not talk about Grid, instead Cloud Computing is quite popular. But Cloud Computing is much similar to Grid in many ways. And in many large scale systems, e.g., Amazon's Dynamo, Faceboos's Cassandra and Linkedin's Voldmort, we see the many features borrowed from P2P systems. They use distributed hash tables to allocation resource rather than using centralized allocation table. They all tend to abandon the master nodes conception and heading to the system with all identical nodes. A Comparison of Approaches to Large-Scale Data Analysis This paper compares MapReduce paradigm in largescale data analysis with the parallel SQP used in database management systems which has already existed for more than 20 years. They share the same basic framework. Their result show that althought it takes longer time to load data into and tune the excution of parallel DBMS, it performs much better than MapReduce. Beside the overall performance, they also list several deliberate designs in DBMS which is lacked in MapReduce. (1) They argue that the MapReduce programmer also must write a custom parse in order to deliver the appropriate semantics for their input recors, which is at least an equivalent amount of work comparing to transform the data into a row-column based format. (2) MapReduce does not provide any indexing system. The user must tune the index strategy to accelerate data process speed. (3) There is long believe that the for large scale data oriented systems, computation should be sent to data rather than the other way around. MapReduce breaks this conception when transforming from map phase to reduce phase. (4) In MapReduce, it is very common that multiple reducers will try to read the same file host on a mapper local disk, which will lead to the degradation on the disk performance. Pros: As a networking person, I constantly heard of MapReduce and their advantages. But nothing do I know about the DBMS. This paper compares MapReduce with DBMS which is perhaps MR's root. And give use a thorough view of the pros and cons of MapReduce. From: Jason Croft Sent: Thursday, April 14, 2011 11:15 AM To: Gupta, Indranil Subject: 525 review 04/14 On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing Both peer-to-peer and grid computing have a similar objective: the pooling and coordinated use of large sets of distributed resources. The authors also argue: (a) both are concerned with the same general problem, (b) take the same general approach by constructing overlay structures, (c) each has made technical advances but has crucial limitations, and (d) the complementary nature of the strengths and weaknesses of each means the two communities will grow closer. Grid computing is defined as an environment where resources are owned by various administrative organizations, while P2P uses resources available at the edges of the Internet with unstable connectivity and unpredictable IP addresses. P2P tends to have a large number of participants and offer specialized services with few assumptions about trust. Grid computing typically has more powerful, more diverse, and better connected resources than P2P, but requires explicit administration and has a higher cost of membership. Grid and P2P also differ in their applications. P2P systems tend to be solutions to specialized resource-sharing problems, and diversification is in the scalability, anonymity, and availability the system offers. Applications used in Grid computing usually are much more data intensive, but the infrastructure relies on centralized components for shared data, centralized resource management, and centralized information directories. Furthermore, the Grid community focuses more on technical and organizational issues, such as authentication, authorization, discovery, resource access, and data movement. The P2P community, on the other hand, focuses on anonymity, censorship resistance, incentives for fair sharing, reputation management, and result checking. The authors point out many differences between Grid and P2P and how the two communities can build off each other as requirements change. The clear and explicit definitions of P2P and Grid computing are also helpful to understand what types of applications and systems the authors are comparing. However, there could have been more discussion on the future directions of Grid and P2P beyond the observation that many of the limitations of one have already been solved by the other. A Comparison of Approaches to Large-Scale Data Analysis This paper compares parallel DBMSs and MapReduce in five different tasks, claiming DBMSs are significantly faster than MapReduce. The authors argue any MapReduce task can be written as a set of database queries, but the data must conform to a well-defined schema (unlike MapReduce, which uses unstructured data). One benefit of this design with SQL is the schema is separated from the application, whereas in MapReduce a programmer must implement custom parsers and ensure at runtime the data does not violate any high-level constraints. MapReduce also does not provide built-in indexes and must be implemented by the programmer if she wishes to speed up the program. MapReduce writes many files to disk to save intermediate results, which can introduce some contention for reads when multiple instances execute on the same hardware. However, this provides better fault tolerance over DBMSs, which must restart large transactions in the event of failures since little intermediate data is saved. The analysis of these two paradigms uses a collection of independent machines where data is partitioned, or allocated to different nodes. Hadoop is compared to DBMS-X and Vertica--two parallel SQL DBMSs--deployed on 100-node clusters. Five tasks are used for comparison: the original MapReduce grep example, a Selection task to find pageURLs in a Ranking table with a pageRank above a threshold, calculating the total adRevenue generated for each sourceIP, finding the sourceIPs that generate the most revenue within a particular date range and calculating the average pageRank of the pages visited during this interval, and computing inlink count for each document. The evaluation compares the loading time of the data (where MapReduce is considerably faster than the DBMSs) and the execution time, which finds that MapReduce is slower than both DBMSs. In my opinion, this paper was biased towards DBMSs and overlooked many of the strengths of MapReduce. One reason MapReduce has become so popular is because of its simplicity, both in setup and in implementing tasks. The authors only briefly discuss this, describing the many difficulties they encountered in installing and configuring the DBMSs. The microbenchmarks (load time and execution time) somewhat reinforce their bias, as it only shows DBMSs being slower for load times but faster for the five tasks they measure. Had the authors shown the total time to load the data and execute the tasks, the results would be vastly different. For a company like Google that is constantly crawling the web and collecting new data, DBMSs would be very ineffective since load times are an order of magnitude slower than MapReduce--a point the authors seem to miss. Given the difficultly and time to setup a DBMS, and the increased load times, there are a number of other applications where DBMSs would not be optimal. Finally, the authors state DBMSs required less code for their evaluation, but there is little discussion of this point outside of the introduction and conclusion. How was code size measured for MapReduce (just non-empty lines or semicolons)? How did the time to code each task compare between MapReduce and the DBMSs. Less lines of code does not mean shorter implementation time (the authors even mention that describing tasks in a declarative language like SQL can be challenging). From: Michael Ford Sent: Thursday, April 14, 2011 10:41 AM To: Gupta, Indranil Subject: 525 review 04/14 A Comparison of Approaches to Large-Scale Data Analysis The authors present a comparison of techniques for parallelizing tasks which include MapReduce and traditional parallel DBMS. The authors promote their opinion that writing code for SQL-like queries is easier than writing a MapReduce implementation. They also note that data formats are more restrictive in a DBMS. One very interesting point that the authors question is who actually needs the increased scalability offered by MapReduce over a traditional parallel DBMS. A parallel DB cluster of 100 nodes scales to a few petabytes of data, as large as ebay or Fox Interactive Media's warehouse. While MapReduce may be required for tech giants, even large companies can use alternatives. However, the MapReduce paradigm presents a new business model - on the fly computation. Small and medium companies can outsource their data and computations to the cloud. The paper certainly did not mention Amazon's S3 and EC2. Since databases are persistent in memory, sharing compute resources can be problematic. Instead, the paper focuses on the performance comparison, showing that DBMSs outperform MapReduce for the selected tasks. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing The first paragraph's comparison of Death and Taxes to failure and upkeep maintenance is a compelling analogy. It serves as a foothold for the comparison of peer-to-peer systems and grid computing. Throughout the comparison, the authors make four arguments; both address resource sharing, they build overlay structures, that each has limitations, and that due to the complementary nature of their strengths and weaknesses, they will grow closer over time. One of their main arguments is that “Grid computing addresses infrastructure but not yet failure, whereas P2P addresses failure but not yet infrastructure”. At the time that the paper was published, this certainly seemed to be the case. However, there was work on the P2P side to communicate efficiently by choosing active neighbors based on link latency. Though these optimizations certainly had not found their way into production systems. Since publication, the two systems have indeed moved toward one another. And one could argue that the emergence of MapReduce fills the void between the two, but this argument is not without its own flaws. From: Andrew Harris Sent: Thursday, April 14, 2011 7:31 AM To: Gupta, Indranil Subject: 525 review 04/14 Review of “On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing”, Foster and Iamnitchi, and “A Comparison of Approaches to Large-Scale Data Analysis”, Pavlo et al Foster and Iamnitchi provide a comparison of Grid-based computing platforms with P2P communication platforms. They address questions of target communities and their motivations, of computing resource distribution and availability, of the varying applications across Grid vs P2P networks, and so on. A major difference highlighted within the paper is in the network composition and utilization of the two types of networks. Grids tend to be much more computationally intensive, while relying on data transfer intermittently to receive computation packages and transmit completed computations. P2P systems, on the other hand, are decidedly transfer intensive, with the vast majority of activity being node-to-node transfer of (relatively) high bitrate information. This underscores the raison d’être for each type of network: grids exist as a cost-effective alternative for high-performance computing clusters for research institutions, whereas P2P networks exist in large part to facilitate the search for and sharing of content among end users. Curiously, the researchers concern themselves with “node death”, which both systems handle somewhat gracefully in implementation. Grid systems tend to rely on bursty, small, infrequent communications, so the loss of a node from the overall network is not a major concern. Nodes need only communicate home every day or so to ensure that work is being transmitted and received. P2P systems are death-hardened by design, in that most have some mechanism by which parts of a file may be transferred in parallel from multiple users. A node death in this model simply means that another node will pick up the transfer slack, and that the system will continue to look for replacement nodes (as it was likely doing anyway in finding more download sources). The researchers mention a need for some sort of persistent infrastructure for future networks, and also stress lessening the reliance on centralized coordination of grids and P2P networks alike to further harden them against node death. I am still left unconvinced, however, at the severity of the problem, given the natures of the two networks and how they inherently handle node death. They note that solving the death “problem” will allow computing resources to be requisitioned in configurable amounts from anywhere in the world, but this does not make sense with either type of network, due to both types being wildly heterogeneous in practice. Finally, on the applications running over both network types, grids are flexible in general but typically users will focus on a small number of grid projects for their systems. Similarly, P2P users have only a single purpose in mind for their systems, and do not use them for anything more. Both are narrow in practice, so it is unclear as to how opening either system to generally requisitioned computation would be a Good Thing, let alone acceptable among users. The Pavlo group compares MapReduce functionality and use cases in Hadoop with two mature parallel database management systems. Micro-benchmarks suggest structural differences between the two approaches to data analysis: Hadoop has amazing load times, due in part to its non-reliance on a formalized data system, but it has very poor Grep times across large datasets for roughly the same reason. The paper concludes suggesting that programmers should weigh the benefits/drawbacks of a structured (DBMS) vs unstructured (Hadoop) code model, before engaging in a particular project involving either, as both have their strong and weak points depending on the task. Some of the differences in time to completion were shockingly large; for instance, the Join task taking less than a minute to complete in both DBMSs versus taking over twenty hours to complete in Hadoop. This is surely due in part to DBMSs having been tuned to handle such commands gracefully and quickly, but it is an important set of points to have in mind when designing a project. Furthermore, on Hadoop and MapReduce, the ability of a programmer to specify their own implementation of the map and reduce functions is remarkably powerful, in that MR can be tailored to fit almost any data distribution practically without configuration. You simply code and go. In a DBMS, this would take comparably much longer to complete, as an entirely new table would need to be created to hold the new range of values, all other things equal. The main takeaway from both these articles, though, seems to be, “Choose what is right for your task.” The Pavlo group suggests as much for large dataset analysis, and Foster and Iamnitchi suggest as much for distributed computation and media sharing.From: david.m.lundgren SpamElide on behalf of David Lundgren Sent: Thursday, April 14, 2011 2:52 AM To: Gupta, Indranil Subject: 525 review 04/14 A Comparison of Approaches to Large-Scale Data Analysis Pavlo et al. compare and analyze the performance of two paradigms for large-scale data analysis: MapReduce (MR) and database management systems (DBMS). Hadoop, Vertica, and DBMS-X are examined and their prospective programming models, data distributional models, indexing, schema support, flexibility, and fault tolerance are discussed. The systems are benchmarked across five tasks of varying complexity, from grep to join to selection and aggregation. MR is demonstrated to typically require orders of magnitude less time to load data, implying it is well-suited to run once tasks. Across all other benchmarks, the DBMSs are shown to outperform MR in terms of execution time by a factor of 3.2 on average. System-level trade offs such as compression, task start-up time, and execution strategies are discussed. Finally the authors finish with a discussion of developer usability. Pros, Cons, Comments, and Questions: - One of the key observations glossed over in the authors' closing discussion (and somewhat trivialized throughout the paper) is the massive, order(s) of magnitude difference in load time for DBMSs vs. MapReduce. For the grep task on 1TB/cluster data, the total time for Hadoop execution on 100 nodes is approximately 1500-2k seconds. Compared with the DBMSs' 22k and 8.5k second performance, load time is an important metric for system performance. - Node failure seems to be a point of interest when considering massive systems. I believe that relaxing the ``zero node failure'' assumption of this paper could lead to significant performance gains for MR (due to the fact that DBMSs rely on atomic transactions that restart given any node failure). - I enjoyed the comparison of system-level and user-level aspects for the two paradigms. The ease-of-use of imperative programming the schema-later-or-never approach of MR for programmers unaccustomed to relational database programming is a non-trivial fact when considering system use. - An interesting study would be on characterizing the nature of large-scale data analysis programs. The observations of DBMSs' superior execution speed is interesting, but relevant only if the majority of jobs do not rely load once-data (giving the DBMS jobs an unreasonably high time to completion). ------------------------------------------------------------------------- On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing Foster and Iamnitchi compare and contrast Grid computing (defined as a continuously available, standardized service infrastructure for resource sharing amongst virtual organization) and peer-to-peer computing (defined as resource sharing applications at the ``edge-of-the-Internet,'' typically meaning an overlay across a transient network of commodity nodes). This definition of P2P systems eschews the traditional to include systems such as Napster and SETI SpamElide Five verticals are examined: 1) target communities and incentives; 2) resources; 3) applications; 4) scale and failure characteristics, and 5) services and infrastructure. The initial audience for Grid computing is identified as scientific or professional communities interested in large-scale computation intensive tasks. P2P's clientele, on the other hand, is characterized as diverse, unrelated, and anonymous individuals. P2P systems are shown to be of a larger scale than Grids, and also contain larger amount of activity than Grids, although it is noted that this is not always the case. Both systems have similar objectives: effective resource sharing using overlays that cross institutional and national boundaries. The authors then speculate on the cross-pollination of Grid and P2P system research to produce more robust and complete systems. Pros, Cons, Comments, and Questions: - Foster and Iamnitchi predict (in 2003) an increased mixing of Grid and P2P research. As the scale of Grid computing has grown, there have been various systems proposed that borrow ideas from P2P to address the trust, resilience, and decentralization issues that occur in these larger, more complex grids. - The authors mention little about existing Grid or P2P services that borrow techniques from the other. Were there any such systems at the time? - I think by structuring their analysis on the fact that the divergent target communities and incentives between Grid and P2P lead to different developments and deployments provides good insight into why and how the resources, scale, and applications of the two differ. From: Curtis Wang Sent: Thursday, April 14, 2011 2:08 AM To: Gupta, Indranil Subject: 525 review 04/14 Curtis Wang (wang505) 4/14/11 In Byzantium Byzantine General’s Problem The paper discusses reliability in computer systems when dealing with failures of its components. The problem is abstractly formulated in terms of byzantine generals coordinating an attack on a city. Generals must communicate with each other by an oral message (which means the contents are under the control of the sender). However, some generals could be traitors, and their messages may not be what the commanding officer (who could also be a traitor) has reported. In this situation, the authors demonstrate that more than 2/3 of the generals must be loyal, otherwise they cannot reach a solution. Pros - Seminal work in the area with many applications - Formal proofs of results Cons - Limited applicability because of its assumptions The paper makes several assumptions about how the messages are sent, but there could be additional issues such as message corruption or message interception. PeerReview The authors describe a system called PeerReview that is used to provide accountability in distributed systems. Namely, in the event of Byzantine faults, the system guarantees eventual detection of the faulty node while ensuring that a good node is not falsely accused. The system accomplishes this by maintaining a history of logs for each node (inputs, outputs, and communications). Also, the system assumes that each node has deterministic behavior. Now, each node is assigned a set of “witnesses” that will periodically monitor the log of the node for “misbehaving”. These witnesses will then collect the evidence and send it to other nodes to check. Pros - Guarantees eventual detection of a faulty node - PeerReview is practical and applicable to many types of distributed systems. The paper provides three examples. Cons - Only works for systems that are deterministic. - Costly to implement and scale may be an issue. From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng Sent: Thursday, April 14, 2011 12:40 AM To: indy SpamElide Subject: 525 review 04/14 CS 525 - Review: 04/14 Old Wine: Stale or Vintage A comparison of approaches to large-scale data analysis, A. Pavlo et al, ACM SIGMOD 2009 The paper compared two common computing paradigms for large-scale data analysis, MapReduce (MR) and parallel SQL database management systems (DBMS). Though these two paradigms have substantially different design choices in some key areas, the paper argued that almost any parallel processing tasks can be written in either approach. Thus, the paper conducted experiments on different set of tasks to explore distinct usage of each approach. The results implied some key trade-offs and the paper then identified the underlying causes and key differences that should be considered while designing future parallel computing architecture. The first contribution was to identify the different design choices (DBMS vs. MR). These are Schema Support, Indexing (hash or B-tree vs. User-defined), Programming Model (relational vs. Codasyl), Data Distribution (Automatic vs. Manual mechanism), Execution Strategy (Push vs. Pull + Materialize intermediate result), Flexibility (Insufficient expressive vs. More general language), Fault Tolerance (Complete vs. Partial restart of a failed query). The second contribution was the design and experiments of five benchmarks on Hadoop and DBMS-X, Vertica. The paper found out that MR took less time to load but usually spent much time on performing tasks. Moreover, MR was slower to start up and to run in full rate on every node. One other significant drawback of MR was the ineffectiveness of data compression. In parallel DBMS, compression is possible to lead in to large space savings (a factor of 6-10) and thus much faster execution on I/O bound tasks. However, the same techniques did not applied on MR. Worse, some compression actually resulted into slower execution. Comments/questions/critiques: The benchmark execution was somewhat misleading due to the impractical assumptions about always available nodes, correct software operation and homogenous nodes. In real world, failure is a norm. Moreover, as mentioned throughout the class, large corporations want to use commercial off-the-shelf computers to build their infrastructure, so the system might be relatively more flaky and heterogeneous. Since compared to parallel DBMS, MR handles failures much more efficiently, MR’s performance might not be so bad in practical applications. I was quite surprised that the parallel DBMS performed much better. The only disadvantages mentioned in the paper are that such system needs time to install and tune and its language is (a little) harder to learn, but these should not be much problem for large corporations. Then why do Facebook, Google, Yahoo! and Microsoft all embrace MR? Is it because that the paper neglects the failure part? The cost-performance trade-off brought up in the paper is quite interesting. MR has poorer performance, so the number of nodes has to be larger. But as the number of nodes increases, the failure rate and thus replacement cost might increases, as well; on the other hand, DBMS might need better infrastructures and higher maintenance cost. Moreover, as technology grows so rapidly, what if the infrastructure becomes out of date easily? Therefore, I am wondering whether there is a rule-of-thumb to determine which approach to use. On death, taxes and the convergence of peer-to-peer and grid computing, I. Foster et al, IPTPS 2003 The paper addressed on two orthogonal approaches in distributed computational systems, P2P and Grid computing. The paper argued that in spite of very different focuses at first, these two have essentially the same final goal: a highly scalable and robust autonomous system that supports coordinate usage of distributed resources. Therefore, researchers should consider merits of both systems. The paper’s main contribution was to compare P2P and Grid computing from many aspects and to list some possible directions to consider for each approach. Some key differences were: Scale (large vs. medium), Infrastructure (yes vs. no), trust (some vs. barely any), tasks/services (nontrivial and more general vs. simple and specialized), application (data intensive vs. in much less magnitude). Some future directions identified in the paper were: persistent and multipurpose infrastructure and standardized service definitions in P2P system and scalability, reliability and self-configuration in Grid computing. Comments/questions/critiques: It seems to me that after almost eight years, P2P and Grid computing become two totally different things. Though the cloud computing (it’s like a next generation of grid computing) adopts many suggestion in the paper such as utility computing, scalability and fault tolerance, P2P and Grid diverged from each other. In particular, P2P systems seem to focus on the other end of the spectrum. First, P2P system still lack of persistent infrastructure or standard service description. At least, I did not notice any widely-used P2P softwares such as PPLive, eMule, and BitTorrent, have these two components. Second, P2P system’s functionality is still very specialized. Most of them are either content streaming or sharing. Data analysis, computing or storage is still rare in pure P2P system. One reason for the discrepancy might be the business model. How to generate profit from supplying P2P system except from advertisement? Without such incentive, I think it is hard for any corporations to push standardized service or provide workable infrastructure. By academia alone, such goal is hard to achieve. From: mark overholt Sent: Thursday, April 14, 2011 12:35 AM To: Gupta, Indranil Subject: 525 review 04/14 Mark Overholt CS525 Review 04/14/2011 A comparison of approaches to large-scale data analysis Summary: This paper compares the performance and ease of use of the MapReduce framework and traditional RDBMSes for data analysis tasks, benchmarking Hadoop's MapReduce implementation against two RDBMSes -- Vertica, as well as another, a "parallel SQL database from a major relational database vendor" (Oracle?). They show that on their (limited) benchmarks both the tested RDBMS implementations significantly outperform Hadoop on the same hardware when running relatively simple jobs, although with more complex jobs manual performance tuning from the user affects how well the job runs on any of the three systems (indeed, on their most complex, non-standard job, Hadoop actually outperforms one of the RDBMSes). They note, however, that Hadoop tends to always be better in terms of ease of usage, deployment and optimization. Discussion: Pros: The paper gave a very thorough experimentation with three different systems. They were each tested on the same machines, and great care was taken to maintain a standard consistent benchmark test for each of the three different systems. The paper breaks down each component of MR and parallel databases and does a cross-comparison between the similar components. The comparison is not only useful to understand, but the simple fact of associating the components in MR and parallel databases is also very useful. Both the performance aspects and software aspects were compared in this paper. Their results showed how the setup for Hadoop is much easier than the setup for the parallel databases. DBMS-X required additional support phone calls in order to get it set up. Cons: The tests are done for only 1, 25, 50, 100 nodes. It's not exactly clear why only these numbers of nodes were tested. The trends might be more clear if they did tests with 15, 20, 75, 85, 90, 100 etc. nodes. On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing Summary: Cloud computing, without any doubt, is the current hype in distributed system area. However, just only few years ago, that hype was on grid computing and P2P. The objectives of grid computing and P2P are pretty much same (the pooling and coordinated use of large sets of distributed resources). Keeping that in mind, in the paper titled "On Death, Taxes, and the Convergence of P2P and Grid Computing", the authors tried to find out what the major differences between these two approaches. The basis of such comparison was target communities, resources, scales, applications, and technologies. Here are some major points that are essential to differentiate P2P and grid computing. Grid computing covers a significant number of complex applications and computation models while P2P is mainly applicable for few applications such as file sharing, content distribution, and so on. So, while we consider applicability, grid is surely the winner. On the contrary, grid is only scalable for tens of institutions and thousands of users, while P2P surely has millions of users. So, from scalability perspective, P2P defeats Grid. Grid follows some standard protocols and maintains robust infrastructure, while P2P lacks of those. So, in that case, Grid is again the winner. On the contrary, from coolness factor point of view, undoubtedly, P2P is the winner with high margin. So, it is hard to say, which is better, since it is largely context specific. Considering the above comparisons, at the end, the authors came to some interesting conclusions such as: (1) both grid and P2P address the same problem (2) both take same general approach to address the problem (3) both has great technical advancement, with crucial limitation (grid provides infrastructure but doesn’t address failure while p2p just does the opposite) and (4) complementary nature of strengths and weaknesses of these two approaches show that both of these communities are likely to grow in future. Discussion: Pros: This paper nicely shows the similarities and dissimilarities between P2P and Grid Computing Adapting mechanism of P2P to grid computing seems promising for scalability and self-stability from failures since the grid computing was not designed for Intenet-scale level. Cons: Is cloud computing the friend or foe of grid computing and P2P? Which concepts of grid computing we can apply in P2P? Which concepts of P2P we can apply in grid computing? Which concepts of grid and P2P we can apply in cloud computing? From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan Sent: Wednesday, April 13, 2011 11:18 PM To: Gupta, Indranil Subject: 525 review 4/14 Nicholas Jordan njordan 4/14 On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing by: Ian Foster Adriana Iamnitchi The last paragraph of the paper best summarizes the whole paper. This analysis suggests to us that the Grid and P2P communities have more in common this is perhaps generally recognized and that a broader recognition of key commonalities will tend to accelerate progress in both disciplines—which is why we wrote this article. They basically said that Grid computing and p2p networks are the same because they are concerned with resource sharing within all its participating members. I feel that this is really big generalization. For the simple fact that, in a grid environment you most likely have dedicated specialized computers which gives you some lower bounds on computations and availability. The grid hardware environment is more stable and predictable than p2p environment. In a p2p system, the hardware is nodes' personnel commodity computers that are constantly leaving and joining the network. This unpredictability and anonymous users characterisitcs leads to a different design of protocols to service the users. The key point and difference between grid computing and p2p is like that the hardware is static while in p2p the hardware is constantly changing in size and location. A Comparison of Approaches to Large-Scale Data Analysis by: a bunch of people In 2009, there is a great buzz around MapReduce and it is a great architecture for cluster computing. Howerver, these researchers are challenging Map Reduce and basically saying so what? There experiments on 3 common Map Reduce task such as the famous grep task, and 2 other aggregate, more complex tasks show that that MapReduce performance is significanly slower than Parallel Database Management systems (DBMS), which have been around for 20 years. They tested the 3 tasks only using 100 computers, althought Map Reduce's can run on 1,000s of nodes. The reasoning for this is that DBMS in commercial use like Netflix only use 74 machines working with data sets on the scale of 1PBs. MR Pros: * Loads data fast, compared to DBMS transformations on data ( optimizations that utilze in-memory storage) * a failure of one node, does not mean a whole task restart * easy setup and programming task * MR Cons: * start-up time for small tasks (< 300s), contributes a lot to computation time * MR system is pull architect (extra controll messages) rather than a push for DBMS * data compression techniques, slowed down execution Additional comments: The researches pointed out that MR far from maturity and it can learn a lot from DBMS and the optimazations that it uses to lower execution time such as pushing data to the necessary nodes, avoiding massive control messages over the network, and useful data compression that actually lowers execution time. It seems that they hinted that MR is a brute force approach to cluster computing, because a node has to query all nodes to find data. There take home message is that SQL is still a powerful tool for large data sets computation. SQL could benefit for a high level language interface that MR has. Both architects can learn something from each other. -- Thanks, Nick Jordan From: wzhou10 SpamElide Sent: Wednesday, April 13, 2011 10:46 PM To: Gupta, Indranil Subject: CS525 Review 04/14 CS525 Review 04/14 Wenxuan Zhou Review of “On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing” Core idea: In this paper, the authors compare Grid Computing and Peer to Peer (P2P) systems, two approaches in distributed computing. They observe that there are some come properties between these two approaches: 1. both are concerned with the same general problem, the organization of resource sharing within virtual communities; 2. both take the same general approach to solving this problem, creating overlay structures that coexist with, but need not correspond in structure to, underlying organizational structures. But they have different limitations: Grid computing handles infrastructure but not failure, while P2P in the opposite. Finally, they arguer the interests of the two communities are likely to converge. Pros: This paper use an interesting metaphor, death and taxes, to analyze infrastructure and failure issues in distributed computing. It also provides a thorough analysis on similarities and dissimilarities between Grid computing and P2P. It concludes with an attractive and novel point that the two technologies are converging. The merge of these two approaches seems able to handle both failure and infrastructure effectively. Cons: There’s no concrete ideas in the paper yet how to make P2P and Grid Computing complementary to each other. The merging work seems quite challenging to me. For instance, to make P2P users cooperate in the way Grid Computing use needs a careful design. Review on “A comparison of approaches to large-scale data analysis” Core idea: This paper compares and contrasts Map-Reduce (MR) and parallel SQL database management systems (DBMS) in terms of both performance and development complexity. The authors defined a benchmark consisting of a collection of tasks, and test one MR and two parallel DBMSs on it. Their results showed a tradeoff of the two types of systems. Although parallel DBMSs cost longer time to set up (load data and tune execution), the performance of them is much better than that of MR. Pros: 1. The authors conducted a very thorough set of experiments. 2. They broke down components of MR and parallel DBMS for comparison. In this say, they are able to compare similar components in these two systems. 3. They observed that MR doesn’t have a good indexing scheme. Cons: 1. The paper claims that clusters of more than 100 nodes are not useful, which contradicts with the cluster sizes of big companies, like Google, Yahoo!. 2. The comparison is unfair to some extent. MR is a new thing, while DBMS has been developing for decades. They didn’t consider cost, since MR only requires commodity servers, while DBMS powerful servers. Also, MR and DBMS are designed for different types of tasks. MR could be used for handling computation intensive tasks, which DBMS might not be good at. So the entire comparison is biased against MR. Best, Wenxuan From: Ankit Singla Sent: Wednesday, April 13, 2011 8:07 PM To: Gupta, Indranil Subject: 525 review 04/14 1. A Comparison of Approaches to Large-Scale Data Analysis ----------------------------------------------------------------------------------------- Summary: The paper claims that map-reduce isn't fundamentally different from the parallel database management systems of the 80s in many significant ways. It compares these two approaches along several parameters: performance, development complexity etc. The overall claim is that while the DBMS took longer to initialize with the data, it offered better throughput in the end. They do note the ease of setting up Hadoop and its extensibility as advantages over DBMS. The paper definitely does not come across as an objective comparison and has a pro-DBMS slant. Comments: I believe there's a series of papers on this argument with points and counter-points made by both camps (map reduce is fundamentally different, new and nice; versus same-old wine, doesn't even work as well). It seems the systems make at least one fundamentally different choice: DBMS use well organized data while MR can use very unorganized data. This is in line with the observation on performance - the DBMS takes time to set up and optimize its storage, but in the end might give higher throughput. Primarily, I feel that map-reduce provides the ability to perform a large variety of tasks on the same data organization with reasonable performance, while the DBMS can probably do many of these things better IF the data was pre-processed in a schema custom to each task. The industry probably makes the former choice because of how quickly map-reduce will adapt to a change in algorithms, addition of new tasks on the data etc. It might also just be the sheer expense of deploying a licensed commercial database system versus an open-source implementation. Is the "shared nothing" setting a good evaluation, given that many clusters today are moving to shared storage via network attached storage? With the cloud-on-a-chip or SeaMicro-style architectures, this will become more and more true. 2. Convergence of Peer-to-Peer and Grid Computing -------------------------------------------------------------------------- Summary: This paper compares grids to p2p systems. It points out that many of the differences in these systems stem from the target communities. For instance, grids being developed by the research or professional community rely on dedicated resources at several sites, but do not impose much control on these participants. They do however rely on these participants to provide resources in good-faith and do not consider 'incentive' an important issue outside of the reciprocal use of resources. On the other hand p2p communities need to manage incentives for users to stay and contribute to the system. Grid resources are usually clusters which are fairly well-connected. In contrast, p2p resources are widely separated with great diversity of connectivity and performance. The usage is also different, with grid applications being very data intensive. The number of participants is wildly different - only a few tens for the grid versus millions for p2p. Comments: I wonder how certain things have changed with time. For instance, is it still true that work machines are 30% faster? Some of the differences have become more significant with time - for instance, the trust relationships, the issue of censorship and network use have come to sharp focus for p2p systems, while these are non-issues in the grid scenario. I do believe that they made a fair case for combining the best features of grids and p2p, but that just seems to make it fairly general list of systems desirables! Ankit -------------------------------------------------------------------- Ankit Singla Graduate Student, Computer Science University of Illinois at Urbana-Champaign (UIUC) http://www.cs.illinois.edu/homes/singla2/