From: w SpamElide on behalf of Will Dietz [wdietz2 SpamElide] Sent: Tuesday, February 15, 2011 12:31 PM To: Gupta, Indranil Subject: 525 review 02/15 Will Dietz 2-15-2011 The papers I read were Pig Latin, and the google "Large-scale incremental processing using distributed transactions and notifications" paper. Pig Latin presents the system "Pig" which allows for processing large data sets with an 'ideal' language interface they call "Pig Latin". The appeal of Pig Latin is that it allows for some high-level semantics like SQL, but with the easy-to-understand-and-write procedural nature that many programmers expect. While Pig can use other backends, the current system is based on Hadoop, a Map-Reduce implementation. In essense, Pig Latin attempts to make using mad-reduce easier both through transparency in the code fragments involved and in terms of the way it uses types (in particular, nested types). This makes it appealing to developers resulting in quick development times, and when coupled with their debugging environment allows for quick iteration of powerful high-scalable systems that can process very large data sets (note that this is used at Yahoo!). To discuss a few of the points I flew past in a tad more detail, they supported nested data types-- so instead of having related tables that you have to join via rows (although you can do that too), you can have nested bags allowing for a more natural flow and resulting program. One thing they take pride in is that their language explicitly only contains language constructs that are inherently parallelizable, and simultaneously allow UDF's to support additional functionality (as well as the kind of application-specific logic that you don't want in your programming language anyway). So pig-latin itself is a rather simple language with a handful of operators that makes it extremely simple to understand, especially when combined with its imperative nature. Things I missed in this paper--there's no evaluation section? I suppose since they claim it's used in practice at Yahoo! and whatnot it "works", but I expected to see experiments for things like overhead on using their system vs straight-up mapreduce. Additionally they talk up this excellent debugger environment ("Pig Pen"), but claim the details of how it does what it does (automagically generate "sandbox" data set for your program, and having it evolve with the changes to your program) beyond the scope of the paper. Maybe they were hoping for another paper out of it, but it was a bit disappointing to not see how they managed that. So in the end they developed a useful system--take advantage of the existing Map-Reduce systems, but hide some of its complexity and limitations into an easy-to-use system suitable for rapid development but also not unsuited for long-term deployment. The big test to me about the quality of a system is the question "would you use this system [if you were doing something that remotely made sense to use such a thing]?", and this paper convinced me "yes" I would give it a go, so well done on that front. The next paper I read was the google paper. My biggest takeaway here was "Here's the kind of crazy engineering issues we have at google due to our intense data sets and ambitious goals and how we solved those issues". The paper seemed on a number of occasions to dive deep into misc problems they had, and while they explained it in detail it seemed more like a flurry of solutions (this is how we handle this one locking issue, this is how we handle this other potential locking issue, failure detection, etc). That said they did go into a lot of detail and had some clever solutions. Anyway the big idea here is that they have way too much data and need to compromise some things (some latencies, guaranteed system consistency on locks, etc) in order to deal with the data in a way they want. MANY of their design decisions are 100% tailored towards their needs and accordingly probably less useful as a whole elsewhere (they have a paragraph describing exactly this). The big things they settled on were transaction semantics for the many little updaters so that you could have pieces updating their own little pieces (or shared pieces) and maintain the invariants they were interested in. They then took the idea of DB triggers and reconstructed it into their notifications system that allows for better batch processing, and better scalability while dropping things like the atomicity of traditional triggers. Their evaluation was much more complete and I was pleased with that, and they had a number of examples that helped illustrate things. The paper was technically detailed, but I kind of feel like there's a beauty to being able to simplify the ideas and they missed that boat. I understand some things are just inherently complicated, but it seemed to me like they didn't do a particularly good job focusing on the intuition and big picture and instead just dove into the engineering details. As for pro's/con's, this really just seems to be the following: "Are you google?" Yes -> This is good for you, No -> You probably want something else. ~Will From: Simon Krueger [skruege2 SpamElide] Sent: Tuesday, February 15, 2011 12:29 PM To: Gupta, Indranil Subject: 525 review 02/15 Fixed email subject. ---------- Forwarded message ---------- From: Simon Krueger Date: Tue, Feb 15, 2011 at 11:37 AM Subject: cs525 review 02/15 To: "Gupta, Indranil" Pig latin: a not-so-foreign language for data processing, C. Oltson et al, SIGMOD 2008 The core idea of this paper is to construct a declarative like language like SQL for map-reduce jobs called Pig Latin. The idea is that programmers prefer procedural map-reduce programming but this type of programming can be tedious to write and combined especially when there are chains of map reduce jobs. To combat this they took ‘the spirit of sql’ and map-reduce paradigms and rolled into a declarative style language. So now programmers can write there map-reduce jobs in pig latin rather than writing the individual map and reduce tasks and chain them together. The advantages to the approach is that it takes less time to write a pig latin statement than a specifically tailored map-reduce job. This is because map-reduce jobs have many common operations that all jobs share such as foreach, filter, cogroup, group, flatten, and join which are all built in operations into the language. Additionally, pig latin allows users to enter in their own User Defined Functions (UDFs) that allows users to define their own specific operations on to the data can be plugged in anywhere into the pig latin query. This means that pig latin can be easily extended to fit any programmers needs. Additionally, Pig (Yahoo!’s implementation of Pig Latin running onto of Hadoop) allows users to use a tool Yahoo! developed called pig pen which allows users to debug their pig latin queries with generated test sandbox data sets. Finally, execution engine of PIG may be able to optimize the query and preform a better job than traditional map reduce or arguably, it may be easier to find optimizations for the map-reduce operation in a high-level query language over procedural code. Also, pig latin provides the idea of nested data structures an idea that is not present in relation table query languages like SQL. While they do provide UDFs I would argue that there are still certain things you couldn’t do as efficiently inside of a normal map reduce task. Additionally, like they had mentioned in the paper many programmers prefer procedural style programming over declarative style programming like SQL so I am not certain if this will be the most popular choice amongst programmers. I can’t think of any other ways that this paper could be further developed. I wonder if this over time will become as popular in the map-reduce world as SQL is in the relational database world. Why didn’t they just try to use more of a natural functional style programming language that could be converted into the distributed map-reduce that runs on top of Hadoop? Large-scale Incremental Processing Using Distributed Transactions and Notifications, D. Peng et al, OSDI 2010 The core idea of this paper is to have a way to incrementally update Google’s web index. Traditionally, google’s web index has been updated using batch style methods like that of map-reduce, but if it could be updated incrementally then the latency of document updates would be reduced. To this end this paper talks about Percolator an incremental updating system that runs on top of Google’s Bigtable and ChuckServers. Preforms well for small updates on a large repository of data. A simpler system with less parts as compared to a large chain of map reduce tasks. Allows the system to scale more as more machines are added. The work load of the system is more balanced over time as compared to map-reduce which has bursts in workload. Map reduce preforms better for large updates on a large repository of data because there is less random access. Users have to think about concurrency more than compared to with map reduce. I can’t think of any other ways that this paper could be further developed. As they mentioned is Percolator really just a wrapper around Bigtable? Simon Krueger From: kevin larson [kevinlarson1 SpamElide] Sent: Tuesday, February 15, 2011 12:20 PM To: Gupta, Indranil Subject: 525 Review 02/15 The primary objective of Pig Latin (and the author’s implementation, Pig) is to provide a higher level abstraction for use with Hadoop (map-reduce). The map and reduce functions, while extremely powerful in their ability to distribute computing, are not always the most logical abstraction for applying a set of transforms on a set of data. Pig Latin introduces a nested data model, as well as a variety of processing functions that allow for a wide variety of operations. These functions provide a more flexible and usable abstraction to the programmer, and are simply compiled into a series of map and reduce operations. Additionally, Pig Pen, a Pig debugging environment, is also presented. Pig Pen allows programmers to verify that their Pig code is working correctly. Map-reduce always seemed to be an obvious choice where it could be easily applied. For processing that could be easily mapped to the map-reduce paradigm, programmers found significant improvement. Pig Latin enables a method to more easily apply of map-reduce to many additional processes and data sets. The simplicity of the Pig statements was the strongest arguments for its adoption. The ability to make SQL-esque queries which are decomposed into map-reduce functions is very impressive. The weakest point of the paper is the lack of evaluation. The overhead of Pig was mentioned, but no qualitative analysis was given. Also, no information was given on how well Pig was decomposed into map-reduce. The question of if the generated map-reduce statements are comparable in performance to (although painfully) hand-crafted map-reduce statements was never addressed. Percolator is Google’s newest incremental indexing system. Google’s indexing systems have always had to overcome enormous data sets. They need to be designed to cope with tens of petabytes of data across thousands of machines. Prior to percolator, these solutions were batch based, using map-reduce to increasing throughput and performing larger quantities of index processing. Percolator, on the other had, had different goals. Although it only processes approximately the same number of documents on a daily basis, it was instead designed to reduce latency, and manages to significantly reduce the age of the documents presented in Google search results. The paper describes how the implementation of Percolator had to deal with a variety of problems. Although many of these problems are specific to Google search indexing, the complications and processes to overcome those complications were quite interesting. Unlike the Pig Latin paper, Percolator had a quite detailed and intricate evaluation. The authors documented the movement away from map reduce and thoroughly explained the reasons that Percolator was able to perform as it did. They compared the performance of Percolator to that of BigTable, and showed how well Percolator matched up in reads. References were made to various other systems and implementations, only of which BigTable seemed adequately detailed. Acronyms (such as ACID and DBMS) were never documented or cited. This paper was written after a short discussion with Will Dietz. From: mdford2 SpamElide Sent: Tuesday, February 15, 2011 12:12 PM To: Gupta, Indranil Subject: 525 review 02/15 Cloud Programming Large-scale Incremental Processing Using Distributed Transactions and Notifications Based on the need for small, incremental updates in a world of batch processing and huge datasets, Daniel Peng and Frank Dabek developed a system of distributed transactions and notifications called Perculator. Previous incarnations of Google's page rank algorithm required that the entire repository be processed in order to take an update into account. This results in latency that is proportional to the size of the repository as opposed to the size of the update. Perculator only provides a solution to large, incremental processing with strong consistency requirements. These restrictions narrow the problem scope significantly, within this window the results are fantastic. Traditional MapReduce achieves better throughput on non ad-hoc processing, while a DBMS can be used for smaller scale problems and reduce the overhead of distributed processing. Overall, the work is focused on solving an engineering problem. The authors make interesting use of Bigtable in their implementation, but other sections of the paper, for instance those on timestamps, failure handling and notifications, are lacking in novel contributions. The value of this work is in accurately re-describing the problem space to create space for a new engineering solution. Pig Latin: A Not-So-Foreign Language for Data Processing Pig Latin fills a usability need for developers coding in large-scale, distributed environments. MapReduce and its open-source rival Hadoop both put stringent limitations on usage scenarios for efficient implementation. They require a "one-input, two-stage data flow". Pig Latin merges two types of programming styles; a high-level querying and low-level procedural programming. Pig Latin has an extensive data model which includes atoms, tuples, and bags. Program structure typically follows the form of processing on a per-tuple basis, then filtering or grouping data. The language resembles SQL and exposes only a few functions while hiding the inner workings of the implementation. A program written in Pig Latin is eventually compiled into various MapReduce jobs. The authors discuss possible techniques for efficient distribution such as automatic data replication. Pig Latin includes a debug environment that runs code locally on small input data sets and there is support for data set generation. While Pig Latin does provide a simplified programming paradigm and therefore potentially reduced development time, the authors do not discuss runtime implications of using Pig Latin as opposed to writing individual MapReduce programs. In a space that, by definition, works with extremely large datasets, it is impractical to disregard performance benchmarks of various implementations. From: Simon Krueger [skruege2 SpamElide] Sent: Tuesday, February 15, 2011 11:38 AM To: Gupta, Indranil Subject: cs525 review 02/15 Pig latin: a not-so-foreign language for data processing, C. Oltson et al, SIGMOD 2008 The core idea of this paper is to construct a declarative like language like SQL for map-reduce jobs called Pig Latin. The idea is that programmers prefer procedural map-reduce programming but this type of programming can be tedious to write and combined especially when there are chains of map reduce jobs. To combat this they took the spirit of sql and map-reduce paradigms and rolled into a declarative style language. So now programmers can write there map-reduce jobs in pig latin rather than writing the individual map and reduce tasks and chain them together. The advantages to the approach is that it takes less time to write a pig latin statement than a specifically tailored map-reduce job. This is because map-reduce jobs have many common operations that all jobs share such as foreach, filter, cogroup, group, flatten, and join which are all built in operations into the language. Additionally, pig latin allows users to enter in their own User Defined Functions (UDFs) that allows users to define their own specific operations on to the data can be plugged in anywhere into the pig latin query. This means that pig latin can be easily extended to fit any programmers needs. Additionally, Pig (Yahoo!s implementation of Pig Latin running onto of Hadoop) allows users to use a tool Yahoo! developed called pig pen which allows users to debug their pig latin queries with generated test sandbox data sets. Finally, execution engine of PIG may be able to optimize the query and preform a better job than traditional map reduce or arguably, it may be easier to find optimizations for the map-reduce operation in a high-level query language over procedural code. Also, pig latin provides the idea of nested data structures an idea that is not present in relation table query languages like SQL. While they do provide UDFs I would argue that there are still certain things you couldnt do as efficiently inside of a normal map reduce task. Additionally, like they had mentioned in the paper many programmers prefer procedural style programming over declarative style programming like SQL so I am not certain if this will be the most popular choice amongst programmers. I cant think of any other ways that this paper could be further developed. I wonder if this over time will become as popular in the map-reduce world as SQL is in the relational database world. Why didnt they just try to use more of a natural functional style programming language that could be converted into the distributed map-reduce that runs on top of Hadoop? Large-scale Incremental Processing Using Distributed Transactions and Notifications, D. Peng et al, OSDI 2010 The core idea of this paper is to have a way to incrementally update Googles web index. Traditionally, googles web index has been updated using batch style methods like that of map-reduce, but if it could be updated incrementally then the latency of document updates would be reduced. To this end this paper talks about Percolator an incremental updating system that runs on top of Googles Bigtable and ChuckServers. Preforms well for small updates on a large repository of data. A simpler system with less parts as compared to a large chain of map reduce tasks. Allows the system to scale more as more machines are added. The work load of the system is more balanced over time as compared to map-reduce which has bursts in workload. Map reduce preforms better for large updates on a large repository of data because there is less random access. Users have to think about concurrency more than compared to with map reduce. I cant think of any other ways that this paper could be further developed. As they mentioned is Percolator really just a wrapper around Bigtable? Simon Krueger From: Shen LI [geminialex007 SpamElide] Sent: Tuesday, February 15, 2011 11:17 AM To: Gupta, Indranil Subject: 525 review 02/15 Name: Shen Li Large-Scale Incremental Processing Using Distributed Transactions and Notifications The author states that there are some application scenarios that MapReduce framework can not efficiently handle, such as using latest crawled data to update page ranking information. Even though the new information is only a small fraction of whole data, MapReduce has to recompute ranking from very beginning. Thus, they propose Percolator which can incrementally processing updates to very large data set. Percolator system consists of percolator worker, Bigtable server, and GFS chunkserver. Data is stored in BigTable. When update in BigTable comes, the percolator worker will trigger cascading actions by invoking corresponding observers. However, percolator is just aiming at incremental processes, and can not take place of MapReduce. Pro: The advantage of Percolator is obvious. It just take actions based on the updated data, and do not require to recompute on the whole data set. It also provide mechanisms to handle failed clients and clean up their dirty locks and data. Con: 1. BigTable can only provide lookup and update operations on each row. It can be very inefficient to update column based information. 2. Percolator employ a threshold on the age of lock to determine whether a worker is live but stuck. This policy can not guarantee correctness. If the time period of transactions varies too much, this threshold will be either too short or too long which will lead to many false positive or inefficient clean ups respectively. 3. The batch strategy they mentioned in section 2.3 is not cure to the overload timestamp oracle. It can only reduce the delay to some degree. Questions: In section 2.2, they mentioned the two kinds of conflicts. I do not quite understand the write-write one. If one write transaction does not see the lock of another one, how does it know there is another write transaction on the same cell? Pig Latin: a Not-So-Foreign Language for Data Processing This paper introduced the Pig processing environment and its associated programming language, named Pig Latin. Pig latin is a procedure luanguage with SQL-like declaritive functions. And the grammer of Pig Latin makes it very easy to be translated into a sequence of Map-Reduce jobs. With Pig latin, people can, to some degree, get rid of the annoy process to write mapper and reducer functions for Map-Reduce job. Given that each round of debugging and test of Map-Reduce jobs can be time consuming, they also introduce a usefule debug tool which can give very fast feedbacks to users by using sampled subset of the whole input data. Pros: 1. Pig Latin easy is development cycle of Map-Reduce applications which can save developer a lot of time. 2. I think the debugging tool provided in this paper is extremly useful. Developers do not have to generator data to test their programs, neither do they need to wait such a long time for the result on original data. Cons: 1. The authors sell the User Defined Fuction (UDF) as one of flexibility achieved by Pig Latine. However, I think the main contribution of Pig Latine is that they some the energy spending on writing and organizing mapping and reducing function code for Map-Reduce jobs. So does the UDF actually go back to their starting point to some degree? 2. Pig Latine cannot support analysis with writing operations inside. 3. When Pig Latine compile a request to several MapReduce jobs, it has to materialize the output data of every step even though these data might only be used once by the next job. From: Anupam Das [anupam009 SpamElide] Sent: Tuesday, February 15, 2011 11:06 AM To: Gupta, Indranil Subject: 525 review 02/15 1. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language DryadLINQ is basically composed of two parts: Dryad and LINQ where Dryad is a distributed execution framework used for running parallel data computation and LINQ (Language INtegrated Query) is a programming model which integrates database queries into object oriented programming model. So, DryadLINQ provides a distributed parallel computing platform using high level, user friendly programming language. The basic idea of DryadLINQ is to provide a programming model which hides all the underlying management and fault tolerance related complexities from the users. Users view the system as a single computing unit and write sequential programs using their familiar programming language and development tool while DryadLINQ automatically generates the distributed execution plan. Basically DryadLINQ takes the query from the LINQ expression and partitions it into sub-expressions which are then computed on different machines. For this purpose, DryadLINQ converts LINQ expressions into execution plan graph (EPG) where each vertex presents a program (representing a sub-expression) and edges represent data channels (like TCP, Pipeline, File etc.). DryadLINQ automatically generates code (for the vertices) during the creation of EPG. DryadLINQ also provides a map-reduce programming interface. Another advantage of DryadLINQ is that it can scale up and down form single host to a large cluster. There are some issues that need to be discussed. The paper does not clearly specify how to decompose the LINQ expressions among several sub-expressions. This could be a potential bottleneck as these sub-expressions are assigned to vertices which are eventually assigned to different machines and if the machines fail this could lead to large number of re-computation. The mechanisms used for allocating task to nodes in the cluster and those regarding fault tolerance have not been discussed in the paper. That is, what considerations are taken during task allocation and re-execution of faulty nodes? These questions are important for heterogeneous clusters containing multiple generation of hardware. It is also not clear as to what level of flexibility in terms of resource utilization or fault tolerance it provides to users. So, what I am trying to say is that though DryadLINQ provides high level of abstraction to the users, but in doing so it also prevents them from making any kind of optimization decision that may vary for different requirements. 2..Pig Latin: a not-so-foreign language for data processing Ad-hoc analysis of large scale internet data has become extremely important to many internet based companies as these analyses often suggest user demand and current trend. For this programmers prefer to use the procedural style of map-reduce model over the declarative style of SQL. However, map-reduce is too rigid and as a result it is hard to maintain and reuse. This is where Pig Latin comes in as it combines the flavor of the high level declarative style of SQL with the traditional procedural language. Pig Latin basically generates an execution plan/data flow for the underlying Hadoop which then executes the plan. Pig Latin has some attracting features like- 1. It is flexible and supports fully nested data model which programmers feel comfortable with and it also conforms to how data are stored in disks. 2. Pig Latin allows user defined functions (UDFs) to run in parallel with system defined functions. 3. It also has the ability to operate over plain input files without any schema information or data import- export operations. 4. It provides a graphical interactive debugging interface which greatly simplifies the tedious task of debugging. 5. Pig Latin is open source and as a result it can be plugged into any existing system. However, I think there are some issues that need to be addressed. First of all, there is no performance comparison with other programming paradigm. After all efficiency in most cases is the main concern, so even if Pig Latin does provide an user friendly approach we need to look into its performance which is lacking in this paper. Some form of practical experimentation would have strengthened their claims. Secondly, every query in Pig Latin is eventually converted in a map-reduce job. This could have an adverse affect on performance as map-reduce jobs are prone to node failures and stragglers. Thirdly, user defined functions can only be written in Java which is a limitation of Pig Latin. Finally, Pig Latin is scan-centric i.e., it only applies to read only data, so if data are being updated while scan is performing then it will not reflect the most recent updates. -------Anupam Das From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman [mrahman2 SpamElide] Sent: Tuesday, February 15, 2011 10:56 AM To: Gupta, Indranil Subject: 525 review 02/15 Pig Latin - A Not-So-Foreign Language for Data Processing Summary: Pig Latin is a programming model that sits between declarative SQL and low level map-reduce. Pig compiles Pig Latin into physical plans that are executed over hadoop. By supporting high level constructs like group, cogroup and filter, Pig Latin hides the low level details of fault tolerant distributed programming for map-reduce. Pig Latin programs are compiled into multiple map-reduce iterations. Since it works directly on data files, the overhead of data fetching in traditional database systems is eliminated. Pig Latin also supports UDF’s which can be utilized in all programming constructs. Pig comes with a debugger called Pig Pen to verify the correctness of a Pig Latin program incrementally before executing it on a workload. The system has already been deployed by Yahoo and is being used effectively in a large number of data intensive tasks. Pros: (1) Provide common database operations like join, group, and filter which are difficult to implement in native map-reduce. (2) Open to non-java programmers. (3) Substantial reduction in code size. (4) Debugging environment. (5) Support for a fully nested data model and user defined functions. Cons: (1) Slowdown due to code conversion. (2) Its not completely equivalent to SQL, requires additional effort to convert SQL to Pig. (3) Storage requirement for intermediate data. Future Works: (1) Programming support for control structures such as loops, and conditionals. (2) It would be interesting to investigate the optimization problem of minimizing the number of map-reduce iterations required for a given Pig Latin program. (3) As the paper mentions, it would be interesting to look into “safe optimizations” that are guaranteed to yield performance benefits. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Summary: DryadLINQ is a system that combines LINQ and Dryad to facilitate distributed computation that executes efficiently on large clusters. A user .NET application instantiates an object and hands it to DryadLINQ which converts it into a distributed execution plan. The Dryad job manager creates an execution job plan, which is a DAG. Each Dryad vertex executes jobs according to the execution plan. Finally the control is returned to the user as .NET objects. DryadLINQ utilizes both static and dynamic optimizations by utilizing run-time statistics. The paper also provides DryadLINQ programs for TeraSort, PageRank and a few others to demonstrate the feasibility and performance gains of DryadLINQ. Pros: (1) Debugging support. (2) Strongly typed high level language. (3) It offers both declarative and imperative operations. (4) The application programmer is hidden from the underlying complexity of scheduling, distribution, and fault tolerance associated with dryad. (5) Supports automatic optimizations. (6) Interaction with SQL. (7) Support for traditional constructs like function calls and libraries. Cons: (1) DryadLINQ requires storing intermediate data which increases latency. (2) The centralized job manager introduces a single point of failure. Future Works: (1) It would be interesting to investigate the trade-off between performance benefits and overhead for doing dynamic optimizations in DryadLINQ. (2) A comparative study with other high level programming models of similar nature would be beneficial. - Muntasir. From: Tony Huang [tonyh1986 SpamElide] Sent: Tuesday, February 15, 2011 2:04 AM To: Gupta, Indranil Subject: 525 review 02/15 Article: DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language Summary: DryadLINQ is a system and language extention that help programmers to write scalable parallel computation codes using traditional programming model. DryadLINQ supports traditional programming constructs. It develops a new programming paradigm which is a hybrid of declarative and imperative programming for large-scale data-parallel computing, and it develops a system that can translate the query into optimized execution plan to run on clusters of huge scale. A LINQ job can be expressed in either an SQL-like imperative style code or a normal object-oriented code. Each DryadLINQ job is translated into an execution plan in the shape of a direct acyclic graph, optimized by static heuristic and dynamic dataset size estimations. Pros: * Achieve optimized and scalable performance on large scale clusters without over-burdeninng programmers on manual optimization, increase productivity. * Provide another level of abstraction to ease programming on large scale clusters. * Underlying optimizations can be constantly improved, providing opportunities for future performance enhancement without the need to extensively re-write existing codes. * Resenbles a compiler for large clusters specially optimized for data-parallel computations. * Opens a new field of future research for developing optimization algorithms and plans for clusters. Cons: * I really don't see any cons for this approach. Suggestion: * More descriptions about how decisions on placing jobs and optimizations are performed, though it maybe a series of future work. * Provide a compare and contrast of the optimizations being done for DryadLINQ and traditional DBMS. Article: Large-scale Incremental Processing Using Distributed Transactions and Notification Summary: In this paper, the authors introduces a incremental update capability to Google's web indexing while avoiding the penalty of running large scale batch-processing job to update the index. To perform this, the authors create a transaction-based system that allows incremental updates to be written into the existing big table cells. Notifications, similar to triggers in databases, are implemented to allow this incremental updates to happen as a chain of notifications by user registered observers. Pros: * Introduce incremental update capability to map-reduce framework, make it more suitable for general purpose data processing jobs. Cons: * Speed and performance. As the author says, the amount of RPC calls to implement one transaction is high, casting doubt on its performance compared to normal DBMS. * System complexity and consistency model. Percolator provides a really weak consistency model. It is also lazy in detecting failures and roll back side effects caused by failures. It maybe acceptable for web indexing purpose, but it makes the system of limited use for other real data-base like application. Suggestion: The difference and trade-offs between DBMS and MapReduce model has been a long topic in distributed, large-scale systems. In this paper, the author implement some data-base like capabilities on top of the big-table storage system. It closes the gap between MapReduce framework and DBMS in term of performing incremental updates to an existing index. Under what situation would this type of system be most useful, and under what situation would a more traditional DBMS would be a good question for exploration. -- Regards -- Tony From: Nipun Sehrawat [sehrawa2 SpamElide] Sent: Tuesday, February 15, 2011 1:31 AM To: Gupta, Indranil Subject: 525 review 02/15 Large-scale Incremental Processing Using Distributed Transactions and Notifications This paper describes Percolator, a system suitable for data processing involving small and incremental updates to a large data-set. Such updates might involve addition of new data, as well as changes to existing data - both can potentially require a distributed-transaction model to maintain a set of invariants over the data-set. Percolator targets to achieve the incremental-update approach for huge data-set, which is distributed across thousands of machines - a scenario which rules out the possibility of using traditional databases. On the other hand, using Map-Reduce for incremental updates needs processing the entire data-set again, resulting a lot of duplicate work. Percolator provides A.C.I.D. transactions and observers (similar to database triggers) as the two abstractions. In transaction support, Percolator leverages the relaxed latency requirements (as compared to traditional DBMS) and takes a lazy approach for failure-detection, aiming to improve the scalability. As opposed to PDBMSs, where locking is integrated as a system-component, Percolator needs to explicitly maintain locks due to lack of a central traffic-interception point, as nodes are allowed to directly modify a BigTable row. For supporting incremental computations, Percolator supports notifications, which are triggers that call a particular piece of code (called observer), when a BigTable row is changed. To avoid repetitive invocations of observers registered for hot-rows, consecutive updates are collapsed and the observer might only be called periodically. For scalability, the modified rows are kept track of separately, for calling the associated observers. Thus, in essence, Percolator supports distributed-transactions over BigTable and allows the data-processing applications to be split into notification-driven components. Pros: - Adopts incremental approach for data-processing, thus avoiding duplicate work on unchanged data - Work done is proportional to new results added to the data-set, instead of size of entire data-set - Batching, prefetching and caching are used to improve interaction with BigTable as scalability is more important concern than latency of update operations - Reduced the average document processing in Google's web-index by a factor of 100! Cons: - Around 50 separate BigTable operations are needed to process one document. - As updates become a significant percentage of total data, high cost of processing one document might make incremental processing perform worse than batch processing. -- Pig Latin: A Not-So-Foreign Language for Data Processing Pig Latin is a language designed for data-processing of huge data-sets. The authors argue that though SQL-like querying is useful for data-processing, most programmers feel more comfortable with procedural models such as map-reduce. However, map-reduce is strict and the code might be difficult to reuse and maintain. Pig Latin comes with flavor of both querying languages like SQL and procedural languages. It provides a more intuitive programming interface and generates an optimized corresponding map-reduce schedule. Pig Latin focuses on parallelization and provides primitives that can be easily parallelize. User Defined Functions (UDFs) are supported to compensate for the lack of rich primitives, to support arbitrary data processing. Pig Latin works on top of Pig platform, which can execute a program on top of any execution environment, such as Hadoop. A logical plan is built to represent a program - data-processing is deferred until a STORE command (used to write output to a file) is encountered - giving room for optimizations such as in-memrory pipelining. The central aim of Pig Latin is to provide a data-processing language that is procedural (not declarative) but high-level (abstracts low-level details such as exposed in map-reduce). Pros: - High-level language, programmers don't need to worry about low-level details of execution platform - Can execute on varied execution platforms, parsing and logical plan constructions are independent of platform. - Debugging environment (Pig Pen) supports quick debugging of commands, but generating representative side data sets. From: Tengfei Mu [tengfeimu SpamElide] Sent: Tuesday, February 15, 2011 12:55 AM To: Gupta, Indranil Cc: indy SpamElide Subject: 525 review 02/15 Tengfei Mu (mu3) 1. Pig Latin: Pig Latin is a new language designed by Yahoo researchers to analyze extremely large data sets. On one hand, writing SQL declarative queries is unnatural and overly restrictive for the programmers to analyze the data. On another hand, as for using the map-reduce programming model, even common operations must be coded by hand and they are difficult to maintain and extend. Based on such situation, Pig Latin aimed to fit between the declarative style of SQL and the low-level procedural style of map-reduce. It offers high-level data manipulation primitives such as projection and join, but in a much less declarative style than SQL. Pig is the new data processing environment deployed in Yahoo and Pig Latin is its associated language. Pig could compiles Pig Latin program into physical plans that are executed over Hadoop, an open-source implementation of map-reduce. Also, a novel debugging environment, Pig Pen is presented, which could freeze the program prefix for the user and enable user to append further commands. This ability makes it easy and fast for users to construct and debug their programs in an incremental fashion. Pros: 1. Higher level query language and more expressive than low level map-reduce. Easier to analyze and programming. 2. Novel debugging environment, which is very important to the programming. Cons: 1. Pig is basically sort of based on map-reduce, so it is not very flexible and will incur overhead. 2. debugging tools is not suitable for all the realworld large data sets. 2. DryadLINQ: DryadLINQ is a novel system that enables a new programming model for large scale distributed computing. Specifically the DryadLINQ system automatically and transparently translates the data-parallel part of the LINQ program to a distributed execution model on the Dryad execution platform. It could be written and debugged within the .NET development environment. Overall, their goal is to provide a programmer-friendly abstraction that could hide the complexity of the underlying distributed system, while achieving high performance. Pros: 1. Abstraction hides the system complexity and make programming easier and program execution more efficient. 2. Dryad system is mature, i.e. fully tested and used. So the system could potentially be used widely. Cons: 1. All of the low-level optimization are completed automatically and transparently. To some extend, this is not good for certain programmers and applications. 2. DryadLINQ is not open-source, which will limit its adoption across the IT community. From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng [ltseng3 SpamElide] Sent: Tuesday, February 15, 2011 12:14 AM To: indy SpamElide Subject: 525 review 02/15 CS 525 Review 02/15: Cloud Programming DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, Yuan Yu et al, OSDI 2008 DryadLINQ combines the idea of LINQ (Language Integrated Query), a relational queries in C#, and Dryad system, a parallel data-processing system. As a result, the paper showed that DryadLINQ system achieves two main goals: automatic optimization and efficient execution on large clusters. Overall, DryadLINQ works as follows: given queries specified by LINQ-enabled programming language, DryadLINQ generates a DAG (directed acyclic graph) that specifies the distributed data flow and passes this DAG into Dryad system and let Dryad platform handle the scheduling and fault-tolerant related policy. One advantage of DryadLINQ is the integration of language which allows programmers the flexibility of using familiar language and thus increases productivity. The second advantage is that its Dryad part adopts a deterministic-replay execution model which makes re-execution easier due to isolation. Though this might degrades the performance, the whole system is less complicated and more efficient when the code is relatively robust and contains only frequent bugs. Third, the usage of DAG to specify the flow of data seems to be more efficient in some case than original map-reduce framework. Finally, the paper ran through series of convincing and real-world experiments to verify the system reliability. Moreover, since Dryad has been through the test of thousands of computers, the DryadLINQ’s is believed to be scalable. On the other hand, DryadLINQ does not consider the issue of latency, so for some tasks very sensitive to latency are not suitable in DryadLINQ. The other thing can be improved is to address more on dynamic optimization and generated execution plan. These are two characterizations of DryadLINQ, but the paper only mentioned that the resulting execution plan is similar to the manual one without further explanation or any informal proof. I would be more convinced if more materials related to these were presented. Finally, the centralize job manager seems to have much more workload than the one in original map-reduce design, it is unclear whether it would become the bottleneck as the size of cluster grows. Pig latin: a not-so-foreign language for data processing, C. Olston et al, SIGMOD 2008 (Yahoo!) Pig Latin is a new language implemented by Yahoo! Research that is tailored for “ad-hoc, scan-centric” and parallel data processing and analysis based on Hadoop. Unlike map-reduce framework, Pig Latin adopts some unconventional features to support less rigid data flow and nested data model and to provide flexibility of using UDFs (user-defined functions) in every construct and common language primitives that are more adequate for algebraic functions. One desired feature is its design of so-called dataflow language. It requires user to specify a sequence of step-by-step execution plan, where each step consists of a “single, high-level data transformation.” In this approach, programmer has more freedom in controlling a large set of data, which is preferred over a single-block approach like SQL. The other advantage is that Pig Latin adopts flexible nested data model and proposes different group function primitive that is tailored to algebraic functions. Nested data model is more intuitive to think of. Combined with Pig Latin’s group primitive, (CO)GROUP, the tasks involved with nested data model can still be easily parallelized and the overall throughput would not be affected by much. The third advantage is Pig Latin’s focus on UDFs. The freedom of passing non-atomic parameters to UDFs and outputting non-atomic values certainly makes coding easier. One final and important contribution is the proposal of novel debugging environment, Pig Pen. It automatically generates a sandbox data set with three properties, Realism, Conciseness, and Completeness, that can be used to effectively test whether the primitive function is used correctly, whether proper UDFs is called and to help programmer understand the schema. Pig Pen makes debugging easier and thus increases user’s productivity. On the other hand, there are something confusing me or not explained clearly in the paper. First, in the related work section, the paper mentioned that Pig is meant for “offline” workload. What does it mean by offline here? Does it mean that unlike DryadLINQ, Pig Latin does not support a stream of input data? Second, the flexibility of using UDFs allows user to write inefficient code due to lack of parallelized task components. The paper said that user might be aware of this inefficiency by themselves, because this kind of explicit primitives are not provided. This argument seems to be a little too optimistic. It would be better if they can develop some sort of task parallelism check in Pig Pen, which would ware less experienced programmer the existence of unparallel tasks. Finally, the paper mentioned that Pig Latin in principle can be implemented over any platform. Currently, it uses Hadoop as the backend. Since the paper mentioned that map-reduce model is rigid and limited, it will be interesting to compare the result between Pig Latin over Hadoop and Pig Latin over Dryad, which supports more flexible execution plan. From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Monday, February 14, 2011 11:34 PM To: Gupta, Indranil Subject: 525 review 02/15 Large scale incremental processing using distributed transactions and notifications Percolator is a system for incrementally processing updates on a large data set. An example of its use is in creating Google’s web search index. Updating an index of the web as documents are crawled requires continuously changing the repository of existing documents. Previously this was achieved by running mapreduce over the entire repository. If there was only a small portion of change, this still would mean running mapreduce over the entire repository thus discarding the work done in previous runs. Alternatively, a DBMS could be used to store the index but the existing DBMSs can’t handle sheer volume of data. Percolator provides user with random access to repository which allows us to process documents individually avoiding global scans of repository. Percolator parallelizes this task which means it has to ensure ACID compliant transactions. Since percolator is an incrementally processing system, it needs to provide observers, that are invoked whenever user specified column changes. Percolator system consists of worker, Bigtable tablet server and GFS chunk servers. All observers are linked to the Percolator worker and whenever a Bigtable column changes, the corresponding observer is invoked. The transaction protocol is as follows. Percolator stores its lock in a column in the same bigtable that stores data. The transaction uses two phase commit protocol. In the first phase, the cells are locked. The transaction checks for conflicting locks and if there are no conflicting locks, it writes to the lock column in the Bigtable. If there are no conflicts, it proceeds to commit and in the second phase, it gets a timestamp from the timestamp oracle. Then at each cell, the client releases its lock and replaces with the write record. To handle the notification to observers, Percolator maintains a special “notify” Bigtable column and whenever a transaction writes it also sets the corresponding notify cell. The workers periodically checks that column and notifies the observers. Pros: -Since Percolator is an incrementally processing system, it doesn’t require the entire repository to be scanned to create index. This can make small updates. Thus it increases the freshness of the index. -Percolator doesn’t have a central location for transaction management, which enables it to scale to thousands of machines. -Percolator provides cross-row cross-table transactions. -Since percolator is built upon systems like Bigtable and GFS, it is resilient to failures. Cons: -Percolator cannot replace Mapreduce for the initial stages of indexing or when the new input is a large fraction of the repository. -Percolator takes a lazy approach to cleaning up locks left by failed clients. This potentially delays transaction commits. -Percolators sends lot many RPCs per work unit compared to Mapreduce. This is required to ensure the ACID properties of the transaction. -Percolator depends on a timestamp oracle to provide a timestamp for each transaction. Even though the worker could batch timestamp request to the oracle, it still is a single point of failure. It seems like it would become a bottleneck. -It seems like if the observer writes to the column it observes, it gets triggered again and might end up in a loop unless handled correctly. DryadLINQ DryadLINQ is a system and a set of language extentions that enable a new programming model for large scale distributed computing. This system is designed to provide a flexible and efficient distributed computation. DryadLINQ compiles LINQ programs int odistributed computations running on the Dryad cluster computing infrastructure. A Dryad job is a directed acyclic graph where each vertex is a program and edges represent data channels. At runtime, vertices are processes communicating with each other through the channels. When it receives control, DryadLINQ starts by converting the raw LINQ expressions into an execution plan graph. There is a central job manager which is responsible for scheduling processes from the dataflow graph on cluster computers. The task manager manages a batch queue of jobs. Pros: -Suitable for large-scale data parallel computing using a rich object oriented language. This supports coarse grain parallelization. -This allows programmers to express complex computations over datasets while letting runtime decide how these computations should be implemented. -DryadLINQ optimizer produces good automatic execution plans for most programs. -There are libraries of common subroutines built on this for various domains. -This has support for debugging. -There are dynamic optimizations based (at run time) based on the size of the input. Cons: -This doesn’t seem to be generalized. -The job manager is a centralized component which can be a single point of failure. -The job manager constructs the Execution Plan Graph which doesn’t seem to scale as the nodes in the graph explodes. There is bound to be latency due to that. -How does it handle performance bottlenecks like stragglers? They could schedule backups. Thanks -Harshitha (gplkrsh2) From: mark overholt [overholt.mark SpamElide] Sent: Monday, February 14, 2011 5:34 PM To: Gupta, Indranil Subject: 525 review 02/15 Pig Latin: A Not-So-Foreign Language for Data Processing Summary: Pig Latin, and its corresponding implementation Pig, is a cloud programming paradigm set in the small niche between declarative queries like SQL and procedural programming like Map/Reduce. The idea being that data processing in SQL is not powerful enough while the data processing in Map/Reduce is too restrictive on the data. The developers wanted a system that could be easily reused and maintained across different types and sets of data, while allowing them to program in a more comfortable, procedural style. Pig Latin offers the declarative grouping of data with the flexibility to process whatever data it encounters and the option for programmers to customize nearly every aspect of the execution of their program. From the order in which atomic operations are done on the data, to how the data is loaded into memory…the user can create functions to specify anything they want. Other important features of Pig Latin are "a flexible, fully nested data model, extensive support for user-defined functions, and the ability to operate over plain input files without any schema information." It also has a novel debugging environment. Because it is assumed that pig will be used with large scale, web data…it requires parallelism. This allows for every part of a pig program to be parallelized as much as possible. Even the construction of primitive types allows for easy parallelization. Pig is currently structured to be compiled and run on top of Hadoop, but the developers assert that it would be easy to migrate that to any platform. Thoughts/Criticisms: Pros 1. The idea of integrating a declarative set of statements into a procedural language for large scale data processing is a very cool idea. In term of the time saved in development of programs would be a huge time saver. 2. The debugging environment is a very novel idea as well. Huge data sets such as web data could cause huge time costs if you messed up 1 or 2 lines of a program. The idea of dynamically using small sections of data that will test each aspect of your program in a timely fashion is a great step. In most cases, it seems like you could squash most bugs in a Pig program rather quickly before turning it loose on the full data set. 3. I also like the full integration of user functions into the statements. The paper mentions that in SQL, you can only use aggregates in the SELECT clause, etc. In Pig, you can use user functions anywhere in a statement. This allows for you to customize anything that the program can do. Cons 1. The paper does not go into detail about the debugging solution that they presented. This to me was the most novel part of their system, and would account for most of the time saved in deploying a web data analytics tools. I would like to see how the data is selected, and how it is exactly used to generate that side-by-side table they presented in the paper. 2. The lack of experimental data was disappointing. The authors can tell me it is quicker to deploy and easier to maintain…but until I see some hard examples with accompanying data…it makes me wonder exactly how much faster and flexible Pig Latin really is. 3. Pig Latin offers more control and more freedom at the expense of more programming skills. The experienced programmers can exploit the benefit of Pig Latin. One of the lures of Map/Reduce and SQL is that it is very simple to use, even for someone with minimal programming skill. DryadLINQ Summary: DryadLINQ is the combination of 2 processing interfaces, Dryad and LINQ. LINQ is the Language Integrated Query, which is a SQL like adaptation for the .NET framework. Dryad is a distributed processing environment that allows for processing of large data sets on a cluster of machines. The idea behind DryadLINQ is to combine those two in a seamless fashion. A programmer can write a LINQ program in .NET…and using DryadLINQ, can run it on a Dryad system without any knowledge of the parallelization that is taking place underneath the system. It allows the programmer to write a program as if it was on a uni-processor system…and have it run in the time of a parallel system. It differs from other data processing languages like Pig-Latin and SQL and supports traditional programming structures like loops, functions and libraries. Thoughts/Criticisms: Pros 1. Because it uses .NET as its framework, it can utilize everything already built into .NET, including its world class debugging platform. 2. Easy to program by abstracting the distributed nature of the execution, as well as its ability to use traditional loops, etc. Cons 1. DryadLINQ requires the use of persistant memory when moving data between executions, causing increased latency. 2. The system uses a centralized job manager, which will clearly become a bottle neck for large clusters and inhibit system scalability.