From: Nathan Dautenhahn [dautenh1@illinois.edu] Sent: Tuesday, February 16, 2010 12:32 PM To: Gupta, Indranil Subject: 525 review 02/16 Paper Review: Pig Latin and DryadLINQ Nathan Dautenhahn February 16, 2010 1 Pig Latin: A Not-So-Foreign Language for Data Processing 1.1 Summary and Overview Pig Latin is a programming language that provides an programming option that fits the middle ground between declarative querying and procedural programming such as map reduce. The primary problems that Pig Latin attempts to solve are the lack of general programming concepts in the standard SQL programming paradigm and the lack of flexibility in the map reduce paradigm. The primary contributions described in this paper include the following: • The development of a data flow language that allows for the sequential specification of a set of high level actions. This is in contrast to the declarative nature of standard SQL syntax. • The ability for a Pig Latin program to execute actions out of order if no side affects exists from the action. This allows the programmer to not focus on the parallelization of his code, and allow for optimization to be performed. • Flexible data import and export policies. • Nested data structure programming model. • The ability to permit user defined function to produce custom processing of queries. • A novel and robust debugging environment. • Concept of a dynamically-constructured side data set, namely, sandbox data set. 1.2 Comments and Criticisms The following are my primary criticisms with the paper: • The read only nature of the Pig. • Forced parallelism. • There is a lack of experimentation. • Overall this paper felt more like a technical report about the development of production product, and less like a full research paper. The authors appear to be less focused on the research portions of the presentation. This is in contrast to DryadLINQ, which is highly focused on well formed pros and presentation. 2 DryadLINQ: A System for General-Purpose Disiributed Data=Parallel Computing Using a High-Level Language 2.1 Summary and Overview This paper discusses the development and implementation of DryadLINQ, which is a programming language that performs a set of sequential operations to execute LINQ expression that perform arbitrary dataset transformation. The primary goal is to provide a layer of abstraction to the programmer to allow them to execute a high level task, and have that task be automatically parallelized and executed in Dryad. The primary problem they are attempting solve is the poor programming interfaces of current parallel data processing systems such as Dryad and MapReduce. They provide abstractions that allow the programmer to interface with the data in a way that is similar to their natural programming styles, which is in contrast to the current SQL style programming of large scale dataset manipulation. The primary contribution and things I liked about the paper include: • Iterative functionality in data operations. • A hybrid of declarative and imperative programming languages. • Perform automatic optimization of DryadLINQ programs • Improvements upon LINQ’s support for high level parallelization. I really liked how their approach is to not only create a programming language/paradigm, but to also create a language that is easily included in other major object oriented paradigms. For example, a program written in C# can perform all of the normal operations that an application needs to do, but then offload all massive dataset transformations to DryadLINQ without needing to understand the parallelization technology. One of my primary concerns with this project is the escalation of complexity that is being included in simple dataset operations. It appears as though this it is a great thing to make more abstractions, but I think that eventually this will prohibit flexibility and ease of use for these types of programming paradigms. 3 Common Themes The primary theme here is the development of an abstraction layer that enables programmers to: not explicitly program parallelization of there work, use common programming constructs that are more easily used than SQL interfaces, and integrate automatic optimization in the dataset transformations. One question I have is, why is it important to denote that these programs are a way to perform sequential operations? From: Rini Kaushik [rinikaushik@yahoo.com] Sent: Tuesday, February 16, 2010 12:25 PM To: Gupta, Indranil Subject: 525 review 02/16 Attachments: review_0216.txt Hi Indy, Please find attached my review for today's papers. Thanks, RiniFrom: Shivaram V [shivaram.smtp@gmail.com] on behalf of Shivaram Venkataraman [venkata4@illinois.edu] Sent: Tuesday, February 16, 2010 12:21 PM To: Gupta, Indranil Subject: 525 review 02/16 Shivaram Venkataraman Feb 16 2010 1. Wave Computing In the Cloud This paper proposes a new model, 'Wave', for expressing queries run in datacenters on periodically generated logs. Many queries share the same input or computation and expressing them in an appropriate model, would help in optimizing the throughput and cluster usage. The data is considered as a stream in this model and queries consist of different 'query series' each of which is a set of repeated computations. This model helps express the correlation between queries which share the same computation on the same input stream. The major issues identified by the authors from execution logs on a production cluster are: - Redundancy of input and intermediate computation. They found that about 33% of total I/O is redundant among all query execution and that 30% of the queries share at least one step of computation. - There exists a load imbalance on the cluster due to jobs not running over the weekend and monthly jobs running only at the end of a month. Calculating intermediate daily aggregates and re-using them would provide a uniform resource usage on the cluster. - As the window of input for a query increases in size, the probability that it fails increases considerably and this further motivates the need to run queries on smaller input. This model helps identify many research problems about handling queries efficiently. Query decomposition techniques would help uncover queries which share input and they can be scheduled appropriately to balance the load on a cluster. Such scheduling schemes however need to take into account machine failures and also need to handle specific queries which may require immediate response. Similar to database systems, query plans could be designed to optimally execute the query taking into account the number and location of the machines available. Preliminary studies suggest that having a declarative high level language which allows users to specify their queries in terms of predefined operators allows greater opportunity for optimization. Pros: - Observations based on production cluster data from Microsoft - Log analysis represents one of the most widely found computations in datacenters and optimizing them would have a great impact on performance and resource usage. - Presents many research directions for scheduling on datacenters and these are more relevant as we move towards a shared-cloud model. Cons: - The model is restricted to a specific type of computation on specific type of input. - Discussion on how hardware developments like SSDs could affect such models would have been interesting. 2. DryadLINQ DryadLINQ comprises of language extensions that enable a user to express distributed, data-intensive computations in a high-level imperative language and a system to efficiently convert programs into Dryad Computations. Based on LINQ (Language Integrated Query, a .NET construct), DryadLINQ provides a flexible language that can make use of existing .NET types and libraries. It differs from other data processing languages like Pig-Latin and SQL and supports traditional programming structures like loops, functions and libraries. When a user's .NET application runs, a DryadLINQ expression object is created, but its evaluation is deferred till the application requests the output of the execution. At this point DryadLINQ compiles the expression into a Dryad execution plan, generates the code that will the run at each Dryad vertex and submits these to the job manager. After the job completes the job manager returns control to DryadLINQ and the output data is made available to the user. Two of the more powerful constructs in DryadLINQ are the Apply and the Fork operators. These can be used by programmers when they wish to perform arbitrary computations over multiple streaming computations. As the system has no control over these computations, programmers need to use annotations which can indicate to the compiler how the computation can be parallelized. The DryadLINQ compiler performs static optimizations like pipelining multiple operations into a single process, removing redundancy and eager aggregation to reduce the amount of data transferred. It can also use the Dryad API to dynamically mutate the execution graph and increase the number of vertices based on the progress of the job. These optimizations result in a execution plan which is efficient for most of the jobs. Pros: - Great debugging support leveraging existing .NET tools. - Having a strongly typed language helps catch many errors. - Ability to use traditional constructs like for-loops make programming easier Cons: - Not much insight into performance profile of a job. Mentioned as work in progress by the authors. - DryadLINQ does not check or enforce the absence of side-effects due to any object shared during computation. This may catch some users by surprise. Interesting points: - A mapreduce program can be expressed within 10 lines of DryadLINQ code - As storage moves from spinning disks to solid state, the advantages to streaming systems like Dryad and MapReduce will diminish. From: pooja agarwal [pooja.agarwal.mit@gmail.com] Sent: Tuesday, February 16, 2010 12:12 PM To: Indranil Gupta Subject: 525 review 02/16 DS REVIEW 02/16 By: Pooja Agarwal Paper                    - DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing using a High-Level Language Authors                - Y Yu, M Isard, D Fetterly, M Budiu, U Erlingsson, P K Gunda, J Currey Conference        - OSDI 2008  Main Idea: This paper presents DyadLINQ which is a set of language extensions that can be used to transparently compile programs into distributed parallel computations running on Dryad system architecture. It is based on LINQ, a set of .NET constructs and provides a hybrid of declarative and imperative programming. The main tasks of the DryadLINQ comprise of compiling the LINQ expressions into Dryad execution plan(EPG), decomposing the LINQ expression into sub-expressions which can be assigned to different dryad nodes, performing static and dynamic optimizations on the EPG constructed, generation of static data and code to be run on the dryad nodes, and generation of serialization code for data types. It also keeps track of the current jobs in the system by using a job manager. They have also extended the LINQ expressions by adding a few new operators. The authors also evaluated the performance of the system against various applications like SKyServer, PageRank, TeraSort. Pros: 1) Using DryadLINQ users are able to use more complex constructs like functions, loops, modules and libraries which are otherwise not supported in general query processing languages like SQL. 2) Provides an integrated programming environment by combining the power of both LINQ and Dryad. 3) Makes use of object oriented programming languages providing cleaner interface and easy extensibility. 4) Provides features to allow reuse of common sub-expressions and avoid recomputation. 5) Provides dynamic optimizations for aggregation based on the topology to efficiently reduce the I/O constraints. Cons: 1) Learning curve is required to start programming in DryadLINQ. 2) Optimizer does not provide other basic optimization techniques or parameters which will be helpful to the users. Currently, the users need to implement them. 3) The performance debugging is not currently supported. 4) Requires side-effect free expressions and can lead to erroneous results if shared objects are modified. Paper                   -Wave Computing in the Cloud Authors                -B He, M Yang, Z Guo, R Chen, W Lin, B Su, H Wang, L Zhou Conference        - HotOS 2009  Main Idea: The paper describes a Wave model that makes use correlation among temporal and recurring computations to achieve better performance and resource utilization. It defines files as streams and recurrent queries as query series that operate on the streams. The correlation is done based on same data required by different streams or the reuse of computations done on same data which might be required at future times in the query series. Due to these correlations, different tasks can be scheduled to occur simultaneously which can lead to reusing the computations or sharing common resources. The recurring nature of some queries can also provide key insight into the query execution behavior and data properties which can be utilized for better prediction. Pros: 1) Reduces redundancy of computations or I/O by exploiting correlations among query streams and data streams. 2) Provides load balancing by decomposing queries into smaller sub-queries. Cons: 1) It does not provide any algorithm for the main ideas like how scheduling of queries is done based on the prediction information. It’s not sufficient to say that scheduling can be done based on the predictions as in complex systems, it could be hard to optimize scheduling based on the predictions. 2) Lacks evaluation against a variety of applications like applications which are already optimized by design to take care of the redundancy. 3) It lacks the evaluation of tradeoff between Query decomposition and Query aggregation as both of them have certain advantages but are orthogonal to each other. 4) The time and computational complexity of the scheme is not clear. From: Giang Nguyen [nguyen59@illinois.edu] Sent: Tuesday, February 16, 2010 11:58 AM To: Gupta, Indranil Subject: 525 review 02/16 Giang Nguyen Wave Computing in the Cloud The authors observe 20,000 successful data-intensive queries totaling 29 million machine hours on 140 data streams that are updated daily or monthly. There are redundancy in reading the input data streams ("143 streams accessed around 40 thousand times. The top ten accessed streams have around 75% of the total number of accesses"), load imbalance caused by the input data window and day of query submission, and inverse relationship between query input window size and success rate. The authors proposed a Wave computing model where the system collects statistics (input/output data size/distribution, complexity of the operation, and cluster execution environment such as network topology) about the execution of each query and stores these statistis to enable optimizations. The statistics will allow the system to read commonly accessed input streams fewer times, to perform better query planning and scheduling etc. As the paper says, there appears to be great opportunities, most obviously with the shared scans of input data. However, the hard part is to convert the collected statistics into a model that can automatically optimize query plannings and executions. As such the paper does not have a proposal to solve that problem. Pig latin: a not-so-foreign language for data processing For large-scale data analysis, the authors say that programmers prefer procedural style of the map-reduce model over declarative style of SQL. However, the map-reduce model is too low level and rigid, claim the authors, which leads to large amounts of custom code that is hard to maintain and reuse. Thus the authors propose a new language called Pig Latin that is procedural but high-level (in the spirit of SQL), with builtin filtering, grouping, and aggregating etc operators. Other important features of Pig Latin are "a flexible, fully nested data model, extensive support for user-defined functions, and the ability to operate over plain input files without any schema information." It also has a novel debuggin environment. As the data analysis workloads are "read-only", Pig Latin doesn't need schema information and also doesn't need to curate the data. The high level SQL-like operators also allow the Pig system to better optimize the queries if possible. A Pig Latin program is compiled into map-reduce programs. The part that intrigues me most is the Pig Pen sandbox data set generator. I think the ability to automatically generate comprehensive data sets to test user commands is very valuable. However, the details of the algorithm are not included in the paper, so it's not clear how good of a job it does. From: Kurchi Subhra Hazra [hazra1@illinois.edu] Sent: Tuesday, February 16, 2010 11:18 AM To: Gupta, Indranil Subject: 525 review 02/16 DryadLINQ: A System for General Purpose Distributed Data-Parallel Computing Using a High- Level Language -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Summary -------------- This paper demonstrates DryadLINQ, a system that exploits LINQ and the Dryad execution platform to facilitate distributed computation that execute efficiently on large computing clusters. A user .NET application creates a DryadLINQ expression object and hands it over to DryadLINQ, which converts this into a distributed DRYAD execution plan. To do this, the Dryad job manager is invoked, which creates an Execution Plan Graph (EPG) for the current job. The EPG is a directed acyclic graph and the framework of the Dryad data-flow graph that will be executed, where each node is an operator and edges represent its inputs and outputs. Each Dryad vertex then executes its designated job as per the EPG. The control then goes back to the users, who can read the results as .NET objects. The system uses greedy heuristics for static optimizations as well as deploys dynamic optimizations by rewriting the EPG depending on run-time data statistics. The writers also demonstrate how the use of DryadLINQ in various applications, like Terasort, Skyserver and Pagerank, show promising performance. Pros-- -------- -- DryadLINQ automatically and transparently translates the data-parallel portions of a program into a distributed execution plan, which gives the programmer the illusion of writing for a single computer and has the system deal with the complexities that arise from scheduling, distribution and fault-tolerance. -- It scores over languages like SQL, which is unsuitable for parallel data-intensive tasks like machine learning and Map- Reduce, where no automatic optimizations take place, by combining the good points of both. -- Dynamic optimization and virtualization employed in DryadLINQ allows it to run plans requiring many more steps than the instantaneously available computation resources would permit. -- In order to reduce the latency introduced by network reads, every node compresses data before sending it out to a different node. However, network reads still remain a bottle- neck. -- The results shown via experiments are promising for an automatic system. -- Dryad has been used in production clusters for several years now, hence the runtime system is tried and tested, and guarantees efficient and reliable execution. Cons-- --------- -- DryadLINQ uses virtualization that allocates resources independent of the actual cluster used for execution, which requires intermediate results to be stored to persistent media, thus increasing latency. -- The fact that DryadLINQ expressions must be side-effect free implies that shared objects cannot be modified and can become a stringent restriction for many applications. -- The system uses a centralized job manager, which will clearly become a bottle neck for large clusters and inhibit system scalability. -- The system uses LINQ as the language platform. However, I am not very sure about the popularity of LINQ and how widely it is used. This might inhibit widespread use of the DryadLINQ system too. -- Dryad and DryadLINQ are specialized for streaming computations and hence are inefficient for applications requiring random accesses. In fact, I feel that they use a similar framework for all applications. Certain parallel applications may not fit into this framework, and the system is not intelligent enough to modify the framework according to the needs of the application. Wave Computing in the Cloud --------------------------------------- Summary -------------- In this paper, the writers introduce a new concept called Wave computing that exploits the temporal relationship among queries in data-intensive distributed computing. This model captures the key properties of log data mining. The writers, through a survey of a query trace obtained from a production cluster, demonstrate the common trends seen in such systems. For example, the computations performed across queries during different times have a redundancy of 30%. Load imbalances are common too, where machine time during weekdays are 50% higher than that during weekends. Besides, the queries with larger input time windows fail more often due to resource contentions or exhaustion. In order to do away with these problems, the writers introduce the notion of streams and query series. Data is modelled as an append-only stream that is constantly updated and is distributed across various machines. The term query series is used to refer to recurrent computations on a stream, with each performed on one or more stream segments. The writers, in their survey, show that queries can be grouped into a number of query series, such that queries belonging to a query series have a lot of similarities. This can be utilized for predicting the behaviour of later queries that can be grouped into an existing query series. The writers also point out a number of improvements that can be introduced into the system as a whole in the form of cross query optimizations, based on a history of queries being executed by a system. Pros- -------- -- The writers try and shift research directions from individual queries to system utilization in large clusters that compute a massive number of queries each day. With the growing popularity and use of clusters, this idea is simple, useful as well as novel. -- The proposed model is practically feasible since it can be built on top of existing system by extending their present capabilities. -- The paper introduces many open ended interesting problems on distributed query optimizations that can trigger a lot of good research. Cons-- ---------- -- This is mainly a theoretical paper that introduces one to a new model. The writers propose some new problems and their possible solutions using the wave model, but do not have any experimental results to back up their claims. -- The concept is based on a survey of queries being executed in a production system. I am not sure how well will the notion of query series hold in other data-intensive distributed computations. Thanks, Kurchi Subhra Hazra Graduate Student Department of Computer Science University of Illinois at Urbana-Champaign From: Fatemeh Saremi [samaneh.saremi@gmail.com] Sent: Tuesday, February 16, 2010 11:12 AM To: Gupta, Indranil Subject: 525 review 02/16 Paper 1: Pig Latin Pig Latin is a new language designed for analysis of extremely large data sets and sits between the declarative style of SQL, and the low-level, procedural style of map-reduce. The sheer size of these data leaves no other way except storing and processing it on highly parallel systems. While parallel database products utilizing simple SQL queries provide some solutions, using these products at web scale is extremely expensive. Besides that, programmers prefer writing scripts or code rather than writing unnaturally declarative queries in SQL and thats why the more procedural map-reduce programming model has been successful. On the other hand, the map-reduce model has its own set of limitations and its one-input, two-stage data flow is extremely rigid and results in producing large portions of custom code which is difficult to reuse and maintain. To this end, Pig Latin eliminates most inappropriate issues of these extreme languages and introduces high-level declarative querying (like SQL) and low-level, procedural programming (like map-reduce). Pig, the developed system based on Pig Latin, compiles this language into physical plans that are executed over Hadoop, the open-source implementation of map-reduce. In addition, a novel debugging environment, Pig Pen is presented for Pig which has the ability of freezing the execution of a program prefix for the user to add further commands and then continuing to execute the extended program without regressing the progress has been made so far. Pros: - High-level declarative querying as well as low-level, procedural programming - Easy to reuse and maintain - Support for flexible, fully nested data model which is closer to how programmers think and conformant to the way data is stored on disk - Extensive support for user-defined functions - Ability to operate over plain input files without any schema information - Novel interactive debugging environment with facilities for writing a program in an incremental fashion - Implemented open-source accompanying system which allows different systems to be plugged in - Flexibility in the execution order of the operations - Quick start and easy interoperability with other applications - Appropriate selection of the language primitives (the primitives that cannot be parallelized are excluded, though can be defined and added by users) - Being the programs written in Pig Latin easier to optimize compared to SQL Cons: - More redundant in commands, compared to SQL - Not being easy to understand the functionality (semantic) of the query by a quick look at its syntax (due to redundancy) - Questionable efficiency of implementation due to grouping operations which might result in gigantic tuples of nested bags that are bigger than main memory - Considerable overhead during compiling Pig Latin into map-reduce jobs due to inflexibility of map-reduce primitives which enforces data to be materialized and replicated on the distributed file system between successive map-reduce jobs Paper 2: Wave Computing This paper introduces a simple but valuable new model, Wave Computing that lies between two different processing models: the traditional batch processing and the stream processing. This recently proposed model exposes the temporal relationship among the queries in data-intensive distributed computing and proposes to use these relations that are of recurrent nature to improve performance and resource utilization of the system. It specifically has been investigated through the study of a query trace with queries written in SCOPE and on around 140 data streams obtained from a production cluster. The trace contains nearly 20 thousands of successfully executed queries, taking a total of 29 millions of machine hours. While the data is considered as streams that are being updated, the updates are persisted and available which results in the periodic processing on the stream being of the batch type. However unlike batch processing which looks at individual queries, the wave computing model defines a series of correlated queries, query series. Query series captures a sequence of the same computation on different sets of segments of the same stream and explicitly exposes the correlations among the queries in the query series in terms of both data and computation. Queries in different query series might share the same I/O to scan the input data and might even share common computation. Those queries could be scheduled to run together as a single combined query by removing redundancies. The wave computing model is particularly compatible with today systems in which the main portion of workload has the mentioned property and therefore the model can be enabled on top of the existing systems. The model helps significantly improve performance and resource utilization, the load balance of the system (through utilizing query decomposition, though not applicable to all queries), and the success rate of queries (by reducing the size of each individual query). In addition, wave model also improves fault tolerance via using the stored previous data to predict the results when a failure happens and the processing behavior (data and computation) is the same as before. This way, the wave model is unlocking the full power of data-intensive distributed computing. The paper is only an overview of the idea. Though the idea sounds working and useful, what happens in practice is not fully predictable. There are a lot of issues and details that govern the effectiveness of the model, e.g., how accurate and efficient an oracle for this model might be. Designing and implementing different modules like query decomposer, query planner and query scheduler include details that noticeably affect the efficiency of the idea. The other issue that is worth mentioning in the paper is to discuss regarding accuracy degree of the prediction model, in which conditions enabling predictions and performing based on the wave model is efficient and beneficial. From: mukherj4@illinois.edu Sent: Tuesday, February 16, 2010 11:03 AM To: Gupta, Indranil Subject: 525 Review 02/16 Cloud Programming Pig Latin: A Not-So-Foreign Language for Data Processing: by Olston et. al: Pig Latin has been described as a Sweet-Spot between the declarative style SQL and the low level procedural language style of Map-Reduce. It is executed over Hadoop, which is a free and open-source version of Map-Reduce. Pig Latin comes with a “Novel” Debugging environment, as claimed by the authors. routing policies. Features/Characteristics of Pig-Latin: As claimed by the authors, the features that can differentiate Pig-Latin from the other programming paradigm are as follows: 1. It is meant for Cloud Programming, means for distributed system, especially parallel database application. 2. It is developed mostly in Yahoo Research Group on top of Hadoop. Although at the later part of the paper the author expressed the view that, in principle, Pig Latin can be compiled into Dryad jobs. 3. [Advantage] Writing a program using Pig-Latin is equivalent to specifying a Query Execution Plan (i.e., data flow graph). Therefore, more understanding and control of how the query will be executed. [Disadvantage] That means, more control, more freedom to the expense of more programming skills. The experienced programmers can exploit the benefit of Pig Latin. The underlying assumption is that, the Automatic Query Optimization techniques are not sufficient for a distributed database systems. 4. [Advantage] It has flexible, fully nested data-model and allows complex, non-atomic data-types. A nested data model is closer to how the programmers think. 5. [Advantage] Pig-Latin has extensive support for User-Defined Functions (UDF). 6. [Disadvantage] Pig UDFs are written in Java only as of now. Although it is mentioned that, they are trying to develop supports for UDFs written in other languages. 7. [Advantage] It supports Out-of-Order execution, i.e., the operations specified explicitly do not need to be performed in sequence. The example provided in Section 2.1 of determining Spam URLs of higher page rank depicts the benefit, but, how the sequence of execution will be chosen is not mentioned very clearly in the paper, i.e., when it will be just the in-order execution and when will it be out-of-order is not very clear. 8. Unlike conventional Databases, transactional consistency and index-based look-ups are not required while programming using Pig. 9. It comes with a novel interactive debugging which is capable of generating tailored dataset (sandbox data set) to facilitate testing in order to minimize the time (and probably, effort) to develop applications using Pig Latin. Many debuggers support interactive debugging, but on a distributed system, debugging a parallel application is a non-trivial job. More Comments: In section 2.5, the author mentioned about carefully chosen primitives, but, not enough details is being provided. Also, not much example and test results (on benchmark problem or data intensive application) is being provided in this paper, to support the arguments/claim by the authors. It seems to be a good option to analyze data based on Ad-hoc Queries. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using High-Level Language: by Yu et al DryadLINQ combines distributed execution engine (Dryad) and the .NET Language Integrated Query (LINQ) extensions, that enable new programming model for large scale data parallel applications running on large PC clusters. LINQ is a set of extensions to the .NET Framework that encompass language-integrated query, set, and transform operations. It extends C# and Visual Basic with native language syntax for queries and provides class libraries to take advantage of these capabilities. Dryad is a high-performance large-scale execution engine. This paper makes the following contribution: Demonstrated a new hybrid of declarative and imperative programming suitable for large-scale data parallel applications Demonstrating Automatic Optimization using DryadLINQ Small set of operations to improve LINQ Support Features:: DryadLINQ exploits LINQ. It converts the raw LINQ expression into an Execution Plan Graph [Advantage] Easy to develop applications, as it gives an illusion of developing an application for a single computer. Does not require much expertise in order to exploit parallelism. [Disadvantage] The programmer does not have control over the low-level primitives. So, the parallelization is solely depend on how well DryadLINQ can exploit LINQ. It reduces programmers effort at the cost of less control and poor optimization based on the application. [Advantage] DryadLINQ provides hint for programmers to optimize beyond the automatic optimizations by specifying annotations of various kinds [Advantage] It produces good automatic execution plans for LINQ based applications Comments: Experiments are done on benchmark problems, hence acceptable. But, the scaling results does not say much about the scaling in weak-sense or strong-sense for all the benchmarks. Like for sorting, they try to show the execution time does not grow much if we try to sort more numbers using more machines, whereas, they talk about speedups for Skyserver, by keeping the problem size same and varying no. of processors. So, some results may not be impressive to publish. An analogy of DryadLINQ to OpenMP style programming and Pig-Latin as MPI can be justified by saying that, using OpenMP it is easy to parallelize although not much speedup possible as programmers have less control. Whereas in MPI, it requires more expertise and effort to develop applications, but suitable for large clusters as it gives better speedups. From: Vivek [vivek112@gmail.com] Sent: Tuesday, February 16, 2010 10:40 AM To: Gupta, Indranil Subject: 525 review 02/16 DryadLINQ: A system for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language This is a system including an execution environment and language support for large-scale distributed computation. It combines the best of declarative and imperative languages. Data-parallel portions of the code can be automatically made to run in a distributed cluster. DryadLINQ uses LINQ, which is a a set of SQL-like programming constructs for programming with large sets of data. One key feature is that it uses virtualized "expression plans". In the paper, DryadLINQ is demonstrated particularly for high-performance distributed computation applications(e.g. TeraSort, Large-scale machine -learning, etc). Pros: - Can be used for Clusters of very large scale - generalizable to many different applications (unlike facebook's HIVE for example, which might be specific to social-networking sites) - allows for both static and dynamic optimizations. - provides a hybrid of declarative and imperative programming, and does not rely on the users knowledge of pure SQL Cons: - The general issue with DryadLINQ is that there is too much automated optimization going on behind the scenes, and there seems to be very little parametrization and user-defined functionality for specific applications. - The issue of random-accesses seems actually to be a bigger issue than the way it is presented. While improvements can be made for random accesses, the underlying structure and design may need to be rethought for such support to be added. - Perhaps the most important downside is the fact that DryadLINQ is not widely distributed and not open-source -- and this (in my opinion) has particularly slowed down the general adoption of DryadLINQ among different user communities. Only a small community of users familiar with Microsoft's programming environment have been using it thus far. This doesn't seem to help the acceleration of development, particularly as compared to Pig Latin. Pig-Latin: A Not-So-Foreign Language for Data Processing Core Idea: In this paper, a language called Pig-Latin is introduced as a solution for analysis of large data sets such as web crawls, click streams, etc. It is geared towards many internet companies such as Amazon, Yahoo!, and Google and is implemented on top of Hadoop. It combines the recently emerging map-reduce style of programming in cloud computing, with the more traditional SQL-style programming. Rather than simple standard query expressions written in SQL, Pig Latin allows one to explicitly define and control the sequence of expression execution. The argument is that allowing programmers write such data-intensive applications as a sequence of steps is much more appealing to programmers than forcing the system to use a particular plan through optimization flags or hints. Pros: - Perhaps the most characteristic feature and advantage of Pig Latin is the approach towards and support for user-defined functions(UDFs). This allows programmers to customize and fine-tune Pig Latin programs so that they work in specific domains. - Another key advantage of Pig Latin is that it is open-source. This allows it to be developed and updated very frequently based on programmer needs. Unlike DryadLINQ, PigLatin has been steadily establishing a much larger community than just the high-performance distributed computation community. Cons: - One issue of Pig Latin is the needs for a single programming language environment rather than multiple different ones. Pig Latin programs have to be enclosed in strings, and this makes syntax checking at the static level nearly impossible. For example, the development environment cannot tell the programmer whether there is a mis-spelling of a variable or whether an operator is applied to a literal incorrectly. - Following up with the previous issue , one of the other issues is that Pig Latin seems to primarily integrated with Java. But what about other more performance oriented languages such as C++? From: Virajith Jalaparti [jalapar1@illinois.edu] Sent: Tuesday, February 16, 2010 6:47 AM To: Gupta, Indranil Subject: 525 review 02/16 Review of “Pig Latin: A Not-So-Foreign language for Data Processing”: The paper presents Pig Latin, a new language designed particularly to support ad-hoc analysis of large data sets and Pig, a framework which fully implements the various constructs needed for programming in Pig Latin and translating programs written in it to Map-Reduce jobs. The paper claims that while SQL type query-based declarative languages provide a high level framework which is un-natural for experienced programmers (who tend to think in terms of imperative code and scripts), map-reduce type programming model require programmers to delve into several low-level issues not necessarily relevant to the problem at hand (and can be done away with). Pig Latin tries to combine features from both these models by providing a procedural programming framework along with the high-level functionalities/abstractions provided by SQL. Pig Latin provides a programming model close to a programmer’s way of thinking by providing sequences of instructions and a nested data model. It provides several basic constructs that are typically used in data-intensive applications such as FOREACH, FILTER, (CO) GROUP, UNION etc. The paper further goes on to provide details of the interactive debugging environment that Pig supports; apart from a graphical display of the execution of the program, it also provides an automated data set generator which helps to generate data which can as input data to the program. Pig Latin provides high-level programming constructs which help users to remain oblivious to the low-level details which are to dealt with when working with for example, the map-reduce framework. Since the programs in Pig Latin are translated into Map-Reduce jobs, it helps to inherently exploit the parallelism present in the data analysis. It provides a generic framework by allows the use of User-Defined functions along with the basic constructs it provides such handling input/output data. The use of various nested data types is natural and definitely increases the flexibility available to the programmer as compared to when only using a 1NF way of constructing data structures. Apart from these programming constructs, Pig also provides a graphical interactive debugger which greatly simplifies the tedious task of debugging. While one of the main reasons for creating Pig Latin is to make is easier for programmers to analyze data, the paper makes no comparisons as to how the efficiency of the programs can be affected as compared to programming in SQL-type languages. The authors make it explicit that Pig Latin is essentially a scan-centric language supporting read-only data analysis. It is not clear why such a constraint has been adopted as it essentially limits the usage of it. Although the paper mentions that optimizations can be done in Pig Latin, it is not clear how Pig supports this. The authors do not talk about the various methods that can be possibly used in Pig Latin so that the optimizations can be made automatic (for eg. by extending classic compiler optimizations). The same is true with parallelization: except for the parallelism achieved by using Map-Reduce, no details of presented as to how the semantics of the program can be exploited to achieve greater parallelism. Further, while Pig Pen is supposed to generate Sand Box data sets, the paper does not provide any methods as to how such “complete” input data can be generated which would make one suspect the validity of their claims (this is essentially a “hard” problem). Review for “Wave Computing in the Cloud”: This paper introduces a new Wave model of computing which essentially tries to capture the temporal relations between queries in data-intensive distributed computing. The fundamental reason for such a model is the presence of redundancy across the computation of various queries which arise because of the complex queries being decomposed into similar simpler queries which can potentially be the same and work on the same input data. The authors present cases studies which show the redundancies present in 3 query series and presents opportunities for optimizations which can exploit these redundancies and achieve better resource utilization and performance. In this model, the input is treated as streams which are append-only and queries are broken down into query series which contain recurrent computations on a stream of data. The paper provides various optimizations that can be achieved on identifying such query series and redundancy which include predicting computation requirements for the execution of a particular query, enabling shared I/O of data, query planning and scheduling. This paper exposes the opportunities to achieve better performance and resource utilization in the case of data-intensive computing. It shows that several benefits can be obtained by identifying common sub-query computations across various complex/bigger queries along with removing the need to compute redundant queries. One of the major uses of the Wave model seems to be predictability: distributed system execution is often quite unpredictable; predictability would help in performing various complex tasks like load-balancing, near-optimal scheduling and achieve near optimal resource utilization. While the paper provides a novel idea to exploit inherent redundancy in data-intensive computing, it is not very clear to what extent the optimizations promised by this method can be achieved in practice. First of all it is not trivial to calculate a query series; the paper provides no method to do so. Even if the queries and data are known a priori, it might actually be computationally expensive to obtain such redundant queries whose presence can be exploited to achieve the advantages provided by the Wave model. It is not very clear to what extent such redundancies occur in regular applications; the paper provides a simple case in favor of their argument but is the problem really that important/prevalent that we would need a new model to capture it? This paper further seems to encourage a bad practice: design architectures/frameworks in such a way that application specific optimizations can be done in it. But shouldn’t optimizations be taken care of at the algorithmic level? The algorithms for the applications being considered should be designed in such a way that they take care of such redundancies and it should not be left to the programming model to detect such optimizations. Further, the paper briefly outlines the various potential advantages of using a wave model for computing and does not give any initial directions (even for a workshop paper) as to how such advantages can be realized in practice. -- Virajith Jalaparti PhD Student, Computer Science University of Illinois at Urbana-Champaign Web: http://www.cs.illinois.edu/homes/jalapar1/ From: liangliang.cao@gmail.com on behalf of Liangliang Cao [cao4@illinois.edu] Sent: Tuesday, February 16, 2010 5:39 AM To: Gupta, Indranil Subject: 525 review 02/16 Reviews by Liangliang Cao, cao4@illinois.edu, Feb 16, 2009 Paper 1: Pig Latin: A Not-So-Foreign Language for Data Processing This paper proposes a new interface of MapReduce which is useful for ad-hoc data analysis tasks. The basic idea of the language design is to introduce Tuple-Bag data model and to allow UDF to be used together with atomic variables. As a result, the Pig Latin leads to a procedural style programming of SQL functions, together with a better interface of calling MapReduce. Pros • The definitions of LOAD, FOREACH, FILTER, and COGROUP are neat. I guess the Yahoo! team has made a lot of modification during the process of developing and using Pig system. • The introducing of UDF, nested data model, Tuple-Bag data model really simplifies the job comparing calling MapReduce directly. • I am most excited about the Pig Pen Debug system, since I believe it is very promising for data-driven task. Cons • Although the paper is well organized and presented, the contribution of main idea is not as significant as it appears. Fundamentally Pig Latin is just an interface of combining MapReduce functions. • The author should give a more thoroughly analysis of Pig Latin and SQL programming, not only for cloud computing but also for general database. • The interface is mainly designed from the developing view point. Some functions, such as COGROUP or JOIN, might not be as efficient as Map-reduce-merge framework (SIGMOD 2007). • The implementation details of Pig Pen are not discussed. I am more interested in the debug environment for OLAP analysis and for Machine Learning problems. But I cannot find such discussion from the paper. Paper 2: Wave Computing in the Cloud, B. He et al, HotOS 2009 This paper introduce “Wave” model to handle the temporal relationship among the queries. A novel concept “query series” is defined for periodically updated input streams. The idea is simple but it effectively models the spatial correlations and redundancy, which dramatically improves the system performance. Pros: • The paper use a case study to convincingly show that there are redundancy and load imbalance in the query streams. • The technique of query decomposition seems very insightful to locate the primary of computational bottleneck. • The idea is simple but effective. Cons and potential improvement • There should be more experimental results. Currently we are not sure how well wave model can be for general problems. • In practice such searching engine and web log, not only the query but also the data are in stream. How to generalize current paper for stream data might be interesting. • It might be more powerful to combine wave model with machine learning techniques, such as feature selection or Kalman filters. From: Sun Yu [sunyu9910@gmail.com] Sent: Monday, February 15, 2010 11:54 PM To: Gupta, Indranil Subject: 525 review 02/16 Sun Yu Dear Indy: I was travelling and planned to be back today, but my flight was cancelled due to bad weather, so I'm afraid I will miss the class tomorrow. Here is the review for the topics tommorrow. I'll turn in a hard copy later. Thanks. Best, Sun,Y 1.DryadLINQ: a system for general-purpose distributed data-parallel computing using a high level language Most data-processing systems such as MapReduce and Dryad doesn't provide satisfactory programming interfaces. This paper aims to address this issue. The DryadLINQ, as its name, is a hybrid of declarative and imperative programming, it compiles LINQ programs into distributed computations on Dryad system structure. The goal is to provide a programmer-friendly interface that conceals all complexity due to distributed nature of the underlying system, yet achieve high performance. The paper described the structure of DryadLINQ and demonstrated its performance using a variety of benchmarks. Such a transparent programming interface for distributed systems may have many limitations, many are discussed in section 7. One other problem that may or may not occur is the predictability of performance? Any users who wrote a piece of code may expect predictable performance with no, or at least small variation. The performance debugging also depends on the Dryad job manager to collect information in a centralized manner, is there any scaling problem with this? 2.Wave computing in the cloud The authors introduced a new wave model for better utilizing computation resources and enhancing performance in data-intensive distributed systems. In current systems, there are significant redundancy in operations across queries, for example, input scan and common sub query computation. Other issues like temporal workload imbalance are also limiting system performance/resulting in under-utilization of resources. The basic idea here is to explore the redundancy: recurrent computations on a stream is defined as query series. With this notion of "query series" or pattern, we have some predictability in the system. This enabled scheduling and combing shared operations. On the other hand, cost-model can be introduced on a statistical basis, opening up the possibility of using query optimization techniques in current distributed systems. It's also claimed that this model has practical significance since it can be enabled on top of current systems. One question is that, the author mentioned "data distribution within a stream tends not to change when the stream grows over time", how is this justified? Also, is it possible to using some learning scheme to find more underlying patterns (some hidden feature, maybe) of queries that can be effectively utilized? An adaptive scheme could also be interesting. From: Ghazale Hosseinabadi [gh.hosseinabadi@gmail.com] Sent: Monday, February 15, 2010 10:48 PM To: Gupta, Indranil Subject: 525 review 02/16 Paper 1: Pig Latin: A Not-So-Foreign Language for Data Processing In this paper a new data processing environment (Pig), its corresponding language (Pig Latin) and a new debugging environment (Pig Pen) are introduced. Pig Latin combines the high level declarative querying style of SQL, and the low-level, procedural style of map-reduce. Pig Latin is used by programmers at Yahoo! for data analysis. Pig Latin is a Dataflow language which benefits from quick start, interoperability and nested data model. Pig Latin also supports user-defined functions (UDFs). Pig Latin has four data models: Atom, Tuple, Bag, Map. Using LOAD command the input data file are converted into Pig's data model. FOREACH implements per-tuple processing. FILTER discards unwanted data. COGROUP gets related data together. Pig Latin also supports some commands to be nested. STORE saves the result of a Pig Latin expression in a file. Pros: The objective in design of Pig-Latin is cleared stated and the designed language achieves the objectives. Cons: The comparison between the performance of Pig-Latin and map-reduce is missing. The amount of improvement that Pig-Latin achieves by implementing over Hadoop is not presented. Paper 2: Wave Computing in the Cloud In this paper, a model (called wave model) for exposing the temporal relationship among the queries in distributed computing is introduced. In wave computing data is considered as a stream that is periodically updated. The authors studied a query trace from a cluster. They looked at redundancy in input data scans as well as common sub-query computation. They also studied the temporal distribution of the load of the cluster. The window size of a query is equal to the size of the time-window of the query's input on the stream. They realized that if the window size increases, success rate decreases. The authors also analyzed the predictability which is obtained from the similarities among the executions of the queries in the same query series. When a query is being executed, its characteristics are saved. By shared scan and computation, query decomposition, query planning and query scheduling stream processing is optimized. Pros: The idea of considering the data as a stream is simple but interesting. Considering different forms of correlation in query processing is important for having better performance. Cons: No theoretical analysis is presented in the paper. It is not well cleared how much the impact of the wave model is. The paper doesn't have any part for evaluation/simulation so the performance of the wave model is not practically investigated. No comparison with other solutions in the literature is available. From: Shehla Saleem [shehla.saleem@gmail.com] Sent: Monday, February 15, 2010 6:55 PM To: Gupta, Indranil Subject: 525 review 02/16 Wave Computing in the Cloud This paper focuses on the challenge faced commonly by large scale clusters these days of having to execute a large number of queries on large amounts of data. Their motivation comes from an analysis of a production computing cluster where their study reveals a lot of redundancy and high levels of load imbalance. The authors propose Wave, which derives its key ideas from log mining with some appropriate modifications. They introduce the notion of a query series which refer to recurrent computations. This concept is used to bring some predictability into the system. The authors identify certain areas that offer opportunities for improvement and then propose some optimizations mostly exploiting query characteristics and correlations. This is a simple paper, more like a presentation of certain ideas. It identifies some potential opportunities for making better data sensitive applications and provides some intuition on how to exploit them. I would however, have liked to see some more results to validate some of the proposals. Pig Latin: a not-so-foreign language for data processing This paper introduces Pig Latin, a high level programming language aimed at working for distributed systems with huge data intensive applications. It is designed in a way that it is conceptually in the middle of the low-level, rigid style of Map-Reduce and the high level, declarative language of SQL. From the programmer’s perspective, it is far more flexible than MapReduce and masks the underlying map-reduce operations from the programmer. This increases code-reusability and that combined with the interactive debugging environment of Pig Latin make it very attractive for programmers. Also the flexible, fully nested data model, support for user defined functions and the ability to work with plain input files without any schema information all add to the strengths of the design. Moreover, the examples in the paper mention Pig Latin’s applicability to real world scenarios which to the promise further. Pig Latin hides the intricacies of multiple MapReduce operations from the programmer, but I was wondering if any optimizations were possible in terms of handling the data generated in the intermediate stages. Also, iterative control structures like loops etc are missing which might render this hard to use for many high performance computing applications. From: ntkach2@illinois.edu Sent: Monday, February 15, 2010 4:58 PM To: Gupta, Indranil Subject: 525 review 02/16 Nadia Tkach – ntkach2 CS525 – paper review 2 Cloud Programming Paper 1: Pig Latin: A Not-So-Foreign Language for Data Processing The authors of this paper propose a new language called Pig Latin that would help many users to manipulate large data blocks and analyze the data. It is based on step-by-step programming technique while maintaining the look and feel of declarative SQL-like querying (such as filtering, grouping and aggregation) and map-reduce procedural programming model. Essentially it includes Pig programming and compiling system which is built on top of Hadoop map-reduce implementation. Pig is open-source project and available for public use. Pros: • Can be used on a large scale to handle terabytes of data • Free open source project • Supports ad hoc data analysis, debugging and parallelism • User-defined functions written in Java with possibility of future support of other programming languages • The system creates “logical plans” as it executes the code and doesn’t carry out any actions until the STORE operation is defined by a user, as well the system can avoid materializing the processing data until the certain operation invocation (especially useful when processing large data sets) Cons: • The system implementation and wide usage might require prior users training • Supports read-only data analysis workloads Paper 2: Wave Computing in the Cloud The paper describes the new Wave computing model for data processing. This model performs batch processing on the streaming and continuously updating input data. The Wave catalogues and analyzes the query processing operations, and using this data finds the query series that share common computation. Once identified such common computations the model is able to predict the future data operations and perform them across several query series as applicable while reducing the amount of computation and I/O operations. Additionally, the Wave analyzes the performance and load patterns and spread the computation process over duration of time. The Wave model can improve performance and resource optimization, and minimize underutilization or resource exhaustion. Pros: • Reduces the number of redundant computation operations • Reduce the load imbalance • Reduces the size of each query and can potentially minimize the query failures Cons: • Wave is implemented on top of existing system, but requires certain capabilities to be enabled to operate such as the language for query manipulation, data and statistics cataloguing, and query rewriting functionality • The paper doesn’t provide any information on model implementation, evaluation and results to back up the theoretical analysis From: arod99@gmail.com on behalf of Wucherl Yoo [wyoo5@illinois.edu] Sent: Monday, February 15, 2010 11:57 AM To: Gupta, Indranil Subject: 525 Review 02/16 Cloud Programming Review, Wucherl Yoo (wyoo5) Wave Computing in the Cloud, B. He et al, HotOS 2009 Summary: The authors claim that they observed that significant redundant I/O and computation of individual queries and load imbalances in cluster environment. They found that log data mining was a dominant workload. This workload shows Wave-like patterns since the log can be considered as stream that is periodically updated. The authors define query series to refer redundant computations among the queries and it exposes the correlations among queries in the series. This makes recurring execution explicit so that it can provide predictability about the execution behavior of queries and data characteristics. Thus redundant computations among queries can be combined to reduce waste of resources. In addition, it can provide load balancing with increased predictability about resource usages. Pros: 1. Interesting observation about redundancies among queries in cluster environment 2. As the authors mentioned, wave model can be applied to other cloud environments since the characteristics of their data look like appended log Cons: 1. Evaluation is not sufficiently strong; reality can be more complex and noisy with heterogeneous workload and resources. 2. Although finding redundancy from sub-queries may be easy, finding that from high-level languages of cloud environment may be not easy due to implicit dependencies among computations and data at different system layers Pig Latin: a not-so-foreign language for data processing, C. Olston et al, SIGMOD 2008 (Yahoo!) Summary: Although map-reduce provides simple procedural programming model, two stage data flow may be too rigid and low-level. In addition, SQL-like declarative model may be restricted for cloud environment. The authors propose new Language, Pig Latin to fit in a sweet spot between the two models; While Pig Latin provides high-level data manipulation primitives, it exposes explicit execution steps. This step-by-step execution also helps to debug the program with provided interactive debugging environment that can automatically generate intermediate example data. The Pig system builds logical plan from Pig Latin scripts and compiles them into map-reduce jobs that can be run on Hadoop. Pros: 1. Higher-level (thus more convenient) and more expressive programming model than MapReduce, explicit procedural step compared with declarative SQL 2. System-level optimization can be possible however it may not be worse than programmer-specified optimization from low-level language like MapReduce Cons: 1. More complexity can cause more bugs and performance lost 2. The programming model is tightly coupled with MapReduce so flexibility is not much improved. 3. Overhead incurred from the inflexibility of MapReduce – intermediate data must be materialized and replicated between successive MapReduce jobs. -Wucherl From: Ashish Vulimiri [vulimir1@illinois.edu] Sent: Monday, February 15, 2010 12:22 AM To: Gupta, Indranil Subject: 525 review 02/16 Pig latin: a not-so-foreign language for data processing, C. Olston et al, SIGMOD 2008 The authors describes Pig Latin, a procedural data processing language that improves upon MapReduce by adding both additional control primitives (inspired by declarative languages like SQL), as well as richer data structures (nested tuples and associative arrays). They also describe the implementation of a development environment for Pig Latin, called Pig, that includes two tools: i) A compiler that can translate Pig Latin programs into optimized sequences of MapReduce jobs, suitable for execution on Hadoop. ii) A debugging environment called Pig Pen that generates test data sets and simulates the program on them to enable the user to identify potential problems. Comments: + Richer query language than map-reduce. In particular, primitives like COGROUP are interesting because they allow operations on subsets of the available tuples, instead of just treating them individually (map) or all at once (reduce). + The existence of a debugging environment, even one as simple as Pig Pen, is a major plus. 0 The primary reason they can simplify the SQL model is because they allow only read-only analysis. Allowing writes would require bringing back all four of the ACID properties of traditional databases, as well as all the complexity that entails. - I'm not sure I buy the argument that programmers are inherently uncomfortable with declarative, SQL-like languages -- especially given the current craze for building web applications for everything (web apps almost certainly have to interact with a database layer). - Maps: keys should be primitive values for efficiency reasons. But shouldn't hashing the keys be sufficient to handle more complex structures? - Debugging environment is limited. Not all complications that can arise in real datasets can be demonstrated via a simple test dataset. - No profiler -- cannot identify performance bottlenecks. - The authors are limited by the artificial constraint that all queries must, in the end, be reformulated in terms of an efficient chain of map and reduce jobs. Are there natural parallel primitives that do not fit this model? DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, Yuan Yu et al, OSDI 2008 DryadLINQ is an alternate data query processing framework that uses the Dryad parallel execution framework as a provider for the LINQ query language included with the .NET framework from v3.5 onwards. (The LINQ query language is distinct from the framework (called the "provider") that the query is actually executed on. Some examples of providers other than Dryad include: traditional RDBMSes, queried via SQL, and simple XML stores that are queried by a local execution engine.) The authors briefly describe the DryadLINQ execution model and then demonstrate several example programs -- including a MapReduce emulator, PageRank, and two standard machine learning algorithms. + Real, strongly typed, data structures, as opposed to Pig Latin's ad hoc nested tuples, although this comes at the cost of some complexity. + Lazy evaluation -- sub-queries are only evaluated if actually needed. + More efficient joins than Pig Latin. - Integrated into programming languages like C# that have intrinsically different semantics. This causes issues such as the one with non side-effect free statements in sec 3.2. - Very limited debugging support -- debug environment only handles outright failures. - As with Pig Latin, no profiler. Thanks, Ashish