From:	Nathan Dautenhahn [dautenh1@illinois.edu]
Sent:	Tuesday, February 16, 2010 12:32 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

  Paper Review: Pig Latin and DryadLINQ
Nathan Dautenhahn
February 16, 2010
1 Pig Latin: A Not-So-Foreign Language for Data Processing
1.1 Summary and Overview
Pig Latin is a programming language that provides an programming option 
that fits the middle ground
between declarative querying and procedural programming such as map 
reduce. The primary problems that
Pig Latin attempts to solve are the lack of general programming concepts 
in the standard SQL programming
paradigm and the lack of flexibility in the map reduce paradigm.
The primary contributions described in this paper include the following:
Х The development of a data flow language that allows for the sequential 
specification of a set of high
level actions. This is in contrast to the declarative nature of standard 
SQL syntax.
Х The ability for a Pig Latin program to execute actions out of order if 
no side affects exists from the
action. This allows the programmer to not focus on the parallelization 
of his code, and allow for
optimization to be performed.
Х Flexible data import and export policies.
Х Nested data structure programming model.
Х The ability to permit user defined function to produce custom 
processing of queries.
Х A novel and robust debugging environment.
Х Concept of a dynamically-constructured side data set, namely, sandbox 
data set.
1.2 Comments and Criticisms
The following are my primary criticisms with the paper:
Х The read only nature of the Pig.
Х Forced parallelism.
Х There is a lack of experimentation.
Х Overall this paper felt more like a technical report about the 
development of production product, and
less like a full research paper. The authors appear to be less focused 
on the research portions of the
presentation. This is in contrast to DryadLINQ, which is highly focused 
on well formed pros and
presentation.
2 DryadLINQ: A System for General-Purpose Disiributed Data=Parallel
Computing Using a High-Level Language
2.1 Summary and Overview
This paper discusses the development and implementation of DryadLINQ, 
which is a programming language
that performs a set of sequential operations to execute LINQ expression 
that perform arbitrary dataset
transformation. The primary goal is to provide a layer of abstraction to 
the programmer to allow them to
execute a high level task, and have that task be automatically 
parallelized and executed in Dryad. The
primary problem they are attempting solve is the poor programming 
interfaces of current parallel data
processing systems such as Dryad and MapReduce. They provide 
abstractions that allow the programmer
to interface with the data in a way that is similar to their natural 
programming styles, which is in contrast
to the current SQL style programming of large scale dataset manipulation.
The primary contribution and things I liked about the paper include:
Х Iterative functionality in data operations.
Х A hybrid of declarative and imperative programming languages.
Х Perform automatic optimization of DryadLINQ programs
Х Improvements upon LINQТs support for high level parallelization.
I really liked how their approach is to not only create a programming 
language/paradigm, but to also
create a language that is easily included in other major object oriented 
paradigms. For example, a program
written in C# can perform all of the normal operations that an 
application needs to do, but then offload all
massive dataset transformations to DryadLINQ without needing to 
understand the parallelization technology.
One of my primary concerns with this project is the escalation of 
complexity that is being included in
simple dataset operations. It appears as though this it is a great thing 
to make more abstractions, but I
think that eventually this will prohibit flexibility and ease of use for 
these types of programming paradigms.
3 Common Themes
The primary theme here is the development of an abstraction layer that 
enables programmers to: not
explicitly program parallelization of there work, use common programming 
constructs that are more easily
used than SQL interfaces, and integrate automatic optimization in the 
dataset transformations.
One question I have is, why is it important to denote that these 
programs are a way to perform sequential
operations?


From:	Rini Kaushik [rinikaushik@yahoo.com]
Sent:	Tuesday, February 16, 2010 12:25 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

Attachments:	review_0216.txt

Hi Indy,

Please find attached my review for today's papers.

Thanks,

RiniFrom:	Shivaram V [shivaram.smtp@gmail.com] on behalf of Shivaram Venkataraman [venkata4@illinois.edu]
Sent:	Tuesday, February 16, 2010 12:21 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

Shivaram Venkataraman
Feb 16 2010

1. Wave Computing In the Cloud
This paper proposes a new model, 'Wave', for expressing queries run in 
datacenters on periodically generated logs. Many queries share the same 
input or computation and expressing them in an appropriate model, would 
help in optimizing the throughput and cluster usage. The data is 
considered as a stream in this model and queries consist of different 
'query series' each of which is a set of repeated computations. This 
model helps express the correlation between queries which share the same 
computation on the same input stream. The major issues identified by the 
authors from execution logs on a production cluster are:
- Redundancy of input and intermediate computation. They found that 
about 33% of total I/O is redundant among all query execution and that 
30% of the queries share at least one step of computation.
- There exists a load imbalance on the cluster due to jobs not running 
over the weekend and monthly jobs running only at the end of a month. 
Calculating intermediate daily aggregates and re-using them would 
provide a uniform resource usage on the cluster.
- As the window of input for a query increases in size, the probability 
that it fails increases considerably and this further motivates the need 
to run queries on smaller input.
This model helps identify many research problems about handling queries 
efficiently. Query decomposition techniques would help uncover queries 
which share input and they can be scheduled appropriately to balance the 
load on a cluster. Such scheduling schemes however need to take into 
account machine failures and also need to handle specific queries which 
may require immediate response. Similar to database systems, query plans 
could be designed to optimally execute the query taking into account the 
number and location of the machines available. Preliminary studies 
suggest that having a declarative high level language which allows users 
to specify their queries in terms of predefined operators allows greater 
opportunity for optimization.

Pros:
- Observations based on production cluster data from Microsoft
- Log analysis represents one of the most widely found computations in 
datacenters and optimizing them would have a great impact on performance 
and resource usage.
- Presents many research directions for scheduling on datacenters and 
these are more relevant as we move towards a shared-cloud model.
Cons:
- The model is restricted to a specific type of computation on specific 
type of input.
- Discussion on how hardware developments like SSDs could affect such 
models would have been interesting.

2. DryadLINQ
DryadLINQ comprises of language extensions that enable a user to express 
distributed, data-intensive computations in a high-level imperative language
and a system to efficiently convert programs into Dryad Computations. 
Based on LINQ (Language Integrated Query, a .NET construct), DryadLINQ 
provides a flexible language that can make use of existing .NET types 
and libraries. It differs from other data processing languages like 
Pig-Latin and SQL and supports traditional programming structures like 
loops, functions and libraries. When a user's .NET application runs, a 
DryadLINQ expression object is created, but its evaluation is deferred 
till the application requests the output of the execution. At this point 
DryadLINQ compiles the expression into a Dryad execution plan, generates 
the code that will the run at each Dryad vertex and submits these to the 
job manager. After the job completes the job manager returns control to 
DryadLINQ and the output data is made available to the user.
Two of the more powerful constructs in DryadLINQ are the Apply and the 
Fork operators. These can be used by programmers when they wish to 
perform arbitrary computations over multiple streaming computations. As 
the system has no control over these computations, programmers need to 
use annotations which can indicate to the compiler how the computation 
can be  parallelized. The DryadLINQ compiler performs static 
optimizations like pipelining multiple operations into a single process, 
removing redundancy and eager aggregation to reduce the amount of data 
transferred. It can also use the Dryad API to dynamically mutate the 
execution graph and increase the number of vertices based on the 
progress of the job. These optimizations result in a execution plan 
which is efficient for most of the jobs.

Pros:
- Great debugging support leveraging existing .NET tools.
- Having a strongly typed language helps catch many errors.
- Ability to use traditional constructs like for-loops make programming 
easier
Cons:
- Not much insight into performance profile of a job. Mentioned as work 
in progress by the authors.
- DryadLINQ does not check or enforce the absence of side-effects due to 
any object shared during computation. This may catch some users by surprise.
Interesting points:
- A mapreduce program can be expressed within 10 lines of DryadLINQ code
- As storage moves from spinning disks to solid state, the advantages to 
streaming systems like Dryad and MapReduce will diminish.
From:	pooja agarwal [pooja.agarwal.mit@gmail.com]
Sent:	Tuesday, February 16, 2010 12:12 PM
To:	Indranil Gupta
Subject:	525 review 02/16

DS REVIEW 02/16
By: Pooja Agarwal

Paper аааааааааааааааааа - DryadLINQ: A System for General-Purpose
Distributed Data-Parallel Computing using a High-Level Language
Authors аааааааааааааа - Y Yu, M Isard, D Fetterly, M Budiu, U
Erlingsson, P K Gunda, J Currey
Conference аааааа - OSDI 2008

аMain Idea:
This paper presents DyadLINQ which is a set of language extensions
that can be used to transparently compile programs into distributed
parallel computations running on Dryad system architecture. It is
based on LINQ, a set of .NET constructs and provides a hybrid of
declarative and imperative programming. The main tasks of the
DryadLINQ comprise of compiling the LINQ expressions into Dryad
execution plan(EPG), decomposing the LINQ expression into
sub-expressions which can be assigned to different dryad nodes,
performing static and dynamic optimizations on the EPG constructed,
generation of static data and code to be run on the dryad nodes, and
generation of serialization code for data types. It also keeps track
of the current jobs in the system by using a job manager. They have
also extended the LINQ expressions by adding a few new operators. The
authors also evaluated the performance of the system against various
applications like SKyServer, PageRank, TeraSort.

Pros:
1) Using DryadLINQ users are able to use more complex constructs like
functions, loops, modules and libraries which are otherwise not
supported in general query processing languages like SQL.
2) Provides an integrated programming environment by combining the
power of both LINQ and Dryad.
3) Makes use of object oriented programming languages providing
cleaner interface and easy extensibility.
4) Provides features to allow reuse of common sub-expressions and
avoid recomputation.
5) Provides dynamic optimizations for aggregation based on the
topology to efficiently reduce the I/O constraints.

Cons:
1) Learning curve is required to start programming in DryadLINQ.
2) Optimizer does not provide other basic optimization techniques or
parameters which will be helpful to the users. Currently, the users
need to implement them.
3) The performance debugging is not currently supported.
4) Requires side-effect free expressions and can lead to erroneous
results if shared objects are modified.



Paperааа аааааааааааааа -Wave Computing in the Cloud
Authors аааааааааааааа -B He, M Yang, Z Guo, R Chen, W Lin, B Su, H Wang, L Zhou
Conferenceааааааа - HotOS 2009

аMain Idea:
The paper describes a Wave model that makes use correlation among
temporal and recurring computations to achieve better performance and
resource utilization. It defines files as streams and recurrent
queries as query series that operate on the streams. The correlation
is done based on same data required by different streams or the reuse
of computations done on same data which might be required at future
times in the query series. Due to these correlations, different tasks
can be scheduled to occur simultaneously which can lead to reusing the
computations or sharing common resources. The recurring nature of some
queries can also provide key insight into the query execution behavior
and data properties which can be utilized for better prediction.

Pros:
1) Reduces redundancy of computations or I/O by exploiting
correlations among query streams and data streams.
2) Provides load balancing by decomposing queries into smaller sub-queries.

Cons:
1) It does not provide any algorithm for the main ideas like how
scheduling of queries is done based on the prediction information.
ItТs not sufficient to say that scheduling can be done based on the
predictions as in complex systems, it could be hard to optimize
scheduling based on the predictions.
2) Lacks evaluation against a variety of applications like
applications which are already optimized by design to take care of the
redundancy.
3) It lacks the evaluation of tradeoff between Query decomposition and
Query aggregation as both of them have certain advantages but are
orthogonal to each other.
4) The time and computational complexity of the scheme is not clear.
From:	Giang Nguyen [nguyen59@illinois.edu]
Sent:	Tuesday, February 16, 2010 11:58 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

Giang Nguyen

Wave Computing in the Cloud

The authors observe 20,000 successful data-intensive queries totaling
29 million machine hours on 140 data streams that are updated daily or
monthly. There are redundancy in reading the input data streams ("143
streams accessed around 40 thousand times. The top ten accessed
streams have around 75% of the total number of accesses"), load
imbalance caused by the input data window and day of query submission,
and inverse relationship between query input window size and success
rate. The authors proposed a Wave computing model where the system
collects statistics (input/output data size/distribution, complexity
of the operation, and cluster execution environment such as network
topology) about the execution of each query and stores these statistis
to enable optimizations. The statistics will allow the system to read
commonly accessed input streams fewer times, to perform better query
planning and scheduling etc.

As the paper says, there appears to be great opportunities, most
obviously with the shared scans of input data. However, the hard part
is to convert the collected statistics into a model that can
automatically optimize query plannings and executions. As such the
paper does not have a proposal to solve that problem.


Pig latin: a not-so-foreign language for data processing

For large-scale data analysis, the authors say that programmers prefer
procedural style of the map-reduce model over declarative style of
SQL. However, the map-reduce model is too low level and rigid, claim
the authors, which leads to large amounts of custom code that is hard
to maintain and reuse. Thus the authors propose a new language called
Pig Latin that is procedural but high-level (in the spirit of SQL),
with builtin filtering, grouping, and aggregating etc operators. Other
important features of Pig Latin are "a flexible, fully nested data
model, extensive support for user-defined functions, and the ability
to operate over plain input files without any schema information." It
also has a novel debuggin environment. As the data analysis workloads
are "read-only", Pig Latin doesn't need schema information and also
doesn't need to curate the data. The high level SQL-like operators
also allow the Pig system to better optimize the queries if
possible. A Pig Latin program is compiled into map-reduce programs.

The part that intrigues me most is the Pig Pen sandbox data set
generator. I think the ability to automatically generate comprehensive
data sets to test user commands is very valuable. However, the details
of the algorithm are not included in the paper, so it's not clear how
good of a job it does.
From:	Kurchi Subhra Hazra [hazra1@illinois.edu]
Sent:	Tuesday, February 16, 2010 11:18 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

DryadLINQ: A System for General Purpose Distributed Data-Parallel Computing Using a High- Level Language

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------




Summary

--------------

This paper demonstrates DryadLINQ, a system that exploits LINQ and the Dryad execution platform to facilitate distributed computation that execute efficiently on large computing clusters. A user .NET application creates a DryadLINQ expression object and hands it over to DryadLINQ, which converts this into a distributed DRYAD execution plan. To do this, the Dryad job manager is invoked, which creates an Execution Plan Graph (EPG) for the current job. The EPG is a directed acyclic graph and the framework of the Dryad data-flow graph that will be executed, where each node is an operator and edges represent its inputs and outputs. Each Dryad vertex then executes its designated job as per the EPG. The control then goes back to the users, who can read the results as .NET objects. The system uses greedy heuristics for static optimizations as well as deploys dynamic optimizations by rewriting the EPG depending on run-time data statistics. The writers also demonstrate how the use of DryadLINQ in various applications, like Terasort, Skyserver and Pagerank, show promising performance.

Pros--

--------

-- DryadLINQ automatically and transparently translates the data-parallel portions of a program into a distributed execution plan, which gives the programmer the illusion of writing for a single computer and has the system deal with the complexities that arise from scheduling, distribution and fault-tolerance.

-- It scores over languages like SQL, which is unsuitable for parallel data-intensive tasks like machine learning and Map- Reduce, where no automatic optimizations take place, by combining the good points of both.

-- Dynamic optimization and virtualization employed in DryadLINQ allows it to run plans requiring many more steps than the instantaneously available computation resources would permit.

-- In order to reduce the latency introduced by network reads, every node compresses data before sending it out to a different node. However, network reads still remain a bottle- neck.

-- The results shown via experiments are promising for an automatic system.

-- Dryad has been used in production clusters for several years now, hence the runtime system is tried and tested, and guarantees efficient and reliable execution.

Cons--

---------

-- DryadLINQ uses virtualization that allocates resources independent of the actual cluster used for execution, which requires intermediate results to be stored to persistent media, thus increasing latency.

-- The fact that DryadLINQ expressions must be side-effect free implies that shared objects cannot be modified and can become a stringent restriction for many applications.

-- The system uses a centralized job manager, which will clearly become a bottle neck for large clusters and inhibit system scalability.

-- The system uses LINQ as the language platform. However, I am not very sure about the popularity of LINQ and how widely it is used. This might inhibit widespread use of the DryadLINQ system too.

-- Dryad and DryadLINQ are specialized for streaming computations and hence are inefficient for applications requiring random accesses. In fact, I feel that they use a similar framework for all applications. Certain parallel applications may not fit into this framework, and the system is not intelligent enough to modify the framework according to the needs of the application.
















Wave Computing in the Cloud

---------------------------------------

Summary

--------------

In this paper, the writers introduce a new concept called Wave computing that exploits the temporal relationship among queries in data-intensive distributed computing. This model captures the key properties of log data mining. The writers, through a survey of a query trace obtained from a production cluster, demonstrate the common trends seen in such systems. For example, the computations performed across queries during different times have a redundancy of 30%. Load imbalances are common too, where machine time during weekdays are 50% higher than that during weekends. Besides, the queries with larger input time windows fail more often due to resource contentions or exhaustion. In order to do away with these problems, the writers introduce the notion of streams and query series. Data is modelled as an append-only stream that is constantly updated and is distributed across various machines. The term query series is used to refer to recurrent computations on a stream, with each performed on one or more stream segments. The writers, in their survey, show that queries can be grouped into a number of query series, such that queries belonging to a query series have a lot of similarities. This can be utilized for predicting the behaviour of later queries that can be grouped into an existing query series. The writers also point out a number of improvements that can be introduced into the system as a whole in the form of cross query optimizations, based on a history of queries being executed by a system.

Pros-

--------

-- The writers try and shift research directions from individual queries to system utilization in large clusters that compute a massive number of queries each day. With the growing popularity and use of clusters, this idea is simple, useful as well as novel.

-- The proposed model is practically feasible since it can be built on top of existing system by extending their present capabilities.

-- The paper introduces many open ended interesting problems on distributed query optimizations that can trigger a lot of good research.

Cons--

----------

-- This is mainly a theoretical paper that introduces one to a new model. The writers propose some new problems and their possible solutions using the wave model, but do not have any experimental results to back up their claims.

-- The concept is based on a survey of queries being executed in a production system. I am not sure how well will the notion of query series hold in other data-intensive distributed computations.


Thanks,
Kurchi Subhra Hazra
Graduate Student
Department of Computer Science
University of Illinois at Urbana-Champaign

From:	Fatemeh Saremi [samaneh.saremi@gmail.com]
Sent:	Tuesday, February 16, 2010 11:12 AM
To:	Gupta, Indranil
Subject:	525 review 02/16




Paper 1: Pig Latin

Pig Latin is a new language designed for analysis of extremely large data sets and sits between the declarative style of SQL, and the low-level, procedural style of map-reduce. The sheer size of these data leaves no other way except storing and processing it on highly parallel systems. While parallel database products utilizing simple SQL queries provide some solutions, using these products at web scale is extremely expensive. Besides that, programmers prefer writing scripts or code rather than writing unnaturally declarative queries in SQL and thats why the more procedural map-reduce programming model has been successful. On the other hand, the map-reduce model has its own set of limitations and its one-input, two-stage data flow is extremely rigid and results in producing large portions of custom code which is difficult to reuse and maintain. To this end, Pig Latin eliminates most inappropriate issues of these extreme languages and introduces high-level declarative querying (like SQL) and low-level, procedural programming (like map-reduce). Pig, the developed system based on Pig Latin, compiles this language into physical plans that are executed over Hadoop, the open-source implementation of map-reduce. In addition, a novel debugging environment, Pig Pen is presented for Pig which has the ability of freezing the execution of a program prefix for the user to add further commands and then continuing to execute the extended program without regressing the progress has been made so far. 

Pros:

- High-level declarative querying as well as low-level, procedural programming

- Easy to reuse and maintain

- Support for flexible, fully nested data model which is closer to how programmers think and conformant to the way data is stored on disk

- Extensive support for user-defined functions

- Ability to operate over plain input files without any schema information

- Novel interactive debugging environment with facilities for writing a program in an incremental fashion

- Implemented open-source accompanying system which allows different systems to be plugged in

- Flexibility in the execution order of the operations

- Quick start and easy interoperability with other applications

- Appropriate selection of the language primitives (the primitives that cannot be parallelized are excluded, though can be defined and added by users)

- Being the programs written in Pig Latin easier to optimize compared to SQL

Cons:

- More redundant in commands, compared to SQL 

- Not being easy to understand the functionality (semantic) of the query by a quick look at its syntax (due to redundancy)

- Questionable efficiency of implementation due to grouping operations which might result in gigantic tuples of nested bags that are bigger than main memory

- Considerable overhead during compiling Pig Latin into map-reduce jobs due to inflexibility of map-reduce primitives which enforces data to be materialized and replicated on the distributed file system between successive map-reduce jobs




Paper 2: Wave Computing

This paper introduces a simple but valuable new model, Wave Computing that lies between two different processing models: the traditional batch processing and the stream processing. This recently proposed model exposes the temporal relationship among the queries in data-intensive distributed computing and proposes to use these relations that are of recurrent nature to improve performance and resource utilization of the system. It specifically has been investigated through the study of a query trace with queries written in SCOPE and on around 140 data streams obtained from a production cluster. The trace contains nearly 20 thousands of successfully executed queries, taking a total of 29 millions of machine hours. While the data is considered as streams that are being updated, the updates are persisted and available which results in the periodic processing on the stream being of the batch type. However unlike batch processing which looks at individual queries, the wave computing model defines a series of correlated queries, query series. Query series captures a sequence of the same computation on different sets of segments of the same stream and explicitly exposes the correlations among the queries in the query series in terms of both data and computation. Queries in different query series might share the same I/O to scan the input data and might even share common computation. Those queries could be scheduled to run together as a single combined query by removing redundancies. The wave computing model is particularly compatible with today systems in which the main portion of workload has the mentioned property and therefore the model can be enabled on top of the existing systems.

The model helps significantly improve performance and resource utilization, the load balance of the system (through utilizing query decomposition, though not applicable to all queries), and the success rate of queries (by reducing the size of each individual query). In addition, wave model also improves fault tolerance via using the stored previous data to predict the results when a failure happens and the processing behavior (data and computation) is the same as before. This way, the wave model is unlocking the full power of data-intensive distributed computing.

The paper is only an overview of the idea. Though the idea sounds working and useful, what happens in practice is not fully predictable. There are a lot of issues and details that govern the effectiveness of the model, e.g., how accurate and efficient an oracle for this model might be. Designing and implementing different modules like query decomposer, query planner and query scheduler include details that noticeably affect the efficiency of the idea. 

The other issue that is worth mentioning in the paper is to discuss regarding accuracy degree of the prediction model, in which conditions enabling predictions and performing based on the wave model is efficient and beneficial.




From:	mukherj4@illinois.edu
Sent:	Tuesday, February 16, 2010 11:03 AM
To:	Gupta, Indranil
Subject:	525 Review 02/16

Cloud Programming

Pig Latin: A Not-So-Foreign Language for Data Processing: by Olston et. al: 
Pig Latin has been described as a Sweet-Spot between the declarative style SQL and the low level procedural language style of Map-Reduce. It is executed over Hadoop, which is a free and open-source version of Map-Reduce. Pig Latin comes with a УNovelФ Debugging environment, as claimed by the authors.
 routing policies.

Features/Characteristics of Pig-Latin:
As claimed by the authors, the features that can differentiate Pig-Latin from the other programming paradigm are as follows:
1. It is meant for Cloud Programming, means for distributed system, especially parallel database application.

2. It is developed mostly in Yahoo Research Group on top of Hadoop. Although at the later part of the paper the author expressed the view that, in principle, Pig Latin can be compiled into Dryad jobs. 

3. [Advantage] Writing a program using Pig-Latin is equivalent to specifying a Query Execution Plan (i.e., data flow graph). Therefore, more understanding and control of how the query will be executed.

[Disadvantage] That means, more control, more freedom to the expense of more programming skills. The experienced programmers can exploit the benefit of Pig Latin. The underlying assumption is that, the Automatic Query Optimization techniques are not sufficient for a distributed database systems. 

4. [Advantage] It has flexible, fully nested data-model and allows complex, non-atomic data-types. A nested data model is closer to how the programmers think. 

5. [Advantage] Pig-Latin has extensive support for User-Defined Functions (UDF).

6. [Disadvantage] Pig UDFs are written in Java only as of now. Although it is mentioned that, they are trying to develop supports for UDFs written in other languages.

7. [Advantage] It supports Out-of-Order execution, i.e., the operations specified explicitly do not need to be performed in sequence. The example provided in Section 2.1 of determining Spam URLs of higher page rank depicts the benefit, but, how the sequence of execution will be chosen is not mentioned very clearly in the paper, i.e., when it will be just the in-order execution and when will it be out-of-order is not very clear.

8. Unlike conventional Databases, transactional consistency and index-based look-ups are not required while programming using Pig. 

9. It comes with a novel interactive debugging which is capable of generating tailored dataset (sandbox data set) to facilitate testing in order to minimize the time (and probably, effort) to develop applications using Pig Latin. Many debuggers support interactive debugging, but on a distributed system, debugging a parallel application is a non-trivial job. 

More Comments:

In section 2.5, the author mentioned about carefully chosen primitives, but, not enough details is being provided.
Also, not much example and test results (on benchmark problem or data intensive application) is being provided in this paper, to support the arguments/claim by the authors.
It seems to be a good option to analyze data based on Ad-hoc Queries.

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using High-Level Language: by Yu et al

DryadLINQ combines distributed execution engine (Dryad) and the .NET Language Integrated Query (LINQ) extensions, that enable new programming model for large scale data parallel applications running on large PC clusters. LINQ is a set of extensions to the .NET Framework that encompass language-integrated query, set, and transform operations. It extends C# and Visual Basic with native language syntax for queries and provides class libraries to take advantage of these capabilities. Dryad is a high-performance large-scale execution engine. 

This paper makes the following contribution:
Demonstrated a new hybrid of declarative and imperative programming suitable for large-scale data parallel applications
Demonstrating Automatic Optimization using DryadLINQ
Small set of operations to improve LINQ Support

Features::
DryadLINQ exploits LINQ. It converts the raw LINQ expression into an Execution Plan Graph 
[Advantage] Easy to develop applications, as it gives an illusion of developing an application for  a single computer. Does not require much expertise in order to exploit parallelism.
[Disadvantage] The programmer does not have control over the low-level primitives. So, the parallelization is solely depend on how well DryadLINQ can exploit LINQ. It reduces programmers effort at the cost of less control and poor optimization based on the application. 
[Advantage] DryadLINQ provides hint for programmers to optimize beyond the automatic optimizations by specifying annotations of various kinds
[Advantage] It produces good automatic execution plans for LINQ based applications

Comments: 
Experiments are done on benchmark problems, hence acceptable. But, the scaling results does not say much about the scaling in weak-sense or strong-sense for all the benchmarks. Like for sorting, they try to show the execution time does not grow much if we try to sort more numbers using more machines, whereas, they talk about speedups for Skyserver, by keeping the problem size same and varying no. of  processors. So, some results may not be impressive to publish. 

An analogy of DryadLINQ to OpenMP style programming and Pig-Latin as MPI can be justified by saying that, using OpenMP it is easy to parallelize although not much speedup possible as programmers have less control. Whereas in MPI, it requires more expertise and effort to develop applications, but suitable for large clusters as it gives better speedups.
From:	Vivek [vivek112@gmail.com]
Sent:	Tuesday, February 16, 2010 10:40 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

DryadLINQ:  A system for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language


This is a system including an execution environment and language support  for large-scale distributed computation. It combines the best of declarative and imperative languages. Data-parallel portions of the code can be automatically made to run in a distributed  cluster.   DryadLINQ  uses LINQ, which is a a set of SQL-like programming constructs for programming with large sets of data. One key feature is that it uses  virtualized "expression plans". In the paper,  DryadLINQ is demonstrated particularly for high-performance distributed computation applications(e.g. TeraSort, Large-scale machine -learning, etc).
  

  Pros:
     -  Can be used for  Clusters  of very large scale
     -  generalizable to many different applications (unlike facebook's HIVE for example, which might be specific to social-networking sites) 
      -  allows for both static and dynamic optimizations. 
      -  provides a hybrid of  declarative and imperative programming,  and does not rely on the users knowledge of pure SQL
      
  Cons:    
       -    The general issue with DryadLINQ is that there is too much automated optimization going on behind the scenes, and there seems to be very little parametrization and user-defined functionality for specific applications.  

  -   The issue of random-accesses seems actually to be a bigger issue than the way it is presented.  While improvements can be made for random accesses,  the underlying structure and design may need to be rethought for such support to be added.

-  Perhaps the most important downside is the fact that DryadLINQ  is not  widely distributed and not open-source -- and this (in my opinion) has particularly slowed down the general adoption of DryadLINQ among different user communities. Only  a small community  of  users  familiar with Microsoft's programming environment have been using it thus far. This doesn't seem to help the acceleration of development, particularly as compared to Pig Latin.  





Pig-Latin:  A Not-So-Foreign Language for Data Processing

Core Idea:   In this paper,  a language called Pig-Latin is introduced as a solution for analysis of large data sets such as web crawls, click streams, etc.    It is geared towards many internet companies such as Amazon,  Yahoo!, and Google and is implemented on top of Hadoop. It combines the recently emerging map-reduce style of programming in cloud computing, with the more traditional SQL-style programming.   Rather than simple standard query expressions written in SQL,  Pig Latin  allows one to explicitly define and control the sequence of expression execution.   The argument is that allowing programmers write such data-intensive applications as a sequence of  steps is much more appealing to programmers than forcing the system to use a particular plan through optimization flags or hints.   

Pros:  

-  Perhaps the most characteristic feature and advantage of Pig Latin is the approach towards and support for user-defined functions(UDFs).  This allows programmers to customize and fine-tune Pig Latin programs so that they work in specific domains.  

-  Another key advantage of Pig Latin is that it is open-source.  This allows it to be developed and updated very frequently based on  programmer needs.  Unlike DryadLINQ, PigLatin has been steadily establishing a much larger community than just the high-performance distributed computation community.


Cons:  

-   One issue of Pig Latin is the needs for a single programming language environment rather than multiple different ones.  Pig Latin  programs have to be enclosed in strings, and this makes syntax checking  at the static level nearly impossible.  For example, the development environment cannot tell the programmer whether there is a mis-spelling of a variable or whether an operator is applied to a literal incorrectly.  

-  Following up with the previous issue ,  one of the other issues is that Pig Latin seems to primarily integrated with  Java.  But what about other more performance oriented languages such as C++?  

From:	Virajith Jalaparti [jalapar1@illinois.edu]
Sent:	Tuesday, February 16, 2010 6:47 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

Review of УPig Latin: A Not-So-Foreign language for Data ProcessingФ:

The paper presents Pig Latin, a new language designed particularly to 
support ad-hoc analysis of large data sets and Pig, a framework which 
fully implements the various constructs needed for programming in Pig 
Latin and translating programs written in it to Map-Reduce jobs. The 
paper claims that while SQL type query-based declarative languages 
provide a high level framework which is un-natural for experienced 
programmers (who tend to think in terms of imperative code and scripts), 
map-reduce type programming model require programmers to delve into 
several low-level issues not necessarily relevant to the problem at hand 
(and can be done away with). Pig Latin tries to combine features from 
both these models by providing a procedural programming framework along 
with the high-level functionalities/abstractions provided by SQL. Pig 
Latin provides a programming model close to a programmerТs way of 
thinking by providing sequences of instructions and a nested data model. 
It provides several basic constructs that are typically used in 
data-intensive applications such as FOREACH, FILTER, (CO) GROUP, UNION 
etc. The paper further goes on to provide details of the interactive 
debugging environment that Pig supports; apart from a graphical display 
of the execution of the program, it also provides an automated data set 
generator which helps to generate data which can as input data to the 
program.

Pig Latin provides high-level programming constructs which help users to 
remain oblivious to the low-level details which are to dealt with when 
working with for example, the map-reduce framework. Since the programs 
in Pig Latin are translated into Map-Reduce jobs, it helps to inherently 
exploit the parallelism present in the data analysis. It provides a 
generic framework by allows the use of User-Defined functions along with 
the basic constructs it provides such handling input/output data. The 
use of various nested data types is natural and definitely increases the 
flexibility available to the programmer as compared to when only using a 
1NF way of constructing data structures. Apart from these programming 
constructs, Pig also provides a graphical interactive debugger which 
greatly simplifies the tedious task of debugging.

While one of the main reasons for creating Pig Latin is to make is 
easier for programmers to analyze data, the paper makes no comparisons 
as to how the efficiency of the programs can be affected as compared to 
programming in SQL-type languages. The authors make it explicit that Pig 
Latin is essentially a scan-centric language supporting read-only data 
analysis. It is not clear why such a constraint has been adopted as it 
essentially limits the usage of it. Although the paper mentions that 
optimizations can be done in Pig Latin, it is not clear how Pig supports 
this. The authors do not talk about the various methods that can be 
possibly used in Pig Latin so that the optimizations can be made 
automatic (for eg. by extending classic compiler optimizations). The 
same is true with parallelization: except for the parallelism achieved 
by using Map-Reduce, no details of presented as to how the semantics of 
the program can be exploited to achieve greater parallelism. Further, 
while Pig Pen is supposed to generate Sand Box data sets, the paper does 
not provide any methods as to how such УcompleteФ input data can be 
generated which would make one suspect the validity of their claims 
(this is essentially a УhardФ problem).

Review for УWave Computing in the CloudФ:

This paper introduces a new Wave model of computing which essentially 
tries to capture the temporal relations between queries in 
data-intensive distributed computing. The fundamental reason for such a 
model is the presence of redundancy across the computation of various 
queries which arise because of the complex queries being decomposed into 
similar simpler queries which can potentially be the same and work on 
the same input data. The authors present cases studies which show the 
redundancies present in 3 query series and presents opportunities for 
optimizations which can exploit these redundancies and achieve better 
resource utilization and performance. In this model, the input is 
treated as streams which are append-only and queries are broken down 
into query series which contain recurrent computations on a stream of 
data. The paper provides various optimizations that can be achieved on 
identifying such query series and redundancy which include predicting 
computation requirements for the execution of a particular query, 
enabling shared I/O of data, query planning and scheduling.

This paper exposes the opportunities to achieve better performance and 
resource utilization in the case of data-intensive computing. It shows 
that several benefits can be obtained by identifying common sub-query 
computations across various complex/bigger queries along with removing 
the need to compute redundant queries. One of the major uses of the Wave 
model seems to be predictability: distributed system execution is often 
quite unpredictable; predictability would help in performing various 
complex tasks like load-balancing, near-optimal scheduling and achieve 
near optimal resource utilization.

While the paper provides a novel idea to exploit inherent redundancy in 
data-intensive computing, it is not very clear to what extent the 
optimizations promised by this method can be achieved in practice. First 
of all it is not trivial to calculate a query series; the paper provides 
no method to do so. Even if the queries and data are known a priori, it 
might actually be computationally expensive to obtain such redundant 
queries whose presence can be exploited to achieve the advantages 
provided by the Wave model. It is not very clear to what extent such 
redundancies occur in regular applications; the paper provides a simple 
case in favor of their argument but is the problem really that 
important/prevalent that we would need a new model to capture it?
This paper further seems to encourage a bad practice: design 
architectures/frameworks in such a way that application specific 
optimizations can be done in it. But shouldnТt optimizations be taken 
care of at the algorithmic level? The algorithms for the applications 
being considered should be designed in such a way that they take care of 
such redundancies and it should not be left to the programming model to 
detect such optimizations.
Further, the paper briefly outlines the various potential advantages of 
using a wave model for computing and does not give any initial 
directions (even for a workshop paper) as to how such advantages can be 
realized in practice.

-- 
Virajith Jalaparti
PhD Student, Computer Science
University of Illinois at Urbana-Champaign
Web: http://www.cs.illinois.edu/homes/jalapar1/

From:	liangliang.cao@gmail.com on behalf of Liangliang Cao [cao4@illinois.edu]
Sent:	Tuesday, February 16, 2010 5:39 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

Reviews by Liangliang Cao, cao4@illinois.edu, Feb 16, 2009

Paper 1: Pig Latin: A Not-So-Foreign Language for Data Processing

This paper proposes a new interface of MapReduce  which is useful for
ad-hoc  data analysis tasks. The basic idea of the language design is
to introduce Tuple-Bag data model and to allow UDF to be used together
with atomic variables. As a result, the Pig Latin leads to a
procedural style programming of SQL functions, together with a better
interface of calling MapReduce.
Pros
Х	The definitions of LOAD, FOREACH, FILTER, and COGROUP are neat. I
guess the Yahoo! team has made a lot of modification during the
process of developing and using Pig system.
Х	The introducing of UDF, nested data model, Tuple-Bag data model
really simplifies the job comparing calling MapReduce directly.
Х	I am most excited about the Pig Pen Debug system, since I believe it
is very promising for data-driven task.
Cons
Х	Although the paper is well organized and presented, the contribution
of main idea is not as significant as it appears. Fundamentally Pig
Latin is just an interface of combining MapReduce functions.
Х	The author should give a more thoroughly analysis of Pig Latin and
SQL programming, not only for cloud computing but also for general
database.
Х	The interface is mainly designed from the developing view point.
Some functions, such as COGROUP or JOIN, might not be as efficient as
Map-reduce-merge framework (SIGMOD 2007).
Х	The implementation details of Pig Pen are not discussed. I am more
interested in the debug environment for OLAP analysis and for Machine
Learning problems.  But I cannot find such discussion from the paper.

Paper 2: Wave Computing in the Cloud, B. He et al, HotOS 2009
This paper introduce УWaveФ model to handle the temporal relationship
among the queries. A novel concept Уquery seriesФ is defined for
periodically updated input streams. The idea is simple but it
effectively models the spatial correlations and redundancy, which
dramatically improves the system performance.
Pros:
Х	The paper use a case study to convincingly show that there are
redundancy and load imbalance in the query streams.
Х	The technique of query decomposition seems very insightful to locate
the primary of computational bottleneck.
Х	The idea is simple but effective.
Cons and potential improvement
Х	There should be more experimental results. Currently we are not sure
how well wave model can be for general problems.
Х	In practice such searching engine and web log, not only the query
but also the data are in stream. How to generalize current paper for
stream data might be interesting.
Х	It might be more powerful to combine wave model with machine
learning techniques, such as feature selection or Kalman filters.
From:	Sun Yu [sunyu9910@gmail.com]
Sent:	Monday, February 15, 2010 11:54 PM
To:	Gupta, Indranil
Subject:	525 review 02/16 Sun Yu

Dear Indy:
 I was travelling and planned to be back today, but my flight was
cancelled due to bad weather, so I'm afraid I will miss the class
tomorrow.
Here is the review for the topics tommorrow. I'll turn in a hard copy
later. Thanks.

Best,
Sun,Y

1.DryadLINQ: a system for general-purpose distributed data-parallel
computing using a high level
language

 Most data-processing systems such as MapReduce and Dryad
doesn't provide satisfactory programming interfaces. This paper aims
to address this issue. The DryadLINQ, as its name, is a hybrid of
declarative and imperative programming, it compiles LINQ programs
into distributed computations on Dryad system structure. The goal is
to provide a programmer-friendly interface that conceals all
complexity due to distributed nature of the underlying system, yet
achieve high performance. The paper described the structure of
DryadLINQ and demonstrated its performance using a variety of
benchmarks.

Such a transparent programming interface for distributed systems may
have many limitations, many are discussed in section 7. One other
problem that may or may not occur is the predictability of
performance? Any users who wrote a piece of code may expect
predictable performance with no, or at least small variation.

The performance debugging also depends on the Dryad job manager to
collect information in a centralized manner, is there any scaling
problem with this?

2.Wave computing in the cloud
The authors introduced a new wave model for better utilizing
computation resources and enhancing performance in data-intensive
distributed systems. In current systems, there are significant
redundancy in operations across queries, for example, input scan and
common sub query computation. Other issues like temporal workload
imbalance are also limiting system performance/resulting in
under-utilization of resources. The basic idea here is to explore
the redundancy: recurrent computations on a stream is defined as
query series. With this notion of "query series" or pattern, we have
some predictability in the system. This enabled scheduling and
combing shared operations. On the other hand, cost-model can be
introduced on a statistical basis, opening up the possibility of
using query optimization techniques in current distributed systems.
It's also claimed that this model has practical significance since
it can be enabled on top of current systems.

One question is that, the author mentioned "data distribution within
a stream tends not to change when the stream grows over time", how
is this justified? Also, is it possible to using some learning
scheme to find more underlying patterns (some hidden feature, maybe)
of queries that can be effectively utilized? An adaptive scheme
could also be interesting.
From:	Ghazale Hosseinabadi [gh.hosseinabadi@gmail.com]
Sent:	Monday, February 15, 2010 10:48 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

Paper 1:
Pig Latin: A Not-So-Foreign Language for Data Processing

In this paper a new data processing environment (Pig), its corresponding language (Pig Latin) and a new debugging environment (Pig Pen) are introduced. Pig Latin combines the high level declarative querying style
of SQL, and the low-level, procedural style of map-reduce. Pig Latin is used by programmers at Yahoo! for data analysis. Pig Latin is a Dataflow language which benefits from quick start, interoperability and nested data model. Pig Latin also supports user-defined functions (UDFs). Pig Latin has four data models: Atom, Tuple, Bag, Map. Using LOAD command the input data file are converted into Pig's data model. FOREACH implements per-tuple processing. FILTER discards unwanted data. COGROUP gets related data together. Pig Latin also supports some commands to be nested. STORE saves the result of a Pig Latin expression in a file. 

Pros:
The objective in design of Pig-Latin is cleared stated and the designed language achieves the objectives.

Cons:
The comparison between the performance of Pig-Latin and map-reduce is missing. The amount of improvement that Pig-Latin achieves by implementing over Hadoop is not presented. 



Paper 2: 
Wave Computing in the Cloud
In this paper, a model (called wave model) for exposing the temporal relationship among the queries in distributed computing is introduced. In wave computing data is considered as a stream that is periodically updated. The authors studied a query trace from a cluster. They looked at redundancy in input data scans as well as common sub-query computation. They also studied the temporal distribution of the load of the cluster. The window size of a query is equal to the size of the time-window of the query's input on the stream. They realized that if the window size increases, success rate decreases. The authors also analyzed the predictability which is obtained from the similarities among the executions of the queries in the same query series. When a query is being executed, its characteristics are saved.  By shared scan and computation, query decomposition, query planning and query scheduling stream processing is optimized. 

Pros: 
The idea of considering the data as a stream is simple but interesting. Considering different forms of correlation in query processing is important for having better performance.  

Cons:
No theoretical analysis is presented in the paper. It is not well cleared how much the impact of the wave model is. The paper doesn't have any part for evaluation/simulation so the performance of the wave model is not practically investigated. No comparison with other solutions in the literature is available.  



From:	Shehla Saleem [shehla.saleem@gmail.com]
Sent:	Monday, February 15, 2010 6:55 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

Wave Computing in the Cloud

 

This paper focuses on the challenge faced commonly by large scale clusters these days of having to execute a large number of queries on large amounts of data. Their motivation comes from an analysis of a production computing cluster where their study reveals a lot of redundancy and high levels of load imbalance. The authors propose Wave, which derives its key ideas from log mining with some appropriate modifications. They introduce the notion of a query series which refer to recurrent computations. This concept is used to bring some predictability into the system. The authors identify certain areas that offer opportunities for improvement and then propose some optimizations mostly exploiting query characteristics and correlations.

 

This is a simple paper, more like a presentation of certain ideas. It identifies some potential opportunities for making better data sensitive applications and provides some intuition on how to exploit them. I would however, have liked to see some more results to validate some of the proposals. 

 

Pig Latin: a not-so-foreign language for data processing <http://portal.acm.org/citation.cfm?id=1376726&jmp=abstract&coll=ACM&dl=ACM> 

 

This paper introduces Pig Latin, a high level programming language aimed at working for distributed systems with huge data intensive applications. It is designed in a way that it is conceptually in the middle of the low-level, rigid style of Map-Reduce and the high level, declarative language of SQL. From the programmerТs perspective, it is far more flexible than MapReduce and masks the underlying map-reduce operations from the programmer. This increases code-reusability and that combined with the interactive debugging environment of Pig Latin make it very attractive for programmers. Also the flexible, fully nested data model, support for user defined functions and the ability to work with plain input files without any schema information all add to the strengths of the design. Moreover, the examples in the paper mention Pig LatinТs applicability to real world scenarios which to the promise further. 

 

Pig Latin hides the intricacies of multiple MapReduce operations from the programmer, but I was wondering if any optimizations were possible in terms of handling the data generated in the intermediate stages. Also, iterative control structures like loops etc are missing which might render this hard to use for many high performance computing applications. 

From:	ntkach2@illinois.edu
Sent:	Monday, February 15, 2010 4:58 PM
To:	Gupta, Indranil
Subject:	525 review 02/16

Nadia Tkach Ц ntkach2
CS525 Ц paper review 2
Cloud Programming

Paper 1: Pig Latin: A Not-So-Foreign Language for Data Processing
	The authors of this paper propose a new language called Pig Latin that would help many users to manipulate large data blocks and analyze the data. It is based on step-by-step programming technique while maintaining the look and feel of declarative SQL-like querying (such as filtering, grouping and aggregation) and map-reduce procedural programming model. Essentially it includes Pig programming and compiling system which is built on top of Hadoop map-reduce implementation. Pig is open-source project and available for public use.

Pros:
Х	Can be used on a large scale to handle terabytes of data
Х	Free open source project
Х	Supports ad hoc data analysis, debugging and parallelism
Х	User-defined functions written in Java with possibility of future support of other programming languages
Х	The system creates Уlogical plansФ as it executes the code and doesnТt carry out any actions until the STORE operation is defined by a user, as well the system can avoid materializing the processing data until the certain operation invocation (especially useful when processing large data sets)
Cons:
Х	The system implementation and wide usage might require prior users training
Х	Supports read-only data analysis workloads


Paper 2: Wave Computing in the Cloud
	The paper describes the new Wave computing model for data processing. This model performs batch processing on the streaming and continuously updating input data. The Wave catalogues and analyzes the query processing operations, and using this data finds the query series that share common computation. Once identified such common computations the model is able to predict the future data operations and perform them across several query series as applicable while reducing the amount of computation and I/O operations. Additionally, the Wave analyzes the performance and load patterns and spread the computation process over duration of time. The Wave model can improve performance and resource optimization, and minimize underutilization or resource exhaustion.

Pros:
Х	Reduces the number of redundant computation operations
Х	Reduce the load imbalance
Х	Reduces the size of each query and can potentially minimize the query failures
Cons:
Х	Wave is implemented on top of existing system, but requires certain capabilities to be enabled to operate such as the language for query manipulation, data and statistics cataloguing, and query rewriting functionality
Х	The paper doesnТt provide any information on model implementation, evaluation and results to back up the theoretical analysis
From:	arod99@gmail.com on behalf of Wucherl Yoo [wyoo5@illinois.edu]
Sent:	Monday, February 15, 2010 11:57 AM
To:	Gupta, Indranil
Subject:	525 Review 02/16

Cloud Programming Review, Wucherl Yoo (wyoo5)

Wave Computing in the Cloud, B. He et al, HotOS 2009

Summary: The authors claim that they observed that significant redundant I/O and computation of individual queries and load imbalances in cluster environment. They found that log data mining was a dominant workload. This workload shows Wave-like patterns since the log can be considered as stream that is periodically updated. 

The authors define query series to refer redundant computations among the queries and it exposes the correlations among queries in the series. This makes recurring execution explicit so that it can provide predictability about the execution behavior of queries and data characteristics. Thus redundant computations among queries can be combined to reduce waste of resources. In addition, it can provide load balancing with increased predictability about resource usages.

Pros: 1. Interesting observation about redundancies among queries in cluster environment 

2. As the authors mentioned, wave model can be applied to other cloud environments since the characteristics of their data look like appended log

Cons: 1. Evaluation is not sufficiently strong; reality can be more complex and noisy with heterogeneous workload and resources. 

2. Although finding redundancy from sub-queries may be easy, finding that from high-level languages of cloud environment may be not easy due to implicit dependencies among computations and data at different system layers 

 

Pig Latin: a not-so-foreign language for data processing, C. Olston et al, SIGMOD 2008 (Yahoo!)

Summary: Although map-reduce provides simple procedural programming model, two stage data flow may be too rigid and low-level. In addition, SQL-like declarative model may be restricted for cloud environment. The authors propose new Language, Pig Latin to fit in a sweet spot between the two models; While Pig Latin provides high-level data manipulation primitives, it exposes explicit execution steps. This step-by-step execution also helps to debug the program with provided interactive debugging environment that can automatically generate intermediate example data. The Pig system builds logical plan from Pig Latin scripts and compiles them into map-reduce jobs that can be run on Hadoop. 

Pros: 1. Higher-level (thus more convenient) and more expressive programming model than MapReduce, explicit procedural step compared with declarative SQL 

2. System-level optimization can be possible however it may not be worse than programmer-specified optimization from low-level language like MapReduce

Cons: 1. More complexity can cause more bugs and performance lost

2. The programming model is tightly coupled with MapReduce so flexibility is not much improved.

3. Overhead incurred from the inflexibility of MapReduce Ц intermediate data must be materialized and replicated between successive MapReduce jobs.


-Wucherl

From:	Ashish Vulimiri [vulimir1@illinois.edu]
Sent:	Monday, February 15, 2010 12:22 AM
To:	Gupta, Indranil
Subject:	525 review 02/16

Pig latin: a not-so-foreign language for data processing, C. Olston et 
al, SIGMOD 2008

The authors describes Pig Latin, a procedural data processing language 
that improves upon MapReduce by adding both additional control 
primitives (inspired by declarative languages like SQL), as well as 
richer data structures (nested tuples and associative arrays). They also 
describe the implementation of a development environment for Pig Latin, 
called Pig, that includes two tools:

i) A compiler that can translate Pig Latin programs into optimized 
sequences of MapReduce jobs, suitable for execution on Hadoop.
ii) A debugging environment called Pig Pen that generates test data sets 
and simulates the program on them to enable the user to identify 
potential problems.

Comments:

+ Richer query language than map-reduce. In particular, primitives like 
COGROUP are interesting because they allow operations on subsets of the 
available tuples, instead of just treating them individually (map) or 
all at once (reduce).
+ The existence of a debugging environment, even one as simple as Pig 
Pen, is a major plus.

0 The primary reason they can simplify the SQL model is because they 
allow only read-only analysis. Allowing writes would require bringing 
back all four of the ACID properties of traditional databases, as well 
as all the complexity that entails.

- I'm not sure I buy the argument that programmers are inherently 
uncomfortable with declarative, SQL-like languages -- especially given 
the current craze for building web applications for everything (web apps 
almost certainly have to interact with a database layer).
- Maps: keys should be primitive values for efficiency reasons. But 
shouldn't hashing the keys be sufficient to handle more complex structures?
- Debugging environment is limited. Not all complications that can arise 
in real datasets can be demonstrated via a simple test dataset.
- No profiler -- cannot identify performance bottlenecks.
- The authors are limited by the artificial constraint that all queries 
must, in the end, be reformulated in terms of an efficient chain of map 
and reduce jobs. Are there natural parallel primitives that do not fit 
this model?


DryadLINQ: A System for General-Purpose Distributed Data-Parallel 
Computing Using a High-Level Language, Yuan Yu et al, OSDI 2008

DryadLINQ is an alternate data query processing framework that uses the 
Dryad parallel execution framework as a provider for the LINQ query 
language included with the .NET framework from v3.5 onwards. (The LINQ 
query language is distinct from the framework (called the "provider") 
that the query is actually executed on. Some examples of providers other 
than Dryad include: traditional RDBMSes, queried via SQL, and simple XML 
stores that are queried by a local execution engine.) The authors 
briefly describe the DryadLINQ execution model and then demonstrate 
several example programs -- including a MapReduce emulator, PageRank, 
and two standard machine learning algorithms.

+ Real, strongly typed, data structures, as opposed to Pig Latin's ad 
hoc nested tuples, although this comes at the cost of some complexity.
+ Lazy evaluation -- sub-queries are only evaluated if actually needed.
+ More efficient joins than Pig Latin.

- Integrated into programming languages like C# that have intrinsically 
different semantics. This causes issues such as the one with non 
side-effect free statements in sec 3.2.
- Very limited debugging support -- debug environment only handles 
outright failures.
- As with Pig Latin, no profiler.

Thanks,
Ashish