From: wzhou10 SpamElide Sent: Tuesday, February 22, 2011 1:50 PM To: Gupta, Indranil Subject: CS525 Review 02/22 CS525 Review 02/22 Wenxuan Zhou Review on “Dynamo: Amazon’s Highly Available Key-value Store” Core idea This paper shows the design and implementation of Dynamo, a key-value storage system of Amazon. One big difference between Dynamo and previous work is that Dynamo put the availability at the first place, by sacrificing consistency of data among servers to some extent. In particular, application layer conflicts reconcilements are made use to address inconsistencies. The design goal of Dynamo is to provide an “always on” experience to all the customers. Pros 1. The work did a very good job in trading off with availability and consistency, and made it work in practice. 2. The paper gives insight how we can make use of a variety of classic distributed techniques (consistent hashing, vector clocks, quorum and gossip, etc) and combine them to build a complex system. Cons 1. Dynamo addresses conflicts during reading, to make sure customers have a good experience while writing. But reading operation happens more frequently than writing, so this might cause a bad experience among customers too. Some extreme example, such as purchasing goods of very limited amount should be considered. 2. The coordinator replicates the keys at the N-1 clockwise successor nodes in the ring. To me, it seems they don’t care whether these N-1 nodes are physical nodes or virtual ones. If majority of these replicas’ hosts are virtual nodes on the same physical machine, then when this physical machine is down, it is probably impossible to solve the conflicts. 3. Dynamo uses application layer conflicts reconcilements to solve inconsistency problems, which might increase software development cost and lead to expose some of the credential information of customers, like credit card numbers. Future suggestion While Dynamo is designed for a single administrative domain where all nodes are assumed to be trusted, I wonder if it could be modified to apply in more general networks with malicious nodes. Review on “Comet: An Active Distributed Key-Value Store” Core idea Motivated by increasing importance and popularity of key-value stores and challenges of building applications with diverse needs on top of these stores, Comet, an extensible, distributed key-value store is proposed. Comet allows applications to customize functions of the store, and to define new functions as well. Specifically, each comet nodes stores a set of ASOs, which can modify their handlers to take dynamic application-specific actions. Pros 1. Comet makes key-value store quite flexible; each application can customize it according to individual needs. To make things better, this customization doesn’t require much effort, tens of lines of code is enough. This is good news to both researchers and developers, because changes and deployments can be made more easily. 2. The implementation of Comet is light weighted, only 5K LOC. 3. By constraining ASOs’ interaction, (only allowed to talk to neighbors), traffic amplification and DDoS attacks are preventedbeAlso it is hard to evaluate changes before deployment. 4. Resource consumption is considered and limited, by limiting the number of per-handler bytecode instructions and memory consumption and incoming/outgoing rate. Cons 1. There’s no discussion on fairness among different applications sharing the same store. 2. Comet makes use of high-level language Lua and does a good job to make things simple, but what about the efficiency? Thanks. Wenxuan From: Curtis Wang [cwang89 SpamElide] Sent: Tuesday, February 22, 2011 12:30 PM To: Gupta, Indranil Subject: 525 review 02/22 Key-Value Stores Dynamo: Dynamo is Amazons key-value store that is optimized for high data availability (for the 99.9 percentile), reliability, and scalability. This comes at the cost of having to make tradeoffs in data consistency (Dynamo is eventually consistent). Instead of choosing an RDBMS like most other production systems to store the state of its services, Amazon chose a key-value store since most of its services only need read/write operations. Pros of Dynamo Approach: - Functions on commodity hardware and is built to be incrementally scalable on heterogeneous networks - System optimized for the 99.9th percentile, not the average - Always writeable (conflict resolution on reads) better customer experience (when buying things) - Symmetry - nodes have the same responsibilities as its peers. Favors decentralization. - Quorum style consistency protocol that is flexible (R, W, N) - Data replicated caross multiple data centers Cons: - ACID Properties Dynamo provides no isolation guarantees - Permits only single key updates - Operation environment assumed to be non-hostile - Eventually consistent services may have to deal with inconsistent data Hinted at in the conclusion of the paper, the gossip-based membership protocol seems like it will not scale as well as the other parts of the system. It would have been interesting to hear if Amazon had any alternative solutions and how they measured in practice against the gossip-based protocol. Project Voldemort: Voldemort is a key-value store project by LinkedIn. It borrows many of the same design elements as Dynamo. The paper seems to be more useful from an application developers point of view, since it goes into details like the specific data types that are supported by the system. Pros: - Decentralized (data replicated across multiple servers) - Uses a vector clock (like Dynamo) for consistency resolution - Read-repair Like Dynamo, resolve conflicts on read - Quorum style consistency Cons: - Read-repair requires more application logic Though Project Voldemort was not published at a conference, it would have been nice if the team elaborated more on their design decisions. For a social network, is read-repair the best solution? It seems like the always-write style of Dynamo may not be as applicable in this context: reads seem to be the most common operation. From: anjalis2006 SpamElide on behalf of Anjali Sridhar [sridhar3 SpamElide] Sent: Tuesday, February 22, 2011 12:27 PM To: Gupta, Indranil Subject: 525 review- 02/22 Dynamo Amazon’s Dynamo system focusses on reliability against all failures such that user request for read and write are always carried out. Eg – shopping cart of Amazon.com users. One of the main contribution of Dynamo is ensuring that the user’s write is never rejected and any conflict resolution is carried out during read by using vector clocks. Dynamo targets applications that consist of trustworthy nodes, have support for hierarchical namespaces, require very low latency (0 hop DHT with client-driven routing) and always writeable datastore in order to service user requests in a fast and consistent manner. Dynamo uses virtual nodes for partitioning the data among many nodes in a ring. Dynamo replicates data at N-1 successor nodes which constitute the N numbered preference list. Dynamo uses a lazy quorum approach for reading and writing data to the KV store to ensure “always writeable” property .When dealing with permanent failures, the Merkle trees are calculated with the leaves containing the hashes of the keys.The parent nodes are the hashes of their children here. The inconsistencies between replicas is resolved by comparing the hash value of the root of the tree constructed for each key range. Dynamo allows the storage system to be customized to the desired availability and performance by using the right N,R,W parameters for lazy quorum to facilitate always available writes and reads.Since Dynamo has been deployed in the amazon servers we are able to see data that is taken from real world interaction with the servers. Dynamo presents an “always writeable ” system which ensures that the user entered data is never lost. Dynamo is able to provide an efficient, reliable and decentralized storage system. However some of the issues that may plague the wide usage of this software is that Dynamo uses a gossip protocol to maintain the membership of every node and the range of keys it contains. Dynamo’s Gossip protocol might prove to be the overhead when scaling the system. Since every node knows about every other node and the range of keys contained in it, the churn of the system will have an impact on the number of messages in the Dynamo system. Dynamo is also developed for a very specific setting and hence does not consider security in its framework. The number of applications that might be able to use this implementation is limited by this. Project Voldermort Voldermort is an open source distributed key value store. Its framework consists of layers that are each responsible for a specific task like serialization, versioning, network connections etc. This provides flexibility for different kind of implementations in each of these layers. Data is distributed across multiple servers and replicated R times. It supports serialization of objects such as JSON, protobuf, thrift etc. It approaches the consistency problem by resolving conflicts at read time using the read-repair protocol. Versioning is carried out using Vector clocks and quorum is used for read and write operations that are carried out Project Voldermort provides flexibility in having either server-routed, client-routed or both in decreasing latency of queries. It also provides pluggable serialization that can either be written anew or you can use an existing one. Read-repair protocol that is used requires to keep track of multiple writes and resolve them at the next read. This may involve some overhead during read time. Voldermort does not provide the “always writeable” property. If the quorum is not satisfied then the write is not carried out and the user data is lost. Since the project does not address temporary and permanent failure handling more extensively and depends only on quorum, it is possible that the performance of the KV store will not be the best. From: mdford2 SpamElide Sent: Tuesday, February 22, 2011 12:24 PM To: Gupta, Indranil Subject: 525 review 02/22 Dynamo: Amazon's Highly Available Key-value Store Any downtime in Amazon's business processes represents a significant loss of revenue to the company. They strive to solve the problem of high reliability, low latency data storage. Based on the unique needs of their business, Amazon is able to choose a specific set of solutions that best fit their restricted problem space. Amazon builds their system for an eventually consistent key-value store with application-variable replication and application-level read conflict resolution. This allows them to tailor their key-value store latencies in order to provide quality of service guarantees. They assign nodes to multiple regions to evenly balance load and use gossip protocols to discover node failures and as a membership management system. There is no security in the system, and all nodes are assumed to be trusted. One could imagine the damage this could lead to. Finding nodes based on a hash value is not discussed in depth; an upper bound on the number of hops would be nice. I wo! nd! er if there is a more efficient way to locate neighbors based on rack/cluster placement. Comet: An active distributed key-value store Comet implements a key-value store on top of the Vuze DHT. It allows for greater flexibility at the application level than other systems by implementing handler features. The handlers available to users are onGet, onPut, onUpdate, and onTimer. These handlers are invoked when underlying structures change. In order to limit the damage that one application's handlers can do to another application's data, Comet implements several security features. Handlers are developed in a sandbox environment, with only math, string, and table libraries available to link. Without network, system, file, or thread access, handlers are limited to acting within a limited scope and applications are insulated from one another. It is not clear what the implications of building Comet on top of the Vuze DHT are. Alternatives for this underlying base layer are not presented. Performance analysis of simple Comet handlers in comparison to Vuze performance would be interesting to see the overhead that Com! et! incurs. From: kevin larson [kevinlarson1 SpamElide] Sent: Tuesday, February 22, 2011 12:21 PM To: indy SpamElide Subject: 525 review 02/22 Kevin Larson 2/22/2011 The authors of Dynamo were faced with the need of a storage technology which would always be available. In creating Dynamo, reliability was key; however, performance, efficiency, and scalability couldn’t be compromised. Attention was not only taken to the averages and maximums of these metrics, but the distribution as well. The concept of “99.9% of something above/below a threshold” is strongly emphasized in the paper. Amazon has an enormous backbone of computers and an architecture with numerous services which run on Dynamo. They had a variety of problems with general techniques and documented the problems and how they overcame them. They used consistent hashing to allow for incremental scaling. The output range appears as a ring (smallest wraps to largest), and nodes are assigned random values which denote their position. In order to have high availability, they have multiple independent copies of data, organized with vector clocks (think Lamport Clocks), created when the original is not available, which will be merged when a read is necessary. The authors used a modified version of Quorum to handle temporary failures, Merkle trees to correct the system after permanent failures. When adding nodes, new nodes are given a number of tokens randomly distributed throughout the ring. When removing nodes, the tokens are reallocated in reverse. The client applications can control many of these thresholds to optimize for a specific metric by providing Dynamo with different values of N,R, and W. Request are coordinated by the top N nodes in the preference list. R and W are the number of nodes that must take part in a successful read/write. The attention to the 99.9% principle was very interesting. The developers seemed dedicated to attain very strong results and this was demonstrated in their evaluation. One especially interesting comparison was that of their three distribution strategies. While strategies 1 (random tokens per node and partition by token value) and 3 (Q/S tokens per node, equal partitions) were better than 2 (random tokens per node, equal partitions), they explained that two was important as it was needed to transition from 1 to 3. At the same time, although the 99.9% principal was very interesting, it also left out some very potentially interesting evaluation. Numerous metrics could have been more thoroughly explored and documented. Voldemort has an extraordinarily simple interface, which allows for additional performance and availability optimizations. It exposes only get, put and delete commands for keys. As a result of this, queries are extremely efficient, predictable, and distributable. Additionally, there is a clean seperation of logic and storage and relational miss-matches don’t occur. On the other hand, complex queries are more difficult and joins must occur in code. Voldemort partitions “hot” data into smaller chunks in an attempt to be entirely within memory. While this allows for an enormous performance gain, this disallows the servers from being interchangeable. In the event of a machine failure, Voldemort uses consistent hashing to distribute the load across the remaining machines. Consistent hashing also allows the addition of new machines to only require the movement 1/(number of servers + 1) values to the new machine. Voldemort also uses a read repair method of reaching data consistency. Although this requires resolution on reads, read repair has strong availability guarantees and very high efficiency. Versioning is performed by storing a unique counter with every piece of data, and vector clocks are uses as in Dynamo. The authors of Voldemort were very good about explaining their design decisions and the other available options. The sacrifice of interchangeability in order to improve performance was very interesting. Although there were numerous interesting decisions and applications, Voldemort lacked any sort of evaluation to prove their claims. Numerous improvements and trade-offs were cited, but none were actually documented or evaluated. From: Igor Svecs [isvecs SpamElide] Sent: Tuesday, February 22, 2011 12:13 PM To: Gupta, Indranil Subject: 525 review 02/22 Igors Svecs CS 525 2011-02-22 DYNAMO SUMMARY Dynamo is a key-value store platform that only supports primary-key access to data. Since many properties of distributed systems are fundamentally incompatible with each other (e.g., completeness vs. accuracy, availability vs. consistency), tradeoffs are unavoidable. Because of Amazon's SLAs, the authors decided to sacrifice consistency to gain high availability and improve performance. The main argument for it is that best inconsistency reconciliation strategy depends on application. They present a ring-based partitioning and replication stragegy that ensures high availability. PROS * Main contribution of the paper is that the authors show how a variety of classical distributed systems algorithms, such as Lamport vector timestamps, quorum, and gossiping, can be used together to build a complex system. * Novel strategy for ring-based key-space partitioning in order to achieve uniform load distribution. * Interesting approach of data versioning that can be applied to other types of uses. CONS * Provides only primary-key access to data; no advanced queries allowed. * Assumes single administrative control; not suitable to be distributed across multiple entities that do not trust each other. Hence, security issues are not considered at all. * Works well with atomic services, each of them having a single well-defined function. May not be scalable for larger composite services that cannot be so easility parallelized. * Applications must explicitly handle multiple versions of the same data. This ensures tighter coupling between applications and the platform and will make them less portable. COMET SUMMARY Comet is a key-value store whose main distinction from related work is that it allows key-values to be accompanied by handlers - pieces of LUA code that will execute on specific conditions (such as onTimer, onGet, onPut). The system consists of three components - routing layer (Kademlia), key-value store (presumably a hash table), and active subsystem, which includes security policies and LUA runtime. Individual nodes are assumed to be untrusting, so the authors implemented measures to prevent attacks. PROS * Unlike Dynamo and Voldemort, this KV store is designed to be used by autonomous nodes; hence, much wider applicability. * Active components of ASOs that support basic tasks such as replication management and statistics gathering. * Authors implemented restrictions to improve security (system/network access, resource consumption) of ASOs. CONS * As LUA runtime is sandbox is severely restricted due to security considerations, usefulness of handlers may be limited. Completely restricting system/network access appears too coarse-grained. * While theoretically sandbox should prevent any unwanted interaction with the OS, it is not guaranteed that bugs will not be found that will allow code to break out of the sandbox. In this case, results can be catastrophic (instant code deployment to all nodes in the DHT). * I disagree with the statement "the security concerns of DHTs with signed ASOs are roughly those of conventional DHTs without ASOs". Interests of the signer (DHT administrator) and end user may not necessarily coincide, especially in case of public applications such as Vuze. NOTES * This paper can potentially lead to very interesting future work. For example, an interesting idea is to create an active k-v store where users can specify "restrictiveness" of the sandbox. This system would have a reputation system where users are encouraged to contribute more resources, and in return they will receive greater share of others' resources. From: Anupam Das [anupam009 SpamElide] Sent: Tuesday, February 22, 2011 12:08 PM To: Gupta, Indranil Subject: 525 review 02/22 1. Dynamo : Amazon's Highly Available Key-Value Store In this paper the authors propose a new key-value storage system that achieves high degree of availability and performance in the face of server failures, data center failures and network partitions. Dynamo sacrifices consistency for the sack of availability. Some of the design features of Dynamo include: incremental scalability, symmetry, decentralization and heterogeneity. Dynamo uses a number of classical distributed systems protocol like vector clocks, gossiping, quorum to build a compact goal oriented system. Dynamo uses consistent hashing for distributing load among the peers. Virtual nodes (tokens) are used for making the key distribution load balancing more fine-grained and more uniform. Each physical node might be responsible for more than one virtual node. Data is replicated at N hosts and contains a preference list. This is a list of nodes that is responsible for storing a particular key. Dynamo applies sloppy quorum instead of using strict quorum to improve availability. It always allows write operations by weakening the quorum condition. Dynamo is an optimistic replication system which allows conflicting updates in to the system to exist with the hope that they will resolved during a read operation. For this purpose Dynamo uses vector clocks and employs application-level/semantic conflict resolution. Another feature of Dynamo is that it uses zero-hop DHT and this is achieved through active gossips. Here each node maintains the full key-coordinator mapping which reduces lookup latency. There are some issues that require careful attention- Since Dynamo trades availability for consistency it is therefore suitable to specific (limited) applications like-the Amazon shopping cart application and would not be used for other applications like banking where high level of consistency is a must. In trying to make vector clocks scalable the authors suggest to truncate the oldest values but this could lead to reconciliation problem. The authors however, argue that such problem has not surfaced in production. Dynamo is developed under non-hostile single administration environment. So it would be interesting to see what kind of security related issues would occur if it were to be used in a public infrastructure. In trying to achieve one-shot lookup, Dynamo requires active gossiping of full size routing table among nodes which limits overall scalability of the system. Dynamo tries to distribute load among the peers as equally as possible, but enforcing this precise load balancing may lead to wasted traffic if nodes join and leave very frequently. Dynamo considers heterogeneity as one of the design principles but in the experimental part they have carried out their experiment on all homogeneous machines. Some form of sensitivity and cost analysis about the values of N,R,W for large scale distributed systems would have been nice. 2. Comet: An active distributed key-value store This paper presents a new-generation of distributed key-value storage system called Comet. Comet allows multiple applications to share a single object instance while at the same time enabling them to modify the object in a way that suits them best. Comet achieves this by collocating code with the key-value store. So, in Comet for each key-value pair there is some handlers (code segments) associated with them. These handlers are stored with the key-value pairs and are executed when certain events occur. The key-value pair along with the handlers is called Active Storage Object (ASOs). In fact these handlers introduce flexibility by allowing applications to customize and define new functions like- defining the number of replicas, replica life time, popularity tracking and so on. The ASO API has a sandbox (collection of Lua functions) that limits an ASO's knowledge and access to other ASOs. It also restricts an ASO from performing unlimited communication and resource consumption to prevent DDoS attacks. So, in summary the key idea of Comet is that it allows clients to inject small fragments of code that control their data’s behavior inside the storage service itself. Some comments that I have are- Since in Comet different replicas can be modified by different applications simultaneously, aggregation of these replicas would be required to get consistency. There is no hint as to how Comet performs this. Comet uses Lua programming language. Could other programming language be used instead of Lua and if so would the choice of programming language have any impact on the performance of Comet? I am also curious about the scalability of Comet. Comet's main goal is allow multiple applications to use the same shared object. But if different applications want to perform different functionality then wouldn't that require introducing different code segments (and possibly new states/variables) inside the handlers. If so would it then scale with the number of applications using Comet? Another issue regarding the execution of handlers is that if there are simultaneous requests from different applications then which application gets first access. Is there any fairness/priority based scheme for admission control? This could be an issue as different application may run different code segments and hence may consume different amount of time/resource. ----Anupam From: Long Kai [longkai1 SpamElide] Sent: Tuesday, February 22, 2011 12:03 PM To: Gupta, Indranil Subject: 525 review 02/22 Dynamo: Amazon’s Highly Available Key-value Store Summary: Dynamo is a Key-Value storage distributed system used in Amazon. Driven by the application need of Amazon, the most important feature about Dynamo is high availability for writes. In order to achieve this feature, unlike traditional system in which reconciliation and consensus are usually done in writing operation, Dynamo uses vector clocks with reconciliation during reads, which provide high availability for users. To handle temporary failures, Dynamo adopts sloppy quorum and hinted handoff, also in the concern of high availability. The basic partitioning model for Dynamo is a ring of logical nodes. Amazon implements consistent hashing to achieve incremental scalability. Several logical nodes can be stored in one actual node and N copies of data are stored in the successive logical nodes in the ring for failure tolerant. One design feature about failure tolerant is that the copies are separately stored in different data center. Thus if one data center totally crushes, data will not be lost. Other techniques include using Merkle trees to recover from permanent failure and Gossip-based membership protocol and failure detection. Pros: > The system is highly available. The user can write to the database quickly even when only several copies are available. > The system is efficient. Sloppy Quorum, hinted handoff and Gossip-based membership protocol and failure detection allows the system to run more efficiently at the cost of instant consistency. > The system meets costumer's need. Cons: > Under several situations like partitioning, the consistency cannot be guaranteed. > Dynamo also cannot guarantee consistency when the system churn is high. > Copies of data are stored in multiple data centers. Although it's good for tolerating a crush of the whole data center, it increases the burden of network transfer and the time for consistency to converge. Comet: An active distributed key-value store Summary: Comet is a generous distributed key-value storage system that allows more flexible application. The system implements active storage objects consisted by the key, values and also handlers. The user can define the handler to meet the specific application need and this is the core feature that makes Comet flexible. For security reasons, however, ASO is restricted. They are kept isolated from each other and can only consume limited resources. Pros: This system is flexible to different application. Users do not need to make change to the existing system to implement their own application. Cons: The restrictions on ASO limit the flexibility of Comet. The restriction on communication between ASOs disables structured ASOs for some applications. Also, it would be even more flexible if different ASOs can run different handlers to meet their special goals. This kind of implementation is not covered in the paper. Best, -- Larry From: trowerm SpamElide on behalf of Matt Trower [mtrower2 SpamElide] Sent: Tuesday, February 22, 2011 11:55 AM To: indy SpamElide Subject: 525 review 02/22 Dynamo Amazon describes in their paper their production level key-value store, Dynamo. Dynamo is the product of several earlier works in the area combined with the constraints and goals of an engineering system. Amazon requires the system to be highly available. This requirement is so strong that consistency is relaxed in the system. Dynamo is similar to other key-value stores in its methods. For partitioning the data amongst nodes it uses consistent hashing. An interesting difference from the earliest works in the area is the development of virtual nodes as a load balancing technique. This is important to Amazon which runs a heterogeneous system. Data is replicated amongst the next N physical nodes in the system. This does not always correlate with the next N virtual nodes though. For consistency, a version of vector timestamps are used to compare data updates. If multiple stamps are incomparable, they can be resolved at the application level or in the system. For applications like a shopping cart, Amazon takes a union of all items in the cart. The paper is interesting from the standpoint of what actually works. Some aspects are missing due to the type of deployment such as security issues. The system also fails under churn which is common in any non-managed system. Comet Comet is a distributed key-value system aimed at extending the number of possible applications which are built on top of KV systems. Comet is built on top of the Bittorrent based system Vuze and allows for server-side extensions (handlers) written in Lua. The authors wanted the system to use the handlers to provide simple functionality which would expand the realms of how KV systems could be used. Comet supports several simple operations such as statistics, source tracking, and time/location awareness. In order to provide security to the system from the handlers actions they use a technique known as sandboxing, which allows the process a play area which won't affect the outside system. One of the interesting applications built on this is Vanish. Vanish is a data system with self-destructing entries. Based on timeouts, data will either be available or unavailable. For many applications information has this temporal property which can be exposed with Vanish. What this system would look like in a trusted environment would be interesting. With no security worries how much functionality could be placed into the handlers before they would become too slow. It also seems the limited functionality of the handlers limits the application space considerably. I am interested in what the returns would look like if more functionality were offered. From: Jason Croft [croft1 SpamElide] Sent: Tuesday, February 22, 2011 10:55 AM To: Gupta, Indranil Subject: 525 review 02/22 Hi Professor Gupta, Below is my review for the 2/22 papers on distributed key-value stores. Thanks, Jason Croft ---------------------------------------------------- Dynamo: Amazon's Highly Available Key-value Store Dynamo is a key-value storage system designed for applications whose need is high availability over consistency. As the authors note, high availability and strong consistency cannot be achieved simultaneously and in Dynamo availability is achieved at the cost of consistency. It provides a simple primary-key interface by building on several distributed system protocols, including gossiping, quorum, and object versioning/vector clocks. It provides an "always writable" data store; that is, writes are never rejected, and write conflicts are resolved by the application since it is more aware of the data to resolve the conflict in a method best suited for the user experience. Furthermore, neither concurrent writes nor disc failures will cause a write to be rejected. Dynamo's design assumes all nodes within the domain are trusted, and targets latency sensitive applications. It provides several knobs that can be customized for each application, including the number of hosts to which each data item is replicated, the minimum number of nodes that must participate in a successful read operation, and the number of nodes that must participate in a successful write operation. With this architecture, each application must run its own Dynamo instance, even if a single instance for multiple applications is more efficient. While Dynamo targets applications that require an "always writable" data store, the trade-off of consistency and availability could create less than ideal user experience. In the authors example of a shopping cart, removing or adding items to a cart may not be instantly updated, which would likely provide better experience than rejected the update in most cases. However, this could be annoying for customer in the final stages of a checkout process if the contents of a cart are not up to date. Nevertheless, Dynamo's design is to store "eventually" consistent data, which it does. For applications such as a shopping cart, this could be worth the trade-off of availability vs. consistency, but for some applications like banking, Dynamo would not be suitable. Comet: An Active Distributed Key-value Store Comet is a distributed storage system that implements Active Storage Objects (ASOs), which consist of a key, a value, and a handler. Handlers are code that can operate on the state of a key and can execute on specific events, such as a put, get, or after some time interval. These handlers can be used to implement custom replication policies, control data access, or monitoring of application access. Comet's architecture consists of a routing substrate, a key-value store, and an active runtime system. The runtime enforces certain isolation and safety policies on handler execution. For example, handlers are sandboxed to prevent corruption of other nodes, and allowing them to only manipulate their respective values. Like Dynamo, performance is an important goal in Comet's design. Therefore, the performance of puts/gets without handlers should be comparable to non-active systems. However, the authors' evaluations show their active system considerably reduces throughput--peak throughput is 60% less than Vuze. Latency and memory overhead are less significant, adding only a few milliseconds for a several second long lookup and 27% additional memory. Comet is also design to be flexible and customizable, and easily perform statistical gathering, information tracking, replication, and access control. Unlike Dynamo, which requires a separate instance for each application, Comet attempts to improve on this design by allowing multiple applications to share a single instance. Furthermore, Comet also assumes less trust between nodes in the system and therefore uses the active runtime system to sandbox ASO execution. Handlers also have restricted capabilities, such as lack of direct network access, system execution, thread creation, and file system access. A handler's resource consumption is also limited, however the authors' description of this is somewhat vague. They claim the runtime limits both the number of bytecode instructions and memory it consumes, but this for each instance of a handler? If so, a malicious user may be able to perform a DoS on a node by implementing an onGet handler and then performing many concurrent gets on the key for that handler. Additionally, as the authors briefly mention, limiting handlers based on instructions may be problematic as not all instructions are equal. From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman [mrahman2 SpamElide] Sent: Tuesday, February 22, 2011 10:48 AM To: Gupta, Indranil Subject: 525 review 02/22 Dynamo: Amazon’s Highly Available Key-value Store Summary: Dynamo is a success story at the interface of academia and industry. It shows how diverse distributed systems solutions can be combined to design a highly available and successful industrial system. In that sense, it does not offer any new solutions, rather it shows that finding the right trade-offs in an industrial setting can also be tricky. Dynamo sacrifices consistency to maintain high levels of availability. The high level goals of amazon are reliability and scalability. Due to this large scale and the law of large numbers, failures are the norm rather than an exception. Amazon uses a decentralized service oriented loosely coupled architecture. Dynamo partitions data and uses consistent hashing for replication. Consistency is facilitated by object versioning [lamport time stamps]. Consistency among replicas is maintained using a quorum like technique and a distributed synchronization protocol. It also uses a gossip style distributed failure detection and membership protocols. It proves that sacrificing a small amount of consistency can still lead to a production system. Dynamo implements eventual consistency, that is, all updates reach all replicas eventually. It uses sloppy quorum and hinted hand-off to deal with temporary failures, whereas it utilizes merkle tree based anti-entropy techniques to manage permanent failures. Pros: (1) Simple key/value interface. (2) High availability with clearly defined consistency window. (3) Resource efficient. (4) Ease of scaling. (5) Decentralized heterogeneous system. Cons: (1) No isolation guarantee. (2) Only single key update. (3) No protection against DoS or other security attacks. (4) Space requirement for vector timestamps maybe prohibitive. The solution of compressing vector timestamps is not clear. (5) The space complexity of merkle trees could be a problem. (6) It is not clear how Dynamo achieives 0-hop routing. (7) The simplicity of read operations is sacrificed at the cost of highly available writes. (8) Higher layers have to deal with conflict resolution, which might be an overhead for application layer programmers. Future Works: (1) The authors mention that 99.9% is an optimal trade-off point. Is there an underlying theory behind that? (2) The optimizations shown here mainly work for amazon services, but is it generally applicable for all systems? (3) Is it possible to add multiple nodes to the system simultaneously, instead of incremental scalability (adding one node at a time)? Comments: Although the paper describes a very successful system, I am not sure it has enough novelty. It finds the sweet spot among many design choices and shows how to combine many systems techniques. But does it really offer academic researchers any new insights or future research directions? Comet: An Active Distributed Key-Value Store Summary: Comet addresses the issue of flexibility for distributed key-value stores. Traditional key-value stores are inflexible in the sense that they support stringent application requirements, e.g a fixed number of replicas, 99.9% availability. As a result implementing applications with diverse requirements on top of a key-value store requires significant labor on part of the application programmer. Comet alleviates this inflexibility by introducing an extensible key-value store where applications can tailor existing functionalities to their needs and even add new functionalities. To this end, Comet allows applications store active objects (ASO) instead of passive values. ASOs contain code segments that control their desired behavior in the DHT. Active code security threats are mitigated using sand-boxing techniques. Comet design goals include: flexibility, isolation, security, and low overhead. The authors built a comet prototype and deployed it on Planet-lab. They also demonstrated that a number of useful P2P applications (e.g. torrent tracker, self-monitoring DHT) that can be easily deployed with minimal adjustment using Comet. Pros: (1) Flexibile key-value stores. (2) Simple implementation of applications on top of Comet. (3) Useful in data center environments. Cons: (1) Overhead of active object storage. (2) Security threats due to active objects. -- Muntasir From: cyhong1128 SpamElide on behalf of Chi-Yao Hong [hong78 SpamElide] Sent: Tuesday, February 22, 2011 10:38 AM To: Gupta, Indranil Subject: 525 review 02/22 ---- Dynamo: Amazon's highly-available key-value store, DeCandia et al, SOSP 2007 ---- Amazon proposed Dynamo, a key-value storage system that provides very high availability. To achieve this goal, they downgrade some of the service guarantees (e.g., strong consistency) such that the system is always writable. In particular, the conflict resolution is performed during the read operation instead of write operation. By sacrificing the consistency, Dynamo provides highly available service while preserving nice properties such as incremental scalability and heterogeneity. Dynamo adopts the ideas of consistent hashing and data versioning to achieve these nice properties. In order to reconcile the different-version objects, Dynamo uses vector clocks to check the conflict and discover the causality between versions. Pros: 1) Dynamo is available to use and is a real implementation in Amazon storage platform. The real availability is confirmed by the experience of the past two-year usage for Amazon internal service. 2) Dynamo provides good performance. For example, the 99.9 percentiles of latencies for both read and write operations are all < 300ms. Cons: 1) Dynamo does not consider security issues. It is questionable if Dynamo could provide high availability under attacks. 2) The system design is complex, and this results in complicated operations for run time execution. Programs are known to have bugs, and therefore a highly complex system is usually not preferable as it is hard to verify the correctness. 3) Dynamo adopts old wisdoms like consistent hashing to provide some properties like uniform load distribution. As previous studies had shown these nice properties are hold, some of the experiments (e.g., Section 6.2) in this paper are redundant or not very informative. ---- Comet: An active distributed key-value store, OSDI 2010 ---- They proposed Comet, an extensible DHT such that administrator could customized the properties of DHT. The customizations include assured deletion, storage location awareness, popularity, logging and forensics, node lifetimes, replication level, etc. It can be shown that these extensibility is likely useful in data center environment. They allow user to customer only important features, i.e., they tried to prevent imposing unnecessary controls. This design simplifies the design greatly. The paper is well written, and the evaluation is sound. -- Chi-Yao Hong PhD Student, Computer Science University of Illinois at Urbana-Champaign http://www.cs.illinois.edu/homes/hong78/ From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Tuesday, February 22, 2011 8:51 AM To: Gupta, Indranil Subject: 525 review 02/22 Dynamo Dynamo is Amazons highly available key value pair storage system. Dynamo is highly decentralized. Data is partitioned using consistent hashing and replicated. Consistency among replicas are achieved using quorums. Dynamo guarantees eventual consistency which allows it to scale and reduce latency. Data is replicated amongst N replicas and it allows multiple versions of the object to be present in the system with version number. To maintain consistency among replicas, Dynamo uses protocol similar to quorum system where W nodes have to agree on write and R on reads where W+R > N. Vector clocks are used to capture causality. To detect inconsistencies in the system, Dynamo uses Merkle trees. Usually its the client which performs reconciliation in the presence of conflicting versions. Dynamo is a zero hop distributed hash table, where each node maintains enough information locally to route the request to appropriate nodes. Dynamo uses a variant of consistent hashing where instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the circle. For loadbalancing, hash space is divided into equal sized partitions and each node is assigned equal tokens. Client driven request coordination involves the client contacting a node for membership information and routing the request to the set of nodes for the key. Pros: -Dynamo is a highly decentralized and very scalable distributed key-value store. -This system is designed for eventual consistency which leaves with the option of who is going to resolve this consistency. They have gone with the approach of the client resolving it which makes it very customizable and reduces latency by allowing eventual consistency. -Dynamo uses a variant of consistent hashing, where a single node is mapped onto multiple virtual nodes. This leads to uniform data and load distribution. -The Merkle tree approach to detect inconsistency minimizes the amount of data that needs to be transferred for synchronization. -Client applications can tune the value of N, W and R to achieve their desired level of performance, availability and durability. -Client driven request coordination reduces the overhead of load balancer and extra network hop Cons and Improvements: -Since all the nodes in the system is aware of data hosted by its peers, and they gossip amongst each other to transfer this information, this system is not scalable as the routing table increases with system size and more gossip messages. We could have a tree structure where the nodes gossip amongst its siblings to exchange information. -If the distribution of the keys are not equal amongst partitions, there would be load imbalances. There is no dynamic loadbalancing. The partitions are also static. One possible improvement would be to start of with static partitions and then split the partitions as and when it grows beyond a certain load or a certain percentage of the average load. This new partition can be transferred to least loaded node. -The applications have to be designed with the knowledge that there could exist multiple version of the same object. -Using vector clock would not be scalable as the no of servers coordinating increases. They have suggested clock truncation scheme but that would mean loosing information about causality. -This system sacrifices consistency for achieving high availability and scalability. This makes it difficult to write certain kinds of applications. -Their system is designed for always writable which means that the read latencies would be higher. -In the client driven request coordination, it could receive a stall membership information. -There is not much information of how this system performs under high churn rates. Voldemort This system provides a simple interface for accessing and storing key value pairs which can be complex objects. They have a layer structure where each layer provides a simple storage interface. Voldemort uses consistent hashing method to partition the data. This system tolerates inconsistencies and resolves them during read time. They use vector timestamps to determine causality and conflicts. This system is similar to Dynamo. Pros: -Having multiple layers with simple interface means that this is flexible and they could be mixed and matched based on the needs. -Data can be in any format -It supports different types of serialization. -Simple system which is easy to distribute across a cluster. Cons: -Very limited query support. -This is a write always system where the conflicts are resolved during reads. This will introduce read latencies. -Uses static load balancing. Thank you -Harshitha (gplkrsh2) From: Andrew Harris [harris78 SpamElide] Sent: Tuesday, February 22, 2011 7:44 AM To: Gupta, Indranil Subject: 525 review 02/22 Comet, from University of Washington, is a distributed storage system (DHT) built with strong abstraction and shared instancing for data. It uses a special type of intermediate object, an “active storage object” or ASO, which act similarly to wrappers for Java primitives: for each storage object, a key and value are stored, and a set of common storage events (get, put, etc.) are exposed to the programmer. In other words, ASOs are a level of abstraction away from the DHT values which allow for greater flexibility for the programmer, for greater data security and integrity through data isolation, and for more specially optimized routines per implementation. ASOs themselves, while limited in their knowledge and power outside of themselves (per encapsulation guidelines), also contain information about themselves that an ordinary K-V record may lack, such as access statistics, value history, timers, location awareness, and others. As a first Comet prototype, the group extended the Vuze DHT, an underlying component of the Vuze BitTorrent client, to support Comet’s abstract methods. With simple configuration, their ASO-based implementation was able to demonstrate: nearby node data replication; timeouts to handle exiting DHT nodes and other inconsistencies; notifications for updated K-V pairs; password-based access control to sensitive K-V pairs; DHT monitoring, and; peer list bandwidth and latency optimization per node. The group also discussed the potential for encryption and signed data objects as extensions to their current model. Their memory footprint averaged ~27% greater than the reference system (Vuze DHT) and the added CPU cost was seen to be negligible, though the Comet system managed a per-node throughput that was ~60% smaller than Vuze. Comet seems well-suited to DHT implementations where maintaining consistency in the table is critical. For example, a distributed database system intended to handle a relatively high number of writes would benefit from such a system, as much of the transactional code could be abstracted away from the programmers once written. In general practice, however, the usefulness of Comet is questionable. In BitTorrent, DHTs are used merely for torrent file lookup; few BitTorrent users would actually need Comet features in their DHT such as network monitoring, so long as the queries were still executing successfully. In fact, in most BitTorrent clients, DHT is largely a set-and-forget feature that helps find peers outside of the main tracker. Though Comet can support features like geolocation, these would be supplanted by the peer-selection optimizations in the BitTorrent implementation, and thus largely left unused. Voldemort, on the other hand, is a stripped-down DHT system which implements only three operations: get, put, and delete. Data within Voldemort is typed as one of four primitives, or an array thereof: numbers, strings, booleans, and “objects” or serialized blobs. These three operations and five datatypes are the extent to which clients are exposed to Voldemort; the rest is configured server-side. Implementations for both client and server are given in Java. As a point of reference, the group reports benchmarks using LinkedIn; access times are seen to be in fractions of milliseconds, with actual performance depending more on network latency than their DHT system. Voldemort supports multiple routing mechanisms, so that it may be extended to a wider array of network arrangements. It uses consistently hashed data values across a hashing ring, as seen in other implementations, as a simple solution to data replication and replacement. This also enables offline key-value index computation for massive datasets, allowing initial data transfer to occur at line speed instead of waiting to hash each new item per insertion. In lieu of transactions, Voldemort uses a read-repair scheme to handle data conflicts: multiple writes, each timestamped, may be contributed to a single key’s value, and conflict resolution is handled at the time of next access. Versioning is handled using a simple logical vector clock, and write failure is left to clients to handle. While elegant in its simplicity, Voldemort makes assumptions of its implementing network that would not be appropriate for a general DHT (say, for BitTorrent). For one, in configuring its routing options, it is assumed that the developer has knowledge of the physical as well as logical network architecture and will configure the Voldemort system accordingly. This is certainly not the case for BitTorrent networks, especially with highly unstable, geographically diverse nodes being the norm. Without some of the active network testing that Comet is able to perform to determine topologies, Voldemort could easily become a tangle of DHT connections. Additional data privacy features also become implementation-dependent; to prevent a MITM attack, for instance, a Voldemort client would need to have authentication capabilities while communicating with a server. No such abilities exist in the reference implementation, requiring a code fork for sensitive DHT applications. From: Nipun Sehrawat [sehrawa2 SpamElide] Sent: Tuesday, February 22, 2011 3:24 AM To: Gupta, Indranil Subject: 525 review 02/22 Dynamo: Amazon's Highly Available Key-value Store Dynamo is an eventually-consistent data store providing a primary-key access only interface. The primary goals of Dynamo include high availability, reliability and scalability (with addition of commodity hardware) on the price of having a weaker consistency requirement, which can demand some extra logic at the application layer. Dynamo adopts consistent hashing to partition data among the nodes and introduces the concept of "virtual nodes" (multiple virtual nodes map to a single physical node), which helps in capability-aware distribution and convenience in data-redistribution at node arrival/departure. Each node maintains enough information to route a query to destination node directly. Data is replicated among geographically distributed data-centers for increased reliability. However, weak consistency requirements can lead to multiple version of the same data in the system. Vector-clocks are associated with every replica of an object to determine causal relationship between different versions. Conflicting versions are merged/overwritten if possible, else all the copies are made available to the application layer, where additional application-specific logic can present a consistent view of the data. Consistency among the replicas is maintained using a (sloppy) quorum-like protocol in which reads and writes are deemed successful after the operation is applied to a configurable number of replicas. The performance of get() and put() functions of Dynamo can be tailored to an application's need by setting the parameters (number of replicas, number of read/write responses required for get/put to finish) of above mentioned consistency protocol. Trade-off in availability vs reliability, response-time vs consistency etc are left for applications to decide. Node arrival/departure in the Dynamo ring are handled elegantly due to the concept of "virtual nodes" and some well-known nodes ("seeds") are employed to prevent temporary logical partitioning of the ring. To address load-balancing among the participating nodes and for an easier archival of the entire data-set, Dynamo presents two additional data-partitioning strategies, which build on the principle of decoupling partitioning and placement schemes. Pros: - Achieves high scalability and availability with a slightly weak consistency model, that can generally be tackled by the concerned application. Also, in an experiment 99.94% requests saw exactly one consistent version. - Deviates from the strict ACID transaction model of RDBMSes (choose consistency over availability), to a system tailored for the more common usage-pattern of Amazon's services (primary-key only retrieval), thus improving availability and scalability. - Dynamo is scalable, symmetric (in terms of nodes' responsibilities), decentralized and heterogeneous, making it a fully distributed system. - The concept of "virtual node" is helpful in tacking heterogeneity (in capability) of nodes, as well as node arrival/departure. - Consistency decisions and availability/latency/reliability trade-off decisions are delegated to the applications. - The usage of Merkle trees used for replica synchronization seem to be a big improvement over sending entire replicas for matching. Cons: - The paper doesn't mentions the space-overhead required for storing the preference-lists of all keys at every node, to facilitate a zero-hop routing. - Network usage statistics for maintaining consistency and distributing data during node arrival/departure are not reported. --- Project Voldemort Voldemort is a distributed database restricted to key-value store operations - get(), put() and delete() to achieve a predictable querying performance while distributing the data across a cluster. The key ideas of data partitioning, maintaining replica consistency and object versioning are directly borrowed from Amazon's Dynamo. The logical system architecture contains multiple layers, all supporting put() and get() commands. This provides flexibility (at run-time also), as new layers can be introduced and different implementations of a particular layer can be used, as long as the layers maintain a consistent put() and get() interface. Various physical architectures are possible, but reducing the intermediate network hops (thus decreasing latency and possibly improving bandwidth) requires the load-balancing logic to be pushed near/inside the client. As Dynamo, Voldemort uses consistent hashing for distributing data amongst the nodes. Access pattern is primary-key retrieval but the associated value can be of complex type such as a list or a map. Voldemort deals with data in binary format and serialization might be needed to convert serialized/deserialized for communication with a client. As in Dynamo, Voldemort adopts a quorum-like model for consistency where conflicts are updated at read time, with the help of application logic and vector time-stamps. Voldmort also deals with storing batch-computed data and the problems associated with pushing this data into live-servers. Instead of building the index for the new data on live-server (which would degrade performance of servicing live requests), it prebuilts the index offline and then pushes the updates to the live-servers. Pros: - Layered architecture with pluggable components - "Values" can be complex objects such as lists and maps, thus effectively providing a one-to-many mapping from a key to values. Cons: - Most of the distributed systems related stuff is directly borrowed from Dynamo. From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan [njordan3 SpamElide] Sent: Tuesday, February 22, 2011 1:43 AM To: Gupta, Indranil Subject: 525 review 04/22 Nicholas Jordan njordan3 Feb 04/22 Project Voldemort, Linkedin I read the website and detailed information was missing. I feel that they “talked” about Voldemort but it was the most general DHT sense. I feel that you could really put any DHT name at the top of the page and it would still be a valid page for it. I wish they would have gone into detail about how exactly they handle consistency & versioning issue in relation to Linkedin. They never said how they decide upon if two emails for the same person are present in a DHT. On a good note, I feel that this is a good introduction into some components into DHT. It touches on consistent hashing, version and timestamp. Comet: I liked the comet paper better. Basically, Commit is a DHT like many others except that the user can control how replication and other aspects of DHT that differentiate the other DHT from each other through these mechanisms called handlers. Some of these handlers can provide useful functionality to modifying that data. For example. “onTimer()” is code that can operated every so often and “onGet()” is called when every a remote OSA calls it. Once has been stored, these handlers can modify the data and the state so this makes it an “active” DHT. I like this concept because it’s a programmable DHT and you can alter its policies to suit a more highly available architect with high replication or maybe no replication with high security in place. A downside to Comet is that they kind of restrict the information ASO and handlers know, so it kind of puts the functions/application is a sandbox and forces it to play nice. It has a restriction on who it can communicate to and what all system information it can access and modify. I buy the security aspect of the system, because all nodes in Comet have all these restrictions that it would be hard to do DOS attack or other malicious acts. I just feel that more complex operations will be in DHT will create a lot of messages or be near impossible. Right now, I can’t think of any drawbacks to this restriction so it’s a tradeoff for the improved security. But they addressed the centralized P2P tracker server issue through comment. Participating nodes in a tracker would store their information under a single key and inquiring nodes would just look up the key and the value would be all the peers in the tracker. -- Thanks, Nick Jordan From: Tengfei Mu [tengfeimu SpamElide] Sent: Tuesday, February 22, 2011 1:28 AM To: indy SpamElide Cc: Gupta, Indranil Subject: 525 review 02/22 Tengfei Mu (mu3) 1. Dynamo : Amazon’s highly available key-value store This paper presents Dynamo, which is a highly available key-value storage system. Amazon is one of the largest e-commerce services around the world so that even slight outage may cause severe financially disastrous consequences and customer trust problem. Dynamo aims to provide highly available system, at the cost of sacrificing consistency, so that it could provide the always-on experience to the users. They use consistent hashing to solving the data partitioning problem, and use vector clocks to provide high availability for writes. Moreover, Dynamo handles various failure situations with extensive object version and application-assisted conflict resolution methods. As a result, Dynamo satisfied performance requirement but also with high availability. Pro: 1. Dynamo is a great example for utilizing existing state-of-art techniques in a distributed storage system. And it is deployed in ‘real’ industry. 2. Incrementally scalable is great for service owners to scale up and down based on their current request. Con: 1. They didn’t consider about the security issues for Dynamo, but only claim that it runs in trusted environment. 2. Dynamo sacrifices consistency under some failure conditions. 3. It would be really interesting for comparing Dynamo with BigTable developed by Google. 2. Comet: An active distributed key-value store This paper presents Comet, an extensible DHT that allows per-application customizations. Before Comet, distributed storage systems were already very popular due to their scalability, reliability and availability. But they have a severe limitation that one-size-fits-all key/value stores cannot satisfy different applications that have different needs. Thus Comet is proposed for an extensible key/value store for dealing with complex application needs problem. In Comet, application store active objects instead of passive values. The active objects contain small code snippets that control their behavior in the DHT and the data. The authors built Comet on top of Vuze and Lua, and they shown several examples using Comet. Pro: 1. Comet could support a wide variety application customizations. 2. Comet considered the isolation and safety issues. 3. Comet is lightweight DHT, low overhead for hosting nodes. Con: 1. The experiments were done on PlanetLab, instead of realistic industry system. 2. The deployment could be at larger scale. From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng [ltseng3 SpamElide] Sent: Tuesday, February 22, 2011 12:50 AM To: indy SpamElide Subject: 525 review 02/22 CS 525 Review 02/22: Key-Value Stores Dynamo: Amazon's highly-available key-value store, DeCandia et al, SOSP 2007 The paper introduced a key-value store system, Dynamo, which is tailored to Amazon’s tasks. First, it supports highly latency-sensitive tasks. Amazon’s service level agreements (SLA) specifies requirements for 99.9% of its requests. Consequently, the challenge is much larger. Second, Dynamo is only used internally. Therefore, the system is relatively more reliable and there is no malicious node and all the failure is assumed to be benign and transient. Third, the write is more important due to its relevance to user experience and the consistency issues about these write operations can usually be resolved by “merging.” Fourth, there are huge amount of operations occurring at the same time, so the system must be highly scalable. Because of all these distinctive requirements, Amazon developed their own system to satisfying the following properties: high availability, incremental scalability, always writeable operations, symmetry, support of heterogeneous nodes, and decentralization. Dynamo seems to be a collection of various techniques, such as consistent hashing for scalability, sloppy quorum and hinted hand-off for handling failures and availability and durability, gossip-based protocol for membership maintenance and symmetry preservation. One contribution is definitely to integrate these techniques into a whole system and rigorously test them in highly demanding real world workflows. The other novelty in the paper is the data versioning. Due to Dynamo’s support of always writable operation, it sacrifices performance by explicitly treating each modification as a new version of the data that cannot be changed. Then the consistency is resolved by comparing associated vector clock values. Comments/questions/critiques: Dynamo focuses on a very specific subset of environment and tasks, especially that the environment is highly reliable and secure, and tasks are very sensitive to latency, but lives well with eventual consistency. Can Dynamo be smoothly transferred or modified to be suitable under other context? I am not sure whether there is geo-distribution in Amazon’s infrastructures. If not, what happen when bulk of servers are down at the same time (say a disaster occurs)? Can the current work ensure the safety and integrity of data? Comet: An Active Distributed Key-Value Store, R. Geambasu et al, OSDI 2010 This paper introduced Comet, a distributed key-value storage system that stores a collection of special objects, called active storage objects (ASOs). Comet provides usual storage operations, like get and put, along with highly dynamic application-specific actions handled by ASOs. Overall, Comet has four desired features of such key-value storage system: availability, flexibility, reliability, and scalability. In addition to the overview of Comet architecture and design choices, the paper mentioned many applications that are not possible in traditional storage systems or DHT due to lack of flexibility in handler, but are now possible due to usage of ASOs in Comet. In the end, the paper also prototyped Comet is upon Vuze DHT, a P2P DHT using Kademlia routing, from PlanetLab, and did experiments to verify Comet’s features. One novelty in Comet is the usage of strong language-based (Lua-based) sandbox that enforces isolation and eventually provides safety. The sandbox provides a very restricted environment with limited knowledge about other objects and resources on the same physical devices, limited access to the device, limited communication over the network, and limited computation and memory. These rigid constraints do not impact the performance, because the paper only focuseed on simple functions over storage objects, such as statistics gathering or location awareness. Moreover, due to usage of Lua and its simple design, the sandbox is small and lightweighted, and thus suitable for purporse. With this strong sandbox and its functionality, the second advantage of Comet, ASOs, can be realized; otherwise, such flexibility would usually devastate the system, either by degrading the performance due to selfish and aggressive behavior, or by maliciously exploiting the system. ASO contains a key, a value and a segment of codes representing a bunch of handlers that can dynamically do many things based on current state. ASOs allow programmers to develop many customized applications on top of Comet. For example, the paper mentioned specialized replication scheme and context-aware storage method, etc. However, due to programmable feature, ASO introduces a new attack model (with respect to traditional attacks on DHTs) via malicious ASOs. The paper argued that with its rigid sandbox, such attack is very unlikely. Comments/questions/critiques: Such flexibility seems to be a very attracting feature, but it comes with a very strong and constrainted sandbox. I am wondering when ASOs become more complex, how well and how efficient can the sandbox support them? The paper mentioned that in most case, Comet does not trust remote nodes. This seems to be over-pessimistic scheme, which incurs lots of overhead. Is it possible to integrate Comet with some sort of reputation scheme and reduce the overhead and sacrifice a bit of safety? Due to its restricted constraints of sandbox, ASO only has limited ability to interact with neighbors. What would happen when the churn rate or failure rate is very high? Especially, ASO can only communicate with some node once. What if the node fails or the link brakes in the middle of communication? Can such one-time capability (of communicating with neighbors) be restored? From: Qingxi Li [cs.qingxi.li SpamElide] Sent: Tuesday, February 22, 2011 12:34 AM To: Gupta, Indranil Subject: 525 review 02/22 Comet: An active distributed key-value store This paper introduces a distributed key-value stored system based on the DHT system Vuze. Compared with Vuze, Comet applys the services letting the user do some application-specific changes. With the help of Comet, the DHT “can easily support different storage lifetimes, access methods, access control schemes” and so on. Instead of the original key-value pairs, Comet stores the Active Storage Objects (ASOs) which includes key, value, and some handlers of this object. For the value stored by Comet, besides the metadata, it includes some state information associated with this file, such as logs, counts and so on. And for the handlers, Comet let users to identify its owner get(), put(), timer() and other functions. However, just as the authors mentioned in the paper, “the proper tradeoff between power and safety” is the main problem. To achieve the proper tradeoff the author gives some restriction to the handler functions. 1. Comet has rate limits on the number of message generated by a SAO. This limit prevents malicious nodes to misuse the bandwidth in the networks. However, at the same time, it also gives some limits to the applications which need frequently communicate with the other nodes. Besides this, what Comet limit is the rate of generating messages but not the total size of the messages in a period. This limit cannot prevent the malicious nodes to send some large packets to waste the bandwidth in the network. 2. Comet gives some extremely restricted to the handler functions. They have no direct network and file system access, no system execution and thread creation capabilities. Besides this, Comet also limits the amount of memory can be used and the number of bytecode can be executed. This restriction can definitely make the handler safe but it will also heavily affect the functionality of the handler functions. From the example given, we can find that the most specific-applications are easy and small. Some of them even don’t need communicate with the other nodes. Dynamo: Amazon’s Highly Available Key-value Store This paper introduces Dynamo, the distributed system used by Amazon. This system is based on the consistent hashing and provides high availability across multiple data centers. To achieve the high availability, Dynamo sacrifices its consistency. For Amazon’s special using environment, Dynamo is required to be always writable which make Dynamo push the conflict resolution to the reads to make sure the users can always put goods into their cart. Besides this, Dynamo also put the task of conflict resolution to the application layer. When the data centers find out some versions of the carts without a causal ordering, the applications will put all the goods in these two versions into one version to resolve the conflicts. For adding nodes, Dynamo using ring structure and make sure when nodes joining, only its neighbor will be affected. And to achieve the load balance, it also uses the virtual nodes which mean that one really nodes may present multiple nodes in the ring. This cause when one nodes failure, there may be multiply nodes disappear in the ring. To detect the nodes failure, Dynamo requires nodes to send message to their neighbor frequently which will waste a lot of bandwidth. This distributed system assumes only be used by Amazon. So to solve some tradeoff, Dynamo only considers about the Amazon’s situation and it cannot be used by the others distributed system which didn’t have exactly situations. Especially for most distributed system, reading will has higher priority than writing. From: mark overholt [overholt.mark SpamElide] Sent: Monday, February 21, 2011 6:56 PM To: Gupta, Indranil Subject: 525 review 02/22 Mark Overholt CS525 02/22/2011 Dynamo: Amazon’s Highly Available Key-value Store Summary: Dynamo is a system used internally at Amazon to provide a highly available, key-value store for their services. While the designers knew there was plenty of other research on key-value stores already available, they needed a system that had a few different requirements. Most key-value store systems focus on high consistency of the data between replicated nodes. Amazon’s business needs did not require this as much as super high availability of writes and reads, even if it meant the data was slightly stale. They achieved this by merging together some common ideas to other key-value stores into a system that was highly dynamic and highly available. They chose a KV store mainly for its simplicity and it was as much storage that their main services needed. A full RDBMS would have been too much, and would have used more resources than Amazon needed to use. They partition their data across machines using consistent hashing along a logical ring. They use vector clocks to assure highly available writes, allowing subsequent reads to reconcile. They handle failures using Sloppy Quorum, hinted handoff, and Merkle trees. Lastly they use a gossip based protocol to handle membership and failure detection of nodes. Discussion: Pros: They do a good job of achieving high availability. Many of the scenarios they describe where high availability is important all make a lot of sense as to how they would be quite annoying if it wasn’t as highly available. By using eventual consistency, they get a good percentage of data that is backed up and consistent. The paper does a good job of going of different settings models to achieve better performance in a number of categories. Cons: To achieve full membership model, Dynamo uses gossiping between neighbors to know data hosted at each node which is not scalable to large systems as the routing table size increases linearly with the number of nodes. Dynamo sacrifices consistency under certain failure conditions. (But as previously discussed, they knew this and it was a conscious decision to do so). Comet Summary: Comet is a DHT system that was made specifically to extend current DHT systems. The goal behind it was to make extending and changing DHT systems quicker and easier for the programmer. The motivation was current deployments of DHTs are nearly impossible to change or extend their functionality beyond the simple GET/PUT constructs without significant code changes and redeployment. Their idea was to take the simple key/value pair, and wrap it up into a extendable object. They called these object Active Storage Objects. They consisted of a Key/value pair, as well as multiple “handler” functions to manipulate the data. These handler functions had specific triggers that would run code at certain times to extend the functionality of the system. Their goal was to make the system extendable and also secure so that developers could start extending their simple DHTs into more powerful services. Examples they give were dynamic replication policies using the onTimer handler, or an aggregate statistics log using the onPut handler. They listed many possibilities that could be used using only a handful of handlers on the ASOs. In order to facilitate security, each ASO had strict policies regarding its actions. It could only take up a minimum fraction of resources. Also it was not allowed network communication unless it was contacting a node with a replica of the same ASO. Also an ASO had no access to the data or handlers of any other ASO, local or not. This was done in an attempt to prevent extended code from running wild and being used in distributed attacked. The prototype system was built on top of Vuze DHT and the extensions were written in a heavily modified version of Lua. Discussion: Pros: Extending the capability of a simple Key Value store is a fantastic idea. Being able to make quick changes to functionality without having to re-roll the entire application would save a lot of time. Limiting the capabilities of the ASOs from a programming point to attempt to prevent malicious attacks. Cons: Where does the code for the handlers reside? Is it on the client and they are pushed to the ASOs, or are they on the nodes, and have to be pushed out when they are changed? Can different handlers be implemented, or is the developer stuck with the few that are given? Can a different handler be used on different ASOs? Or can each ASO use a different handler? This wasn’t really clear in the paper. From: iitb.ankit SpamElide on behalf of Ankit Singla [singla2 SpamElide] Sent: Monday, February 21, 2011 3:18 PM To: Gupta, Indranil Subject: 525 review 02/20 1. Dynamo: ----------------- Summary: For me, this paper provides a good collection of systems techniques (summarized conveniently in a table) for reliability and performance. The paper starts out with the observation that because the way their system is used doesn't require complex queries for which RDBMS are built/optimized, they can simplify -- providing a simple key-value interface only. This also allows easier scale-out than for RDBMS. They make the conscious choice to depart from the ACID guarantees (settling for weaker consistency and dropping isolation) in order to gain on availability. The overall structure is a fairly simple consistent-hashing based key-value store. The do not however use DHT-like multi-hop routing, instead relying on storing enough information in 'preference lists' on where to fetch each piece of data from. This is a requirement for them because of latency sensitivity. They also apply standard quorum read-write techniques to ensure reliability. Comments: Their use of consistent hashing is somewhat different. One of the motivations of consistent hashing is to ensure that data is moved only to the successor of the failing node and doesn't impact anything else. They consciously choose to run multiple virtual nodes in a way that will cause communication with multiple nodes on such a failure. However, it seems the load-balancing factor is more important to them. The use of Merkle hash-trees to optimize the anti-entropy is a nice idea, but fairly obvious and I doubt that it's novel. It is also surprising that manual intervention is the only means of changing Dynamo ring memberships! The modularity of the implementation is impressive in its tolerance of different storage engines. One question I wonder about is how an outage affects the earnings: Say on Day 1, Amazon is down. Do people make 1.2x average-daily-sale purchases on Day 2 (because some from the previous day came back to make purchases) or do they make 0.8x average-daily-sale purchases (because people were unhappy with service, word got around etc.) [Of course, this is more of a curiosity and the paper is not exactly about this.] 2. Comet -------------- Summary: This paper primarily adds the notion of event-triggered handlers to the key-value store. Thus they store key-value-handlers mappings, with a small set of handlers specified during the put operation. These handlers are triggered in response to the events which they register for. This naturally adds flexibility, but springs forth the challenges of isolation and safety - handlers for one service should not overwhelm the system with messaging or computation/memory. These assurances are built explicitly into the system by imposing restrictions on handlers, including language-based sand-boxing. Based on this simple primitive the paper investigates several applications that can benefit from it. Comments: I'm curious as to how this compares with Google's BigTable. BigTable seems to allow storage of database-like rows of data with processing rules (which handlers sound similar to) being allowed in these rows. There's also the notion of an 'action-complete' implementation, which would allow all kinds of things for applications to do, but also be a significant vulnerability. Comet seems to be a fairly conservative first-cut allowance to applications. Ankit -------------------------------------------------------------------- Ankit Singla Graduate Student, Computer Science University of Illinois at Urbana-Champaign (UIUC) http://www.cs.illinois.edu/homes/singla2/