From: pooja.agarwal.mit@gmail.com on behalf of pooja agarwal [pagarwl@illinois.edu] Sent: Tuesday, March 02, 2010 12:25 PM To: Indranil Gupta Subject: 525 review 03/02 DS REVIEW 03/02 By: Pooja Agarwal Paper – Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance Main Idea: This paper presents SMFS(Smoke and Mirrors File System) which creates backup of files at geographically distributed areas. It provides a semi synchronous mirroring between different sites which strives to achieve better performance at the cost of some risk of data loss. It tries to limit the window of data loss by leveraging the advantage of high speed wide area network between the mirroring sites. It also uses forward error correction code(Maelstrom) to enable recovery of some amount of data that might be corrupted in transmission. It also allows batch updation of group of files which are either reflected completely or none. The failure model assumes all kinds of failures in storage, networks, and rolling disaster. It makes use of redundant network paths to sustain network failures and assumes that storage system is mostly recoverable. The system is build around the reliability offered by network and treat the packets sent over network as transient file storage. In Network-Sync, the host performs an operation at a local site which is immediately written to local storage, the local site than contacts the egress router to send the duplicated information to the remote mirror, the egress router adds Maelstrom based FEC to the packets and sends it to the remote mirror’s ingress router over TCP. TCP’s reliable delivery and partial reliability offered by FEC are the major backbones used to provide reliability to the transient network storage. On receiving the Ack from the egress router, local site responds to the client. Hence, the performance is increased by making use of reliability offered by network. SMFS uses distributed log-structured file system in which provides append only log format for adding newer data. The system is evaluated against several levels of synchronizations like local-sync, remote-sync, network-sync, local-sync+FEC and remote-sync+FEC. Pros: 1) It is one of the preliminary distributed file system which brings forth the notion of leveraging the network buffers as fairly reliable storage for transient data. 2) It increases the performance of requests and also decreases the window of vulnerability by sending updates as soon as a request is received. Here, the reliability of transaction depends upon the stability and reliability of network. 3) Network-Sync is quite general and many applications with different kinds of storage systems can use it. Cons: 1) The scheme does not guarantee reliability as it depends upon network stability which is highly unreliable. Also FEC is unable to recover data in certain cases when most of the data is lost or is corrupted. TCP retransmissions can fail due to failure of source node. 2) Assumes high bandwidth wide area channels between remote sites(10 to 40 Gbits) which can be afforded by only big organizations like banks and large industries. This makes it less usable by smaller organizations. 3) In the light of cloud computing storage, firstly the massive amount of data in each file will take a long time to transfer over wide area network, making the window of vulnerability higher in this file system. Also, use of FEC on such large datasets will be expensive too. Talk – Towards Robust Distributed Systems Main Idea: The talk presents the key principles behind Inktomi- a startup company by the authors which provides distributed global search and distributed caching. The talk outlines several fundamental issues with the classic distributed systems like focus on computation, not data, ignoring location distinctions, poor definitions of consistency/availability goals, and poor understanding of boundaries. It presents a DQ principle in which the Data/query*Query/sec = constant to have evenly distributed work load however, faults can reduce this constant linearly. It also presents the notion of Harvest and Yield where Yield is fraction of answered queries and Harvest is fraction of complete results. The Harvest*Yield tends to be a constant and various balancing methods are adopted by trading between availability(Q) and integrity(D). For graceful degradation the harvest*yield should be decreased linearly and some amount of admission control should be done. Also, integrity should be reduced dynamically to support more users during peak time. Inktomi also uses random assignment of queries to various servers and tries to make faults independent. They also suggested upgrading half of the machines at a time to have no mixed versions of different machines running simultaneously. Pros: 1) I very much appreciate the key problems with the classic distributed systems mentioned in the slides. The slides were made in year 2000 and still the mentioned problems persist. 2) Harvest*yield = constant equation can be used to dynamically vary availability and integrity based on demand and faults. 3) Idea of using randomization provides better load balancing. Cons: 1) More details of the distributed algorithms could have been provided. 2) It fails to cover what policies are followed to dynamically perform tradeoff between availability and integrity. From: Giang Nguyen [nguyen59@illinois.edu] Sent: Tuesday, March 02, 2010 12:23 PM To: Gupta, Indranil Subject: 525 review 03/02 CS 525 paper review 03/02 Giang Nguyen nguyen59 Cumulus: Filesystem Backup to the Cloud This paper presents a file backup solution that uses the public cloud. The cloud is thin, meaning that it is only required to provide a miniumal interface (get, put, delete, and list) that works at the granularity of whole “files.” Since the interface doesn't provide a “modify” functionality, the way to modify it is by deleting it and storing the new version. Cumulus revolves around the idea of “snapshots.” Each snapshot logically contains two parts: the metadata log that lists all the files backed up, and the file data itself. Both parts are broken into blocks or “objects,” which are numbered sequentially and packed together into “segments,” which have unique names. Each object is uniquely named by its segment name and offset pair. Each file is “described” in the metadata log as a list of objects in the file data segments. Cumulus provides sharing between files, where multiple files can point to the same data object. Each snapshot is described by its snapshot descriptor, which is not stored in a segment. The snapshot descriptor contains among other things a time stamp and a pointer to its “root” object (inside a metadata log). When a file is changed, only the new/changed data needs to be uploaded and its pointers uploaded. The file's other objects need not be changed. For segment cleaning, Cumulus doesn't remove a segment as long as there are snapshots that point to it. When creating new snapshots, Cumulus avoids referring to inefficient old segments (those that have little useful data), by copying them into new segments and referring to the new segments. This way, the old segments have more chance of becoming free and be reclaimed. The segment cleaning is triggered when a segment's utilization (fraction of the segment's bytes used by a current snapshot) falls below a configured threshold. To restore a snapshot, Cumulus downloads the root segment of the snapshot and follows all the pointers in there to download all the files in the snapshot. Pros: - When using Amazon S3, performs better in terms of cost when compared with other existing commercial backup solutions. - Only requires a minimal interface from the cloud, so can easily use any cloud storage provider. Cons: - However, use of a simple storage cloud potentially limits performance. From: Shivaram V [shivaram.smtp@gmail.com] on behalf of Shivaram Venkataraman [venkata4@illinois.edu] Sent: Tuesday, March 02, 2010 12:16 PM To: Gupta, Indranil Subject: CS 525 review 03/02 Shivaram Venkataraman - 2 March 2010 Smoke and Mirrors: Reflecting files at a Geographically Remote Location without loss of Performance This paper presents a technique to mirror files at remote locations without appreciable loss of performance and increased tolerance to a catastrophic failure of the primary site.Traditionally organizations have two choices for backing up data remotely a. Synchronous Mirroring - Applications block until the data has been written at the remote location and an acknowledgement is received. This entails performance degradation which may be unacceptable for enterprises requiring low latency operations. b. Semi-Synchronous or Asynchronous Mirroring - Applications reply to the clients after it has been written to local disk and a request has been sent for replicating it to the mirror site. This gives better performance but if there is widespread failure of the primary site, any data which has not yet been written to the remote site is lost. The Smoke and Mirrors File System (SMFS) proposes to use new synchronization mechanism called "network-sync" which has two additional properties: a. Forward Error Correction(FEC) - SMFS uses a FEC protocol called Maelstrom which is opaque to the application layer and sends additional error correcting packets for every TCP/IP packet sent to the remote location. b. Application Callback - Whenever the error correcting packets have been transmitted by the egress routers of the primary site, a callback is triggered and the application replies to the user that the write has been committed. This is a very important feature that exposes the network layer details to the applications and ensures lower data loss. SMFS exposes an NFS like API to the clients and consists of a File server, Storage server and a Metadata server. The fileservers interact with the storage servers through a thin log interface which consists of create, read, append and free calls. They choose have a distributed log structured file system as it is easier to combine the data and the order of operations present in both the sites. Pros - Efficient techniques to get an intermediate solution between synchronous and semi-synchronous solution. This could be used for a large subset of data transferred. Cons - Its not clear which of their two design choices gives a better reliability vs performance tradeoff. Would have been interesting to observe numbers for just having a callback from the egress router and no FEC codes. - Techniques used to determine the (8,3) configuration of Maelstrom are not clear. Interesting points - The Local-Sync+FEC configuration is as reliable as the network sync configuration in the Cornell NLR-Ring test. - Log structured file system is assumed to have ample capacity to avoid segment cleaning costs. Towards robust distributed systems - Keynote at PODC 2000 This keynote delivered by Dr. Eric A. Brewer at PODC 2000 provides insight on research areas which could help in building robust distributed systems. There are three major issues identified in this keynote: a. Where is the state ? - It is important to distinguish between the various locations where data could be stored based on the availability and fault characteristics of the particular site. Specifically datacenters need to be used to save persistent state and other locations like cellphones or desktops can only be used as a cache. Datacenters also suffer from various downtimes due to disk crashes, network outages etc and distributed systems need to be aware of the failure patterns. b. Consistency vs Availability - Database research is focussed on higher consistency guarantees (ACID) while industry solutions focus on Availability. This is a fundamental trade-off between ACID and BASE (Basically Available Soft-state Eventual consistency). This trade-off leads to the Consistency-Availability-Network Partitions (CAP) Theorem, which states that an application can only choose any two out of the above three characteristics. Examples of applications that forfeit tolerance to network partitions include single site databases and xFS file system and these systems typically employ 2-phase commit protocols and cache-validation techniques. Distributed databases and distributed locking protocols forfeit availability to tolerate network partitions by employing pessimistic locking and exposing the unavailability of small number of partitions to the application. DNS and web-caches forfeit consistency for availability and tolerance to network partitions. Traits of such applications includes leases, conflict resolutions and optimistic operations. Overall this is an unexplored research area and a symptom that the database and systems communities are separate. c. Understanding Boundaries - The interface between two modules could be designed and implemented in different ways. The most basic interface is the function call but the two sides share the same address space in this case. For larger processes which have different address spaces, it is often most efficient to use a pass by value paradigm rather than a pass by reference paradigm. This is one of the reasons why protocols have been more successful than APIs. There are benefits in exposing the failures transparently and a need for clearly defining boundaries for distributed systems to scale. Pros - Points out many interesting research problems in computer science by exposing the realities faced by industry implementations - Highlights the importance of thinking probabilistically to account for random failures - Graceful degradation under failure and ability to scale are problems that need to be solved From: Rini Kaushik [rinikaushik@yahoo.com] Sent: Tuesday, March 02, 2010 11:55 AM To: Gupta, Indranil Subject: 525 review 03/02 Cumulus: Filesystem Backup to the Cloud --------------------------------------- Cumulus is a client side file system backup solution which is based on a thin cloud assumption. It aggregates small files using LFS strategies and also supports incremental backup. It requires minimal interface from the underlying online storage and hence, is very portable across storage providers. Pros: 1) Given the thin cloud assumption, Cumulus can be used with any online storage service. Hence, there is no vendor lock-in and users are free to switch providers. This is different from integrated solutions such as Symantec's Protection Network. That said, I am not really sure about this motivation. Given the large size of the backups, people may not be interested in switching vendors around in the first place as that would require moving the data around. 2) It aggregates small files into larger units called segments. This results in metadata savings, and better storage efficiency. It also allows for better segment compression in which inter-file redundancies can be exploited to yield higher compression rate. 3) Cumulus does not modify the files in place and hence, provides failure guarantees against snapshot corruption Cons: 1) Since, Cumulus is not an integrated solution; it loses out on performance and higher storage efficiency: it cannot do optimizations such as single instancing (which can reduce space consumption) or better across client compression. Users may not like the lower storage efficiency as they will need to pay higher storage costs. In case of backups, performance is not that much of an issue than storage costs and reliability. 2) Cumulus cannot take care of reliability and integrity of the backup data from the front end. 3) Cumulus cannot do cost savings optimizations such as energy savings which would translate to further savings to the customers. 4) Restore process is not optimized and several segments would need to be read in order to retrieve the desired files. 5) Privacy concerns would be of utmost importance in a solution like this and just client-side encryption may not be secure enough 6) Clients would need to provision space for creating and storing the backups on their local disks before transferring the data to the storage. It would be better to have a mechanism for on-the-fly backup which won't require disk space on the client. Smokes and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance ---------------------------------------------------------------------------------------------------- SMFS aims to offer a mirroring solution called network-sync with stronger data reliability guarantees than semi-synchronous and asynchronous solutions while retaining their performance. Pros: They use forward error correction code for reliability. However, the redundancy overhead is not clear from the evaluation. I would assume that the overhead is proportional to the data size? Cons: 1) Still not clear about the advantages of SMFS over semi-synchronous mirroring. In both cases, if the network partitions and the local disk fail, data loss could occur. SMFS assumes that data is safe once it is written to the disk. However, file system's today do not actually write to the disk right away. They cache the data for a substantial amount of time for better aggregated bandwidth. So, if the cache is not NVRAM, there are strong chances of failures if the local system is loses power. 2) What is the overhead of the redundant information sent on the network? What is the RTT of the feedback mechanism? 3) Isn't the feedback mechanism assuming a lot of intelligence at the client side? The local storage system would need to have a redundancy model which would need to be aware of the network's loss rate, burst length etc. It would need logic to handle feedback per FEC packet. 4) There is duplication at the file system level at the client side. SMFS introduces another file system interface in addition to the storage/file system layer native to the client. This will result in unnecessary layering and duplication in the storage space needs at the client. From: Kurchi Subhra Hazra [hazra1@illinois.edu] Sent: Tuesday, March 02, 2010 11:18 AM To: Gupta, Indranil Subject: 525 review 03/02 Smoke and Mirrors: Reflecting Files at a Geographically Remote Location without Loss of Performance ----------------------------------------------------------------------------------------------------------- Summary ------------- The paper presents the Smoke and Mirrors File System (SMFS) that mirrors files at geographically remote locations using a syncing method called network sync. Network sync is close to semi-synchronous style of mirroring. After the primary site performs some operation on its filesystem, an egress router forwards the IP packets, sends some error correction packets and sends a feedback to notify the primary site storage system about the packets that are already in transit. The client at the primary location can then continue with its job. Thus, rather than waiting for an acknowledgement from a remote site, it delays only until it receives feedback from an underlying communication layer, acknowledging that data and repair packets have been placed on the external wide area network. In cases where more stringent guarantees are required, the primary site can also wait for an acknowledgement from the remote site. . In effect, the design visualises the wide area network between the remote site and the primary site as a storage system which temporarily stores data. The file system considered here is log structured, since this facilitates maintaining consistent states and enables for easy garbage collection in certain failure scenarios. The error recovery scheme used is based on FEC (Forward Error Correction) and the specific protocol used here is Maelstorm. Sending of error correction packets ensures that even if the primary site fails and some of the data packets get corrupted, these packets can be reconstructed at the remote site. Pros ------ -- The network sync operation increases the responsiveness of the system, as the client does not have to wait for a remote acknowledgement which involves a link latency. -- The system does not depend on a reactive TCP based error correction, but proactively sends error correction packets. The remote site thus does not have to get back to the primary site in case some packets are corrupted. This not only increases fault tolerance but also reduces network traffic, in case of corruption of a considerable number of data packets. Of course, this involves a tradeoff since there is redundant information being sent with every packet. Cons ------- -- Although it handles some corner cases of failure which semi-synchronous systems cannot, a vast majority of the failures still remain unhandled. The frequency of occurrences of these corner cases in a real system will determine how useful network sync is, and simulations alone cannot justify its usefulness. -- The use of a log structured file system inhibits its use in various applications which may require continuous updates and modifications to existing files. -- Group mirroring consistency is used which implies that files in the same file system is updated as a group in a single all or nothing operation. However, this implies that even if a single packet is corrupted or lost while transferring data, the entire group will not be committed. Cumulus: Filesystem Backup to the Cloud ----------------------------------------------------------- Summary -------------- This paper presents Cumulus, a cloud based filesystem backup service that depends on a thin cloud assumption. As such, Cumulus is designed around a minimal interface consisting of i) put, that stores a complete file on the server, ii) get, that retrieves a file from the server, iii) list, that gives a list of files stored on the server, iv) delete, that removes a given file from the server. Every client maintains information about recent backups to enable optimizations such as selective updates to backup for files modified by the client. Cumulus backs up the file system by maintaining snapshots of the storage system at various points of time. Both metadata and files are broken into objects which are aggregated into segments and stored on server. The stored files can be rebuilt using this snapshot in case of a failure. Cumulus is optimized for a complete restore. A partial restore is inefficient and requires fetching of a number of segments. When segments from old snapshots are no longer needed, the segments are cleaned when no more new segments point to or depend on such segments. In order to facilitate such cleaning, Cumulus may have old and new snapshots refer to different copies of same data, thus, storing redundant information. A working prototype is built using Amazons S3( Simple Storage Service). In implementing prototype, the authors make a number of design decisions to maintain a balance in the tradeoffs involved between efficiency and resultant storage costs. Pros ------ -- The authors explore the option of building a generic service that can be used to utilize cloud computing resources, thus, heavily decoupling service from infrastructure. Clients will have greater freedom in choosing one or a combination of more affordable infrastructure to meet their needs if such a service can be efficiently built. -- The backup service supports modification of backed up files. Although this is done via storing redundant information, supporting such a feature enables the usage of the service in a range of applications. Cons -------- -- There is no inbuilt compression or encryption built into the system. There is a heavy reliance on external tools, which will evidently not adaptive to any requirements of a particular system. Also, a vulnerability of the external tool will result in a vulnerability of the backup system too. -- Segment maintenance is disorganized and new segments may point to parts of old segments that could be otherwise cleaned up. This introduces a lot of redundancy and wastes storage, a disadvantage that is not tolerable in a storage system where clients pay on the bases of their usage. -- Cumulus is optimized for a complete restore. However, the need to do a complete restore will be less frequent thanthe need to do a partial restore. This system thus becomes inefficient in the common case. -- The support to do incremental changes to files comes from the local client, i.e., the local client decides which files have been changed and need to be backed up. Although this helps in making Cumulus less dependent on the underlying infrastructure at the cloud, such a operation should desirably be migrated to the server side, since the overhead of such computation can be considerable for a large sized filesystem. Thanks, Kurchi Subhra Hazra Graduate Student Department of Computer Science University of Illinois at Urbana-Champaign From: gildong2@gmail.com on behalf of Hyun Duk Kim [hkim277@illinois.edu] Sent: Tuesday, March 02, 2010 11:06 AM To: Gupta, Indranil Subject: 525 review 03/02 525 review 03/02 Hyun Duk Kim (hkim277) * Towards robust distributed systems, Eric A. Brewer, Keynote, ACM PODC 2000 This talk discusses problems of current distributed systems, especially their robustness. Although distributed systems are widely used, classic distributed systems are fragile. There is trade-off between consistency and availability. Classic systems sacrifice availability for consistency. Also, due to the lack of consideration in boundaries, our system led to more fragile. This talk discusses possible principles and solutions for these problems with their experience with large scale web service. Although this talk is old (2000), this gives us insights of future distributed system, and even some of his ideas are used in current cloud systems. This talk pointed out problems of existing distributed system well, and tries to suggest solutions. Some of key open problems on slide 42 are considered in current cluster distributed system like Hadoop. Hadoop uses central center for control states, and supports reliability and partial failure through duplicate data store and execution. Because many threads are executed on separate machines in cluster, they have separate I/O with concurrency. Moreover, as it is predicted, these days many of clusters are 'data' centric than 'computation' centric. Although the author mentioned a lot about availability, he did not suggest using duplication as a solution. When one of the systems fails, we may use duplicated backup systems for better availability. As we discussed above, data and execution duplication of Hadoop can be a good example of using duplication for better availability. However, as it is already mentioned in this talk, because duplication can decrease consistency, it should be carefully taken care of. * Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance, H. Weatherspoon et al, FAST 2009 This paper proposes Smoke and Mirrors File System (SMFS) which mirrors files at geographically remote data center locations with small performance degradation. Securing data from disaster is important issues. Existing mirroring solutions has trade-off between performance and reliability. Synchronous mirroring solutions are very safe, but extremely sensitive to link latency. Asynchronous mirroring solutions have better performance, but we will lose data which is not synchronized. Even semi-synchronous mirroring suggests solutions for both, but still not satisfactory. This paper proposes a new mirroring technique called network-sync. It adds redundancy at the network level to transmitted data, and gives back feedback for the redundancy added data. They call this redundancy adding and feedback mechanism Maelstrom. Maelstrom generates error collection packet based on Forward Error Correction (FEC) technique, and uses callbacks to increase performance. According to the experiment results, SMFS shows good data safety without much loss of performance. This paper proposes a good solution for satisfying both performance and safety. Using error recovery packet is a smart choice to increase safety without much burden. Also, this paper has pretty good comparison in the evaluation part. It compared with synchronizing and asynchronizing solutions. In addition, it compared the sub part of proposed solution, FEC, with various setups. As a simple and powerful solution for data mirroring which guarantees safety and performance, we can use additional mirroring site in the close location. Although SMFS decreases data loss, it still has some loss. Synchronizing solution is safe, but does not work with far data center. We may be able to use another mirroring site in close distance between host and far data center. In that case, we can guarantee safety as well as performance. Surely, it can be costly having another mirroring site. To solve this problem, we can make the 'intermediate' data center small as a temporary space for far data center. That is, intermediate data center store data and remove data when it receives acknowledgment from far data center. Because we can send acknowledgment to host when data is stored at intermediate center, performance will not be harmed. In this case, because intermediate data center needs to keep data as much as the latency to far data center, it does not need to be very big. -- Best Regards, Hyun Duk Kim Ph.D. Candidate Computer Science University of Illinois at Urbana-Champaign http://gildong2.com From: mukherj4@illinois.edu Sent: Tuesday, March 02, 2010 10:33 AM To: Gupta, Indranil Subject: 525 review 03/02 Storage-2 Jayanta Mukherjee NetID: mukherj4 Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance: by H. Weatherspoon et al The Smoke and Mirrors File System (SMFS) uses wide-area links to mirror files at geographically remote datacenter locations. In this paper the authors elaborated a new mirroring option called network-sync, which potentially ensures more data reliability than semi-synchronous and asynchronous solutions while retaining their performance. Network-Sync is designed around two principles described as follows: 1.It proactively adds redundancy at the network level to transmitted data. 2.It exposes the level of in-network redundancy added for any sent data via feedback notifications They used the TCP Reno congestion control algorithm to communicate between mirrored storage systems for all experiments. To implement network-sync redundancy feedback, the Maelstrom kernel module tracks each TCP flow and sends an acknowledgment to the sender. Pros: 1.The SMFS proposed and described here has good group-update performance 2.Using a log-structured file architecture is beneficial for remote mirroring as described in the paper 3.SMFS maintains good synchronization and supports high update throughput, masking wide-area latency between the primary site and the mirror. 4.The proposed Network Sync is a remote mirroring option in which error-correction packets are proactively transmitted, and link-state is exposed through a callback interface. 5.SMFS minimizes jitter when files are updated in short periods of time. 6.The failure model considered for the study is quite comprehensive 7.The method is as an enhancement in terms of reliability of the semi-synchronous style of mirroring offering the similar performance. Cons: 1.Network Sync is not as safe as synchronous solutions. Also, remote mirrored backups are not suitable for large financial organizations. 2.The Wide area network has long round-trip-time latency. 3.If the primary site is failed the system can not recover the lost packets and the data sent after that will be discarded following TCP (protocols). 4.Even with large TCP buffers, the remote data stream will experience an RTT hiccup each time loss occurs: to deliver data in order, the receiver must await the missing data before subsequent packets can be delivered. Network-sync evades this RTT issue, but does not protect the application against every possible rolling disaster scenario. Comments: The Remote Location Mirroring may not be suitable for mirroring every single instance of transactions, but, similar strategy can improve reliability for bulk operations or keeping backups. Cumulus: Filesystem Backup to the Cloud: by M. Vrable et. al Cumulus is a system for efficiently implementing filesystem backups over the Internet and being designed under a thin cloud assumption—that the remote datacenter storing the backups does not provide any special backup services. The authors described that Cumulus can use any storage device and aggregates data from small files for remote storage. As described in the paper, Cumulus is as efficient as the integrated approaches, and uses LFS-inspired segment cleaning to maintain storage efficiency. It represents incremental changes, including edits to large files. The authors assumed only a very narrow interface between a client generating a backup and a server responsible for storing the backup. The interface consists of four operations: Get: Given a pathname, retrieve the contents of a file from the server. Put: Store a complete file on the server with the given pathname. List: Get the names of files stored on the server. Delete: Remove the given file from the server, reclaiming its space. The authors adopted a write-once storage model. Pros: 1.Cumulus is designed around a minimal interface (put, get, delete, list) which makes it simple to use 2.The authors showed through simulation that, through careful design, it is possible to build efficient network backup on top of a generic storage service—competitive with integrated backup solutions, in spite of having no specific backup support in the underlying storage service. 3.They build a working prototype of this system using Amazon’s Simple Storage Service (S3) and demonstrate its effectiveness on real end-user traces. 4.The authors described how such systems can be tuned for cost instead of for bandwidth or storage, both using the Amazon pricing model as well as for a range of storage to network cost ratios. 5.Aggregation of data produces larger files for storage at the server, which can be beneficial to: Avoid inefficiencies associated with many small files Avoid costs in network protocols Take advantage of inter-file redundancy with segment compression Provide additional privacy when encryption is used Cons: 1.Only suitable for any application-logic implemented on the client side. 2.Both the fileserver and user workloads exhibit similar sensitivities to cleaning thresholds and segment sizes. 3.The user workload has higher overheads relative to optimal due to smaller average files and more churn in the file data, but overall the overhead penalties remain low. 4.The storage cost of sub-file incrementals in Cumulus is an overhead (called as size overhead) 5.When only a small portion of the file changes each day, other systems (like rdiff) is more efficient than Cumulus in representing the changes. The overhead is regarded as delta overhead for Cumulus. Comments: The paper is based on the thin-cloud concept and focused on an simple application. The experiments done to prove the concepts are also not very impressive. From: Nathan Dautenhahn [dautenh1@illinois.edu] Sent: Tuesday, March 02, 2010 9:41 AM To: Gupta, Indranil Subject: 525 Review 03/02 Paper Reviews: March 2, 2010 Nathan Dautenhahn 1 Cumulus: Filesystem Backup to the Cloud Authors: Michael Vrable, Stefan Savage, and Geoffrey M. Voelker 1.1 Summary and Overview of Contributions This paper introduces Cumulus a cloud file system backup solution. The primary motivation for introducing Cumulus is that current backup mechanisms have specific formats for data storage and protocols of perform- ing the backups. This requires a heavy cloud layer for the user to use rather than a thin cloud layer, which is term introduced by the paper. The idea is that end users do not want to be limited by the format of the cloud service, therefore, necessitating the need for a thin cloud layer, or more specifically a standard data storage protocol that works across multiple cloud vendor systems. Cumulus is a simplified storage systems that provides functionality for four storage commands: get, put, list, and delete. These operations work on a write-once storage model to provide failure guarantees. They introduce the concept of storage segments that hope to avoid: inefficiency in storing small files, network protocol costs for small file storage, exploit the inter-file redundancy by aggregating files, and provide increased privacy for encrypted storage systems. I really like the full experimentation that they provided. They really brought out the benefits of using their cleaning system. They provided a fair assessment of the problems associated with their approach, identifying key areas for improvement and isolating the true uses of their system. 1.2 Questions, Concerns, and Comments Comments and concerns are as follows: • The assumption that small file backups incur too much overhead should be verified. Much of there work went into developing the segmentation component of Cumulus, and it should not be founded on potentially shaky foundations. • I am somewhat confused as to how using segmentation will not force large increases in the need for network bandwidth. They do compare their solution against other options, but I’m not sure if the cost savings of using larger file transfers and storage will be less than the increased cost to transfer this information over the Internet. 2 Smoke and Mirrors: Reflecting Files at a Geographically Re- mote Location Without Loss of Performance Authors: Hakim Weatherspoon, Lakshmi Ganesh, Tudor Marian, Mahesh Balakrishnan, and Ken Birman 2.1 Summary and Overview of Contributions This paper introduces a new remote file system Smoke and Mirrors File System (SMFS) that provides a complete mirroring of data from one site to the next. The primary motivation for SMFS is that critical company data must be duplicated in order to avoid data loss due to natural disasters. Such backup systems must deal with the issue of how to perform their backup with respect to answering the question of do I need my data to be saved remotely before continuing work, synchronous, or can I use some method of asynchronous backup. This is an important issue because the synchronous option is the only one that provide for greater security, but comes with high latency costs. SMFS chooses an intermediary of synchronous backup called semi-synchronous. SMFS adds redundancy to the network level in the form of Maelstrom, which is a protocol of a Forward Error Correction techniques. SMFS also adds a feedback loop to allow end nodes to identify when errors have occurred. SMFS uses the feedback loop to stop activity of the users when loss of data has occurred. 2.2 Questions, Concerns, and Comments My comments are as follows: •Is this research incremental? It appears to add very little novel new technique, albeit interesting and useful, it still suffers from the same type of problems in that if a large disaster occurs it will cause data loss. •Is the problem a big enough deal? It appears as though data loss is extremely important, but does the industry really need this extra guarantee for delivery? It appears as though they are willing to deal with shorter geographic distances for defense. •The Maelstrom packets are sent over UDP, does this introduce any potential errors? They argue earlier that it is important for the packets to be received in order. Does the fact that these are just duplicate packets make it okay that they are not in order? 3 Common Themes Each of these papers focus on the issues of remote storage. They diverge much in the lower level focus and motivation, which makes them not very comparable. From: liangliang.cao@gmail.com on behalf of Liangliang Cao [cao4@illinois.edu] Sent: Tuesday, March 02, 2010 2:26 AM To: Gupta, Indranil Subject: 525 review 03/02 Paper reviewed by Liangliang Cao (cao4@illinois.edu) for CS525 class on March 2, 2010 Paper 1: Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance, by H. Weatherspoon et al, FAST 2009 This paper discussed a new option for smoke-mirror file backup system, which potentially offers stronger guarantees on data reliability than semi-synchronous and asynchronous solutions while being more efficient than classical synchronous system. Pros: • The idea of adding redundancy at the network level is novel. It will not decrease the performance but still improve the reliability. • By requiring the feedback notification for within-network level, it becomes possible for a file system to respond to clients as soon as enough recovery data has been transmitted. Cons: • The proposed network-sync method cannot 100% guarantee that data loss will not happen. • Although the proposed framework is interesting, it is mainly designed from the view of system provider. Sometime it might be helpful to distinguish the different types of backing up requirement, for example, the long-term vs short term backup requirement, or the important or usual backup. Paper 2: Cumulus: Filesystem Backup to the Cloud, M. Vrable et al, FAST 2009 This paper discussed system backup interface called Cumulus, which provides only thin-cloud service: put, get, delete, and list. The idea of Cumulus is to aggregate data from small files for remote storage, and use LFS-inspired segment cleaning to maintain storage efficiency. The success of Cumulus suggests it is worthwhile to explore a balance between the complexity and the expense of services. Pros: • The cost of Cumulus is apparently lower than Jungle Disk and Brackup. • Cumulus can efficiently represent incremental changes, including edits to large files. Cons: • When choosing the pricing model for evaluating the cost of Cumulus, only Amazon S3. It could be understood considering the limited amount of choices. However, someone might wonder such pricing results are not general enough for all cloud computing services. • The design of thin-cloud interface is definitely attractive, however, the paper could be further strengthened by provide some statics or analysis of the customer request. I guess most of users are limited to thin-cloud interface, but a small number of other users would rather pay more for the advanced services. From: Ashish Vulimiri [vulimir1@illinois.edu] Sent: Tuesday, March 02, 2010 12:56 AM To: Gupta, Indranil Subject: 525 review 03/02 Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance, H. Weatherspoon et al, FAST 2009 This paper describes the smoke and mirrors filesystem, a remote backup solution designed to allow data at a primary site to be replicated to a secondary backup location over a high bandwidth, high latency link. The system focusses on handling problems in the network link (it ignores problems at the primary and secondary sites), using forward error correction to do this -- the standard sliding window mechanism used in TCP to handle errors is inappropriate for high latency links. Comments: * Are dedicated links necessary? It would be interesting to see what reliability guarantees are attainable using just the Internet, or perhaps a RON-like overlay solution (this would also allow using some other error control mechanisms, e.g. sending multiple copies of packets over different paths simultaneously). * The evaluation environment is somewhat weak -- the packet loss model in particular seems fairly arbitrary. * I might be missing something, but I'm not sure I see what the contribution of this paper is. The novel part of the system is Maelstrom -- and the authors have already documented Maelstrom in a separate (NSDI) paper. Towards robust distributed systems, Eric A. Brewer, Keynote, ACM PODC 2000 In this talk, Prof. Brewer discusses his work with Inktomi Inc. and the reasons why they chose to roll their own systems without relying on results from classical distributed systems research. He lists three issues: 1) The problem of state: he argues that a lot of distributed computing is data-intensive in nature, while most classical DS research focusses on the computation instead of the data 2) Consistency vs availability: he discusses the tradeoff between consistency, availability, and tolerance to network partitions (and suggests that the "pick any two" rule applies in this context) 3) The boundary problem: what exactly should the interface between computational entities look like? Comments: * The problem of state: major plus of this talk. Correctly identifies the importance of data intensive computing. * Consistency vs availability: again, a plus. The solution to this problem is still not clear. While there has been a fair amount of work at the extreme ends of the spectrum -- full-fledged RDBMSes on one end and key-value stores on the other -- the space in between is not as well explored. * The boundary: I'm not sure why he chose to focus on the RPC mechanism. Even at the time of this talk, better programming models for distributed computation (e.g. libraries based on the Actor model) were available and known. Now, of course, even more solutions are available due to the increased significance of the problem. Examples: languages based on the Actor model (such as Scala (which Twitter uses) and Erlang), and Java RMI and .NET Remoting for more mainstream examples. From: arod99@gmail.com on behalf of Wucherl Yoo [wyoo5@illinois.edu] Sent: Tuesday, March 02, 2010 12:31 AM To: Gupta, Indranil Subject: 525 Review 3/2 Storage-2, Wucherl Yoo (wyoo5) Towards robust distributed systems, Eric A. Brewer, Keynote, ACM PODC 2000 Summary: CAP theorem is most important concept of this talk although it introduced many others. The idea is that distributed systems with shared data can practically have at most two among consistency, availability, and tolerance to network partition. Forfeiting which one among three decides the characteristics of the distributed systems. Forfeiting the partition tolerance provides transactional semantics with all nodes in contact. Clustered databases and xFs are examples of this decision and they tend to have 2-phase commit and cache validation protocol. Forfeiting availability provides consistency even with temporary partitions but requires system-wide blocking. Distributed databases and majority protocols are examples of this decision and they typically have pessimistic locking and make minority partitions unavailable. Forfeiting consistency provides more room for scalable availability. DNS and web caches are examples of this optimistic decision. These systems typically have mechanisms of expiration/lease and conflict resolution. Pros: 1. Cap theorem has strongly impacted distributed system design Cons: 1. Partition tolerance may not be as important as other two (consistency and availability) Cumulus: Filesystem Backup to the Cloud, M. Vrable et al, FAST 2009 Summary: Cumulus is a file system backup system to cloud. It has simple storage server interface with thin server layer to provide portability without sticking to a specific cloud provider. Write-once storage model helps to simplify maintenance operations and failure guarantees. The storage server interface has four operations: get, put, list, delete. These backup operations support on entire file granularity. Cumulus snapshots consist of a metadata log and file data. Metadata and data are partitioned as objects and the objects are packed into segments. The metadata include file properties, cryptographic hash for file integrate check, and a list of pointers to objects containing the file data. Delete operations do not reclaim disk space instantly. Instead, segment cleaning reclaim the dead space from unused snapshot like garbage collection in LFS. A segment with less than a segment utilization threshold value (the fraction of bytes within the segment that are referenced by a current snapshot) is reclaimed. The threshold value is optimal around 0.6 since small value increases wasted unused space overhead and large value increases the overhead duplicate copies over time. To reduce interaction overhead with server, the client maintains local state: a local copy of the metadata log and a database keeping a record of recent snapshots and all segments and objects stored in them. This information is used to detect file change history from previous snapshots Pros: 1. Cost-efficient backup on cloud storage 2. Low storage overhead by aggregation of small files and sub-file incrementals Cons: 1. Lack of discussion about optimal snapshot number with various backup workloads with different failure probabilities 2. Overhead increases for frequently modified data 3. Lack of discussion about consistency issues among local storage and cloud storage and locally cached client state; Although Cumulus assumes the cloud as a persistent storage, it can be inconsistent due to multiple possible failures. Local storage can be inconsistent due to disk failures. A mechanism of resolving these inconsistencies may be required. -Wucherl From: Virajith Jalaparti [jalapar1@illinois.edu] Sent: Monday, March 01, 2010 11:11 PM To: Gupta, Indranil Subject: 525 review 03/02 Reviews for March 2nd: 1. Review of “Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance”: This paper presents SMFS, a file system that is suitable for mirroring files at remote locations via wide-area high bandwidth links while ensuring minimum performance degradation at the primary location. It provides a middle path between the reliability of synchronous mirroring and the performance of an asynchronous mirroring framework. SMFS uses a generic mirroring solution called network-sync, which runs between the egress router of the primary site and the ingress router of the remote site serving as an in-network temporary backup store. Network-sync proactively deals with loss of data: it creates and sends redundant packets along with the original data packets using FECs, which ensures that most of the losses can be dealt with by the remote site without requiring the primary site to retransmit the lost packets. This is an important optimization since these networks have high one-way delays, leading to a non-negligible loss of throughput and a such losses can lead to a loss of data if the primary site fails when the packets is lost. Using proactive error recovery, Network-sync ensures that such a scenario doesn’t occur. SMFS builds on network-sync and uses a log-structured file system, which turns out to have high performance in remote mirroring. SMFS has a double layered error recovery mechanism as it operates on top of TCP which in turn uses network-sync at the end routers. The paper further evaluates SMFS against conventional synchronous and asynchronous mirroring showing that SMFS provides 99.9% reliability while achieving performance quite close to the former.Pros: - The paper introduces the idea of proactive error recovery which turns out to quite beneficial in situations of mirroring files at remote locations over wide-area links. - It provides the best of both solutions for mirroring: along with the performance of an asynchronous mechanism, it provides the reliability and consistency obtained by a synchronous mechanism. - Network-sync is a generic protocol and can be used along with any other mirroring protocol as it operates just between the end-routers and the end hosts are transparent to it. Cons: - The protocols presented in this paper are very specific for high-bandwidth wide-area links. It does not take into account the effect of congestion. In particular, network-sync sends redundant data even when there are no losses in the link. A more efficient approach can be one in which reliability is preserved for every network level hop in the network. Although this might require modification of the underlying network, it could provide the reliability as SMFS without requiring the additional redundant data. - The evaluation section concentrates on the scenarios under which a failure occurs. It does not measure the overhead imposed by SMFS under normal operating conditions. In general, the network utilization of SMFS can be relatively much higher along with a smaller goodput as compared to existing asynchronous operations. - It might be possible to dynamically adopt the loss recovery mechanism. The edge routers can send periodic pings/heartbeats to each other and if losses occur then can increase the redundancy in the data being sent. This would ensure that under normal operations, much bandwidth is not wasted because of redundant data. 2. Review of “Cumulus: Filesystem Backup to the Cloud”: This paper presents Cumulus: a system for backing up files over the Internet. It provides the minimal functionality which is sufficient for inter-operating with various storage providers, which includes put, get, delete and list. Cumulus operates under the thin-cloud framework and operates under a write-once storage model. Cumulus aggregates small files into larger segments and uses segments as the smallest object which is stored/transferred into the backup service. This ensures efficient utilization of the remote file system by having advantages such as smaller metadata, smaller network costs and avoids problems such as fragmentation. The backup is done by taking regular snapshots of the local file system in which the metadata and the file data (which is compressed using tar or gzip or gpg, if encryption is required) organized as a tree structure. Cumulus supports two different types of cleaning mechanisms for garbage collection: (a) in-place cleaning, in which segments are rewritten to keep only the required data and (b) in this mechanism, older segments are not modified and the writes new copies of the older data if it is required for the segments of the recent snapshot; this, however, leads to space inefficiencies). Cumulus also efficiently supports sub-file incrementals in which only a small part of the size is modified: the new snapshots can continue pointing to the file objects from older snapshots; this ensures the re-use of existing data and uses storage efficiently. The paper goes on to evaluate Cumulus comparing it with an optimal/ideal scheme for performing remote backup of files. Pros: - The main advantage of Cumulus is the fact that it works under the thin-cloud assumption. It assumes the existence of some basic operations and ensures that the backups are maintained in a consistent state and supports various specific operations like compressing data, sub-file increments, garbage collection and restore. - Compression and encryption are an integral part of Cumulus allowing decreasing the amount of storage required. - The experiments show that Cumulus performs very close to the optimal with the right value of cleaning threshold. In terms of costs of backing up data in Amazon S3, it performs better than existing tools to do so, esp. because of its ability compress data. Cons/Comments: - Cumulus takes a very conservative approach as to assuming the interface/operations possible between the client and the remote storage system. Although, this ensures that it is generic and can be used with various storage providers, it’s performance/efficiency can be quite less as compared to a client which is customized for a particular server. - The optimal mechanism used in the evaluation section is not completely optimal. It assumes that the optimal scheme will transfer the entire file if any part changes but this is not correct. An optimal scheme should be able to identify exactly which parts have changed and transmit only those modified parts. From: ntkach2@illinois.edu Sent: Monday, March 01, 2010 9:25 PM To: Gupta, Indranil Subject: 525 review 03/02 Nadia Tkach – ntkach2 CS525 – paper review Storage II Paper 1: Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance The authors developed a new innovative approach to enterprise-wide data replication solution. Unlike many other technologies that provide replication at the cost of system latency and application performance, network-sync mirroring option shows to be just as effective as semi-synchronous and asynchronous solutions without the performance degradation. It is build on two principles, proactive redundancy at the network level with latency and jitter independence from the length of the link, and feedback notifications in order to monitor the safety level reaches during replication process. Pros: • Improved performance in comparison to synchronous replication, and improved reliability comparing to semi-synchronous solution with about similar performance results Cons: • The safety guarantee of network-sync are not as good as of synchronous solution • Sender blocks its activity until the receipt of notification that the certain level of in-flight redundancy was reached • Doesn’t account for all possible failure scenarios Paper 2: Cumulus: Filesystem Backup to the Cloud The authors propose a new filesystem backup implementation designed on a “this cloud” assumption that the backup servers do not provide any application replication technologies or operations other than put, get, delete and list. The innovation of this solution is hidden in its portability, simplicity and efficiency. This implementation supports only write-once operations with no modifications allowed, but permits deletion and re-use of memory space. The algorithm behind it is based on packing smaller files into segments of data objects and creation of snapshots of the segments. Each snapshot descriptor contains a timestamp and a pointer to each segment it includes. This is very useful for future garbage collection when you can delete a pointer to a specific segment and clean up the memory later on. Pros: • This filesystem backup can be implemented on almost any online storage device • Efficient in terms of the usage of packed smaller files into segments as data blocks or units of storage • Simplified garbage collection via usage of snapshot descriptors and pointers Cons: • While segments in a snapshot can be deleted independently, it might result in an unclaimed memory space until the garbage collection is performed and memory is being overwritten with new information • This solution provides filesystem backup of data and acts only as a storage system, while it doesn’t seem to support the fail-over to the backup site From: Fatemeh Saremi [samaneh.saremi@gmail.com] Sent: Monday, March 01, 2010 5:22 PM To: Gupta, Indranil Subject: 525 review 03/02 Paper 1: Smoke and Mirrors The Smoke and Mirrors File System (SMFS) reflects files at a geographically remote location and explores a new mirroring option called network-synch. Network-synch provides stronger reliability compared to semi-synchronous and fully asynchronous mirroring solutions while retaining their performance. SMFS works as follows: the primary site storage system simultaneously applies a request locally and forwards it towards the remote mirror. The network-synch routes the request and sends additional packets assisting error correction if needed. Then network-synch sends an acknowledgment to the primary storage system (which is local to it). At this point the primary storage system can move to its next operation. When the remote-site storage system receives the request, applies it to local storage and sends a storage level acknowledgments to primary site. When the primary storage system receives the response, it realizes that the request has been mirrored successfully and therefore it can call garbage collector. The network-synch remote mirroring option has been evaluated on the authors SMFS prototype and the Cornell NLR Rings testbed. SMFS proactively adds redundancy to transmitted data at the network level (network-synch). This results in more reliability along with latency and jitter being independent of the link length. The latter, in turn, provides appropriate long-distance mirroring that is much more suitable for recovering from natural disasters. The appealing feature of SMFS is that it has a stronger reliability than semi-synchronous mirroring options while it does not increase the delay. Due to its log structure file system, log operations can be pipelined which results in increasing system throughput. Combining data and order of operations into one structure provides the capability of managing identical structures naturally at remote locations. However, SMFS does not eliminate the possibility of the creation of orphaned processes. Regarding the volume of the data and requests the related applications deal with, I wonder if a distributed log-structured file system is really the optimal choice. Tough it is appealing for its write performance, how the cleaning costs in these applications would be. Paper 2: Cumulus Cumulus is a cloud-based backup system which is designed around a minimal interface thin cloud: minimum functionality should be required from the cloud so that different backends can be implemented. The authors assume that any application logic is implemented solely by the client. Through simulation study they show that it is possible to build efficient network backup on top of a generic storage service. They have demonstrated the efficiency of their proposed backup utility via a working prototype of it on Amazons S3 and explain how the system can be tuned for cost instead of for bandwidth or storage. The authors show that compared to thick-clouds that are designed for a specific application, thin-cloud solutions can be a cost-effective alternative. The appealing feature of Cumulus is its interoperability. It has been built on top of a simple interface and can be easily adapted to different storage systems. This gives the end-user the freedom to switch between different storage space providers. Cumulus does not give the ability to add content from multiple nodes at the same time. In Cumulus, only the version of a file at a given snapshot date can be accessed. It would be more appropriate to provide the capability of accessing each version of the file. Thin cloud is attractive; however, the question is if thats only the application which should care about all security concerns and how effective it would be. From: gh.hosseinabadi@gmail.com on behalf of Ghazale Hosseinabadi [ghossei2@illinois.edu] Sent: Sunday, February 28, 2010 4:44 PM To: Gupta, Indranil Subject: 525 review 03/02 Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance In this paper, Smoke and Mirror File System (SMFS) is introduced. The main idea of this work is to mirror files at geographically remote locations. The mirroring is done in a way that effect on the primary file system performance is negligible and efficiency degradation as a function of link delay is minimum. The main reason that necessitates mirroring on file system is data loss that happens as a result of site failure. In this work the main concern is that site failure might happen after a natural disaster, so mirrors should be geographically far from the primary location. Saving data at remote locations introduces latency in the mirroring procedure. So there is a trade-off between safety and delay. The main contribution of this paper is designing a new mirroring method called network-sync in which error-correction packets are proactively transmitted and link-state is determined by feedback. Data is transferred between remote locations via high-quality optical links. Network-sync uses forward-error correction (FEC) protocol to increase the reliability. In FEC, for every r data packets, c error correction packets are introduced into the stream. FEC performance is independent of link length. FEC protocol that is used in this work is called Maelstrom. When a host located at the primary site submits a write request to a local storage system, the following happens: A primary-site storage system simultaneously applies a request locally and forwards it to the remote mirror. Then Network-sync layer routes the request and sends additional error correcting packets, and then it sends an acknowledgment to the local storage system. When a remote mirror storage system receives the mirrored request, it applies the request to its local storage image, generates a storage level acknowledgment, and sends a response. When the primary storage system receives the response, it knows with certainty that the request has been mirrored. Another important property of SMFS is mirroring consistency, which means that it preserves the order of operations in the structure of the file system itself and also a distributed log-structured file system. SMFS maintains group mirroring consistency, in which files in the same file system are updated as a group in a single operation where the group of updates will all be reflected by the remote mirror site atomically. The performance of SMFS is evaluated through simulations by measuring system throughput and application throughput. Results of simulations show that performance of SMFS is independent of link-latency. Pros: In this paper new mirroring method called network-sync mirroring is introduced. The introduced method is more efficient than the well-known asynchronous mirroring method. Network-sync is implemented in gateway routers (which is controlled by site operators) and as a result there is no need to change wide-area Internet routers. Use of FEC puts an upper bound on the delivery delay. Cons: It is not clear why they have chosen FEC as the error correction method. Why other error correction methods such as convolutional codes are not used? It is well-known that convolutional codes are more efficient. Although remote sites might be far from each other, there might be correlation among them, for example they might share network links. This correlation might affect the feedback sent by them. The correlation among remote sites is not considered in the design of the SMFS. Towards Robust Distributed Systems In this work, a new robust distributed system called Inktomi is presented. Inktomi is designed based on scalable cluster and parallel computing technology. It doesnt use classic distributed systems ideas. As discussed in the paper the main problems in the design of classic distributed systems are as follows: they focus on computation, not data, they ignore location distinctions, consistency and availability is not of much concern in their design and finally they dont consider issues related to the boundaries (the interface between two modules). Inktomi can be used in global search engines as well as distributed web caches. One important design parameter in this work is that high availability should be provided. More clearly, service availability should be guaranteed in case of crashes and disk failures, database upgrades, software upgrades, OS upgrades, power outage, network outages and physical move of all equipments. According to the CAP theorem, any shared-data system can have at most two of these three properties: Consistency, availability and tolerance to network partitions. Examples of systems that provide each pair of these properties can be found in the existing systems. Another definition is given in this work which explains the DQ principle. This principle says that Data per query times Queries per second is constant for a given node and a given application and OS. . In other words, if we define harvest as the fraction of answered queries and harvest as fraction of the compete result, yield times harvest is a constant. A fault reduces the capacity (Q), completeness (D) or both. When a fault happens harvest/yield should be decreased smoothly proportional to faults. The reason for this is that DQ decreases linearly in the case of a fault. Inktomi spreads data and queries randomly to maximize symmetry and make faults independent. It considers the fact that a 100% working system or a 100% fault tolerance is realistic. Instead we need to think probabilistically about everything. Pros: This works explains the main problems in the design of classic distributed systems and tries to solve these issues mostly by bringing some notion of randomness in the design. Graceful degradation is another interesting idea presented in this work. The designed issues explained in this work consider more realistic assumptions about the actual distributed systems. Cons: We know that DQ drops linearly at the best. It might decrease more rapidly in average (lets call this function g). So, decreasing of harvest/yield proportional to faults might not be optimal. I think decreasing according to g might result in higher efficiency. Because in this case, we consider average failure pattern instead of the best case pattern. No performance evaluation is presented.