From: vivek112@gmail.com on behalf of Vivek [vivek@illinois.edu] Sent: Tuesday, February 23, 2010 12:35 PM To: Gupta, Indranil Subject: 525 review 02/23 RAID: Core Idea: The famous paper from 1987 introduces the novel ideas of redundancy of data for hard disks, in contrast to SLED where only a single disk is used. It addresses the current divide between processor speed and performance of memory and hard disk. It introduces one of the seminal ideas for increasing data parallelism, performance, and reliability of the hard disks known as RAID (redundant array of inexpensive disks). Five different levels of RAID are discussed: mirrored, hamming code for ECC, single-check disk per group, independent read/writes, no single check disk. Pros: - the theory behind RAID has been widely accepted and cited in many papers. Using this design in systems is difficult to argue against because of its wide acceptance. - The five different solutions provide the basis of many solutions used for architectures supporting data-intensive applications. Moreover, the assessment of performance and reliability is done through accepted metrics such as MTTF Cons: - Would this scale to 1000s of disks? At the time, this may not have been a consideration, but we are seeing today that scalability of systems is a very important issue. - Parity checks can reduce performance. Can one come to a middle ground between fault-tolerance and performance? Perhaps RAID addresses this in theory, but how could this be acheived in practice? FAWN : Core Idea: FAWN describes energy efficient cluster architecture for data intensive computing, in the speed of processors is limited (making them perhaps "wimpy") so as to reduce the power consumption. The technque it uses for looking up data resembles that of Chord, and the storage nodes use consistent hashing. The performance of such an architecture is carefully evaluated and compared to similar systems such as BerkeleyDB. Pros: - discusses the practical use of the system - considers both high-performance and low-cost operation. addresses I/O bound workloads which are primary challenge in today's data intensive applications (particularly web applications such as facebook) Cons: - practical applications: the examination of real-world applications did not seem strong. This may indicate that the idea has not matured and gained widespread acceptance. - the justification or demonstration of practicality of installation in real-world industrial environment seems to be lacking. How would it fare for , say, a company whose business is finance/trading(e.g. Goldman-Sachs)? From: Ghazale Hosseinabadi [gh.hosseinabadi@gmail.com] Sent: Tuesday, February 23, 2010 12:21 PM To: Gupta, Indranil Subject: 525 review 02/23 Paper 1: FAWN: A Fast Array of Wimpy Nodes In this paper, a cluster architecture is designed such that data computing is run with low power. The introdeuced method is clalled FAWN, and the corresponding system built on a FAWN is called FAWN-KV. A lot of factors are considered in the design of FAWN. Flash is a memory store with specific properties. Using flash, three functionalities are obtained: 1) Fast random reads 2) Efficient I/O 3)Slow random writes. FAWN data store (FAWN-DS) is a log-structured key-value store. FAWN-DS is designed in a way to perform well on flash. Mapping a key to a value, reconstruction and semi-random writes are designed in a specific way to acheive high perfromance. Functionality of different functions like store, lookup, delete, split, merge and compact in FAWN are also explained in the paper. FAWN uses a consistent hashing for mapping key ranges to nodes. Wimpy hot-spots are prevented by a two-level chach hierarchy. A configurable replication factor for fault tolerance is also considered. Join and leave of nodes and failure detection are also taken care of in the designe of FAWN. Pros: The main contribution of this paper is acheiving fast and energy efficient processing of read-intensive applications by having low-power embedded nodes with flash storage. Comparing with the previous related work, their work acheives more efficient objectives simultaneously. Cons: No analytical computation for performance metrics of FAWN is presented. A theoretical comparison with previouse work can show the performance of FAWN more clearly. Paper 2: A case for Redundant Arrays of Inexpensive Disks (RAID) Processor speed and memory speed of computers are increasing exponentially. Reduction in size of disk computers makes the construction of disk arrays possible. In this paper the effect of array of inexpensive disks is analyzed. The problem of disk unreliabilty is solved by redundant arrays of inexpensive disks. Their method is by breaking the arrays into reliability groups, with each group having extra check disks containing redundant information. Mean time to return (MTTR) is computed under specific assumptions. The main features of RAID are as follows: mirrored disks, Hamming code for ECC, single check disk per group, independent read/writes. Pros: This paper presents an efficiend method for constructing redundant inexpensive disks in order to achieve efficient performance metrics. Cons: No comparison with other (similar) approaches is presented. From: pooja.agarwal.mit@gmail.com on behalf of pooja agarwal [pagarwl@illinois.edu] Sent: Tuesday, February 23, 2010 12:10 PM To: Indranil Gupta Subject: 525 review 02/23 DS REVIEW 02/23 By: Pooja Agarwal Paper – A Case for Redundant Arrays of Inexpensive Disks (RAID) Authors - D Patterson, G Gibson, R Katz Conference – SIGMOD, 1988 Main Idea: This paper presents RAID and five levels of RAID. The idea of five levels of RAID has been widely accepted in database literature and marks the foundation of RAID architecture for secondary storage devices. To increase the disk I/O performance and reduce cost, the paper proposes to use arrays of many inexpensive disks called RAID (redundant array of inexpensive disks) instead of one SLED ( single large expensive disk). However, since the reliability of RAID is inversely proportional to the number of disks used, increased number of disks lead to decrease in the reliability of the RAID system. To offset this problem, the paper proposes to use extra disks to maintain redundant data to recover original information if disk failures occur. A mathematical formula for reliability of a RAID system is derived based on the number of disks and check disks used per group. They also proposed five different levels of RAID which are analyzed for performance on six metrics – read, write, read-modify-write for both small and large data. In level 1, each data disk is replicated and every write to a disk has to be done on check disk too, leading to overhead cost of 100% and useable storage of 50%. In level 2, the data is bit interleaved into different disks and enough redundant disks are added to correct single bit errors. Level 2 is not appropriate for small reads and transaction processing as it requires accessing several disks. Level 3 uses single check disk per group reducing the overhead to 4% to 10%, however the performance is same as of level 2. In Level 4, the performance of small transfers is increased by interleaving data between disks at sector level allowing more than one reads per cycle per group. The performance of transactions was not affected much in level 4, hence level 5 was proposed in which all the disks were used as both actual disks and check disks. Level 5 gave the freedom to do small read-modify-writes and large transfers at the speed of per disk. Pros: 1) RAID disks need lesser power, cost as compared to SLED and provides greater performance. 2) RAID can be used in several datacenters and clouds to provide redundancy for error correction and reduction in MTTR, providing reliable services to users. 3) This paper marks the seminal work in the RAID architecture which is widely accepted and used(?) in large infrastructures. 4) Users can use different levels depending upon the type of common operations needed. Cons: 1) RAID requires regular consistency checks with the data disks. 2) If redundant disks in RAID are spread geographically, the data transfer will be costly over network. 3) RAID provides recovery against only single bit errors or single disk failures. 4) With the current convergence of costs of all types of disks, employing more disks for RAID is equivalently expensive. 5) Increasing the disks to store data requires increased effort for securing data. Paper : Needle in a Haystack: Efficient Storage of Billions of Photos Speaker: Jason Sobel, Manager of the Facebook, Infrastructure Group Main Idea: Facebook is one of the most popular social networking sites and the biggest photo sharing website with over 60 billion photos and 25TB of new data uploaded per week. Due to requirement of high availability of photos, facebook uses four levels of caching by using CDNs, Cachr, Photo servers and FHC( file handle servers). Each image is stored in a different file in NFS file system, the metadata associated with storage is enormously high and hence, large amount of time is spent on retrieving metadata. It is hard to even fit the large size of metadata in caches leading to 3 disks i/o on an average per read. This significantly reduce the performance of read and writes. Two solutions are proposed 1) using cachr to cache profile pics 2) NFS file handle cache to cache file handles. However, these optimizations only add extra tiers of caches and do not solve the actual problem of large meta-data. To overcome these problems, Facebook engineering team proposed HayStack which is a new object to store the images from each user in a single file called the haystack store file. Another file called index file stores the metadata about the store file containing only minimal entries. A haystack object consists of a superblock and array of needles. Each needle stores the metadata about the image and the actual image. When a user performs read, a http request containing the metadata is sent to the photo server(caching the index files) which translates into the appropriate object, loads the requested file from the object using XFS and sends the reply to the user. During write/modify, photo server creates four images of different sizes, assigns unique keys to each and appends them to the haystack object. The image with largest offset is chosen if more than one image with same key occur in an object. During deletion, only the offset is set to zero and no data is deleted. In compaction, a new object is created and images that are not deleted or only new modifications are copied. Hence, Haystack proposes new mechanism of efficiently reducing the metadata associated with each image. Pros: 1) The system reduces the metadata associated with each image, decreasing the storage and memory requirements per image. 2) Efficiently clusters the user’s data into a single object leading to faster retrieval of related data belonging to each user. Cons: 1) It is mentioned that images are not deleted which can lead to breach of security and privacy. 2) The success of facebook still relies on CDN networks like Acamai and Limelight, which provide first tier of caching based on the geographic locations of the users. 3) Assuming that these objects are built on per user basis, hence the objects will be of varying sizes. It’s not clear how these heterogeneous objects are prioritized while caching when cache is limited. For example : different haystack sizes and frequency of visits will need some level of cache management. From: Sun Yu [sunyu9910@gmail.com] Sent: Tuesday, February 23, 2010 11:58 AM To: Gupta, Indranil Subject: 525 review 02/23 Sun Yu 1. Needle in a haystack: efficient storage of billions of photos. In this note the author talked about the photo storage challenge that Facebook faced, and how they implemented an infrastructure called haystack to provide improved performance over old photo infrastructure. In the old photo storage system, the main reason for performance degradation is that large amount of metadata is generated in storage tier, which leads to multiple I/O operations per each upload/read request. Haystack is a log structured object store containing a store file stuffed with "needles" for the stored photo, and an index file for fast locating needles. When system starts, index file are read into in-memory index for photo lookup. Implementation of all kinds of operations such as read/write/modify/delete are discussed. This haystack storage infrastructure keeps the metadata overhead small, allowing retrieval of images with minimal amount of I/O operations. In this brief note, implementation details such as how to deal with server failures are not discussed; Also, they choose to keep 4 scaled version of each image, which seems to be a limitation since user may want to have more flexibility; From the description of haystack structure it seems we can replace photos here with any object and the arguments still apply. Considering the fact that their main goal is to enable an image (of faces, for most of the time) storage system, some processing can possibly further further reduce metadata? But it depends on how cheap the computation power is. 2. A case for redundant arrays of inexpensive disks (RAID) Increasing performance of memories and CPU are not matched by the slow progress made in making hard disks faster. This makes I/O operation a bottleneck of overall system performance: according to Amdahl's Law, the benefit we get from increased CPU power and memory speed seems to be largely limited by the slow I/O. Large, inexpensive disk arrays is proposed in place of the expensive large disks, it's shown that with lower power consumption and cost, the I/O bandwidth is greatly improved. Reliability issue of disk arrays is brought to the attention of researcher, and RAID is proposed in this paper. Five different organizations of disk arrays are described, the basic idea is to introduce some redundancy in exchange of reliability. They can be implemented either by software or hardware--although it may be very complex. The RAID 5 structure seems very distributed: both the data and the check information. But how is it affected by disk failure? From: Kurchi Subhra Hazra [hazra1@illinois.edu] Sent: Tuesday, February 23, 2010 11:23 AM To: Gupta, Indranil Subject: 525 review 02/23 A Case for Redundant Arrays of Inexpensive Disks ---------------------------------------------------------------- Summary -------------- This paper presents Redundant Arrays of Inexpensive Disks (RAID), an I/O system built as an array of inexpensive disks, either interleaved for large transfers as used by supercomputers or independent for small transactions as in transaction processing. The motivation of the proposal comes from the fact that in contrast to CPU and primary memory technologies, there has not been considerable performance improvement in Single Large Expensive Disk (SLED). On the other hand, inexpensive magnetic disks meant for personal computers have lower costs, lower performance and lower capacity, but an array of such disks can be used, in place of SLED, leading to an improvement of most I/O performance metrics. However, the authors point out that without fault tolerance, large arrays of inexpensive disks are too unreliable to be useful. Hence, they propose five levels of RAID as a solution to this - the arrays are broken into reliability groups, with each group having extra check disks containing redundant information to achieve fault tolerance. After replacement of a failed disk, the redundant information in check disks can be used to reconstruct the lost information on the new disk. RAID 1 simply proposes mirroring information on the check disks. RAID 2 bit-interleaves data across the disks of a group and then adds enough check disks to detect and correct a single error. RAID 3 proposes to use a single redundant parity disk per group. Information on the failed disk can be reconstructed by calculating the parity of the remaining good disks and then comparing bit-by-bit to the originally calculated parity. If these parities agree, the lost information was 0, else it was 1. However, this involves reading and writing to all disks in a group on every disk read/write access. RAID 4 interleaves data between disks at the sector level rather than at bit level as level 3 RAID. The new parity information can be calculated simply by reading the disk hosting the information and the check disk. Finally, RAID 5 distributes data and check information across all disks - including the check disks. This leads to support for multiple individual writes per group. Throughout the discussion, the authors compare the performance of the several levels in supercomputer applications and transaction processing. Level 5 RAID appears ideal for supercomputer applications or transaction processing with limited storage capacity or for performing both of these. Pros --------- -- The idea presented here is seminal in the field of storage systems. Although the paper was published in 1988, storage systems today use similar technology. This is proof enough of the usefulness of the idea. -- Different levels of RAID are presented, each having their unique advantages and disadvantages, giving rise to a variety of choices to system designers. -- The levels can be simulated both in hardware and in software. This gives the flexibility of using off the shelf hard disks to simulate the RAID behaviour. Cons --------- -- The cost estimates presented here are not valid in todays context, since the cost of hardware has drastically reduced. -- The kind of applications run today is not limited to transaction processing or supercomputer applications. Specifically, the authors take two extreme cases to measure performance, one that needs high I/O rate and one that need high data rate. Applications that need a good mix of both can show other performance characteristics on these disk arrangements. -- The methods enlisted only handle one faulty disk. Although the failures are independent, in todays data storage infrastructures, it is not highly unlikely that two or more disks may fail at a time. -- RAID 3 has become invalid and is not used anymore now. Facebook's Needles in a Haystack ------------------------------------------ Summary ------------ This article presents the Haystack Photo Infrastructure used by Facebook. The old NFS photo infrastructure generated enough metadata for the millions of photos uploaded and read back from Facebook, so as to exceed the caching abilities of the NFS storage tier. A read request then resulted in multiple I/O requests and a resultant slowdown. The Haystack Photo Infrastructure, on the other hand, tries to minimize this metadata. The filesystem used is the Extended File System (XFS) that mitigates fragmentation by efficient file preallocation. The Haystack Object Store is log structured and append only, containing the actual Haystack Object store file and index files. The Object store file further consists of needles, where each needle representing a stored object. The index file stores the minimal amount of metadata information required for quick location of an objects representative needle file in the Haystack object store. Overwrite operations results in duplicate needles and a delete operation results in an invalidated needle. The space occupied by these is reclaimed by compaction. The Photo Store Server, which accepts http requests and translates them into operations of underlying Haystack store, keeps an in-memory index of all photo offsets in the Haystack store file. The index fits into the main memory since only a limited amount of metadata required to locate the images is stored. The in-memory index enables retrieval of images with a minimal amount of I/O requests and speeds up the entire system. Pros ---------- -- The whole design is simple and efficient, and makes use of the fact that any main-memory operation is faster than the corresponding I/O operation. -- Facebook has millions of photos stored and retrieved everyday with real-time performances. This is proof enough for the robustness and great performance of the system. -- Failure recovery and resilience has well been taken care of. Cons ---------- -- The idea is not novel as such, but known facts have been efficiently used to build a simple but a high performing system. -- The system makes use of an abundant amount of main memory. However, this increases the cost of the system that may not be affordable by smaller organisations. Besides, with the current exploding rate of growth of Facebook, this may soon become a bottleneck too. Thanks, Kurchi Subhra Hazra Graduate Student Department of Computer Science University of Illinois at Urbana-Champaign From: arod99@gmail.com on behalf of Wucherl Yoo [wyoo5@illinois.edu] Sent: Tuesday, February 23, 2010 9:31 AM To: Gupta, Indranil Subject: 525 Review 02/23 Storage – 1 Review, Wucherl Yoo (wyoo5) FAWN: A Fast Array of Wimpy Nodes, D. G. Andersen et al, SOSP 2009 Summary: FAWN is designed to provide fast and energy-efficient processing of random read-intensive workloads. It is also cost-effective since the hardware is based on low-power embedded processors and flash storage. The authors built FAWN-KV that is key-value store with the cluster of FAWN nodes. The FAWN-KV is two-tier design; front-ends that forward client requests to back-ends that have log structured data storage. Each front-end node manages DHT-like membership of back-ends by assigning them virtual ids (VIDs). A single management node assigns a contiguous key space to the front-end node like Chord ring with consistent hashing. Each back-end node is responsible to stores a range of key-value pairs in the log-structured flash storage. The read performance is optimized with in-memory hashtable for offset in datalog corresponding to a key. Further optimization is done by the query cache of the front ends and buffer cache in the file system layer of the back-ends. The key space of the back-end node is decided from the ring space of the front node with VIDs and keys. The multiple backend nodes will replicate data with chain replication mechanism. Back-end nodes can join or leave the key ring space; key space is split into two when a node joins or merged into one when a node fails. In this transition of key space may increase response latency. Pros: 1. Energy-efficient and cost-effective solution for small-size read-intensive workloads. The workloads can be suitable for twitter messages or thumbnail images of Facebook 2. The seamless merging design of DHT key space and log-structured storage of flash memory Cons: 1. Optimized only for constrained workloads; High latency for random writes and large size read/write, frequent delete operations may not be tolerable due to increased split/merge/compaction operations 2. Weak discussion about failures; assuming no communication failure 3. Frequent join/leave may not be tolerable due to significant overhead of pre-copy phase for key space merging and split 4. Delete entry and compaction overhead; since the flash memory has high latency for (random) write operations, write operations including delete needs to be grouped. 5. Doubtful about deployment with SSD; SSD with parallel flash memories has separate data translation layer that is not controllable Facebook's Needles in a Haystack, P. Vajgel et al Summary: Haystack is log-structured storage that stores hundreds or thousands of images as a file to reduce metadata overhead. RDB or usual file system may incur significant metadata overhead (3 I/Os for 1 image) for their workloads. The image is stored as needle with index and key. The index is size-reduced of metadata that can find the position of the needle. The lookup is optimized with in-memory data structure of the indices with 4-level scaled size of the images. Pros: 1. Reasonable customization of file system for lots of image files; good application for log-structured file system Cons: 1. Constrained workloads with frequent reads/writes and infrequent deletes. -Wucherl From: gildong2@gmail.com on behalf of Hyun Duk Kim [hkim277@illinois.edu] Sent: Tuesday, February 23, 2010 4:27 AM To: Gupta, Indranil Subject: 525 review 02/23 525 review 02/23 Hyun Duk Kim (hkim277) * A Case for Redundant Arrays of Inexpensive Disks (RAID), D. Patterson et al, SIGMOD 1988 This paper introduces Redundant Arrays of Inexpensive Disks (RAID). Compared with the increasing performance of CPUs and memories, the performance of single large expensive magnetic disks (SLED) has been improved at a modest rate. For the improvements of performance, reliability, power consumption, and scalability, RAID can be a good replacement of SLED. This paper introduces five levels of RAIDs and analyze cost and performance of each level of RAID. Basically, RAIDs increase performance and reliability with using duplication. RAID1 uses direct duplication. For error checking, RAID2 introduces hamming code, and RAID3 decreases parity check disks for lower costs. With sector unit storing, RAID4 could archive improved performance with low cost. Finally, RAID5 more improved performance by distributes data dn check information across all the disks. According to the experiment results, compared to the traditional single disk, RAID offers significant improvement in performance, power consumption and reliability. This is one of the classic papers in computer science. It was a good chance to read the original paper of the RAID concept learned in the basic computer science class. This paper introduced a novel concept which could improve the performance of single disk system in various aspects. This paper also shows various solutions improved incrementally and detail analysis with practical statistics. Because this is a rather old paper, it does not reflect latest trends. These days there are more than five levels of RAID concepts studied after this paper. Moreover, the statistics, especially about price, and some intuitions of technology development trend are different now. For example, because of lowered price of one hard disk, it may be more efficient to use just one hard drive within a single machine, and the cost/performance efficient will be guaranteed only for the multi-machine situation like cloud computing. RAID concepts show similarities with current cloud computing trends. Instead of using one powerful machine, both practices use many cheap less powerful components. This paper is meaningful in the sense that it initiated the thinking of using multiple small components instead of using a big one component. Compared with the cloud computing, RAID concepts look like not very scalable. It will work well for one computer, however, we need more considerations for connection between many computers. Also, RAID is only for storage. Cloud computing allow users exploit computing power as well as storage. * FAWN: A Fast Array of Wimpy Nodes, D. G. Andersen et al, SOSP 2009 This paper introduces a new cluster architecture, FAWN: A Fast array of Wimpy Nodes. Existing parallel data management systems does not seve well at small-object random-access workloads and consume a lot of energy. FAWN couples low-power, efficient embedded CPUs with flash storage (FAWN-DS) to provide efficient, fast, and cost-effective access to large, random-access data. They also designed FAWN-KY which provides storage functionality using a log-structured per-node datastore. According to the evaluation results, FAWN cluster achieves two orders of magnitude better than traditional disk-based clusters. This paper introduces an idea that using flash memory for the cluster. FAWN adopted the log-structure concept and well modified for the goal of this paper. Because FAWN has low power consumption, it also be useful for mobile device. Power management is one of the most important issue for mobile device. FAWN concepts may be adopted for mobile device cluster or even P2P structuring. However, FAWN is not good for really large scale data analysis. Although the capacity of flash memory increased, still it is much smaller than magnetic disk. Also, because the cost of flash is rather expensive, to build a large-capacity cluster, using flash is not a good choice. Experiment results also shows that using disk is better for huge data size handling. Therefore, it will not be appropriate for large scale web data which mentioned at the beginning of the paper. It is natural that one system is best for all situation. While FAWN with flash memory showed improvements in random access, it may not be a good for sequential data access. Using both practices with switching system may be one direction to solve this problem. For example, add one more module at the front end which analyzes a pattern of an input query and decides how to execute the query. That is, if a sequential reading query comes, later module uses disk storage, and uses flash structure for a random access query. After some amount of query processing, the classification of query type would be more accurate. Surely, there will be other challenges to solve, such as how to maintain two different storages (disk, flash) efficiently and consistently. -- Best Regards, Hyun Duk Kim Ph.D. Candidate Computer Science University of Illinois at Urbana-Champaign http://gildong2.com From: Giang Nguyen [nguyen59@illinois.edu] Sent: Tuesday, February 23, 2010 1:57 AM To: Gupta, Indranil Subject: 525 review 02/23 Giang Nguyen nguyen59 FAWN: A Fast Array of Wimpy Nodes On today's clusters, I/O intensive workloads that consist of small (100B to several KB) objects do not perform well because they are massively parallel and require random accesses over large datasets, while random disk accesses are poor, and DRAM are expensive and power-hungry. The paper proposes using a cluster of nodes that have slower CPUs and 4 GB of Flash storage to save power. Because Flash writes are slow, the storage system FAWN-DS datastore is log-structured (append-only). The system consists of a small number of front-end nodes that send requests to the storage backend nodes. The front end nodes know about one another, and they implement a key-value DHT system, where each front end is responsible for a contiguous chunk of the key space. If a front end receives a query for a key that is outside of its range, it forwards the query directly (without DHT routing) to the appropriate peer front-end. The FAWN datastore backends expose a similar interface, with "store", "lookup", and "delete" operations that take the keys (and values). Each key is 160-bit, but for optimization, the backend node only keeps a substring of the lowest order bits of key as index into an in-memory hash index, with each entry containing next 15-lowest order bits of the key, a valid bit, and a 4-byte offset pointer into the Flash storage, for a total of 6 bytes per entry. Pros: - Achieves two orders of magnitude better query/Joule than disk-based desktop for their targeted small-object workloads. - Use of DHT means automatic adjustments when nodes are added/removed or fail. Cons: - The log-structured file system affects performance during log compaction. - A FAWN node has max of 2 GB DRAM, so for even larger datasets, traditional system that can use more DRAM yields higher query rates with fewer required nodes. I wonder if the system would do just as well with larger objects such as photos, or it would require tweaks. ------------------------------------------------------------------------------ Facebook's Needles in a Haystack Facebook maintains more than 15 billion photos that users have uploaded. For each photo, Facebook generates and stores 4 version of different sizes, for a total of more than 60 billion images. At one point, there is half a million photos served per second. Because of the large number of image files, the NFS storage infrastructure couldn't cache all metadata. Thus reading of an image file to serve requires significant disk I/O to access the file's metadata (inode). Facebook then designs a new photo storage infrastructure, building on top of a general log-structured (append-only) object store called Haystack. Each Haystack contains two files: the Haystack store file that contains actual objects (and their data), and the index file. - The store file contains Needles--actual objects (and their data)--one, after the other. Each Needle contains some identifying information (key), some flags (currently, only the "deleted" flag), the size of the data, the data itself, followed by a checksum. The entire Needle is a contiguous blob of data within the store file. The store file is log-structured (append-only). - The index file contains the Needles' corresponding index records, in the same order as the Needles. An index record contains the Needle's indentifying information (key) and the offset within the store file to find the corresponding Needle. Index information (object keys and offsets) is kept in DRAM memory for fast access. Also, since the index file is not critical--it can be generated from the store file--it is written asynchronously for performance. Modification/Addition of objects are done by appending Needles to the store file. A new appended Needle might have the same key information as some existing Needles--the application typically assumes the Needle with the higest offset is the valid one. The index file is updaated accordingly. When an object is deleted, simply its Needle's "deleted" flag is set. However, the index records are not updated; a subsequent read operation will simply fail when it sees the set "deleted" flag. The Photo storage application is built on top of the Haystack object store. Each uploaded photo generates 4 different images of different sizes, and all of these images are added to the Haystack. The HTTP server simply converts client requests to read the Haystack for the corresponding photo. Pros: - Avoid extra disk overhead of reading a file's metadata in order to get to its data. Cons: - Log-stuctured system requires compaction. I think this is a very neat design. Surely it must perform better than the original system, but it would be nice to see some numbers. Questions: - Why does the index record need to store the size of its Needle? - Each image's unique key is stored in the user's profile? From: Nathan Dautenhahn [dautenh1@illinois.edu] Sent: Tuesday, February 23, 2010 12:25 AM To: Gupta, Indranil Subject: 525 Review 02/23 FAWN: A Fast Array of Wimpy Nodes and A Case for Redundant Arrays of Inexpensive Disks (RAID) Nathan Dautenhahn February 23, 2010 1 FAWN: A Fast Array of Wimpy Nodes 1.1 Summary and Overview FAWN, a Fast Array of Wimpy Nodes, is a key-value distributed storage system, using low-power embedded CPUs using flash storage. In this paper, Andersen et al. address the issue of reducing power consumption of current I/O intensive distributed systems workloads while maintaining the capacity, availability, throughput, and latency of current systems. The problem with current systems is that they are consuming massive amounts of power consisting of up to 50% of the data center budget. The authors submit FAWN as a prototype system to answer the problem. The authors built a key-value store cluster to test their design, which uses append-only data logs. I liked the way they separated out the performance of FAWN, and the direct comparison to other solutions focusing on the monetary components. This gives the perspective that the authors really want to push: that they provide a cost ecient storage mechanism. They clearly showed that their work resulted in a approximately 10% drop in costs, which was their primary goal as set at the beginning of the paper. 1.2 Concerns and Questions: Cons are as follows: They evaluated FAWN using a read intensive benchmark, but one of the primary limitations of the FAWN is the latency with regards to random writes. They should show their performance against a set of tests that also bring out their weaknesses. Along the same lines it isn't clear what the performance is in FAWN versus its competitors in terms of the write workloads. The paper does mention that its target workloads are reads, but the data must get onto the storage units in some way, which means they must include writes. As mentioned in the pros section I like the monetary comparison versus current technology, but what is missing is an analysis of the cost that a FAWN would incur in order to meet the performance characteristics of the other systems. What type of applications t this problem statement? I feel as though they gave good theoretical motivation for their work, but they lack real examples that would benet from their work. 2 A Case for Redundant Arrays of Inexpensive Disks (RAID) 2.1 Summary and Overview This paper introduces RAID, which is a solution to the problem faced in the late 1980s where CPUs and Ram performance were increasing at the Moore's Law rate, while concurrently, Single Large Expensive Disks (SLED) were only increasing performance at approximately 7% per year. This increase was too slow to keep up with the pace of CPU and main memory. The problem with this is that the I/O performance would limit the entire system performance, thus requiring attention by the research community. Patterson developed the idea of using a lot of inexpensive disks in order to meet the same performance and storage characteristics of disks such as the IBM 3380 disk. The contributions of this paper are: a new method for performing I/O storage that is scalable, several novel methods to store data and provide redundancy, and take the current system and increase future growth and current performance. The total cost increase and reliability gains for RAID vs. SLEDs are amazing, and makes it extremely easy to see the improvement. 2.2 Likes This paper has several supporting arguments and is well written. The method of using cheaper and less reliable to create a better performance is a novel concept. Most would assume the use of higher quality and costlier machinery is better. Other pros are as follows: Novel techniques for redundancy. Great use of reliability measurements. I'm not sure if this is one of the rst studies to use these types of probabilities analysis for comparing systems, but it is really interesting to see these calculations used then, that are still used today. They identied two metrics throughput measurements and eective performance per disk. 2.3 Comments and Concerns The assumption that disk failures are independent. 3 Common Themes Both of these papers focus on providing solutions to reduce the costs of current systems. Rather than increase pure performance or common characteristics they viewed the problem from a monetary standpoint, and both produce great decrease in their provided systems and solutions. From: Virajith Jalaparti [jalapar1@illinois.edu] Sent: Tuesday, February 23, 2010 12:18 AM To: Gupta, Indranil Subject: 525 review 02/23 Review of “FAWN: A Fast Array of Wimpy Nodes”: This paper presents a new architecture for clusters, FAWN, which aims at lowering the energy consumption of data-intensive computing. It makes use of flash storage since it consumes less power and is cheaper/faster than large DRAMs/disks. FAWN uses low power embedded CPUs so as to save power, reduce the I/O induced idle cycles which lead to wastage of energy and execute more instructions per joule as opposed to traditional processors.. The paper proposes FAWN-KV, cluster-based key-value store and FAWN-DS, a log-structured per-node data-store which are practical systems built making use of the FAWN technology. FAWN-DS supports basic operations like Store, Lookup and Delete and is designed esp. for flash storage. It maintains an append-only log of data using a DRAM based hash table. To reduce the DRAM usage only the higher i bits are used to map the keys to the hash-buckets and as a consequence of this multiple reads might be necessary to find the correct key value. Each node in the FAWN-DS can be associated with several unique virtual ids and the data is distributed on these virtual nodes based on the keys associated with them. In order to support such an architecture FAWN-DS supports the Split, Merge and Compact operations. FAWN-KV uses consistent hashing to map each of the VIDs to a ring-structure which ensures that the amount of data-movement on the occurrence of a node-join or node-leave is small. It has several front-ends which maintain membership lists containing the mapping of the keys associated with them and ensure that the query reaches the correct node responsible for the key. FAWN-KV replicates the data stored on one node on R-1 of its successors and a result requires measures to ensure that these copies are consistent, in the event of node-leaves and node-joins. Pros: - The paper shows that FAWN achieves the goal it is set out to do: reduce the power consumption in clusters intended for data-intensive computation. - FAWN-DS is optimized for puts as it just requires appending the data at the end of the log and this can be done in parallel over multiple nodes. It provides strong consistency of data along with reliability ensuring that the maintenance operations cause low overhead. - Provides an efficient architecture for random queries involving small reads and writes. Cons/Comments: - The authors present “micro-benchmarks” for their FAWN test bed consisting of 21nodes. In reality of data-center type of usage this would be a really small FAWN cluster as its capacity is just ~80GB. The paper doesn’t explore the details of the scalability of their scheme. Although, the power savings might still hold, the performance of the system in terms of the QPS might change. - Although FAWN gives more queries per joule performance, it has far less QPS (nearly 6-7 times) as compared to high performance traditional machines (Table 4). This shows that the FAWN system may not be able to keep up the requirements of the current clusters/data-centers. (but looks like a power-aware choice for the future). The table also shows that FAWNs are much costlier. - The authors do not evaluate FAWN as a whole system but just present “micro-benchmarks” which don’t show how the system would perform in practice and how it can support practical applications. - The paper uses the notion of virtual nodes in order to achieve load balancing across the various nodes. (an idea taken from Chord). However, since the operator of the system already knows the number of nodes that are going to be used in the system, they can plausibly be assigned IDs such that the load is evenly distributed. Also, FAWN-DS uses R consecutive (virtual) nodes to store the replicated data. However, it might happen that all of these map to single physical node resulting in a loss of data when the node fails. Review of “A Case for Redundant Arrays of Inexpensive Disks (RAID)”: The main premise of this paper is that the rate of increase of disk I/O bandwidth is not able to keep up with the rate of increase of CPU speed and thus, following Amdahl’s Law, even an 100x improvement in the CPU speed may not increase the speed of a computer by more than 10x, making disk I/O the bottleneck of the whole system. So the authors propose the use of an array of Inexpensive disks (the expensive ones just have twice the I/O bandwidth of these) which can be used in parallel and hence increase the I/O bandwidth by nearly an order of magnitude as compared to the costly disks. However, this leads to a higher probability of failure of disks. To overcome this and to ensure reliability even in the face of failure of several disks, the paper proposes RAID which uses redundancy of data to ensure that data can be retrieved even if some of the disks get corrupted. The paper proposes multiple levels of RAID (1-5) each varying in the amount of redundancy and performance it delivers. While, level 1 RAID uses simple data mirroring, higher levels use hamming codes and parity bits in order to make sure that corrupted data can be retrieved: RAID 2 uses hamming codes with bit-interleaved data, RAID 3 uses one check disk per group, RAID 4 spreads the data across various disks achieving parallelism and RAID 5 spreads the check bits across the various disks rather than dedicating a complete disk for these, leading to a greater parallelism, thus increasing the speed of the various operations. The paper goes on to give a comparison of these various techniques in terms of efficiency and performance. Pros: - The main advantage of RAID is obvious: it provides a storage system by using several inexpensive disks, which provides high I/O bandwidth and reliability using redundancy. Cons/Comments: - The paper provides the performance/efficiency of the various RAID levels but all of these are based solely on estimates and are not measured in a real system/implementation. In reality, the performance of the system (in terms I/O bandwidth) would be must worse because it requires coordination across multiple disk controllers. However, the estimates regarding lifetime, storage efficiency are quite valid. - Almost all the RAID levels are suited just for applications which perform large reads and writes: RAID 5 performs much better than others but the efficiency achieved by it for small writes/reads is much smaller as compared to one achieved when large data is written/read. This shows that while RAID is suited for super-computer type applications, it might not be good for transaction based systems. - While hamming codes can help to recover from against multiple errors, parity bits can be used to correct only one-bit error. While, one hard disk getting corrupted might be a typical case, it is not always true and if multiple of them fail, there is no way to recover data using just parity bits. - The paper assumes that disk failures are independent. However, some events like a surge in electricity would cause correlated failures in which case there would remain no assurance of reliability. -- Virajith Jalaparti PhD Student, Computer Science University of Illinois at Urbana-Champaign Web: http://www.cs.illinois.edu/homes/jalapar1/ From: Shehla Saleem [shehla.saleem@gmail.com] Sent: Monday, February 22, 2010 10:35 PM To: Gupta, Indranil Subject: 525 review 02/23 A Case for Redundant Arrays of Inexpensive Disks (RAID) There has been a tremendous increase in the CPU and memory performance but the I/O aspect has not been able to show the same trends of performance improvements. At this rate, there would come a stage when further increase in CPU performance would go to waste since I/O would just not catch up. The authors try to address this very pressing issue of the possibility of an impending I/O crisis by building a case for RAID and discuss some of the attractive features of using them. The idea is to add redundancy by having an array of disks so that in the instance of the failure of a disk, the redundancy may be exploited to reconstruct lost or corrupted information. They evaluate the reliability levels that can be achieved by this arrangement of disks. They also talk about the overheads and evaluate the performance in case of supercomputers and transaction processing. They use six performance metrics to study the relative efficiency with and without RAID. One of the many strengths of the paper is that it covers a large part of the spectrum of possibilities: starting from level 1 i.e. complete mirroring with one check disk for each data disk and going to level five where data and check information is spread across all the disks. Level 2 and 3 employ bit interleaving whereas level 4 employs sector interleaving. Each of them is suited to different scenarios depending upon the kind of application and the size of data being handled. Level 5 has the advantage of allowing multiple simultaneous writes due to the fact that check information is spread across multiple disks. This is a seminal, classical, more like a revolutionary paper in the area. The idea is very well deliberated over and excellently presented and evaluated. FAWN: A Fast Array of Wimpy Nodes This paper tries to answer the question of how to design a cluster that is power-aware and cost-effective for data intensive computations. This paper identifies some of the issues faced by large-scale, data-intensive applications. The authors emphasize that different kinds of workloads need to be treated differently. So they explain how giving high performance and low cost is particularly hard for small object random access workloads. The paper presents FAWN: a design which involves the use of many low-performance nodes each with its own relatively small amount of flash memory. The basic intuition behind FAWN comes from the fact that flash memory is faster than conventional disk memory, and wimpy nodes have slower processors which consume lesser power. This architecture fills some of the gap between the performance of memory and CPU. The main components of the FAWN system are FAWN-DS and FAWN-KV. FAWN-DS is comprised of the large number of slow nodes with flash memory. FAWN-KV is a consistent, replicated and highly available key-value methodology for routing. Overall, the paper is very well written and the authors have apparently not overlooked any important aspects. Reduction in power requirements is particularly attractive considering the explosive growth in power requirements of data centers and clusters of these days. The authors also discuss and quantify the overheads incurred by their system. One of the downsides is that it seems that FAWN is very specifically suited to a particular type of applications i.e. data intensive applications. What about compute-intensive applications, like video processing or interactive gaming etc? These applications are becoming increasingly popular and should also be considered. Can FAWN be modified to suit these applications too? Also, the paper talks about node joins and leaves (graceful). Can the results change if nodes can fail too? From: Ashish Vulimiri [vulimir1@illinois.edu] Sent: Monday, February 22, 2010 9:55 PM To: Gupta, Indranil Subject: 525 review 02/23 FAWN: A Fast Array of Wimpy Nodes, D. G. Andersen et al, SOSP 2009 This paper presents the FAWN (Fast Array of Wimpy Nodes) architecture, which aims to support data intensive computation using a cluster of low-power nodes equipped with energy efficient flash storage. The authors implement a DHT on top of the FAWN architecture and show via an experimental evaluation that the FAWN+SSD architecture is cheaper than a traditional, high-performance server solution for most combinations of query rate and dataset size, although they demonstrate that replacing the SSDs with regular magnetic disks or with DRAM (while still using a FAWN like architecture) would be preferable, respectively, for large dataset-low query rate and small dataset-high query rate combinations. The main question I have is about how well the FAWN architecture would perform with application layers other than key-value stores. The authors cite work on implementing databases and filesystems on log-structured flash storage, but these papers mostly deal with single-node implementations, and (with the exception of the ones dealing with sensor network applications) focus mostly on performance issues. It would be interesting to see an evaluation for other metrics such as power usage, operation cost (as was done for FAWN-KV), ease of programmability etc. Also: the price of flash storage is still unstable and is expected to fall further in the near future. This should have a (positive) impact on the economic analysis of the FAWN architecture. Facebook's Needles in a Haystack, P. Vajgel et al Haystack is a custom object store that Facebook uses to manage storage and retrieval for its users' images. The Haystack infrastructure was constructed as a replacement for a direct filesystem (NFS) store that was performing poorly due to the high overhead involved in looking up the directory->inode, directory->filename and filename->inode metadata. The data itself is stored as an append-only sequence of image records (one image per record), and individual records are accessed using entries in a separate index file that only contains pairs for each key. The index file itself is not critical (it can be rebuilt from the data file if necessary), and is used primarily for efficiency reasons -- in Facebook's implementation, the index file is less than 1% of the size of the data file, and is stored directly in memory. + Optimized for their usage requirements. A random access filesystem is overkill for append-only data. - Why build a custom object store instead of just using, say, a hash table or a standard relational database? - Related to above comment: it would've been nice to see some numbers on how well they do wrt some of the other available options (including the direct filesystem store). - Data isn't actually deleted: this leads to privacy issues. Facebook has had privacy worries in the past (including a lawsuit a few months ago). From: Fatemeh Saremi [samaneh.saremi@gmail.com] Sent: Monday, February 22, 2010 7:09 PM To: Gupta, Indranil Subject: 525 review 02/23 Paper 1: FAWN Regarding the growing need for efficient and massively parallel access to data and its critical role in major Internet services like Amazon, LinkedIn, and Facebook, this paper presents FAWN architecture a Fast Array of Wimpy Nodes. FAWN pairs low power embedded nodes with flash storage to provide efficient, fast, and cost-effective access to intensive, random access data. Flash is a an appropriate choice for FAWN because it is significantly faster than disk, much cheaper than the equivalent amount of DRAM, and consumes less power than either. The key design choice of FAWN-KV the cluster-based key-value store built for providing storage functionality similar to that of several large enterprises is its use of a log-structured per-node datastore which is called FAWN-DS and provides high performance reads and writes using flash memory. FAWN uses this log structure for the basis of chain replication between cluster nodes to provide reliability and strong consistency and to ensure that all maintenance operations require only efficient bulk sequential reads and writes. Data is distributed across nodes using consistent hashing, with data split into contiguous ranges on disk such that all replications and node insertions involve only a fully in-order traversal of the subset of data that must be copied to a new node and results in having efficient writes. Experiments on a prototype with 21 500MHz-CPU nodes show that the FAWN cluster delivers two orders of magnitude more queries per Joule than conventional disk-based systems. FAWN uses random key distribution policy that results in random load balancing which has been shown to have fairness problems. In the case of FAWN, this problem leads to some back-end nodes receiving more queries than others. The paper assumes only stop-failure while other forms of failure are also probable and should be dealt with, e.g., a communication failure that prevents one node in a chain from communicating with the next while leaving each able to communicate with the front-ends. Paper 2: Needle in a Haystack Haystack is the Facebook solution to the problem of a file-system not working terribly well for their high volume blob storage needs. Unlike their old photo structure that had two different tiers, the photo serving tier and storage tier for serving photo operations, their Haystack Photo Infrastructure merges these tiers into one physical tier. It implements a HTTP based photo server which stores photos in a generic object store called Haystack. The reason behind this merge was to have more efficient photo read operations; this way, each read I/O operation is only reading actual data instead of file-system metadata as well. Haystack Photo Infrastructure consists of five functional layers: HTTP Server, Photo Store, Haystack Object Store, File-system, and Storage. As HTTP server, a simple multi thread server is used which each thread serves a single HTTP request at a time (this is acceptable since the workload is mostly I/O bound). The second layer, photo store server handles HTTP requests from the upper layer and translates them to the corresponding haystack store operations. Compaction, provided by this layer, is an online operation to reclaim the garbage space; it creates a new haystack by copying the in-use space and skipping the garbage parts. Then, it swaps the files and in-memory structures. Haystack object store, the third layer is a simple log structured object store that contains needles which represent the stored objects. A haystack consists of two files: the actual haystack store file containing the needles and an index file. A needle is uniquely identified by its tuple in which the offset is the needle offset in the haystack store. The index file contains a corresponding index record for each needle in the haystack store file and the order of indices should match the order of associated needles in the haystack store file. The index file provides the minimal metadata required to locate a particular needle in the haystack store file. Write, read, and delete operations are provided for handling needles in the haystack store file. Modify operation is not allowed; instead, a new version of the needle must be written using the same tuple. In addition, the delete operation only marks the needle as deleted and the space of a deleted needle is not reclaimed in any way; the only way to reclaim space from deleted needles is to compact the haystack. Haystack object stores are implemented on top of files stored in a single file-system (the fourth layer) created on top of the 10TB volume (the lowest layer). The architecture sounds reasonable, but not being allowed to modify a needle and not having a real deletion is questionable. The reasons that Not planning to delete photos at all since delete rate is VERY low so the resource that would be recovered are not worth the work to recover them in the Facebook usage and Facebook is doing the exact same thing as the filesystems would not be acceptable regarding the increasing rate of photo modifications and deletions in Facebook. From: ntkach2@illinois.edu Sent: Sunday, February 21, 2010 4:55 PM To: Gupta, Indranil Subject: 525 review 02/23 Nadia Tkach – ntkach2 CS525 – paper review 4 Storage Paper 1: A Case for Redundant Arrays of Inexpensive Disks (RAID) At the time the paper was written the commonly used storage technology of Single Large Expensive Disks (SLED) was slowly pushed away by the new technology of magnetic disks mainly targeted at personal computers market, Redundant Array of Inexpensive Disks (RAID). Unlike SLED, inexpensive disks were less reliable and could not be useful without a proper fault tolerance implemented. This led to the use of disk arrays that would store redundant information and be used to recover data in case of a failure. The paper describes 5 levels of RAIDs and analysis their capabilities in terms of cost and performance trade off. The five levels of RAIDs have some similar characteristics in respect to the previous level, and they are in a sequence order Level 1: Mirrored Disks (duplicates all disks, thus requires twice as many disks), Level 2: Hamming Code for ECC (uses check disks to detect and correct a single error, requires less disks then Level 1), Level 3: Single Check Disk Per Group (requires only one check disk per group, detects errors but does not correct them), Level 4: Independent Reads/Writes (stores any one transaction data on a single disks, thus allowing read/write operation parallel on several disks in the group simultaneously, the single check disk acts as a bottleneck), and Level 5: No Single Check Disk (stores check information across all array disks plus the “check” disk in the group) Pros: • Authors present metrics data in terms of efficiency of array of disks in comparison to a single disk, thus simplifying and using deterministic measures rather than absolute numbers Cons: • The authors make an assumption that failures are exponential and independent. While it is a reasonable assumption to make, there still might exist cases when the failure of one array triggers the failure of another one or the whole array of disks • The RAID levels described and analyzed from a hardware perspective only, no considerations for software optimization Paper 2: Facebook’s Needles in a Haystack The document describes the Haystack technology deployed by Facebook for faster and simpler photo retrieval. Haystack is a software application that works over photo storage system. It consists of superblocks and needles that provide all necessary information for photo location and retrieval process. Additionally on top of Haystack store file, there is built the index records system which contains the minimal amount of metadata required for finding the photo in the store. Indexing system allows to retrieve the needle metadata without traversing the large Haystack store file. The Haystack application can be considered in some way as an abstraction layer in the process of information retrieval from large data stores. Pros: • Needle indexing permits faster access to the needle metadata and thus faster photo loading, additionally it can be rebuild from the original Haystack store file at any time in case of inconsistencies • Disk caches are disabled in order to avoid any inconsistencies in case of a failure Cons: • Deleted needles are not physically erased from the memory, rather their offset is updated to show the deleted status and the record is left in the haystack store file until the compaction operation is performed, which results in memory space wasting From: liangliang.cao@gmail.com on behalf of Liangliang Cao [cao4@illinois.edu] Sent: Thursday, February 18, 2010 4:47 PM To: Gupta, Indranil Subject: 525 review 02/23 Paper reviewed by Liangliang Cao (cao4@illinois.edu) for CS525 class on Feb 23, 2010 Paper 1: A Case for Redundant Arrays of Inexpensive Disks (RAID), SIGMOD 1998 This is a seminal paper on building distributed storage systems using disk arrays. By arranging the devices into arrays for redundancy, RAID (redundant array of inexpensive disks) enable computer users to achieve high levels of storage reliability from low-cost and less reliable PC-class disk-drive components. In the past decade, the industry has witness a surge of RAID product, which has also encouraged the design and evolvement of distributed servers and cloud computing. Pros: • The introduction is a classic. Much insight has been drawn for the background and motivation of the RAID techniques. • The five-level design has covered quite a lot of basic techniques used in following technology. Cons: • This paper does not discuss too much on handling the trade-off of protection against data loss, capacity, and speed. • The design of RAID assumes that the disk failure happens independently so that the redundancy is assigned in an evenly distributed manner. However, this method might not be optimal since in practice some disks are of the same age, and some disks are more reliable than the others. • The current paper doesn’t consider much the CACHE issue, which sometimes plays an important role in boosting storage performance. Paper 2: Facebook's Needles in a Haystack As the biggest photo sharing website, Facebook host 15 billion photos in total with 1.5PB storage. As an increasing community, the uploaded photo grows at a rate of 220 million new photos per week (25TB of additional storage), and at the peak rate of 550,000 images per second. The classical NSF system stores each image in its own file, generates enormous amount of meta data, which exceeds the cashing abilities of the NSF storage tier, resulting in multiple I/O operations per photo read request. To handle this problem, Facebook develop a new data store named Haystack, which eliminate unnecessary meta data overhead for reading and other I/O operations. By using needle file to store cookie, key, flags, size, and other metadata, Haystack is able to store needle’s location in an in-memory index and eliminate unnecessary meta data overhead. Pros: • By aggregating hundreds of thousands of images in a single store file, Haystacks manage to store the index file in memory. • The needle file structure keeps the metadata overhead small • Haystack eliminates unnecessary metadata overhead, by which the retrieval of an image’s data can be done in a minimal number of I/O operations. Cons • The design of Facebook Needles is not general: it mainly considers the special requirement of Facebook community. For other company, such as Flickr or Wikipedia, different architect should be constructed to accommodate different requirement.