From: Igor Svecs [isvecs SpamElide] Sent: Tuesday, March 08, 2011 12:37 PM To: Gupta, Indranil Subject: 525 review 03/08 Igors Svecs CS525 paper review 2011-03-08 Bigtable SUMMARY Bigtable is a distributed storage system that is designed to be highly scalable both in terms of storage size and the number of machines. Data is indexed by a row key, a column key, and a timestamp, and is treated as an unstructured string. Every read/write under a single row key is atomic. Rows are grouped into tablets for distribution purposes. Columns are grouped into families. Timestamps ensure multiple version support. Bigtable uses Chubby distributed lock service to control file locking and replication. Implementation consists of a client library, a master server that assigns tablets to tablet servers, and dynamically added tablet servers. Tablets are stored in a B+ tree-like structure. Perfomance is increased by storing recently commited updates in memtables, which is commited to disk (GFS) when it grows too large. COMMENTS Bigtable treats data as uninterpreted strings, and I wonder what benefits (such as supporting a subset of relational operations) could be gained if Bigtable knew the structure. Bigtable is similar to other key-value stores we saw in the class, but it is more similar to databases. Its authorization mechanism – reading the list of permitted senders from a file – seems to be too basic and inflexible, although is may suffice in a single company. Another criticism is that they performed evaluation in a single data center, while most highly reliable systems use multiple sites. A significant plus of Bigtable is that scans scale nearly linearly, even despite the fact that random write perform poorly. One great benefit of the paper is that compared to many other papers that only talk about the final result, it thoroughly discusses lessons learned and describes author's thought and build process in details, so we can see many issues that arise in building a practical system. Haystack SUMMARY Haystack is Facebook's distributed object store that was designed to reduce the number of reads when accessing a file in a datastore. It was designed for storing pictures. Most importantly, it keeps file metadata in memory, so that a read operation takes only one disk operation to read the data, while traditional approaches (such as NAS over NFS) follow a multi-step process. Their data shows that “requests from the long trail account for a significant amount of [their] traffic”, so that CDNs alone don't work. Haystack's architecture consists of three components: Store, Cache, and Directory. Store's capacity is organized in physical volumes, which are grouped into logical volumes. Cache is a DHT that works similarly to a CDN, and caches a photo when request comes directly from client's browser. Directory maps logical volumes to physical volumes. The main feature of their architecture is that multiple pictures (that they call needles) are grouped together into a single file, and an offset to individual objects within the file is being kept in memory. In-memory metadata is periodically saved to disk as an index file to speed up recovery. COMMENTS This paper does not introduce any particularly interesting ideas and is not written in an informative way, and its main contribution is that it presents evaluation on an actively used dataset that is among the largest in the world. They encode machine number in the URL, which may not stay the same forever, potentially rendering these URLs invalid in the long term. Since they already have the Directory, why not map logical ids to (machine, physical id)? It is also not explained in which cases requests from client browsers go to CDN, and when they go directly to Haystack Cache. Recovery / fault-tolerance process is particularly confusing. It appears that they deal with each failure case manually, which should be unacceptable in a modern distributed system. This system is heavily optimized for the type of content they are serving – immutable objects that are rarely deleted, and will not work for any other arrangement (such as mutable data). These two papers emphasize that design should be simple – in Bigtable's case they were spending too much time on corner cases, in Haystack's case it allowed Facebook to deploy the system faster. From: anjalis2006 SpamElide on behalf of Anjali Sridhar [sridhar3 SpamElide] Sent: Tuesday, March 08, 2011 12:27 PM To: Gupta, Indranil Subject: 525 review 03/08 Finding a Needle in Haystack: Facebook's Photo Storage, D. Beaver et al, OSDI 2010 Haystack is a storage system developed for the storage and retrieval of billions of photos uploaded by Facebook users. Recognizing that the disk IO is the bottleneck, Haystack attempts to reduce the disk access when retrieving a photo from its servers. It reduces the amount of metadata stored for each photo and stores metadata in main memory. While CDNs take care of the latest photos uploaded, the requests for the older photos do form a significant portion of requests. In order to minimize latency of photo retrievals that may not be cached, Haystack provides a solution. Haystack consists of three components the Directory, Store and Cache. It also deals with two types of metadata Application metadata and Filesystem metadata. Application metadata is used by the browser to request the photo and filesystem metatdata is used by the host to find that photo on its machine. The Directory provides a mapping from logical to physical volumes, it keeps track of the storage on each of the logical volumes marking them as read only and it directs the user traffic to the Cache or CDN. Haystack tries to maintain its independence of CDN by having an internal cache to service user requests. The Cache keeps track of the write enabled machines since it is user behavior to view the most recently uploaded photos. The Store retrieves a photo based on the logical volume id and the offset where the photo is stored. Each store also has an index file that is the in-memory mapping of the machine. Haystack has read, write and delete operations. Each store machine consists of a superblock followed by needles that represent each photo. A request for a photo comes in from the cache with the logical volume id, the photo id called the key and the alternate key specifying the size of file required. The corresponding offset and other relevant metadata is retrieved by the store machine using the key. Reducing metadata at the cost of permissions is mentioned in the paper. It seems that there might repercussion regarding privacy. When dealing with failures, pitchfork marks a logical volume as faulty when it is a single Store machine that might be the problem. Failure checking should be done at a store by store level to prevent unnecessary loss of machines. In optimizations Haystack talks about batch loads, however in the case of mobile uploads; photos are uploaded one by one asynchronously. Haystack is tailored to Facebooks needs since the idea of newsfeeds driving a majority of the traffic is integral to its design. The impact of constructing a storage system without newsfeeds driving the traffic is not seen. There is a lack of flexibility with only four kinds of images being stored for each photo. The current image processing techniques may allow for more flexibility in terms of compression and resolution. Image formats like JPEG 2000 provide high compression rates and flexibility in image sizes. The Google File System, S. Ghemawat et al, SOSP 2003 The Google File system is a large scalable distributed file system for data intensive applications. It consists of a single master and multiple chunk servers which store fixed chunks identified by a 64 bit ID. The master provides information about the location of the chunk and the client streams the data directly from the chunk server. The client may batch chunk information queries to the master for optimization. It maintains the metadata in the main memory of the master in order to speed up operations. It has to store 64 bytes for each 64MB chunk. GFS decouples data and control flow in order to maximize usage of the network. After receiving information about the primary and replica chunk servers from the master, the client sends data to the primary and replica servers. It then sends a write request to the primary server. The primary server orders the list of mutations to be performed on the data received. This ordering is followed by all the replica servers in order to reach a consistent and defined state. GFS provide fault tolerance, data replication and data integrity methods. The assumption of high sustained bandwidth is possible for Google to make since it can afford to construct its own high bandwidth links. However the realization of this design may not be possible by all organizations. For optimizing for network bottlenecks and high latency, it attempts to estimate distances from IPs. This is assuming that the network topology of the file system of the organization should be able to do this. GFS seems tailored for applications that Google runs internally and hence optimized for its needs and available resources. The overhead of changes to be made for any organization to use its design has to be considered. From: Curtis Wang [cwang89 SpamElide] Sent: Tuesday, March 08, 2011 12:20 PM To: Gupta, Indranil Subject: 525 review 03/08 Curtis Wang (wang505) 3/8/11 Storage in Industry GFS The Google File System is designed for performance, scalability, reliability, and availability, but makes a few assumptions to specifically optimize for Googles applications. They assume frequent component failures (due to using commodity machines), large files, and frequent append operations. In addition to the standard operations (e.g. read write), they provide fast operations to append data (record append) and to copy data (snapshot). Their architecture consists of a single master, which stores and maintains metadata, and several chunkservers. Clients talk to the master to locate their files, but they always read and write from the chunkservers. Pros - Built to run on commodity machines and as a result, to be fault tolerant on these machines - Built to scale and operate with large data chunks - File system control (master) is separated from the data transfer (chunkservers). This allowed for higher throughput with many concurrent readers and writers - Provides good fault tolerance through their monitoring system (heartbeat messages) and chunk replication methods (lost replicas are replaced as fast as possible, addressing more serious outages first) Cons - Master seems to represent a single point of failure though the addition of shadow masters helps (clients could still receive stale data during the duration of recovery of the master though) - Optimized for record append operations. If random writes are common, this system may not perform as well as expected. It would have been interesting to hear why they designed the system so that stale chunks are garbage collected instead of salvaged. With the chunk version number and the log of mutations, it seems possible to just update the chunk instead of wasting the time to re-replicate it. Is this because it would introduce too much complexity in the system? Bigtable Googles Bigtable is a distributed, multi-dimensional, sorted map. It is indexed by three keys: a row key, a column key, and a timestamp. Column keys are grouped into sets that are called column families. Under a column family, one can have unlimited columns, and each of these columns can store multiple copies of data (perhaps similar data at different timestamps). Underneath the hood, Bigtable uses GFS to store data and relies on Chubby, a distributed lock service for access control. Chubby also stores metadata, such as the bootstrap location of Bigtable data. Bigtables architecture consists of client libraries, one master server (for assigning tablets to tablet servers, load balancing, and garbage collection), and several tablet servers. Like GFS, actual data is not communicated through the masterclient data only moves between the tablet servers. Unlike GFS, the master does not contain tablet location information, so clients rarely communicate with the master. Bigtable optimizes its data storage by exploiting locality groups (grouping related data closer together) and by using compaction. SStables are frequently compacted (minor compaction) and groups of SStables are occasionally compacted together to form one SStable (major compaction). Pros - Has high throughput, high scalability, and high availability - More flexible than a traditional DBMS since it doesnt enforce schemas but still handles structures data Cons - Once again, the master seems to represent another single point of failure for the system - Random read scales the worst, so this could be bad depending on the application Having the word table in the name is quite misleading since Bigtable is not implemented like an actual relational table at all. The only similarity comes with the names that are associated with the keys (rows, column families, etc.). From: kevin larson [kevinlarson1 SpamElide] Sent: Tuesday, March 08, 2011 12:15 PM To: Gupta, Indranil Subject: 525 review 03/08 Bigtable is a storage system designed by Google in order to manage structured data in a very large scale. Various Google projects need to store and access petabytes of data with real time accesses. The data and logs used in Bigtable are stored on Google File System (GFS). The primary structure used in Bigtable is the tablet, which is used hierarchically to store the many rows of data in Bigtable. It also uses Chubby, a distributed locking service, for a variety of tasks, guaranteeing only one master, discover tables, finalize tablet server deaths, and store access control lists. Tables are assigned to one server at a time, and a the master tablet (top of hierarchy) is responsible for reassigning the tablets in the case of a tablet server failure. Accesses are checked for authorization (in Chubby) and well-formedness before being served. Evaluation demonstrates Bigtable’s performance and ability to scale, and adoption at Google demonstrates is usability and performance. The authors present a strong argument for their data model and demonstrate how it applies to Google’s projects, and proceed to show how their implementation built around that model. It was also interesting to see the various refinements added to Bigtable as needed to make it perform adequately. While it was very interesting to see the refinements made to Bigtable, it would have been interesting to see how these various refinements affected the performance. The question of were the the additions of compression, locality, bloom minor, or did one or more of them profoundly impact the throughput or latency of Bigtable, was left unanswered, and would have been quite interesting. Much like Bigtable, Haystack also needed to manage petabytes of data. Unlike Bigtable, there were a variety of different challenges and goals in mind when implementing Haystack. The authors repeatedly touch on the importance of low latency access to images, and concede they cannot cache everything they wish. They determine non-cached photos were largely bottle-necked by the numerous disk accesses required to fetch the images. As a result, they implemented Haystack, which uses small needle indexes in order to limit the vast majority of image accesses to single disk accesses (all but those spanning disks in RAID). These needles are designed to be as small as possible, in order to be stored entirely within memory, preventing unneeded disk accesses. Evaluation shows that Haystack can achieve nearly the same performance as random aligned IO accesses. Additionally, it shows the interactions between reading and writing photos don’t significantly hurt performance as well. The Evaluation of Haystack was particularly interesting. In addition to clearly demonstrating the relative performance and latency to random aligned IO accesses (a high standard to set yourself to), they also demonstrated the effects of use (storage) on a server, and how the load changed as the server filled with images. From: Long Kai [longkai1 SpamElide] Sent: Tuesday, March 08, 2011 12:15 PM To: Gupta, Indranil Subject: 525 review 03/08 The Google File System Summary: This paper descripts the design and implementation of the Google File System. This is a distributed storage file system for data-intensive applications. One important feature is that the system is running on inexpensive hardware while providing fault tolerance. It demonstrates important techniques for supporting large-scale data processing workloads on commodity hardware. However, some designs are specific to Google’s unique needs and setting. Like component failures are the norm rather than the exception, files are huge by traditional standards and most files are mutated by appending new data rather than overwriting existing data. But many of the basic ideas can also be applied to general large-scale data processing system. As for the architecture, a GFS cluster consists of a single master and multiple chunkservers. Data is divided into fixed size chunks and stored in chunk servers. The master maintains all file system metadata, including the file and chunknamespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. This allows the master to easily process data transfer, fault recovery and clients’ queries. However, clients interact with the master only for metadata operations, and all data-bearing communication goes directly to the chunkservers. Thus, flow of data is decoupled from the flow of control. And in this way, the master will not become the bottleneck for the system. This system achieves high availability by fast recovery from machine failures and chunk replication on multiple machines. The system provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery. Chunk replication allows us to tolerate chunkserver Cons: Multiple concurrent client operations on one chunk of data may cause efficiency issue that has not been dealt with by GFS, because the hardware needs to go back and forth to access data in different offsets. Though the master is not the bottleneck for the throughput of the system, master’s failure still causes the whole system to stop running for a period of time, which is not convenient from a user’s perspective. Bigtable: A Distributed Storage System for Structured Data Summary: Bigtable is a distributed storage system for storing structured data at Google. This system provides flexible, high-performance solution for applications with different constrains and requirements. Each table is a multi-dimensional sparse map. The table consists or rows, columns and times as the three dimensions. In order to optimize for GFS, the tables are split at row boundaries into tablets. Tablets are stored on systems as immutable SSTables and a tail of logs. Each query goes through three-level system, the client contact with the META0 tablet to get address of the META1 tablet, in which location of the actual tablet is stored. Pre-fetching and caching are used to reduce the overhead on META1 tablet. All tablets on one machine share a log and when a machine failure is detected, the master distributes the chunks indicated in the log to other machines. Machines that should store the tablets will query the master for the location of the data and transfer data from that location directly. Future development: Next step for Bigtable includes expressive data manipulation, multi-row transaction support and Bigtable as a service. Best, -- Larry From: Anupam Das [anupam009 SpamElide] Sent: Tuesday, March 08, 2011 12:10 PM To: Gupta, Indranil Subject: 525 review 03/08 I. Bigtable: A Distributed Storage System for Structured Data Bigtable is a distributed storage system that supports large scale structured data. In many ways Bigtable resembles a database, but it does not support any relational data model. Bigtable can be thought of as multidimensional sorted map which is indexed by a row key, a column key and a timestamp. Row keys are arbitrary strings (e.g. URL) and each read/write operation in the row level is atomic. Rows in the Bigtable are lexicographically sorted and row ranges (called tablets) are the basic blocks of distributed data. Column keys are grouped together into column families which form the basis of access control. A column family can contain multiple columns and these columns within a family are accessed through a qualifier. Each cell in the Bigtable may contain multiple versions of the same data by using timestamps. Bigtable uses Google File System (GFS) to store logs and data files. It also uses a distributed lock services called Chubby to ensure consistent replication. Bigtable implements three major components: master server, tablet servers and library linked to every client. The master manages the tablets servers in various aspects like: assignment of tablets, failure detection and load balancing. Bigtable also provides various types of refinements to improve the performance of its service. Pros: 1. Supports large scale (Internet level) structured data. 2. Unlike RDBMS it can be implemented using low cost commodity hardware. Cons: 1. The performance of Bigtable may be hampered if other underlying infrastructures instead of GFS and Chubby (all Google products) are used as it heavily depends on the optimizations made by these infrastructures. So it might not be reproducible using third party infrastructures. 2. The paper does not discuss how load balancing is done i.e., upon what metric does it decide to do load balancing? 3. In the experiment section they do not perform comparison with other systems. It would have been interesting to see how relational database performs under the same scenario. 4. Data integrity across different rows is hard to implement in Bigtable. 5. Many implementation details have not been discussed in the paper. Questions like- How are bloom filters updated after rows are deleted? Are bloom filters periodically updated? -have not been addressed. --------------------------------------------------------------------------------------------------------------------- II. Finding a Needle in Haystack: Facebook's Photo Storage This paper describes Haystack, an object storage system optimized for Facebook photo application.  Facebook came up with this new storage system as traditional NFS failed to meet their strict timing requirement. In Facebook there are billions of photos and the main characteristics of these photos are:  written once, read often, never modified and rarely deleted. So saving each photo as a single file in the NFS would mean that all the metadata about these files will not be available in the main memory and as a result multiple disk accesses will be required to fetch a single file. While NFS performs well in serving popular requests as it caches the most recent/popular files, this is not the case with Facebook where requests associated with less popular (older) files are often generated. So the main object of Haystack is to minimize the number of input/output operations in retrieving a file. To do this, Haystack system keeps all the metadata in the main memory by drastically reducing the memory used for file system metadata. Haystack uses a simple and straight forward approach of merging millions of photos into a single file. Each photo stored in the file is called a needle and the file system keeps record of the offset and size of each needle in the file. Pros: 1. High throughput and low latency is achieved by keeping all metadata in main memory, and minimizing the disk accesses per read and write. 2. Achieves fault-tolerance by replicating each photo in geographically distinct locations. 3. Cost-effective compared to NFS in terms of cost per petabyte of usable storage. 4. Design technique is simple and straight forward enabling them to develop and deploy the system within few months. Cons: 1. The design is suited to a particular application and is not generic 2. The paper does not discuss in detail as to how replication management is done. How is replication managed during deletion or compaction? 3. Are there multiple Haystack directories in the system or only one? What if the Haystack directory crashes? ------Anupam From: mdford2 SpamElide Sent: Tuesday, March 08, 2011 12:02 PM To: Gupta, Indranil Subject: 525 review 03/08 Finding a needle in Haystack: Facebook’s photo storage Facebook previously used NAS on a CDN to service picture requests from users. CDNs offer caching, but due to the “long-tail” usage pattern that is typical in social networks, the caching was insufficient, and looking up metadata required multiple disk accesses. Facebook built a custom system that packs multiple images into a single file, reducing the number of inode reads required. Moreover, the system minimizes the metadata required to perform the lookup, allowing for all metadata to be stored in main memory. This fast indexing scheme allows for roughly four times as many reads to be performed as on the NAS system. This paper is fairly repetitive and can be summarized by two words; overhead reduction. The performance results are impressive, and are certainly a testament to the fact that off-the-shelf systems can be inefficient to specific workloads. Bigtable: A Distributed Storage System for Structured Data Google uses Bigtable for many solutions, with varying scalability, availability and performance requirements. They achieve this by allowing the structure of the table, as well as some implementation parameters, to be customized. A Bigtable is indexed by a row key, a column key and a timestamp. Writes to and reads from rows are performed atomically. Columns are of specific types called "families", which allows for data types and access control. Timestamps can be generated by the table, or at the application level. Bigtable is built on top of Google File System, a cluster management system, the SSTable file format, and Chubby. Services start with a single table, and grow the number as necessary. Tables locations are maintained in a three-level hierarchy. Individual table reads can be performed with a single disk access. The choice of using locks at the row-level is an interesting one. It seems as though this could hinder some applications that could work in parallel on various columns in a row. The authors indicate that most of the applications using Bigtable only require per- row accesses, but is this an effect of some candidate applications using other services? From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan [njordan3 SpamElide] Sent: Tuesday, March 08, 2011 11:57 AM To: Gupta, Indranil Subject: 525 review 03/08 Nicholas Jordan njordan3 3/8 Finding a needle in Haystack: Facebook’s photo storage: Facebook needed to design a custom file system because NFS/NAS combination was not working for 1 billion page requests a second at peak time. The main disadvantages of the current technology is the metadata for files are really big in main memory and it was taking multiple disk reads for a single file. Throwing $$ or more main memory for a bigger cache was not practical so they created HayStack. They designed an log structure/append only architect on top of XFS file system. Pros - use caching for recently written photos, this lens well to their traffic. - Metadata for a file went from 526B to 40B. Cons - N/A Additional Comments: They store 4 different sizes(thumbnail, small, medium, large) for the same image, in 3 different locations. Storing in 3 different places seems to eliminate the need to keep explicit redundant copies in the HayStack Store. You can reconstruct the smaller sizes, from the larger sizes. They say [Related Work : File system] that log file systems have not reached there full potential, I’m curious in what way? Bigtable: A Distributed Storage System for Structured Data: A distributed storage system that uses in memory indexes and GFS for storage. It’s a NOSQL file system. The main structure organization is string encoded row and column keys. You can also put successive entries in the same (row, column), multiple copies will be differentiated by a timestamp either customary or by BigTable. Also, all columns belong to a column family, which provides another layer of organization to the data. This paper explains the organization, the implementation of Bitable, test data, and then explains the real-world use of the BigTable in their products such as Google Earth and Analytics. Pros - the rows are sorted chronological. So “blog.cnn.com/index.html” -> com.cnn.blog/index.html” And “blog.cnn.com/about.html” will be close by with the key “com.cnn.blog/about.html” Cons - to take advantage of the pro, have to think of how to express row string Additional comments Its nice to know that KISS is still practical and is necessary for small and big projects. The use of Chubby and locks on a file to keep track of what tablet server is alive seemed weird at first. But, reading the “Lessons Learned” section answered my inquiry. -- Thanks, Nick Jordan From: wzhou10 SpamElide Sent: Tuesday, March 08, 2011 11:32 AM To: Gupta, Indranil Subject: CS525 Review 03/08 Review on “Finding a needle in Haystack: Facebook’s photo storage” Core idea This paper presents Haystack, an object storage system specifically designed for Facebook’s photo application. Motivation is the huge amount plus increasing speed of photo stored at Facebook. Haystack provides a creative solution: a new photo infrastructure merges photo serving tier and storage tier (as in traditional NFS photo infrastructure) into one physical tier. Each stored photo has a certain amount of information (identity and location) associated, and this information is used to build an index file, one copy of which is written into memory, speeding up the process to find a photo and reducing the number of I/O operations to 1/3 as typically required. Pros 1. The idea is simple, and it works just for photo application, but it works well. Lots of data is stored in a single file, reduceing metadata overhead. Read operations work on actual data directly, instead of filesystem metadata first. The number of I/O operations is just as 1/3 as typically required, saving much hardware resources, and monetary cost as well. 2. Haystacks makes Facebook able to be less dependent on CDN, which is costly . 3. Haystacks adopts some important optimizations. For instances, it effectively saves more memory, 40bytes/photo compared with 536 bytes/photo of xfs_inode_t. 4. In evaluation section, the deployment of Store machines is on commodity storage blades. The typical hardware configuration of a 2U storage blade is: 2 x quad-core CPUs, 48 GB memory, a hardware raid controller with 256MB – 512MB of NVRAM cache and 12x 1TB SATA drives. Cons This design is specified for Facebook’s photo application. Upload is frequently, and once uploades, objects will probably not be modified. So this makes Haystacks or the design idea of Haystacks is limited used in other environments. Review on “A Distributed Storage System for Structured Data” Core idea Bigtable is a distributed storage system for managing structured data, designed to solve the scalability to petabytes and thousands of servers. Bigtable is able to support applications with various demands. A Bigtable is a saprse, distributed, persistent multi-dimensional sorted map, indexed by a row key column key, and a timestamp. The row ranges are devided into partitions, tablets. Tablests are distributed across servers for distribution and load balancing. In this way, users can get efficient I/O and the network traffic is small, if the request for a group of rows falls in one or very small number of tablets. Pros 1. Bigtable chose very simple data model, and gives users the right to control the detailed format of data according to their unique demands. 2. The light loaded master server design is desirable. Clients’ data doesn’t move through master, and most clients never communicate with the master, since Bigtable clients do not rely on the master for tablet location information. 3. Bigtable is able to provide good locality property: reads of short row ranges are efficient and typically require communication with only a small number of machines. Cons 1. There’s no detailed discussion about the security issue, which I think is important for a public service. 2. Once an SSTable is written, it’s never changed. If the data in the tablet is changed, a new SSTable is generated. This is effective because of they are stored in GFS, not optimized for small writes. I wonder if this will cost much when applied in a situation where small writes frequently happens or where a different storage system is in use. Thanks. ~Wenxuan From: Nipun Sehrawat [sehrawa2 SpamElide] Sent: Tuesday, March 08, 2011 11:25 AM To: Gupta, Indranil Subject: 525 review 03/08 Finding a needle in Haystack: Facebook's photo storage This paper presents the photo storage system, characterized by data being written once, rarely deleted and read very often. It presents a rather unusual finding that disk I/O for reading file meta-data proved to be the bottleneck in facebook photostore's read throughput. The paper first presents some of the drawbacks associated with the previously used NFS based design, which involved storing each photo as a separate file, thus exploding the amount of meta-data maintained and disk I/O per single image retrieval, among other disadvantages. Haystack strives to achieve four goals: high throughput & low latency, fault tolerance, cost-effectiveness and simplicity of design. Facebook depends on a CDN to server 'hot' images and Haystack to handle long tail of large number of less popular requests. To reduce file metadata, Haystack stores large number of files together in a file, and each file is identified by an offset within this large file. Haystack comprises of three components: Haystack directory, Haystack Cache and Haystack Store. Haystack directory is responsible for providing the physical location of an image, load-balancing both read and write traffic. Haystack Cache is a DHT and uses photo ids as hash-keys. One key facebook specific optimization made here is that a photo is cache only if it was recently uploaded. Haystack store builds on the key idea of accessing a photo using the disk's logical id and offset of the photo, thus reducing the metadata per photo to 10 bytes, which makes it possible to store the metadata in memory, thus providing a single disk access to a photograph. A physical volume in Haystore represents a single file and it contains a superblock followed by a sequence of needles, where each needle represents a photo. Pros: - Very simplistic design of reducing file metadata, which was proving to be a bottleneck in earlier NFS-based implementation Cons: - Novel ideas seem to be presented only in Haystack store's single big file containing multiple photographs. Most of other optimizations/decisions are Facebook traffic specific - I wonder how well would they fit in other photo sharing sites. --- Bigtable: A Distributed Storage System for Structured Data Bigtable is a distributed storage system for structured data (here, a sorted map), built on top of GFS. It is designed with primary goal to scale out on commodity machines, as the dataset grows. Bigtable doesn't provide a full relational data model, rather is indexed by row, column and timestamp (thus making it a 3-D structure). Bigtable guarantees atomic reads and writes to a particular row, however multirow transaction support is not provided. Bigtable rows are partitioned into 'tablets', much on the same terms of database sharding. Bigtable supports garbage-collection of multiple timestamped data accumulated under a row,column pair by either last n only or since time t only semantics. Bigtable is built on top of GFS and uses SSTable file format to store the data. Data can be accessed in a single disk I/O by keeping the index in memory. Bigtable consists of three components - one master server, multiple tablet servers and client-side library. The master is lightly-loaded is mainly responsible for tablet-server allocation, load-balancing and garbage collection purposes. Each tablet server stores multiple tablets and is responsible for splitting a tablet when it grows large. Clients communicate directly with tablet servers with the help of a client-side library, which caches tablet locations. Tablet to tablet-server mapping is determined by a parsing a page-table like hierarchical structure. Pros: - Simple design and 3-D storage well-suited for various applications in Google, such as maintaining snapshot caches of webpages. - Scales-out on commodity hardware and is robust enough to handle many type of failures typically not addressed/assumed by many distributed protocols. Cons: - Support of multi-row transaction missing, though most applications don't need it (Percolator does). Haystack vs BigTable (for storing images): - One major concern to Facebook was enormous metadata associated with all the images collectively. Bigtable needs only 128MB of metadata to locate a table, metadata maintained by SSTable file is not indicated. I wonder if BigTable will be able to achieve a single I/O access to images, as Haystack does (though HS is highly specialized for storing images as compared to BT, which is more generic and can cater to wide number of applications). From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman [mrahman2 SpamElide] Sent: Tuesday, March 08, 2011 11:10 AM To: Gupta, Indranil Subject: 525 review 03/08 Finding a Needle in Haystack: Facebook’s Photo Storage Summary: This paper proposes Haystack, which is an object storage system designed for facebook’s photo application workloads where data is usually written once, read often, never modified, and rarely deleted. The key design observation of the system is that traditional designs incur excessive number of disk operations due to meta-data lookups. This is remedied by carefully reducing per-photo metadata to fit all meta-data in main memory. Haystack was designed with four main goals in mind: high throughput and low latency, fault tolerance, cost-effectiveness, and simple design. For social networking sites, CDNs are effective only for popular photos, however the long tail of less popular photos are usually missed by CDNs. As a result serving the long tails is a unique problem for social network storage which is not dealt in traditional file systems. Currenly facebook uses CDNs to serve popular images and utilizes haystack for the long tail. Haystack uses a very simple approach of storing multiple photos in a single file to reduce meta-data. The haystack architecture consists of a store, a directory and a cache. The store encapsulates the persistent storage. The directory maintains logical to physical mappings, load balances writes, decides between using the cache or a CDN for a request, and finds read-only logical volumes. The cache saves a photo only if the request comes from a user and the photo is fetched from a write enabled store. Th first optimization is due to experimental characterization of the ineffectiveness of post-CDN caching. The second decision is due to the fact that photos are accessed more frequently immediately after being uploaded, and the file-systems works better with only reads or writes but not both at the same time. Pros: (1) Very simple design. (2) Incrementally scalable and fault tolerant system. (3) Characterization of typical workloads for facebook. Cons: (1) Haystack disallows overwriting needles which can result in duplicate storage overhead. (2) Haystack is heavily optimized for a specific workload. In the future, if something else takes over photo-storage as the most popular social networking activity, then haystack may not be suited for that new activity workload pattern. Future Work: (1) Can Haystack be adapted to work well with other multimedia objects, e.g. video storage? (2) A mathematical analysis of the optimal trade-off between using caches and CDNs would be interesting. The Google File System Summary: The astronomical scale of Google operations (e.g. storing multiple copies of the entire web) demands a highly reliable, scalable, and distributed file system. These requirements motivated the design of the google file system (GFS). GFS was designed to optimize the performance of typical google workloads and also future application needs. Some design decisions that GFS takes into account include: built in fault tolerance and auto-recovery, re-examination of standard IO assumptions, prevalence of record appends, and co-design of applications and file systems. In the GFS architecture, a master process manages meta-data and a set of chunk-servers at a lower layer stores data in units called chunks. Chunks are 64MB in size and stored as files. They are replicated across multiple chunk-servers for reliability. The master and the chunk servers regularly communicate to obtain state information. Clients only communicate with the master for meta-data information, whereas the actual read/writes take place between clients and chunk servers. The GFS designers argue that a single master is not a bottleneck since its involvement in read/write is minimized. For reads, the client converts the byte range to chunk request and sends it to the master along with the file-name. The master then returns the chunk handle and replica locations. For writes, the master returns primary and secondary replica locations. The write command is only sent to the primary which determines the serial order for the secondary replicas. Record append is another important application for google, where a file is used as a producer consumer queue. This operation is similarly optimized in GFS. Fault tolerance is achieved through chunk replication, checksums, and fast state recovery. Discussion Points: (1) Even though the interaction with the master is minimized to reduce bottlenecks, it could still be a single point of failure. (2) Although soft state has performance gains, it might not be sufficient for sensitive data operations, e.g in google finance. -- Muntasir From: Simon Krueger [skruege2 SpamElide] Sent: Tuesday, March 08, 2011 10:52 AM To: Gupta, Indranil Subject: 525 review 03/08 Finding a needle in a Haystack: Facebooks photo storage, Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel The core idea of this paper is to implement a storage service for Facebooks photos that scales to their workload. In particular, their old storage service relied heavily on content delivery networks (CDNs) and network attached storage (NAS) appliances mounted over NFS. Their old service was bottlenecked by the metadata lookup on their files so they implemented a new system called Haystack to be able to achieve higher throughput, lower latency, and lower cost by storing files metadata in memory and using a log based data store. Haystack is composed out of three parts: a directory, a cache, and a store. The directory is responsible for mapping from logical volumes to physical volumes of the store. The cache caches information from the store, and can be used with or without CDNs which lowers their dependance on CDNs. The store keeps 100GB files on XFS, where each photo is stored in a sequential logo structure on the file. Since it only takes 40 bytes for the metadata info for four scaled photos of the same image, the index into these 100GB files are able to be kept in memory and remove the disk overhead for metadata lookup. Pros of the approach: * Removed dependance on CDNs by implementing their own caching service * Removed disk overhead for files metadata lookup * Able to handle the long tail of photo requests * Vary simple design * Handles batch updates of albums Cons of the approach: * Deleted files will have to go through the compaction process to be removed. Like mentioned in previous papers (FAWN, and RAID) I wonder how well this type of storage system would work on SSDs since the log structure assumes sequential reads and writes and the performance would probably be different for SSDS. How energy efficient is their solution? What other applications would be useful for this type of system? Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber The core idea of this paper is to implement a distributed key value store that will have run on commodity hardware. The store uses a multidimensional key consisting of a row index, column index, and timestamp index. Bigtable is built on top of the Google File System. Similar to the google file system there is a master server, and some type of data storage servers called tablet servers. The master is responsible for managing tablets on tablet servers, while each tablet server manages a set of tablets. Clients talk directly to the tablet servers for reads and writes. It seems that this is really just a wrapper around GFS to provide a new higher level abstraction. Additionally, it seems to share similar design characteristics of GFS which could be good or bad depending on how you look at it. The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung The core idea of the paper (1 Paragraph) The core idea of this paper is to present a scalable distributed file system that is fault tolerant and runs on commodity machines. Multi-GB files are common, and files are commonly only read, and often sequentially. Most new data is appended. GFSs API supports create, delete, open, close, read, and write to files and files are organized in a hierarchal directory structure. GFS is designed to have a single master server that manages the files metadata, garbage collection, how files map to chunks, and other management duties. Additionally, there are multiple chunk servers that hold chunks which are fixed partitions of one of the files in the system which are stored as files on the local chunkserver. Chunks typically are replicated on three chunkservers. Clients first contact the master server to get the chunkserver that holds the data and then for a limited time does the rest of the communication directly with the chunkserver. Each chunk is 64MB and this is advantageous because it limits the number of requests a client needs to make to the master because there is enough space to do reads and writes and it reduces the size of the metadata needed to be stored on the master server. Finally GFS has a relaxed consistency model where multiple successful writes maybe undefined but consistent but failed writes will be undefined and inconsistent. Pros of the approach: * The design is simple * The master server doesnt store chunk location persistently because it asks each chunkserver about its chunks at master startup * metadata is stored in memory so metaata operations are fast * Chunk locations are periodically posted from the chunkserver to the master Cons of the approach: * The master is a centralized comment * the amount of chunks that can be stored is limited to the amount of memory in the master While I think the design is simple I feel that they do some quick and dirty design choices (e.g. some what centralized master server and their low level of consistency) but I guess it works well for them. Simon Krueger From: Andrew Harris [harris78 SpamElide] Sent: Tuesday, March 08, 2011 10:38 AM To: Gupta, Indranil Subject: 525 review 03/08 Review of “Finding a needle in Haystack: Facebook’s photo storage”, Beaver et al, and “Bigtable: A Distributed Storage System for Structured Data”, Chang et al Facebook’s photo storage and retrieval system, Haystack, is a stripped down caching system for finding kilobyte-sized file objects at the petabyte scale. The researchers developed this system as a response to their previous NFS-based file system that was (comparably) inefficient, requiring multiple disk reads and network crosstalk in order to find a particular file. By contrast, Haystack requires a single read, and uses the file names themselves to direct this reading. Haystack also implements a redundant array of machines, allowing for fault tolerance at machine-level granularity. Finally, Haystack is intentionally simplified in a number of areas (e.g. Physical to logical volume mapping), which has allowed Facebook to roll out the system onto their production servers with a very quick turnaround time and minimal testing. Haystack, while efficient for its scale, does not seem to be designed with multiple data centers in mind. Part of its implementation, the Cache, seems to function best when implemented physically closely to the Store servers. Despite being a distributed hashing table, if the Cache were not implemented in proximity to the Store, network latency across multiple datacenter sites would become a new bottleneck. A second limitation to Haystack seems to come in its file descriptor implementation. Suppose that Facebook decided to move its content storage to some new format. The current file descriptor format is based on the logical arrangement of Haystack servers, rather than on metadata stored about the file (such as the associated user account, date of creation, etc.). Were Facebook to move its data storage, the entirety of Haystack would need to be left active and in place until the end of replication, to ensure that metadata-equivalent information related to each file is left accessible. In other words, Haystack seems to assume in its implementation that it will be the end-all for distributed data storage, despite the researchers taking steps to future-proof Haystack. Google’s BigTable is a system designed to store structured data on top of Google File System, in contrast to Facebook’s more generic object storage across XFS. The use case envisioned for BigTable in implementation was web crawl information, although it has been adapted to a number of Google products (Analytics and Earth are mentioned, among others). Information in BigTable, at the raw level, is stored as strings (which may be serialized to other formats). Metadata is stored in cascading tablets, a sort of multilayered cache system, which forms the core functionality of BigTable. Fault tolerance and scalability were also design goals for BigTable, and read-write access management through Chubby was also a consideration (a difference from Haystack, where read-write access was entirely mechanical). BigTable has a reverse problem from Haystack, in that it ends up making multiple metadata parses in order to find single pieces of information. Luckily, with their particular implementation of metadata storage, only a small number of metadata rounds are required to find a certain piece of information. This also may be unavoidable, as BigTable is designed to generalize to arbitrary strings rather than single objects, and so must store information on those strings. In other words, the structured nature of BigTable seems to disallow features such as Haystack’s logical server layout. BigTable is indeed scalable, though it too suffers from out-of-datacenter latency issues; this could be partially resolved by collocating applications’ information beforehand, and only allowing individual applications’ data to be stored at single physical locations. A drawback for both systems seems to come in open-ended searches for information. Consider a partial string matching search; in Haystack this might be part of a Facebook user’s name, and in BigTable this might be a phrase across a number of websites. Haystack has no good facility for such a search, and relies almost entirely on knowing everything about a data object before it is able to fetch it. In other words, Haystack is not suited for anything but the most exact data lookups. BigTable on the other hand could parse through its swaths of metadata, however it may need to do so across many machines. This introduces a natural parallelization of the search task, though all servers would need to be involved to give a truly exhaustive search. The choice for one of these systems over the other seems to rely most heavily on use-case. Facebook’s Haystack is suitable only for instances where all information is known about an object, and where only single data objects are being stored per descriptor. BigTable is generalizable to just about every other remaining type of storage task! BigTable also relies on GFS and Chubby for a fair amount of its redundancy and integrity, so that too would need to be considered; if Haystack’s simple implementation over XFS could be used, it would be preferred from a complexity-of-implementation standpoint to setting up an entire GFS cluster.From: Agarwal, Rachit [rachit.ee SpamElide] on behalf of Rachit Agarwal [agarwa16 SpamElide] Sent: Tuesday, March 08, 2011 10:07 AM To: Gupta, Indranil Subject: 525 review 03/08 ---- Finding a needle in Haystack: Facebook's photo storage The paper describes an object storage system that is optimized for a particular application (photo applications) and a specific class of workloads: write-once, read-often, never-modify and rarely deleted data. The main observation that allows extreme optimization for Haystack is the fact that excessive number of disk operations because of metadata lookups can be avoided at the expense of having extra "main memory" for metadata storage. In the class of systems, that are highly optimized for specific applications, a natural question is whether the performance of the system will significantly degrade in even slightly modified application/workloads? Regarding the application and architecture, following questions come to my mind: 1. The photos in Facebook *are* modified, may be rarely though. For example, what about the photo tagging? I believe photo comments are another possibility of *metadata* modification. 2. The authors argue that the directory determines whether a photo request should be handled by the CDN or by the cache. How does the directory do that? Doesn't it need to keep photo access history in order to determine this? 3. Since Haystack combines multiple photos in a single file, it has to keep track of the file offset that points to a particular photo in the file. Doesn't that increase the metadata size significantly? 4. It is not clear why the final Haystack architecture requires *at most* one disk operation as claimed by the authors in the introduction. 5. It would have been really interesting to see comparison results for NFS based implementation and Haystack in terms of memory requirements, latency and number of disk operations. ---- The Google File System The paper presents another file system architecture designed for specific workloads: modest number of write-once and a lot of append-only, streaming data files. The system is highly optimized for such specific workloads that appear at Google. As may be usual in industry, the system system falls into "quick and dirty" design rather than stable, neat, well-traded-off design. It is interesting that many distributed system aspects that we, in academia, consider really important can be completely ignored by industry. In particular, it is interesting that the authors make the following design decisions: 1. A single master node -- a single point of failure. Googlers are the real masters and of course, they scale :) 2. The architecture is more tailored towards high bandwidth applications; it is not clear that Google does not care about latency. In fact, I would say that Google does care a lot about latency, but may be they leave it to the algorithmic/data-structure innovations rather than the data-access complexities. 3. One surprising design decision is that the GFS decides to push most of the consistency checks to the application -- it is not clear if this approach, in general, is a good idea for file systems. In particular, how much responsibility should be pushed to the application. May be the answer completely depends on the application/workload etc. 4. In terms of design decisions, it is not clear why authors decided for a hierarchical directory structure given that they decided to go with flat namespace. 4. It would have been interesting to see quantification of several aspects of the design decisions made. In particular, how much replication of data objects written to a file chunk occurs on average? How much fragmentation does the filesystem incur? ---- -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: Shen LI [geminialex007 SpamElide] Sent: Tuesday, March 08, 2011 9:07 AM To: Gupta, Indranil Subject: 525 review 03/08 Name: Shen Li Bigtable: A distributed Storage System for Structured Data This paper proposes an application friendly storage system building above google file system and chubby lock service. Bigtable itself is three dimensional. Each data cell is indicating by (row, column, timestamp) triple. The table is sorted based the lexicographic order of rows. Tables are partitioned into tablet which is the basic unit for load balance, fault tolerance and other operations. When first time a client access to Bigtable, it first access to the chubby server to get the location of root tablet, then get metadata tablet by using location information in root tablet. The metadata tablet contains that location information of real tablets stored in servers. The master node is responsible to perform load balancing and fault recovering among the whole system. Each server will acquire a lock on Chubby server for each tablet stored on it and the each lock has a life period which has to be refreshed by servers respectively. So, the master node can simply monitor the related directory on chubby server to get a full view of the whole system. When one server fails, the master node will communication with google file system to get the location information of the replications on that server and restart those tablets there. Pro: 1. The data structure provided by Bigtable can support a large range of applications, e.g., Google Earth, Crawls, Percolator and so on. So, it is a good middle ware that compensate the lack of Google File System. 2. It allows users to tune the table to some degree. For example, users can indicate how may version of different timestamp they want to keep. This kind of small flexibility sometimes can achieve large benefits. Cons: 1. Given the master node is responsible for tablet assignment, detecting expiration, load balancing, garbage collection, column family creations, it does not seem to be a "light" job. And there is only one such node in this system. It can be a bottleneck for these operations. 2. Even though the Chubby cell contains usually five nodes. According to their design, there is only one master node which is responsible for all operations. The Chubby server will be accessed by not only the nodes inside the data center, but also users in wild area. Will that be a bottleneck? The Google File System This paper propose the Google File system which is a scalable distributed file system for large distributed data-intensive applications. There are two different nodes inside the system, master node and chunk servers. The master node stores the meta data of all chunks. And the chunk server is responsible for storing the real data of the chunk. When a client try to get data from the system, the client first need to communicate with the master node to get the location information of the chunks it needs. Then, all subsequential communications happens between client and chunk servers directly. Pro: 1. Google File System is well tuned for applications in Google. For example, it uses very large chunk (64MB). 2. The single global point design is good for failure detection and consistency maintenance. Cons: 1. The Google File System does not provide any data format designs. All the data in the system is stored as plain Linux files. It eases the development. However it may not be good for optimization, such as prefetching prediction. For some applications based on objects, the data access pattern highly related to the objects relationships. If they take the data format into consideration, the prefetching policies can be very efficient. 2. The single master node design can be bottleneck for the whole system. From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Tuesday, March 08, 2011 2:29 AM To: Gupta, Indranil Subject: 525 review 03/08 GFS GFS is a scalable file system for large distributed data intensive applications. It provides fault-tolerance and high throughput. Constant monitoring, error detection, fault tolerance and automatic recovery are integral part of the system. Files are divided into fixed-size chunks of 64 MB. These chunks are stored at chunkservers. The master keeps track of the location of all the chunkservers that store that chunk. The default replication factor is three. The client contacts the master to read particular chunk and the master gives back a handle to the chunkservers including the replicas. The client directly read the data from chunkservers. Master periodically exchanges heartbeat with the chunkservers to get information about the chunks stored. Pros: -Clients interact with the master for metadata operations but all data bearing communication goes directly to the chunkservers. This reduces the dependency on the master. Also the chunkserver has the latest information about a chunk. -The client caches the metadata information regarding a chunk thereby reducing the communication overhead with the master. -It is optimized for the use case found in Google where large chunks of files are read. -Operation log stores all the mutations carried on a chunk. The master also does checkpointing. -Detects corruption of data by checksum -Uses hierarchical locking mechanism whereby a read lock is acquired on the directory and write lock at individual files. -Control flows from client to primary but data flow can be optimized based on network topology. Cons and Suggestions: -Master maintains all the metadata in memory which would limit the number of chunks. So even by increasing the number of chunkservers, we cant scale it beyond a threshold. Alternative would be to use distributed storage to store the metadata and thereby making it very scalable. -Its catered towards multi-GB files. So performance for small files is not great. -Garbage collection is delayed and the files are not completely deleted until few days later. Users sometimes have problems due to this as they cant reuse the space immediately. -Data flow could be modified to read different portions of chunks simultaneously from various replicas. -They use checksum to detect corruption of files and in such cases, reads copy of chunks from replicas. Instead of this, we could use other techniques to detect and correct errors. Thus avoiding data transfer. Bigtable Bigtable is a distributed storage system for managing structured data. It is designed to scale with petabytes of data across thousands of machines. Bigtable is a sparse, distributed, persistent multidimentional sorted map. The map is indexed by rowkey, columnkey and a timestamp. Bigtable uses GFS to store log and data files. Bigtable data is stored as SSTable. Bigtable implementation has three component: client, master and many tablet servers. A tablet server is responsible for number of tablets. Bigtable cluster stores a number of tables and each table consists of a set of tablets and each tablet contains all data associated with a row range. Bigtable uses chubby lock service for various tasks like ensuring at most one master, discover tablet servers etc. Pros: -The client requests directly goes to the tablet servers there by reducing interaction with the master -Tablet servers can be dynamically added or removed from a cluster to accommodate workloads. -Client library caches tablet locations. -The recently committed updates are stored in memtable. When the size of memtable increases, it is written to SSTable. This reduces the memory usage and also is checkpointing. -Bigtable carries out major compaction in the background to reclaim resources used by deleted data. -Bigtable tablet servers use two levels of caching. Scan cache caches key-value pairs returned by SSTable. Block cache caches SSTable blocks that were read from GFS. Cons and Suggestions: -Its not a relational database -Bigtable does not support general transactions across row keys. -It has a central master which can become bottleneck -When a tablet server fails, all its tablets have to be migrated to other tablet servers. These tablet servers will have to read the log file and re do the mutations. All the new tablet servers reading this log file might cause performance hiccups. How about storing these log entries in memory of some other tablet servers? When a tablet server goes down, we could restore the tablets from in-memory logs thereby reducing the overhead of reading from GFS. Thank you -Harshitha (gplkrsh2) From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng [ltseng3 SpamElide] Sent: Tuesday, March 08, 2011 1:39 AM To: indy SpamElide Subject: 525 review 03/08 CS 525 - Review: 03/08 Storage in Industry Finding a Needle in Haystack: Facebook's Photo Storage, D. Beaver et al, OSDI 2010 The photo storage in Facebook is very different from what we used to manage disk storage, so this paper proposed a new simple and robust system called Haystack to deal with one key features of Facebook, huge amount of photo storage and quick retrieval. Photo storage in Facebook has the following features: written once, read often, never modified and rarely delete. Moreover, many contents in metadata in traditional POSIX-based filesystem, such as permissions, are useless in photo storage. Haystack utilized these distinct features to build a simple system that has high throughput and low latency, is fault-tolerant and cost-effective (in terms of cost per TB of usable storage and normalized read rate for each terabyte of usable storage). The paper has several contributions. First, the paper identified that NFS-based design together with content delivery networks (CDNs) could not handle photo storage very effectively and efficiently, because such design usually needs more than one disk operation per read, which results in limited performance due to slow disk IOs. Moreover, Facebook also needs to support large number of retrieval of less popular photos, which are seldom found in CDNs and it leads to large traffic and high latency (called long tail problem). The second contribution of the paper is a straightforward but effective design to overcome the disadvantages of the traditional design. Haystack combines multiple photos in a single very large file and simplifies the metadata by explicitly designing two different types: application metadata for users to construct URL and filesystem metadata for host to retrieve from local disk. As a result, each photo only needs 10 bytes of overhead on average, so Haystack can store most of this type of information in main memory to reduce disk read and increase the performance. Moreover, Haystack adds one extra layer of cache called Haystack cache to better handle long tail problem. Comments/questions/critiques: The paper does not address much attention on failures. However, in such a huge amount of data storage, failure should be considered as a norm. The paper mentioned Haystack used RAID and a background tool to diagnose failures and to mark identified failed machine as read-only. Moreover, afterward, Facebook “manually” address the underlying cause offline. This does not seem to be a very promising approach if failure is a norm. In addition, RAID might not be suitable for large distributed system. Finally, it is unclear that why marking failed machine as read-only is enough to handle the fault. Wouldn’t it possible for compromised photo being accessed by users or propagated to other good machine? Since Facebook is a very popular world-wide system, the performance would definitely be improved if locality and geographical issues are considered. I was surprised these were not discussed or even mentioned as a future work in the paper, since read-write policy for disk storage seems to be a highly-related to these two issues. Perhaps, Facebook just have multiple servers and datacenters in different continents or countries, but it will be interesting to see how Facebook deploys their service and storage across the globe if all these servers and datacenters can jointly act as one geo-distributed file system and perform some optimizations to take locality into account. The Google File System, S. Ghemawat et al, SOSP 2003. The paper first identified that the traditional file system was not capable of handling rapidly growing data processing needs. Then the paper proposed four design choices that must be considered for a new large-scale distributed file system. They are: component failures are common cases, files are huge, usually at multi-GB’s, most files writes are appending and random writes are rare, and the system must be flexible and provide high sustained bandwidth (and low latency is less important). The paper proposed a centralized file system called Google File system (GFS) that takes account of abovementioned features. The first advantage of GFS is the effort of separating tasks between master and chunkservers cleverly so that master’s workload is minimized yet have enough information of whole system to perform complicate chunk placement and replication decisions to improve performance. More precisely, only is file system control handled by the master and the more time-consuming data transfer tasks are handled by chunkservers. In this way, the centralized master would not be a bottleneck of GFS. One other distinct feature of GFS is to adopt locking scheme. Because many master operations might still be time-consuming, GFS master should support multiple active operations concurrently. This is done by disallowing aliases for the same file or directory and using locks to ensure mutual-exclusion for critical sections, such as the regions concerning namespace. The final and probably the most important feature of GFS is the way it handles failure. By adopting simple replication scheme and use fast online monitor and repair mechanism, GFS is able to support high availability and ensure data integrity even when many and frequent faults happen. Comments/questions/critiques: Lock scheme is used in master and seems to be effective, but nothing about failure in critical section was mentioned. I am wondering whether traditional scheme can be adapted or GFS needs some other novel ways to handle it. The scheme of “shadow” master (only providing read access while main master is down) was used. Why can’t the usual replication of masters be used? From: david.m.lundgren SpamElide on behalf of David Lundgren [lundgre4 SpamElide] Sent: Tuesday, March 08, 2011 12:53 AM To: Gupta, Indranil Subject: 525 review 03/08 Finding a Needle in Haystack: Facebook's Photo Storage Haystack is Facebook's (FB) object store for their excessive amounts of pictorial content. Haystack is optimized for high read throughput and low latency to accomodate the long tail of users' photo requests. The system's design is kept intentionally simple and is shown to be more cost-effective than Facebook's previous photo store incarnation. Haystack is comprised of the Haystack Directory, Haystack Store, and Haystack Cache. A client's browser communicates with a FB web server which then forwards the client's request to the Haystack Directory. The Directory load balances across storage machines for photo writes and across Haystack Cache and content delivery network (CDN) requests for photo writes. The Directory also maintains a mapping from logical to physical volumes and various associated flags. After a request is sent to the Directory, it responds to FB's server, directing the client to either a CDN or FB's Haystack Cache. The Cache is a distributed hash table that is functionally equivalent to a FB owned CDN. It stores photos requested directly from the user that reside on write-enabled machines (for optimization purposes). If a request is sent from a user to the Cache, the Cache replies directly back. Haystack's improvements over its predecessor are the result of reducing the number of disk reads while retrieving a photo by storing its metadata in memory. These improvements are evaluated empirically and it is shown that a "high" level of throughput is maintained. Pros: 1. Facebook's characterization of photo requests was interesting and seemed to reinforce one's intuition that such read requests would follow a long tail distribution, and that write requests would be most frequent after the weekend. 2. The design decision to create an internal CDN (the Haystack Cache) with modified caching requirements cleverly augments the benefits of traditional external CDNs. Cons: 1. The authors discuss the importance of the Haystack Directory in managing the distribution of requests to either the Haystack Cache or an external content distribution network (CDN). I would like to have seen more details (or any) on their load balancing algorithms. 2. Perhaps the data is non-existent, but a comparison of the Haystress benchmarks (and more generally all of the benchmarks) to Facebook's previous photo storage system would have been useful and would have allowed for a more meaningful analysis. Beaver et al. discuss the importance of non reliance on traditional CDNs. I think a useful research contribution could be to characterize the limitations of extant CDN services. ---------------------------------------------------------------------- Bigtable: A Distributed Storage System for Structured Data Bigtable is Google's distributed structured data storage system for some of Google's core products including Google Analytics and web indexing. Bigtable's data model is intentionally simplistic and its reliance on the exotic features of the distributed lock service, Chubby, are purposefully kept to a minimum. Bigtable is a multidemensional map in that each entry in the map is indexed via a (row_key, column_key, timestamp) tuple. The values stored in Bigtable are uninterpreted arrays of bytes to accomodate a wide variety of customer use cases. Bigtable's logs and data files are stored using the Google File System. Data files conform to Google's SStable file format. A Bigtable is split into many smaller "tablets" that contain the data for neighboring clusters of row entries. These tablets are distributed to one of many tablet servers; this assignment of tablet to tablet-server is managed by a master server. The master also serves to provide load balancing across tablet-servers, perform garbage collection of files on the distributed file system, and maintain a list of active tablet-servers. Various optimizations and refinements are presented and the system's performance is evaluated. Application use stories are detailed. Pros: 1. The "real applications" section of the paper was highly relevant and provided concrete examples that truly motivate the work. Cons: 1. The scalability results of Bigtable were disappointing. The authors points to poor load balancing across multiple server configurations and describes how these imbalances come about, but no possible improvements are discussed. 2. Master failure recovery is addressed but the time to restart a master is not covered. It is unclear to me whether restarts are rapid. The authors describe a periodic freezing of the memtable to perform minor and major table compactions. One is left wondering are there viable online algorithms for such compaction. From: Chi-Yao Hong [cyhong1128 SpamElide] Sent: Tuesday, March 08, 2011 12:49 AM To: Gupta, Indranil Subject: 525 review 03/08 ---- Finding a Needle in a Haystack: Facebook’s Photo Storage, OSDI’10 ---- While 80% of photo access are for those photos with age <200 days, the measurement results show that old photos (e.g., more than 1500 days) still have fairly change of access. Therefore, to provide great service over 260 billion images, Facebook requires a new file storage system that provides high throughput, low latency, fault-tolerant and cost effective. The paper shows that in the traditional design where each image is stored in its own file, an enormous amount of metadata far exceeds the caching abilities. Therefore, multiple I/O operations per photo request slow down the response time. To deal with these problems, Facebook proposed to use Haystack, a new object that stores the images for each user in only a file. Under this setting, each user has only one index file, which stores the metadata about the image file of the user. When user performs read operation, the store machine simply looks up the relevant metadata in its in-memory mappings, and reads the needle according to a proper offset. Therefore, this does not introduce any I/O operation. For write or update operations, web server sends the related information (e.g., logical volume and data) to each store machines. Store machines then append the needle images in a synchronous fashion to avoid inconsistency, and store machine will update its in-memory mapping. Pros: 1. The hash function allows Facebook’s photo storage has well-balanced behavior. Also, the high hit rates for the Haystack cache significantly improves the serving latency and throughput. 2. Using two benchmarks, Randomio and Haystress, the authors use synthetic workloads to show that the proposed file storage system could have high throughput even in the presence of concurrent write operations. Cons: 1. The paper does not show why metadata should be stored in a per-user basis. Intuitively, per-user metadata could simplify the design, but could be less efficient. I would expect to see more discussion about the pros and cons of the methods about how to group related files. There should be a tradeoff between total metadata size and the access granularity. Clearly, Facebook should seek for an operation point where they could store the entire metadata in the main memory, while at the same time they also want to smallest access granularity (e.g., caching granularity and file permission granularity). 2. This paper does not address the incremental growth problem. For Facebook, the current growth rate is 220 million new photos per week, which translates to 25TB of additional storage consumed weekly. The growth rate of facebook photo services could be faster than that of cache storize size and that of DRAM size. ---- Bigtable: A Distributed Storage System for Structured Data, OSDI’06 ---- For google, they require a file storage system that could scale up to billions of URLs with hundreds of millions of users. Because of the enormous size requirement, current database cannot provide nice performance, and therefore Google seeks for low-level storage optimizations to improve the system performance. Google proposed Bigtable, a distributed storage system that could scale to the desirable petabyte scale while preserving applicability, scalability, high performance and availability. The fundamental function that Bigtable provides is a multi-dimensional sorted map. The building blocks of Bigtable includes i) Google File System to store persistent data, ii) a scheduler that assigned jobs involved in BigTable to machines, iii) Lock service to performance master election and location boot-strapping, and iv) MapReduce to process large-scale data processing. Bigtable split a large table into multi-level tablets, and assign a reasonable number of tablets for each serving machine. This setup allows Bigtable having fast recovery and load balancing. Bigtable selects a master server to keep track the status of servers, and the master server assigns tablets to servers with performance considerations. Pros: 1. The performance is good. Bigtable could perform ~1M random writes/reads with 500 tablet servers. Also, the scalability is good. The aggregate throughput of Bigtable increases considerably, as the number of serving servers increases from 1 to 500. It performs well in many real applications like Google Analytics and Google Earth. Cons: 1. The Bigtable data model is somewhat restricted -- a sorted map with fixed dimensionality. I doubt the proposed refinements are general enough for other data model. -- Chi-Yao Hong PhD Student, Computer Science University of Illinois at Urbana-Champaign http://hong78.projects.cs.illinois.edu/ From: yanenli SpamElide on behalf of Yanen Li [yanenli2 SpamElide] Sent: Tuesday, March 08, 2011 12:04 AM To: Gupta, Indranil Subject: 525 review 03/08 cs 525 Paper review 03/08/2011 Paper 1: BigTable: A System for Distributed Structured Storage This paper describes Bigtable, which is industrial distributed storage system for structured data inside Google. BigTable is a sparse, distributed, multi-dimensional sorted map. Indexing is performed using a row key, column key, and timestamp tuple, and the data is an uninterpreted array of bytes. Operations on a single row are atomic. These operations are defined as Insertion, Deletion, Lookup etc. columns are grouped into column families, which are all of the same type and are grouped and compressed together. Please note that BigTable doesn't support multi-row transactions and table-wide integrity constraints. Bigtable is built on top of Google File System for storing data/logs, SSTable for performing mapping from keys to values, Chubby for distributed locks. The data cells are stored in Tablets, which are distributed in tablet servers. In order to make BigTable efficient and scalable, several implementation refinements are carried out, such as grouping column families together into an SSTable, shared Logs for editing a table, and data compression on columns as well as on rows. Pros: - Simple design, so that it is more easy to detect problems and make modifications. - Highly scalable to thousands of machines. - Applicable to different kinds of applications Cons: - Not ACID compliant, for example, no support multi-row transactions and table-wide integrity constraints. - No optimization on queries - It's not clear that whether it work over Internet? (clusters spread across geographically distant) Paper 2: Finding a Needle in Haystack: Facebook's Photo Storage This paper discusses a specialized store for Facebook's Photo Storage. The motivation of the design is to provide (1) high throughput and low latency; (2)a good user experience; (3) Fault-tolerant; (4) Cost-effective; (5) Simplicity for the proposed photo storage. A photo is model as an generic object (needle) in a large 10 GB append-able file (Haystack). The key is the design is a fast index for determining needle offsets in the Haystack. The authors propose Haystack Directory, which provides a mapping from logical volumes to physical volumes and handles load balances writes across logical volumes. They've also designed the Haystack Cache, which is a distributed hash table to locate cached photos. Operations provided by the system include: upload photo; read; output needle data; Write/Modify in-memory index; delete photos from the system. Pros: - Reduced disk I/O, therefore resulted in higher disk throughput - Simplified metadata for easier lookups - Provides single photo serving and storage layer Cons: - Privacy concerns. Are cookies sufficient protection? - How is consistency maintained between the Haystack? -- Yanen Li Ph.D Candidate Department of Computer Science University of Illinois at Urbana-Champaign Tel: (217)-766-0686 Email: yanenli2 SpamElide From: mark overholt [overholt.mark SpamElide] Sent: Monday, March 07, 2011 8:18 PM To: Gupta, Indranil Subject: 525 review 03/08 Mark Overholt CS 525 03/08/2011 Finding a Needle in Haystack: Facebook’s Photo Storage Summary: Facebook is one of the biggest photo sharing website with over 60 billion photos and 25TB of new data uploaded per week. Due to requirement of high availability of photos, facebook uses four levels of caching by using CDNs, Cachr, Photo servers and FHC( file handle servers). Each image is stored in a different file in NFS file system, the metadata associated with storage is enormously high and hence, large amount of time is spent on retrieving metadata. It is hard to even fit the large size of metadata in caches leading to 3 disks i/o on an average per read. In order to combat this problem of large metadata and increased i/o, the facebook team came up with a photo storage solution called haystack. Each user is assigned a haystack, in which all of those user’s photos are stored as one file. A haystack object consists of a superblock and array of needles. Each needle stores the metadata about the image and the actual image. When a user performs read, a http request containing the metadata is sent to the photo server(caching the index files) which translates into the appropriate object, loads the requested file from the object using XFS and sends the reply to the user. During deletion, only the offset is set to zero and no data is deleted. In compaction, a new object is created and images that are not deleted or only new modifications are copied. Hence, Haystack proposes new mechanism of efficiently reducing the metadata associated with each image. Discussion: Pros: Seems to effectively reduce i/o operations on facebook photos by reducing the size of the metadata. Facebook has millions of photos stored and retrieved everyday with real-time performances. This is proof enough for the robustness and great performance of the system. Cons: The photos are not actually deleted from the servers?!? This seems like a big privacy issue that users should be made aware of. The system makes use of an abundant amount of main memory. However, this increases the cost of the system that may not be affordable by smaller organisations. Besides, with the current exploding rate of growth of Facebook, this may soon become a bottleneck too. Bigtable: A Distributed Storage System for Structured Data Summary: Bigtable is a storage system developed by Google to handle their large amounts of structured data that is used by many of their applications. Their goal was a system that could store petabytes of data across thousands of commodity servers, and allow that data to be retrieved in a quick manner, and available as much as possible. Bigtable is akin to many distributed databases today. However, it does not support a relational model. Instead it provides clients with a simple data model that supports dynamic control over data layout and format. In short, Bigtable is a large, distributed, multi-dimensional sorted map. The map is indexed by a row, a column, and a timestamp. Data maintained by Bigtable is kept in lexicographical order based on the row key. Next it groups together columns into families, allowing them to be compressed together. The columns are also sorted. Bigtable allows an arbitrary number of columns, with the ability to have them be any name. Behind the scenes, Bigtable uses the Google File System (GFS) to distribute and load balance the data between the cluster of servers. Bigtable divides its data up into chunks called Tablets. There is 1 master tablet server that saves only metadata on the tablets in the system. There is a second level of metadata table servers, which can be variable in size, and when one of these metadata tablet servers gets too overloaded, can split into 2 different servers. It should be noted that the master tablet server cannot be split. Lastly, the third level is the User tablet servers. Each of these servers holds between 10 and 1000 tablets, and can dynamically load balance themselves if there are too many tablets or too many requests on 1 particular server. The metadata and index data for the tablets are stored in main memory on each server. Discussion: Pros: Bigtable seems to work quite well for everything Google asks it to do. Considering it is already the backbone of countless Google applications in production today. Bigtable is very customizable to whatever application needs to use it. Bigtable can handle low latency or it can handle high availability. It is a very versatile system. Cons: Because Bigtable uses a single Master server to handle indexing, it is a central point of failure. Even though client themselves access the User Tablets directly…this master node is still a source for concern. I am curious to see what would happen to a Bigtable deployment is the master node stopped working. The experiments in this paper do not provide a comparison of how other systems would perform in a similar environment. As a result, it is difficult to assess the actual performance gain compared to a relational database. From: Ankit Singla [iitb.ankit SpamElide] Sent: Sunday, March 06, 2011 5:42 PM To: Gupta, Indranil Subject: 525 review 03/08 1. Finding a Needle in a Haystack -------------------------------------------------- Summary: The paper walks us through Facebook's storage system design. I think they did an excellent job of illustrating why the CDN alone wasn't enough (because of ~20% requests being for older photos which were simply too large in volume to push to a CDN economically) and why the traditional NFS/NAS system didn't work for them (because given that their entire set of metadata too large to be cached, this system required at least 3 disk operations to read one image). The key idea of their system is to bunch files (encapsulated in 'needles') in large files of 100GB. Having these large files allowed them to keep the file descriptors for all 10 files on the disk open, while also allowing smaller size of metadata which could then be cached. This resulted in a design with only 1 disk operation to fetch an image (which is the least possible). They have several other design decisions: a) mapping 3 physical large files on different disks to one logical volume to help with redundancy b) doing security and validity ('not deleted') checks after reading the image from disk, so that this information doesn't have to be cached with other metadata in memory c) using an internal CDN smartly - they shield the write-enabled machines from read requests by caching their data, rather than caching data for read-only machines etc. Comments: Overall, the paper presents some interesting engineering choices which are clearly dictated by their workload. They use the workload to justify these in a very convincing manner. The applicability of the same system to other scenarios is highly suspect. The paper doesn't discuss much about the distributed operation of the haystack directory and web servers (which of course are not the main focus of this paper), but I think a little more information (for instance, how requests are load balanced across multiple web-servers and whether these instances co-operate in load balancing over the storage or just do things independently, relying on randomization to work things out) on these would be good to know. I'm also curious about whether any of their systems explicitly utilizes the social network structure to make smarter choices (about replication, load balancing, caching etc.) Another interesting thought that crossed my mind was whether they could ship the metadata (inode number, machine ID etc.) to the users most likely to ask for a particular person's pictures (like close friends) and get away with the traditional NAS/NFS system with this added idea. This does reveal some internal structure to the user end and people might be able to develop applications to completely side-track visits to the website and just see pictures (which would be bad from an advertising perspective). 2. Google File System ---------------------------------- Summary: Like Facebook's paper, this paper is also highly geared towards making design choices that suit a particular workload. It immediately struck me that from Google's perspective (at least in 2003) most file operations were being done not in direct response to user requests, but by their background processing tasks done by bots. This is what seems to have prompted their journal-like file system design - bots read and write large chunks of data sequentially and data is rarely modified. Chunks are stored on chunk-servers and a master keeps record of what chunks are stored where. Providing efficient append operations for large writes is one of their key priorities. As a result, they use large sized chunks (64MB) as the core object for the file system to operate on - this means sequential reads impose only one contact with the master per chunk. The applications/clients can cache locations of chunks for data they access frequently and maintain persistent TCP connections to chunk-servers they contact often. Large chunks also reduce the metadata overhead on the master server. The focus on multiple concurrent operations means some compromise needs to be made on consistency, which is what they did. (Only the file namespace mutations seem to be atomic, locking operations.) Comparison with Haystack: It's also significant that Google's data store was much smaller than Facebook's (100s of TB versus 100s of PB at Facebook) in 2003. I wonder how Youtube and Google's resounding successes in other application areas changed things relatively. Another contrast between GFS and Haystack is that GFS values high throughput more than low latency (this again boils down to large streaming reads by bots versus small user requests). GFS is very focused on concurrent operations by multiple applications on the storage, which I think is a non-issue for Haystack because of the caching. The comparison of their design decisions is almost a lesson in tailoring your system to your workload! Comments: For me, one major curiosity is that both GFS and Bigtable were published prior to YouTube being massively popular. (Even around 2006 when BigTable was published, YouTube was nowhere close to today's scale). I strongly suspect that Google could benefit from an entirely separate file system for YouTube than its other applications. It seems like YouTube could require different optimizations (for instance, efficient streaming of 10s of MBs with fairly predictable sequential reads.). In a tech talk from 2008 (http://www.youtube.com/watch?v=HXevnuOOy48), one of their engineers described that even dealing with the large number of image thumbnails for videos was a significant challenge and they had separate infrastructure just to tackle that problem. Ankit -------------------------------------------------------------------- Ankit Singla Graduate Student, Computer Science University of Illinois at Urbana-Champaign (UIUC) http://www.cs.illinois.edu/homes/singla2/ From: Tony Huang [tonyh1986 SpamElide] Sent: Saturday, March 05, 2011 3:16 PM To: Gupta, Indranil Cc: Tony Huang Subject: 525 review 03/08 Zu Huang zuhuang1 * Finding a Needle in Haystack Core Idea: In this paper, the author introduces a storage system optimized for picture storage and retrieval for Facebook. The main improvement of this system is that different pictures are concatenated into one giant file to reduce the size of file metadatas. An index file contains all the meta-data for the picture files inside a giant file including key, offset, size and bookeeping flags. The index is stored entirely inside memory to prevent disk seeks during operations. One of such huge file is called a volume. Multiple physical volumes are controlled by a Haystack Store. One top of the Haystack store is a cache level. Haystack Cache is tailored for facebook photo's access pattern, which is that new photos are accessed frequently, and ther access frequency decreaes after two days with a very long tail. To optimize this, all new uploads are cached to shield the machines newly written to. On top of that, a Haystack Directory is a databse that contains information about where the photos are stored. Pros: * A much more efficient photo storage system that serves its purpose. * Again prove the idea that storing all your index inside main memory is an efficient way to speed up your system. * Illustrates the idea how caching and CDN can effectively reduces load on the actual storage machine and improves user experiences. Cons: * Not tested under a more general workload. * Why not use something like Bigtable to achieve this? * The Haystack Directory may become a single point of failure. * Bigtable Core Idea: Bigtable is Google's distributed storage system. It is a sparse, distributed multi-dimensional sorted map. Each piece of data is indexed by a row key, a column key and a timestamp. The data is stored as uninterpreted raw bytes. Data are maintained in lexicographic order n row key. Access to a single row are atomic. The data within a row is called a tablet, and it is the unit of load balancing and distribution. Column keys are grouped into sets of column families, which is the basic unit of access control. Bigtable are built on top of other infrastructure management softwares inside Google. It depends on other software for scheduling, resource management, resouece sharing and machine management. The SSTable file format is used to the data. The system is consisted of a library that is linked into every client, one master server and many tablet servers. Master is responsible for assigning tables to tablet servers, load balancing, garbage collection, tablet health monitoring and schema changes. Tablets servers manage the tablets assigned to it, and manage spliting a tablet if it gets too big. The maste r is mostlyu lightly loaded. Tablet locations are stored in a three-level B+-tree like data structure. Chubby is used to distribute the location of the Root tablet. Root tablet, as well as the second level tablets, stores only metadata and logging information. Pros: * Highly scalable and optimized for high-throughput projects. * More expressive than traditionial DBMS, which requires schemas. Cons: * Master server can become a single point of failure and resource constraint. Matter of fact, Google is having a project to sharded the master server for both GFS and Big table. * Not optimized for incremental updates. * No geo-location considerable for tablet-assignment. -- Regards -- Tony