From: Curtis Wang [cwang89 SpamElide] Sent: Thursday, February 24, 2011 12:29 PM To: indy SpamElide Subject: 525 review 02/24 Curtis Wang (wang505) 2/24/11 Storage - 1 Innovator’s Dilemma In “The Innovator’s Dilemma”, Christensen describes the hard disk industry as akin to fruit flies—they both have frequent cycles of change and allow for easy of study. Christensen states that the hard disk industry is a cycle of entrant firms beating out established firms. He attributes this to the fact that established firms tend to build sustaining technologies, i.e. improving their existing technologies for the existing market. Entrants, on the other hand, are more likely to build disruptive technologies, i.e. altering the use of existing hard disks to appeal to a new market (in the hard disk space, this meant reducing the size of the disk to follow the trend of reductions in computer sizes). Generally, this involved taking a step back in terms of performance for their disks, which was viewed unfavorably by established firms who, in terms of pure technology, were still leading the wave. When the established firms recognized the need to adopt the new uses that were propounded by the entrant firms, it was too late (most of the times) for them to recover. The use of the hard disk industry was a very clear way to visualize the points that Christensen was talking about. However, it would have been nice for him to also extrapolate the claims he made and relate it to a different market, to provide a different perspective on what he was saying. RAID To increase performance and reduce costs and energy expenditure, Patterson calls for the use of Redundant Arrays of Inexpensive Disks in contrast to Single Large Expensive Disks. He argues that the performance (number of I/Os per second) of inexpensive disks is similar (within a factor of two) to larger disks while being better in other metrics such as price per megabyte. One of the problems with using several disks is an overall increased chance of failure for one disk failing in the entire array. To combat this, Patterson proposes RAID, a method of using additional disks to store redundant information to allow for recovery. He proposes several different array organizations called levels. It’s interesting to note the parallels in technology such as RAID and distributed computing. Moving away from SLED to RAID is similar to how many companies today use clusters of heterogeneous commodity computers instead of supercomputers. RAID aims to move away from “centralized” storage to “distributed” storage. Like with distributed systems, problems arise in this manner, namely performance and fault tolerance. The RAID paper also follows a similar pattern to the story of the first chapter of “The Innovator’s Dilemma”. A new use is found for smaller, inexpensive, and lower-power disks, making them more useful than larger disks. This application was not discovered in the companies that created these disks but rather in academia. From: w SpamElide on behalf of Will Dietz [wdietz2 SpamElide] Sent: Thursday, February 24, 2011 12:29 PM To: Gupta, Indranil Subject: cs525 review 02/24 Will Dietz wdietz2 2-24-2011 First paper I read was the classic 'Case for RAID' paper. This paper presents the simple yet ridiculously effective concept of using multiple cheap disks together instead of buying one large disk. They motivate this by breaking drives down into performance (with respect to two primary workload types: "Supercomputer" and "TPS"), various cost metrics, and reliability. They present 5 levels (configuration styles) of RAID and examine how they perform on these metrics relative to each other as well as relative to single expensive disks. Raid 1 is just mirroring across disks. Raid 2 has dedicated parity disks using hamming codes like ECC on DRAM. Wikipedia points out that Raid 2 is no longer used much (bit-level parity checks) single most hard drives (even the inexpensive ones like the ones discussed in the paper) have built-in parity checks these days. Raid 3 drifts further towards performance by having a single check disk per group (instead of enough check disks to both detect and correct and error as in raid 2). Raid 4 is like Raid 3 only breaks the data at the block level. This also allows for independent I/O operations, yielding high reads, but terrible read-mofiy-write performance, making it generally unsuitable for most workloads. Raid 5 strikes a happy compromise by not having any check disks, and using a distributed type of parity, and is well suited to both workload types analyzed. All in all this was a good read, and the entire paper had a comforting pragmatism to it, and judging from how popular RAID setups are today, this paper's impact is hard to deny. The second paper I read was "How can great firms fail? Insight from the Hard Drive industry". This paper presents a history of the hard drive business to give insight into what works and what doesn't. Very early on the paper points out the 'innovator's dillema' that is in order to succeed you need to listen to your customers and innovate towards their needs, but similarly doing exactly that can cause your business to fail. They expand this idea by pointing out that succeeding doesn't require this aggressive attempt to stay on top of things technology-wise--that is, the 'mudslide' hypothesis they propose is inconsistent with their research. They then analyze the history of HD companies, and present the concept of disruptive techologies--things that shook things up, and done incorrectly caused many industry leaders to fail. They found that the innovations that *sustain* the trajectry of performance were generally NOT the ones that brought down companies. Finally the discuss the relationship between the company and the customer, suggesting that customers can 'hold captive' companies (say by not wanting a new product because it's incompatible with their old one), causing established companies to have issues, while NEW companies to the market don't have such limitations. From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Thursday, February 24, 2011 12:01 PM To: Gupta, Indranil Subject: 525 review 24/02 FAWN FAWN is a cluster architecture for low-power data-intensive computing where they couple low-power CPUs with flash storage. The motivation is that for data-intensive computing, storage, network and memory is the bottleneck which causes low CPU utilization. This system consists of front-ends which forward requests to appropriate backends that store data on flash. The backend storage nodes are organized in a ring using consistent hashing. Each node handles multiple virtual nodes with separate datastore file and this helps in loadbalancing and making it fault tolerance. FAWN backend nodes use an in memory hash index that maps between key to data log. Pros: -This system provides fast and energy efficient processing for certain data intensive computing workloads. -By using consistent hashing, this system is incrementally scalable. -This system uses two levels of cache, one at the front-end and the other at the back-end. This reduces the latency and handles load imbalances for popular queries. -FAWN offers configurable replication factor which makes it fault tolerant. Cons and Improvements: -The workload supported by this system is typically small -The front-end maintains the entire node membership list which make is less scalable as the backend nodes grow. This could be fixed by using distributed hash table routing like Chord -If a query comes to a front-end which is beyond its range, it forwards it to the right front-end. This involves doubling the traffic at the front-end. One solution could be that the client is aware of the node membership, which it periodically collects, and send request to the correct front-end. -Even though the joins and leaves do not affect the entire system, its still complicated and join operations complete way after pre-copies are finished under high load. -This system is more optimized for reads than writes -The design of the system is such that the put request is sent to the head replica which sends it to all the replicas till tail and the tail replies with a response. This would increase the latency. Instead they could have lazy replication. -The put request is always sent to the head and the get is sent to tail node. Instead, queries can be directed to replicas that have the least load. A Case of RAID Redundant Array of Inexpensive Disks, based on magnetic disk technology offers a good alternative to SLED with performance improvement, more reliability, less power consumption and scalability. There are five levels where the first level is mirrored disks where all the disks are duplicated and every write to data disk is also written to check disk. Level two is suitable for supercomputers where it has the same read performance as level 1 but uses fewer check disk. In level three, there is only one check disk per group. In the fourth level, RAID improves performance of small transfers through parallelism. In the fifth level, the data is distributed across all the disks including check disks. Pros: -An array of disks has low MTTF and to overcome this RAID break the arrays into reliability groups, with each group having extra disks containing redundant information. If this extra disk is a hot backup, the MTTR is reduced. -The reduction in size of disk simplifies interconnection of many components, packaging and cabling. This improves the cost/performance metrics. -RAID offers modular growth over SLED. -RAID significantly reduces the power consumption per megabyte. Cons: -In the array of disks, failures become frequent ie MTTF is lower. Therefore, there needs to be extra measures for redundancy. -The reliability cost varies from 100% to 4% depending on the RAID levels. This is basically the extra cost of these check disks which stores the redundant data. -This is more suitable for systems smaller than SLED. Thank you -Harshitha From: anjalis2006 SpamElide on behalf of Anjali Sridhar [sridhar3 SpamElide] Sent: Thursday, February 24, 2011 12:01 PM To: Gupta, Indranil Subject: 525 review 02/24 FAWN FAWN is an architecture that is based on providing high performance low power cluster operations. It uses flash storage instead of DRAM. The important aspects of flash storage are 1) fast random reads, 2) better query/joule performance and slow random writes. Flash has faster disk access than DRAM and uses lesser power. FAWN uses low power processors which are slower than the traditionally used CPS’s but provide better instructions per joule. The paper presents the implementation of this architecture by building a key value store based on these basic principles. The FAWN-KV store consists of a log structure data store and a chain of virtual nodes that provide replication and consistency. The data store is log-structured that writes by appending. The data store is indexed using a key that is stored in the hash table. In order to prevent random seeks for writing, this design is used to ensure that all writes are appended in the data store and the offset is stored in the hash table. The store allows for lookup, delete and store. FAWN also provides for inserts and joins by having a two phase join and leaves. FAWN also has a set of front-end nodes that keep track of the membership of the KV store. They route the query of the user to the right node. FAWN uses chain replication on a consistent hashing ring. The query is routed to the head of the chain and the tail of the chain sends a reponse to the frond end again. FAWN provides an energy efficient solution. It is provides this solution for read intensive workloads of small size. The query routed through the chain would cause overhead if the physical nodes in the chain are located far from each other. FAWN does not mention query failures and how nodes handle them. It assumes only fail stop failure of nodes. It does not look into communication failures. FAWN does not deploy the KV store in a practical setting where it might be useful. The idea of using flash storage coupled with slightly low performance processors might be useful in the current set of thin mobile devices. When a store is performed and it is for a key that already has a value that has been written, the new value is now stored and the hash table has a new offset to refer to it. The old memory space is not referred to anymore. This might lead to disk fragmentation. Compaction might be a required very frequenty if there are a large number of writes. The paper mentions that this slows down the query rate during this period. FAWN seems to target read intensive workloads in its analysis when it is the performance of random writes on flash storage that needs to be analyzed and improved upon. How can great firms fail? The author tackles one of the important questions faced by any industry in the face of escalating innovation. The disk drive industry example is used to summarize some of the reasons about why industry giants are toppled by entrant firms. The new technology might be innovative and might offer more than what the earlier firms were. Sometimes the existing technology might be in demand and servicing current customer requests might be bringing in more revenue than trying to ship a new technology. It might be bad business at that point in time but a few years down the line the demand might be for the new technology. This can lead to companies going out of business by listening to their customers. The author states that the large companies lost their vision and mobility towards an always growing trajectory map which led to newer entrant firms gaining an advantage. The author does not consider companies who have many products that they are dealing with. For example, Microsoft is an software industry giant with many successfull products. However it was late in jumping onto the mobile bandwagon and is now struggling in a market with already established companies. Hence though the vision of the changing industry might have been absent, it was not for any of the stated reasons that Microsoft failed to make its mark. From: nicholas.nj.jordan SpamElide on behalf of Nicholas Jordan [njordan3 SpamElide] Sent: Thursday, February 24, 2011 12:00 PM To: Gupta, Indranil Subject: 525 review 04/24 Nicholas Jordan njordan3 04/24 RAID levels: I find this paper really enjoyable. The idea of RAIS started when someone realized that they could put together 12 smaller inexpensive disks that could cost less than a Single Large Expensive Disk(SLED). From my rough calculations the original author could have used 12 Conner CP3100 disk agianst one IBM 3380 disk. The several RAID levels are good advancements into the final RAID 5 level. The question to ask is, is this the first instance of ”distributed” system? Yes all the devices are in a single unit. Its distributed as the user sees the storage device as a single entity. However, behind the hardware, there are several disks that have to some synchronize in some RAID levels. Also, the storage of data is spread among multiple disk and some disk shared the same error checking parity information a the same disk. FAWN: A Fast Array of Wimpy Nodes The design here exploits the cheap cost of flash memory and builds a powerful key value store that is power efficient compared to traditional key values datacenters. Traditional as the sense that datacenters use mechanical SRAM and DRAM that continually need power to elements to maintain the data. The design is similar to other key-values stores that use consistent hashing, and it uses replication for availability. The way it differs is that is uses flash memory. Flash memory is 175 times faster than magnetic disk for reads. It’s two orders of magnitude more efficient than mechanical disk in terms of Query/Joules. The fact small writes are very expensive causes it to go a log-structure. For small and big writes, you have to erase the whole page, and then write the modified block in its entirety back to memory. So internally, it maintains a Data Log that you can only append data to. This avoids the write issue described above. It has a hash Index, the key is associated to a pointer to the where the keyed-value is located in the Data Log. The test results, I’m questioning the real-world applications. For example, the entire Wikipedia can fit in 8 GBs on disk, well less then the 20 GB they stored in their nodes. However, their query size of data is 256B and 1KB. Maybe these small data sizes will be possible for twitter application or bank transactions. Is FAWN-KV still feasible for fetching web pages that are 100KB or 200 KB in size? You have to take into account that most of a website can be format like navigation bars that won’t change. When the showed the results for their Join and Leave operations, they only tested the system with 30% of its maximum supported query rate. This load rate equaled about 9,000 queries per second. I’m not sure if this is a high rate or are they cheating. When these joins and leaves happen they move data between a couple of nodes which basically ads more writes to the system. They basically showed that during the test, the queries per second weren’t affected by there operations. Isn’t this expected when your operating more than 3x less your max capacity. It would have been nice to see how the system performed at max capacity. I look forward to seing the results of how they tackle the sleeping disk issue. I feel that with non-volatile memory this will be the next thing to take advantage of. -- Thanks, Nick Jordan From: Simon Krueger [skruege2 SpamElide] Sent: Thursday, February 24, 2011 11:36 AM To: Gupta, Indranil Subject: 525 review 02/24 A Case for Redundant Arrays of Inexpensive Disks (RAID) David A Patterson, Garth Gibson, and Randy H Katz While CPU and Memory have been improving every year in size and in performance at an exponential rate approximately, single large expensive disks (SLEDs) have not. This has caused I/O to become a bottleneck in computer systems and because of Amdahls law we know that if bottlenecks are not removed speed ups are wasted (e.g. Consider a program that spends 10% of its time in I/O and 90% of its time in the CPU. By Amdahls law we know that If this program receives a 100 times CPU speed up then the program will be less than 10 times faster, thus wasting 90% of the CPU speed up). This paper presents a design to fix the I/O bottlenecks in computers by arranging multiple hard disk drives to allow distributed and parallel data reads and writes. They call this approach Redundant Array of Inexpensive Disks (RAID) which improves SLEDs performance, reliability, and in turn removes the I/O bottleneck. The paper talks about 5 different RAID configurations. A RAID configuration is presented as a RAID level and each RAID level has varying performance and overhead. Pros: The advantage of this approach is that the algorithm can be applied to any type of hard drive. That is, hard drives performance and storage could increase and they would only make RAID more powerful. Additionally, with this approach disks become more reliable and can have a power saving. Cons: Certain RAID levels (e.g. 5) might have to change based upon the under lying disk technology. Specifically, how well will this work with solid state disks (SSDs) since they have no concept of a sector? FAWN: A Fast Array of Wimpy Nodes David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan This paper constructs a new, highly energy efficient, distributed key-value store called FAWN-KV where FAWN stands for Fast Array of Wimpy Nodes. The idea behind FAWN-KV is that distributed key value stores are I/O bound and not computationally intensive thus wasting energy by running such powerful energy demanding machines. With this fact they architect FAWN to be similar to Dynamo and Voldemort expect for how they actually store data and the types of CPUs. They use a special data store called FAWN-DS which is optimized for flash storage and embedded systems type CPUs. By using lower powered CPUs and flash memory they are able to use considerably less energy and still have a high query/joule throughput. Advantages of their approach: Their FAWN-DS is able to have a much higher through put to similar logging data structures such as BDB on flash memory. They are able to use considerably less energy and have a higher amount of queries/Joule compared to other systems. Which translate in to a lower costing system. Additionally, since the hardware they used is ~2 years old and are small embedded systems CPUs costed much less than commodity PCs. Cons of their approach: The machines used in FAWN may not be general purpose enough to be used in any cloud computing application. (e.g. they can only be used for key-value stores with small file sizes). What is the power saving of FAWN if they tried to put unused nodes to sleep? Simon Krueger From: Shen LI [geminialex007 SpamElide] Sent: Thursday, February 24, 2011 11:15 AM To: Gupta, Indranil Subject: 525 review 02/24 Name: Shen Li FAWN: A Fast Array of Wimpy Nodes This paper propose an idea to use flash disk instead of magnetic disk in distributed key-value stores to achieve both high performance random access and low power consumption. They point out that using flash storage is both beneficial and challenging. On one hand, flasdh disk can provide fast random reads and efficient I/O, on the other hand, the random writes on flash disk are very expensive. Because updating a single page requires first erasing an entire block of pages, then writing the modified block into the device, which is as expensive as writing a whole new block. So they employ the append-only data log policies to reduce the random writing cost. Above the basic thought, they built FAWN-KV, a cluster based bey-value storage system, in which data is distributed across nodes using consistent hashing. They also designed operations to deal with nodes dynamics, including split, merge, and compact. Since they employed the append-only writing policy, split and merge operation only call for sequential reading and writing. Pros: As they claimed in paper, their design can handle roughly 350 key-value queries per Joule of energy which is two orders of magnitude than the disk-based system. Cons: 1. I an not sure whether it is a novel approach to use flash storage instead of magnetic disk. It seems that Apple has used this technology in their MacBook for years. 2. Append-only data log can reduce the cost of random writing but it will also cost a lot of extra storage space, especially in the cases where there are a lot of updates. So one object can appear in their storage system many many times. 3. They claimed that FAWN-KV execute significantly more instructions per Joule than previous approaches, but it is not clear about the comparison between the access speed of two systems. I think commercial organizations care more about the performance of their product than the energy cost. 4. The compact operation mentioned in Section 3.3.2 is costly, which cannot be done very frequently. 5. Their experiments are mostly focus on small entries. How is the performance of large data volumes? A Case for Redundant Arrays of Inexpensive Disks (RAID) As the gap between CPU and disk keeps growing, this paper proposes to use several redundant disks to achieve better performance and better fault tolerance. They provide five different levels of Redundant Arrays of Inexpensive Disks. Level 1 is mirroring one disk with one or more disks. This technique is simple to implement and provide better fault tolerance. Level 2 RAID employs bit-level stripping and Hamming code parity which is capable of fault recovering. Against RAID 2, RAID3 uses byte-level stripping with parity. Since parity calls for less bit than Hamming code, the fault recovering cost is also reduced when compared with RAID2. RAID4 is based on block level stripping which allow IO request to be performed in parallel. The difference between RAID4 and RAID5 is that RAID 5 uses distributed parity. Pros: The influence of RAID is significant. The first contribution is that RAID narrowed the gap between CPU, memory and disk. Secondly, the idea of using redundant but cheap component to provide better performance is still using today in different research areas. For example, in data centers, people use cheap switches to build redundant connections between servers to provide both better fault tolerance and higher network capacity. Cons: RAID is not flexible enough. Whichever level of RAID is using, all data are stored in an redundant way. So, it can not provide application specific requirements. From: mdford2 SpamElide Sent: Thursday, February 24, 2011 11:08 AM To: Gupta, Indranil Subject: 525 review 02/24 Insights from the Hard Disk Drive Industry Christensen uses the hard drive industry as the example for the opening chapter of his book based on companies' short lifetimes and the high rate of innovation. He argues that the main cause of companies failing was not the steep performance trends, but rather companies not adapting to new areas quickly. The price per megabyte has declined by roughly five percent per quarter over the time span of Christensen's data. The emergence of new markets is the eventual downfall of some companies. Those who only listen to their current customers and focus solely on their performance metrics often fail to see the benefit of alternative technologies to different audiences and never expand their market base. Then as customers themselves switch technologies, companies are left with decreases in market share, and no future. Unfortunately Christensen's data only extends to 1993. Over the past twenty years, there have been some remarkable changes. Also, the devil's advocate question arises: if timesharing, data processing, supercomputers, server farms, grids and clouds are all just the same thing, hasn't there always been a market segment that is looking for the same performance curve? Another puzzling question arises: If established companies were able to introduce models that were competitive even after a few years of emerging companies establishing themselves, why did they go bust? FAWN: A Fast Array of Wimpy Nodes FAWN is a storage architecture that couples embedded processors with local flash storage. Data centers consume huge amounts of energy, and disks are inherently bad at random accesses, and while there are DRAM-based alternatives, the power/size ratio is terrible. "Flash is significantly faster than disk, much cheaper than the equivalent amount of DRAM, and consumes less power than either." FAWN-KV is implemented on top of this new hardware, a distributed key value data store that uses an append-only log providing strong consistency via chain replication. Flash provides fast and efficient reads, but writes are very expensive, which has motivated previous work on log-structures. FAWN keeps a hash index in memory, where entries point into the data log. Data log entries consist of (key, length, data) tuples. When a write occurs, a new entry is appended to the data log and the hash index is updated. When a block is full, a scan operation clears previous incarnations of data. As long as the memory on a node is large enough to store all of the data, but writing pages to disk, or rather flash, during the scanning process can be slow. Addressing failure rates for flash based devices would be another interesting cost comparison in addition to the energy cost comparisons presented in the paper. From: Jason Croft [croft1 SpamElide] Sent: Thursday, February 24, 2011 10:44 AM To: Gupta, Indranil Subject: 525 review 02/24 Hi Professor Gupta, Below is my review for the 2/24 papers on storage. Thanks, Jason Croft ---------------------------------------------------- A Case for Redundant Arrays of Inexpensive Disks (RAID) The goal of RAID is to improve on the growing gap between the speed of CPUs/memory and single large expensive disks. While the capacity of these disks doubles every three years, seek times only improve at a rate of 7% per year, and as Amdahl's Law shows, much of the speedup in CPUs would be lost. To improve on the seek and rotation delays, as well as fault tolerance, the authors propose using extra disks containing redundant information. The authors note that without this fault tolerance, an array of inexpensive disks is too unreliable to be useful. In addition, Patterson et al. describe five different organizations of an array of disks, called RAID levels. The cost of reliability varies with each level, between 100% and 4%, or between 50% and 96% usable storage capacity. Level one mirrors disks; that is, all disks are duplicated. With an appropriate number of controllers, this scheme can reduce average read seek time by 45%. Level two uses parity disks to detect errors on a group of disks where bits of data are interleaved across the group. It provides better performance per disk than level one, and is more ideal for supercomputers. Level three uses only one check disk, increasing performance over level two since there are fewer check disks. It provides reliability with the lowest cost. Level four uses parallelism to improve small transfers by exploiting the bandwidth of the entire array. In level five, no check disk is used, but rather the check information is distributed across the disks in the array. One advantage of this organization over the previous levels is that it can perform multiple individual writes per group. While the authors' evaluation of each RAID level, and the strengths/trade-offs of each, is quite thorough, an evaluation of their implementation in hardware or software would be an interesting follow up to this proposal. At the least, it could have shown the effect of caching, which would likely have been an improvement over their analysis since in this respect it is worst-case. There is also an assumption that disk failures are exponential and random, which may not always be the case. FAWN: A Fast Array of Wimpy Nodes FAWN provides low-power, fast, and cost-effective access to large data by exploiting the observance that power consumption grows super-linearly with speed (i.e., low frequency CPUs can provide over 1 billion instructions per Joule, compared to 1 million for multi-GHz processors). By using low-power embedded CPUs, FAWN reduces the amount of energy needed for key- value queries by two orders of magnitude over a traditional disk-based cluster. Flash is used for storage since it is cheaper than DRAM and consumes less power than DRAM or disks. FAWN's architecture consists of a key-value store that is organized into a ring and uses consistent hashing. The datastore uses a log-structured per-node datastore and chain replication for fast failure and fast node insertion. The datastore is optimized for flash, e.g., all writes are sequential and reads require single random access, by maintaining a hash table mapping keys to the log in a similar manner to an append-only filesystem. Replicating data is accomplished by mapping the chain replication to the hashing ring, so that each node becomes a member of several different chains. The authors provide a very thorough evaluation of their implementation, and show significantly reduced power consumption for queries compared to disk-based systems. Interestingly, a considerable amount of power consumption in their design comes from network switches. In addition, they show their system provides consistent query latencies, e.g., accesses vary between only 300us and 1ms, even during a split, and 90% of requests take less than 2ms. However, despite this detailed evaluation, the authors seemed to skim over failure detection, and a few corner cases in this area that are left to future work. Additionally, I wonder if write longevity of the flash storage could be an issue in this design compared to disk-based clusters. Failure rates could possibly be higher, since flash memory has a limited number of write cycles (on the order of 100k to 1M). From: wzhou10 SpamElide Sent: Thursday, February 24, 2011 10:17 AM To: Gupta, Indranil Subject: CS525 Review 02/24 CS525 Review 02/24 Wenxuan Zhou Review on “A Case for Redundant Arrays of Inexpensive Disks (RAID)” Core idea The fact that the development of I/O cannot catch up that of CPUs and memories bottlenecks the overall performance of computers. This paper presents a case for redundant array of inexpensive disks (RAID) to replace a Single Large expensive disk (SLED) as one solution to the above problem. RAID is able to achieve good performance and scalability, low power consumption, while maintaining acceptable reliability. Pros 1. The approach to replace one complex and expensive system SLED with multiple simple and cheap ones (RAID) is a clever idea to overcome technique barrier. Also, this approach provides flexibility to users, that is, most of it can be implemented on software, without requirement on special hardware. 2. RAID uses redundancy to ensure reliability, and thus has to sacrifice performance to some extent. This paper discusses tradeoff between performance and reliability, associated five different levels. The discussion is thorough and educational, and convincing because the supporting benchmarks. Cons 1. There are some assumptions for failure analysis. However, there might be cases (though rare) these assumptions don’t hold, for instance, common failures among disks. It’d be better if how often the cases violating those assumptions happens is presented. The distribution of failures is assumed exponential and independent. As the author admitted, this is mainly for the sake of simplicity and due to lack of experiences. 2. Another shortcoming is that no real world experiment is showed. Review on “FAWN: A Fast Array of Wimpy Nodes” Core idea The paper presents a new cluster architecture, FAWN, for low-power data- intensive computing. Because power consumption scales quadratically with clock speed, FAWN replaces one brawny machine with many wimpy ones while maintaining the level of performance but with a much lower energy consumption. Besides, the implementation of a key-value store, FAWN-KV, over FAWN is another key point: it adopts chord-like idea and use hierarchy structure to balance workload. Pros 1. The idea of FAWN is simple, but it’s able to achieve performance comparable to traditional systems. It shares one similar idea with RAID paper, using a bunch of “wimpy” components to replace a “brawny” one, but paying more attention on energy consumption issue. 2. The evaluation is solid and convincing, and the related work section is educational, giving a clear picture of previous work. I also like some interesting data mentioned in the paper, such as the fact that the ratio of puts to gets of facebook is 1/6. Cons 1. It’s not clear how FAWN deals with failures, since not many details are revealed, especially on backend nodes. I also wonder if FAWN has effective methods to handle concurrent failures. So I’m not sure about the reliability of FAWN. 2. FAWN is designed to solve the problem in large-scale data-intensive applications, but only small-scale (21 nodes) benchmarks are presented as evaluation. This makes the paper a little less persuasive. Thanks. Wenxuan From: Nipun Sehrawat [sehrawa2 SpamElide] Sent: Thursday, February 24, 2011 2:45 AM To: Gupta, Indranil Subject: 525 review 02/24 A Case for Redundant Array of Inexpensive Disks (RAID) This is a classic paper, challenged the state-of-the-art SLEDs (Single Large Expensive Disks) of that time by following an opposite technique of using an array of inexpensive disk to achieve better performance, redundancy and scalability while lowering down the total cost and power consumption. The authors first identify I/O becoming potential bottleneck in the overall computing performance (as governed by Amdahl's law) and define two broad usage patterns - low number of I/O requests but large data transfer (supercomputing applications) or high rate of random access (transaction-processing applications). They then present the concept of utilizing a large number of inexpensive disk collectively to improve the various performance parameters. However, using a collection of disks as a single logical disk decreases the MTTF (Mean Time to Failure) of the logical disk. To address this, some disks in the collection (called "Check Disks") were used to store redundant information for error detection/recovery. The paper presents six levels of RAID, a higher level being generally designed to overcome some shortcoming of a lower level. All levels follow same conceptual design - the array is broken into "reliability groups", each having multiple data and check disks. The authors define measures such as "usable storage capacity percentage" and "reliability overhead cost" to quantify the trade-offs involved in following the RAID approach. First level RAID is rather naive and maintains one replica per data disk. Second level RAID distributes the data in a bit-interleaved fashion among the data disks in a group and maintains check-disks per group for error correction. This makes it suitable only for large data transfers, thus for supercomputing applications. However, a larger number of check disks is required in second level, which is the motivation for Third RAID level. The number of check disks is reduced to one per block relying on the disk controllers to detect disk failures. However, as accessing a data block needs data transfer from all the disk in a group, small I/O is expensive. Fourth level RAID distributes data at block-level, so that small reads can happen parallely. However, writes still have to happen sequentially as each write needs access to the check disk for changing the parity. Fifth level RAID removes the bottleneck posed by the check disk in the previous level by distributing the parity information among all the data disks of a group. Now reads, as well as writes (to an extent) can occur parallely, making Fifth level RAID suitable for both supercomputing as well as transaction-process applications. Pro: - The paper provides incremental solution, identifying problems in the naive First level RAID and solving them incrementally. - RAID was able to achieve performance, scalability, redundancy and cost gains over SLEDs. - Data stripping and parity support can also be provided in software, obviating the need of specialized special hardware Cons: - Usable storage capacity if the disk array can be low (only 50% in First level RAID). FAWN: A Fast Array of Wimpy Nodes FAWN is a storage architecture built using nodes with low frequency processors coupled with flash storage, primarily to reduce power consumption per query and improve I/O latencies. It is targeted to be a backend for key-value stores characterized by a large data set and a high query rate. The paper also present FAWN-KV, which is key-value store built on top of FAWN. FAWN chooses flash drives over conventional disks to decrease the response time of queries (disk access time dominated by the seek-time). It chooses nodes to be powered by 500Mhz processor to gain significant improvements in power consumption and the lack of requirement of high-processing to handle I/O load. Each FAWN nodes maintains FAWN-DS datastore. The nodes participating in FAWN-KV key-value store are arranged in a ring and data is partitioned using consistent hashing. Each physical node is presents as multiple virtual nodes in the ring. As small writes are expensive in a flash drive, FAWN-DS maintains a log-based key-value store. Writes are sequential (as written at the tail of log) and thus can be combined easily. Each node has sufficient DRAM to hold a hash providing the mapping between keys present at the node and the most recent value of the key in the log. Periodic checkpointing of the hash table is done to quick recovery from a failure. As a new node arrives, its neighbor splits its log and passes on the part of log having keys of the new node. A merge of logs is done when a node leaves. The system is divided into front-end nodes (receive queries and maintain information for zero-hop routing) and back-end nodes (store actual data). Two-level caching is employed to tackle skew in query distribution. Chain replication is used for replication and consistency. Pros: - Well suitable for key-value access pattern on large datasets, an important pattern nowadays. - Significantly reduces power consumption per query. Cons: - Flash drives have limited write-cycles. Usage in a queries with high write percentage might lead to a very short lifespan of the SSDs. From: trowerm SpamElide on behalf of Matt Trower [mtrower2 SpamElide] Sent: Thursday, February 24, 2011 2:27 AM To: indy SpamElide Subject: 525 review 02/24 How Can Great Firms Fail? C. M. Christensen uses the narrative of the hard disk industry from its inception in the early 50's at IBM up to the mid 90's to describe what makes companies succeed. He uses data from both established companies and startups regarding their introduction of new technologies. The reason for so many companies failures wasn't the speed of technological change interestingly enough. Three trends were established after analyzing the transition of hard drives from 14 inch disks down to under 2 inches. The first is that the big leaps in technology are seldom technological leaps and more often unique use cases not previously possible for economic reasons. This allowed for entrant firms to establish themselves before older companies offered the new products. Second, big companies innovate and are able to push the limits of technology in the direction that customers are interested in. The final and most important point is that customers demand this continual increase in performance so much that established firms become encumbered in their path and are unable to adopt new leaps in technology. This opening chapter provides insight into what entrant and established firms do well. The author does not answer how they do these things well in this chapter though. Furthermore, it seems that the storage industry no longer operates as it once did. Finding industries which are similar to the storage industry of old could help entrepreneurs find the next big gap. RAID RAID is the answer to disproportionate advances in performance of computer components. It is observed that while CPU and main memory speeds have increased steadily, secondary storage has remained stagnant in its speed. RAID is an attempt to increase throughput performance by sacrificing capacity which has increased steadily. RAID uses cheap disks which have similar throughput characteristics to the high-end disks but much shorter MTTF's. By combining all the disks, the throughput can scale linearly assuming even distribution of reads across disks. In order for this performance boost to be meaningful, reliability must be kept within normal ranges. This is done by introducing redundant check disks. This trade-off of capacity for throughput can take many shapes. The paper presents five levels or arrangements for RAID. Level 1 is a direct copy backup, level 2 error correction with hamming codes, level 3 bit level parity disk, level 4 sector level parity disk, and level 5 distributes the role of the check disk across all data disks. RAID brings high performance at low cost. Over time RAID has become more prevalent in the consumer market. Technology has changed some of the initial assumptions in the paper. The size of hard disks has increased the repair time for disks. RAID also does not scale down well. For end users looking for a performance boost the costs of reliability become extreme. Finally, RAID requires disks to be homogeneous. The exponential curve of disk capacity makes replacing failed disks with equal capacity drives unattractive. From: david.m.lundgren SpamElide on behalf of David Lundgren [lundgre4 SpamElide] Sent: Thursday, February 24, 2011 1:46 AM To: Gupta, Indranil Subject: 525 review 02/24 A Case for Redundant Arrays of Inexpensive Disks (RAID) In A Case for Redundant Arrays of Inexpensive Disks (RAID), Patterson et al. describe five RAID setups and argue for RAID's adoption through a cost-performance evaluation of RAID in relation to single large expensive disks (SLED). The authors frame their argument by positing that with the rapid increase in processing and memory performance, input/output (read: disk performance) will become the bottleneck for computing. Moreover, by Amdahl's Law, this computing bottleneck will be substantial. Patterson et al. build upon other arrays of inexpensive disks solutions by introducing redundancy to such arrays. PROS: - The authors evaluation of the RAID levels based on data usage (i.e. RAID 3 for scientific computing vs. RAID 4 for merchant/transactional computing) was a practical consideration. - I thought the division of error redundancy vs. performance boosting was a wise practical distinction when evaluating RAID levels. - The paper presented a prophetic view of disk arrays becoming the norm. CONS: - The authors fail to examine the time to rebuild/recover a drive after failure. - I would have liked to have seen a more detailed examination of hardware vs. software RAID controllers. - I did not notice the paper addressing concurrency issues. - The independence assumption for disk failure may not always hold (particularly given that many disks in an array have similar characteristics and usage statistics). How does the MTTF change when this independence assumption is removed? The Innovator's Dilemma, Chapter 1 Christensen examines the disk industry from 14 inch drives to the advent of the 1.8 inch drives. He introduces two categories for technological innovations: sustaining technologies and disruptive ones. Sustaining technologies are not defined as modest improvements on existing products, but rather as improvements that maintain the rate of performance improvement that customers expect. Disruptive technologies are paradigmatically different than sustaining technologies, and are characterized by the creation of new markets. PROS: - The sustaining/disrupting model that Christensen introduces is extrodinarily simple and elegant. - Christen's warning of managers and firms being "held captive by their customers" is counterintuitive, but, given the data and the potential consequences, is extremely important. - The observation that established firms tend to lose insight into identifying new markets and applications leads one to imagine novel business strategies with a greater emphasis on this aspect. CONS: - It seems unlikely that it is truly possible to partition all innovations into two classes. It seems more likely that innovations lie on a spectrum between the two. - I disagree with saying a technology is innately disruptive or sustaining. This seems akin to claiming that one can objectively classify inventions as inherently "good" or "bad" in a social context. From: kevin larson [kevinlarson1 SpamElide] Sent: Thursday, February 24, 2011 1:11 AM To: Gupta, Indranil Subject: 525 review 02/24 Christensen’s chapter on hard drive manufactures demonstrates the remarkable rate at which firms enter and leave the market. Initial estimates thought this was due to the blistering pace at which hard drive manufacturing moved forward. It was even likened to trying to stay atop a mudslide. Upon extensive analysis, it became evident this was not the cause of the firms failure rates. Firms managed to increase data densities and quickly adopt new manufacturing techniques and advances, staying on pace with their markets. They had a remarkable ability to entertain the wishes and opinions of their markets. So much, that it actually was the root cause of their downfall. The clients of the manufacturers of 14” disks made it clear their priorities lay in maximum storage per dollar, and that they were not interested in smaller disks. The manufacturers followed this and largely ignored the growing market for 8” drives. The 8” drive production capacity grew to the extent where the storage per dollar dropped below that of 14” drives, and eventually the 14” market collapsed. Then the 8” drive manufacturers fell into the same trap and those of the 14” drives, ignoring or joining into the 5.25” market far too late. This trend continued, generation after generation, with only a few of the later generation manufacturers surviving the incoming markets. Christensen offered a unique view of the hard drive market. Before his findings, most theories seemed to have believe that it was just the pace of improvement and technological advances that buried the older firms. He showed that is was in fact the decreasing form factors that drove the older manufacturers out. His evaluation strongly demonstrated the trends he had observed. Christensen’s evaluation left a bit to be desired on the topic of how the firm which successfully survived a transition managed to do so. There was mention on how Seagate survived the transition from 5.25” to 3.5” drives, but there was nothing on how the other firms survived the change. Patterson et all demonstrated the rate at which hard disk seek speeds were growing was far less than those of processor speed, memory capacity, and hard disk capacity. They (accurately) predicted that SLED were soon to be a thing of the past and that RAID would be the way of the future. At the time of publication, drives MTTF were reasonable for single drives, but when hundreds or thousands of drives were grouped together, the frequency of failures were quite high. In order to counteract this problem, some sort of redundancy was necessary. RAID1 proposed the mirroring of a drive. This allowed for considerable increases in MTTF, but at the expense of capacity as well as performance. RAID2 improved upon some of the flaws of RAID1 by using hamming code to more (storage) efficiently recover the data from a failing drive. While RAID2 managed to utilize more of the disk space than RAID1, it had abysmal performance with small reads and writes. RAID3 was born in an attempt to remedy some of these problems. The redundancy of RAID2 was unnecessarily high, and RAID3 was able to reduce the number of parity drives to further increase the usable storage capacity of the drives as well as increase read/write performance over RAID2. This was successful, but it still did very poorly with small reads and writes. RAID4 used a different arrangement of parity. It grouped data so parallel accesses could be made efficiently. This greatly improved the small reads (over 1000% improvement over L2/3), but the small write performance still needed serious improvement. By striping the parity drive across all the disks, RAID5 finally brought the small write performance to a reasonable level. While the small writes/RMW were still below that of RAID1, large, RMW, and storage utilization were increase by nearly a factor of two. In the present day, RAID2-4 don’t get much attention. The evolution of RAID1 to RAID5 was well documented and evaluated. The methodology and evaluation were very easy to follow, and the future work mentioned in the conclusion was incredibly detailed. One point that would have been nice to hear more about was the recovery after drives fail. questions of how the regeneration of lost data results in downtime or limited use of the RAID was not covered. From: Qingxi Li [cs.qingxi.li SpamElide] Sent: Thursday, February 24, 2011 12:23 AM To: Gupta, Indranil Subject: 525 review 02/24 FAWN: a fast array of wimpy nodes This paper introduces FAWN, the distributed key-value storage system based on low-power embedded CPUS and small amounts of local flash storage. FAWN is used for the systems which 1. Have a lot of I/O instructions but not computation instructions, 2. Most operations are independent 3. The total size of the objects stored is large but each of them is small. FAWN supports high-performance and low power cost. The whole process is that client generates queries and sends it to the front-end, front-end sends it to corresponding back-end and back-end do the looking up and sends back the answer. Each back-end nodes used FAWN-DS which is an append-only log structured per-node datastore. FAWN-DS includes an in-DRAM hash index for the log file with one bit for availability. For delete operation, the back-end node invalidates the hash entry and appends a delete entry for fault tolerance. And for update operation, the back-end node just appends the new data in the log file and point the hash index to the new data. All these two operations will waste a lot of memory. This memory will be free by reconstruction. However, if reconstruct frequently, the cost for I/O is large and if reconstruct infrequently, there will be a lot of disk be wasted. The whole FAWN back-ends used consistent hash with ring structure. To balance load, each physical node can have multiple virtual nodes. To avoid storage failure, each data will be replicated in all the nodes of a chain and when this data be updated, all the nodes in this chain will need update their data. For looking up the data, the front-end forward the query to the head of the chain and the head node forward it to the tail node which waste a lot of time. I didn’t know why the author wants to do this. In fact, the head node can directly give the response or the front-end can directly forward the queries to the tail node. To avoid hot spot, the front-end nodes will cache the record. To detect back-end nodes’ failure, the front-end will periodically send heart-beat message to back-end. Besides this, there is a management node which will allocate back-end node to front-end node when front-end node joins in. This cause the whole system has a single point failure and also reduces the scalability of the whole system. RAID: a case for redundant arrays of inexpensive disks. This paper introduces RAID which based on the magnetic disk and offers an alternative to SLED, promises an order of magnitude in performance, reliability, power consumption and scalability. The RAID builds the I/O system as arrays of inexpensive disks over the SCSI interface. To solve the unreliability problem, authors give five level RAIDs. And when a disk fails, a replacement disk is switched in electronically. However, it needs human operator replaces the fail disks periodically. And with more and more disk failed, the performance of the whole system will decrease. 1st level: mirrored disks: Each disk has one check disk. This RAID will have long MTTF (mean time to fail) which even seems useless, more than 500 years. Of course, the cost is large and the time for writing and reading is doubled. 2nd level: Hamming code for ECC: This level RAID is inefficient for small writing and reading. Since for writing, it needs to read the whole sector and then write and for reading, it at least needs to read the whole sector. Besides this, most check disks used for finding out which disk is fail which seems useless. 3rd level: single check disk per group: In this level RAID, each group has one check disk. The disk uses parity code for recovery. However, if two or more disks fail at the same time, the whole system will fail. And this level RAID doesn’t support parallel reading and writing. 4th level: Also use parity code, but instead the whole group share one, to achieve parallel reading, sector 1 of each disk share one check disk. 5th level: also use parity code, but like the fig. given by authors. Sectors 0 use disk 5 as check disk, sectors 1 use disk 4 as check disk. This level RAID can support most parallel writing. However, it don’t support parallel writing if I want to write sector 0 in disk 4 and sector 1 in disk 2, since both of these writings need to write in disk 4. From: lewis.tseng.taiwan.uiuc SpamElide on behalf of Lewis Tseng [ltseng3 SpamElide] Sent: Wednesday, February 23, 2011 11:32 PM To: indy SpamElide Subject: 525 review 02/24 CS 525 Review 02/24: Storage - 1 A Case for Redundant Arrays of Inexpensive Disks (RAID), D. Patterson et al, SIGMOD 1988 The paper identified the limitation of single large expensive magnetic disks (SLED) due to the seek and the rotation delays. Instead, the paper proposed the scheme called redundant arrays of inexpensive disks (RAID) that can improve the performance in terms of price-performance and reliability. Five different schemes of RAID were proposed and examined and compared with each other in terms of different metrics, such as MTTF (mean time to failure) and overhead cost. The simplest and the most expensive option is to mirror all the data to the backup disks. The second scheme is to store error correction code to backup disks (called parity or check disks). The third scheme is based on the observation that disk controllers are able to detect if a disk fails, so only is one redundant disk storing parity bits necessary. Therefore, the overhead cost is reduced from the second scheme to the third. The fourth scheme exploits parallelism by storing the whole transfer unit in a single sector. In this way, the overhead remains the same, but the efficiency of small reads increases a lot. The fifth scheme focuses on improving write operations by removing the “single” check disk and allowing every disk to store some redundant sectors. In this way, because multiple individual writes per group are possible, the efficiency of small writes and R-M-W (read-modify-write) increases. The idea of RAID that exploits redundancy and the place and the method of storing to decrease the overhead and increase the performance is quite novel. First, it is very simple and thus analysis is convincing (though only are safety and throughput discussed). Second, the deployment can be in either hardware or software, which provides flexibility. The second contribution is the introduction of MTTF, which is then widely used in the field. Comments/questions/critiques: Would the heterogeneity of disks in a group affect the performance a lot? In the paper and the analysis, every disk is assumed to be the same. Wouldn’t a significant slower disk decrease the performance? In this case, what would be a better scheme, to indentify and ignore the slower disk or to simply decrease the workload of slower disk? What would be the scalability of RAID? How many disks can it support? In the paper, 25 disks form a group. What if we want to have more space? Should we just simple put more disks in a group? Or should we form a super-group that contains many groups? FAWN: A Fast Array of Wimpy Nodes, D. G. Andersen et al, SOSP 2009 This paper proposed a cost-effective cluster architecture for data-intensive computing that usually has bottleneck at I/O. The cluster is called FAWN and is built around slower CPUs and flash memory to greatly reduce energy consumption (to 1/10 of traditional architecture’s level) while preserving roughly the same computation power, and all the properties, such as availability and latency. The FAWN consists of two main components: several front-end nodes dealing with client request and performing the first level of cache and many back-end nodes actually serving the request forwarded by front-end nodes using its unique Data Store (DS) system (similar to database except for no transactional or relational functions supported). FAWN-DS provides usual hash-table functions and is based on append-only log-structure. Moreover, it is specially adapted to flash. For example, it adopts strategy of using semi-random writes. FAWN manages the storage and key-value match in a chord-like DHT scheme that is based on consistent hashing and chain replication. One difference is that front-end nodes can help the membership management. Therefore, even though there are unnegligible overhead to maintain several chains for replication on each back-end node, the join and leave can still be done efficiently through the help from front-end nodes. One main contribution of the paper is to recognize the importance of reducing energy consumption by increasing instruction/joule. FAWN’s idea of using less powerful hardware to build system is also interesting. Second contribution is to use a hierarchical design to seamlessly integrate chord-like key-value storage and log-structure. Comments/Questions/Critiques: While simulation using some benchmark actually showed that FAWN outperforms general purpose architecture, Berkeley DB, the result would be more convincing if some theoretical analysis is presented. Only benign failures are assumed. How to adapt FAWN to tolerate more aggressive or even malicious nodes? Though with cooperation of front- and back-end nodes, join and leave can be performed in reasonable amount of time, what would happen if churn rate is relatively high (since the hardware is assumed to be “wimpy”, high churn rate seems to be reasonable)? The overhead of copying log and maintaining chain is an issue. Is the architecture of the datacenter a right layer to consider energy consumption? Though instruction/joule is shown to be able to be reduced, this comes with a cost and possibly more instructions due to its design. The tradeoff is still not clear and I am not complete convinced that FAWN would really help and be adopted by the industry. From: Tengfei Mu [tengfeimu SpamElide] Sent: Wednesday, February 23, 2011 7:33 PM To: indy SpamElide Cc: Gupta, Indranil Subject: 525 review 02/24 Tengfei Mu (mu3) 1. A case for redundant arrays of inexpensive disks (RAID) This paper presents a solution trying to make use of extra disks containing redundant information (RAID) to recover the original information when a disk fails, so that to overcome the reliability challenge. Before the release of this paper, the most popular storage technology of Single Large Expensive Disks (SLED). But inexpensive disks were less reliable and could not be useful without a proper fault tolerance implemented. Based on such motivation, the paper presents 5 levels of RAID and analyzes their capabilities about cost and performance trade off. Specifically, level 1: Mirrored Disks (duplicates all disks); level 2: Hamming Code for ECC (uses check disks to detect and correct a single error, thus it requires less disks then level 1); level 3: Single Check Disk Per Group (use check disks to detects errors but does not correct them, it requires only one check disk per group, ); level 4: Independent Reads/Writes (stores one transaction data on a single disks, enabling concurrent read/write operation on several disks in the group. However, the single check disk acts as a bottleneck); level 5: No Single Check Disk (stores check information across all array disks plus the check disk in the group). Pro: 1. High performance-price since RAID use several inexpensive disks. It provides high I/O bandwidth and reliability. Con: 1. There are lots of measurements with the paper, but all of those are not done in real system. 2. The author assumes that the disk failure is independent. But this is not true sometime. 2. FAWN: a fast array of wimpy nodes This paper presents FAWN, which pairing the low-power nodes with flash memory storage to provide fast and energy efficient processing of random read-intensive workloads. Using flash memory is because it consumes less power and is cheaper and faster than large disks or DRAMS. FAWN tried to fill the gap between the performance of memory and CPU. The authors also the FAWN system are FAWN-DS and FAWN-KV. FAWN-DS is a log-structured per-node data-store which are practical system built making use of FAWN. FAWN-KV is a consistent, replicated and highly available key-value methodology for routing. Pro: 1. FAWN could reduce the power consumption for data-intensive computation running on cluster. Con: 1. The paper discuss little about failure, especially communication failure. 2. Random key distribution policy would result in fairness problems. From: mark overholt [overholt.mark SpamElide] Sent: Wednesday, February 23, 2011 6:03 PM To: Gupta, Indranil Subject: cs525 review 02/24 Mark Overholt CS525 02/24/2011 A Case for Redundant Arrays of Inexpensive Disks (RAID) Summary: The main idea behind this paper is to propose the RAID organization of hard disk drives. The motivation of this was the increase speed and decreasing cost of CPU and main memory. It was shown that CPU and main memory were increasing in efficiency at such a rapid rate, that the I/O performance of computer system could not keep up. The use of Single Large Expensive Disks (SLED) was too slow and too fault prone to be used in large systems. The Disk was the bottleneck in most systems. The idea behind RAID is to use many, low cost disks and cluster them together to improve fault tolerance and speed. The paper proposed 5 different levels of RAID and then evaluated them on 6 different metrics. They were tested for read, write, and read-modify-write on both large and small data blocks. Discussion: Pros: Now a day, RAID is a defacto standard in the infrastructure community, allowing flexibility in storage designs and uses. RAID disks use less power and cost less than traditional SLED designs. The different RAID levels allow for flexibility in your design, depending on the type of I/O operations your organization uses most. Although I rarely hear of any company using a RAID level other than RAID 5 or maybe RAID 10 (not talked about in this paper) Cons: RAID provides recovery against only single bit errors or single disk failures. While RAID provides durability of data (data is recoverable after a drive failure), data is not always available. The recovery process (rebuilding) could take a long time if a drive fails, leaving that data unavailable until recovery actions are taken. FAWN: A Fast Array of Wimpy Nodes Summary: This paper presents a new architecture for clusters, FAWN, which aims at lowering the energy consumption of data-intensive computing. It makes use of flash storage since it consumes less power and is cheaper/faster than large DRAMs/disks. FAWN uses low power embedded CPUs so as to save power, reduce the I/O induced idle cycles which lead to wastage of energy and execute more instructions per joule as opposed to traditional processors.. The paper proposes FAWN-KV, cluster-based key-value store and FAWN-DS, a log-structured per-node data-store which are practical systems built making use of the FAWN technology. FAWN-DS supports basic operations like Store, Lookup and Delete and is designed esp. for flash storage. It maintains an append-only log of data using a DRAM based hash table. To reduce the DRAM usage only the higher i bits are used to map the keys to the hash-buckets and as a consequence of this multiple reads might be necessary to find the correct key value. Each node in the FAWN-DS can be associated with several unique virtual ids and the data is distributed on these virtual nodes based on the keys associated with them. In order to support such an architecture FAWN-DS supports the Split, Merge and Compact operations. FAWN-KV uses consistent hashing to map each of the VIDs to a ring-structure which ensures that the amount of data-movement on the occurrence of a node-join or node-leave is small. It has several front-ends which maintain membership lists containing the mapping of the keys associated with them and ensure that the query reaches the correct node responsible for the key. FAWN-KV replicates the data stored on one node on R-1 of its successors and a result requires measures to ensure that these copies are consistent, in the event of node-leaves and node-joins. Discussion: Pros: I like how they are working on a system that is not just focused on performance alone, but performance with respect to power consumption. Based on the results given, it seems they achieved their goal with FAWN, set out at the beginning of the paper. Cons: The micro-benchmarks for this system detail a prototype of total storage of about 80GB. Data centers will require much more storage…is this solution scalable to those needs? From: Agarwal, Rachit [rachit.ee SpamElide] Sent: Tuesday, February 22, 2011 11:57 PM To: Gupta, Indranil Subject: 525 review 02/24 ---- A Case for Redundant Arrays of Inexpensive Disks (RAID) The paper suggests that performance of CPUs and memories has two factors: (a) capacity (processors/storage) per unit area; and (b) I/O speed. The authors argue that while (a) has improved with advancements in system design, (b) has been more or less stagnant and hence, becoming a bottleneck in system design. They propose five different design/implementations of RAID, each having a set of application domains. The most interesting aspect of the paper is varying designs for varying system requirements. For instance, they present customized designs for systems that require small storage but large number of transactions (read/write operations) and for systems that require large storage but small number of transactions. The techniques used in back-of-the-envelope analysis are also very interesting. While the paper presents interesting designs, it would have been very interesting to further incorporate the "systems" aspects in the analysis. I list some of the potential systems aspects below: 1. RAID would require system buses of higher capacity, will enforce more control traffic and hence, potentially larger delays. While reliability is important, delay may be an important concern for many of the applications. 2. The estimates on the cost of the RAID are rather too optimistic. The real cost of the system also depends on implementation requirements, more expensive system buses, packing and cabling costs, for instance. Even rough estimates on these expenses would have been interesting. 3. By the same token, the claims on power consumption advantages offered by RAID are not immediately convincing. The claimed improvements are not based on real deployment/implementation of RAID, and the analysis could have benefitted by inclusion of other system costs. I am wondering how recent theoretical works in improved designs of erasure codes and reduced memory costs change the analysis presented in the paper. ---- FAWN: A Fast Array of Wimpy Nodes The paper presents an alternative architecture for energy-efficient data-intensive computations. The main design decisions made in the paper are the use of embedded CPUs, flash storage and a highly customized key-value storage system. In short, the paper convinced me of the fact that one can save on energy and cost at the expense of latency. The paper is definitely interesting in that, it presents (a) an interesting design point in energy/cost/latency trade-off, with a clean architecture; (b) some interesting data points on read/write costs of various systems; (c) a key-value system with very interesting optimizations. I believe the space in design of systems for data-intensive computations can be further explored. In particular, I can think of some questions that FAWN does not seem to answer: 1. The authors, by data-intensive computing, really mean computing based on random-access reads. This may not be entirely true. As soon as the data-intensive tasks start requiring many writes, the performance of FAWN, in my opinion, will significantly deteriorate. 2. By using wimpy nodes and flash storage, the authors achieved low cost and low energy consumption; they seem to have not cared about latency much. Both their architecture and their storage system seems to incur much higher latency (a significant factor of lower queries per second) when compared to other data-intensive computing systems (if I am not already too sleepy!) It would be interesting to explore this three-way trade-off and may be, design a more flexible architecture that allows more knobs to set the system performance. 3. Using a mix of wimpy nodes and flash storage also raises the question of system performance in presence of churn/failures. It would have been very interesting to evaluate this over a realistic workload. 4. Finally, I'll also put in what I tried to avoid: scalability. It is not really clear how scalable the system is. Neither the discussion nor the simulation results stop me from claiming that FAWN does not really seem to scale very well to the level to current data-center sizes. ---- Best regards, Rachit -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: Tony Huang [tonyh1986 SpamElide] Sent: Monday, February 21, 2011 12:14 PM To: Gupta, Indranil Cc: Tony Huang Subject: 525 review 02/24 Attachments: Review 04.txt RAID: Core Idea: The authors introduct a mechanism to use multiple commercially common hard disks to acheieve performances that are comparable to expensive high performance disks. It introduces seminal ideas of using disk-level parallelism to incrase throughput. Reliability issues are solved by various scheme proposed, including full mirror, Hamming code, etc. The system is evaluated using MTTR (Mean Time To Return) and is shown to be comparable to SLED. Pros: * Use multiple inexpensive disks to achieve high performance. * Low in cost, widely accessible. * Scale up gracefully if we add more disks to the system. Cons: * Did not explore the effect of hardisk bus band-width limitation of the system. FAWN: Fast Array of Wimpy Nodes Core Idea: The author introduce an innovative architecture for constructing distributed key-value pairs using flash storage and less powerful mobile chips. The main purpose for this architecture is its energy efficiency. The systme is optimized to fit the characteristics of Flash memory, including: * Fast random access. * Efficient I/O. * Slow random writes. The way FAWN utilize these propoerties is that is uses a log-structured key-value store on the flash storage indexed with an in-memory hash table in high performance DRAM. Write to the storage is append only to the end of the log. Read is performed by looking up in the index and find out the offset of current data. Split, merge and compact of data is implemented in the system. A key-value system is implemented on top of this data store. The nodes in this system is organized into a ring, much like the Chord system except it uses 0 hop routing instead of log(n) hop routing. Routing of request is done by front-end nodes. Data are replicated across 3 nodes on the ring. Read are commited only until it is written to all 3 nodes. Joining and leaving operations are also explored. Evaluation shows that the log-structured file system achieve much higher performances on flash data storage than traditional high performance database while being much more energy efficient. Pros: * Developed an file system that are much more suitable to run on flash storage. * Use low-power processors to achieve energy efficiency. Cons: * No profiling and analysis of performance of FAWN compared to a normal disk-based cluster is presented. Opinion: This paper points out an exciting direction of implementing cluster using new storage technologies. I would be interested to see a comparison of performance between this system and traditional machines, and an analysis of performance bottlenecks under this system. -- Regards -- Tony