From: anjalis2006 SpamElide on behalf of Anjali Sridhar Sent: Tuesday, April 26, 2011 12:15 PM To: Gupta, Indranil Subject: 525 review - 04/26 ACMS: The Akamai Configuration Management System Sherman et al, NSDI 2005. Akamai Configuration Management System is used to deliver configuration information to servers. It provides a reliable, fault tolerant and asynchronous method for delivery. Many web site providers use third party content delivery networks (CDN) to increase the availability of their web pages. In order to ensure that these customers have control over the way the CDN’s distribute the content, configuration information is disseminated to all servers of the CDN hosting the content. ACMS is split into two parts - the front end which handles the acceptance and storing of configuration information and the back end that handles the delivery of these updates. The front end is a set of machines called Storage Points. The Akamai servers use a pull based approach for receiving the latest configuration information from the SP machines. The SP’s use a quorum based approach and hence requires only a majority of the SP’s to be in working condition to decide if an update should be disseminated to all the Akamai servers. The SP’s use the Vector Exchange algorithm to decide on a submission. If a majority agree, the accepting SP replies with an “Accept ” message to the publisher (application sending the new configuration information). Download Points are additional sites at which the new configuration information can be downloaded. Recovery is performed by using Index Merging. The group index file and the root index file are used to construct the tree which represents the latest snapshot of the SP. It would be interesting to note the performance when the number of SP’s is increased. There is a visible tradeoff between reliability of using more SP’s and the overhead of the vector messages. The quorum required can be decided on the consistency that is required. How often can this be varied? It seems that if the number of SP’s is low and there are outages, configuration updates can be refused more frequently than required. The paper does mention an outage between a pair of SP’s located at different Tier1 network. The pull based approach of constant querying for updates can be modified on a per group index file. The files that are updated more frequently can be queried at a greater rate. Memcached Memcache is a system used to reduce the load on the databases which is a well known bottleneck for the performance of many systems. Though the actual data may be stored in a number of machines, the cache appears as one machine to the user. It is a distributed memory caching systems where data is stored as key value pairs. There is no predetermined format for keys and can be decided by the user deploying it. If Memcache is used, the user first queries Memcache for the value corresponding to the key. If it is not present, the database is queried and the retrieved value is stored in the cache. Memcache uses non blocking networking libraries to ensure that it can take heavy loads. Memcache provides open source software to build efficient object caching systems to improve the retrieval performance of the database. The high availability and ease of use make this an attractive software choice for any developer. How does Memcache handle complex queries for the database? If a single value is associated with each key, a complex query might take multiple operations involving Memcache. Memcache does not handle failovers or security on its own and it is upto the developer deploying it to add it. The generic character of the keys may lead to collision if the key structure and its obscurity are not decided correctly. From: Long Kai Sent: Tuesday, April 26, 2011 11:58 AM To: Gupta, Indranil Subject: CS 525 Reviews 04/26 Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Summary: Many current systems are trying to address the simultaneous needs of scale and low-latency. One popular way to do this is to partition their user data across pools of servers that operate purely on in-memory state. However, this model makes the application programming very complicated for developers to deal with consistency and concurrency issues. This paper is motivated by this problem and proposes a system that makes such development easy. Centrifuge integrated lease management and partitioning by a replicated state machine. Leases are used to ensure that only one server at a time is responsible for a given piece of state. Each piece of state is assigned to an in-memory server and requests are then sent to the appropriate server. Servers that want to send requests link in a Lookup library, while servers that want to receive leases and process requests link in an Owner library. All these requests are processed by manager service. Manager Service partitions a flat namespace of “keys” among all the servers linking in Owner libraries. The Manager then conveys to each Owner library its subset of the map using a lease protocol. This system makes allows service programmers to focus on the logic particular to their service. The problem is that the system is not fine-grained as it stated. When the system scales, the range of each lease automatically becomes larger. It is likely in real application that many requests are made to nearing objects. And because the range of a lease is too large, these requests are unnecessary blocked by each other, even though they request different objects. ACMS: The Akamai Configuration Management System Summary: ACMS is motivated by the need to allow fast, reliable and lightweight management and synchronization of their configuration state in large distributed system. In this paper, the authors provide ACMS, the Akamai Configuration Management System, which was built to support customers' and internal services' configuration propagation requirements. Through the use of simple quorum-based algorithms, ACMS provides highly available, distributed, and fault-tolerant management of configuration updates. ACMS accepts distributed submissions of configuration information and disseminates this information to the Akamai CDN. ACMS provides persistent storage of configuration updates, and can scale if necessary to hundreds of thousands of servers. -- Larry From: Jason Croft Sent: Tuesday, April 26, 2011 11:52 AM To: Gupta, Indranil Subject: 525 review 04/26 Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Centrifuge uses leasing and partitioning for in-memory servers to reduce the distributed system complexity for programmers, while also eliminating the need for datacenter load balancers. By using partitioning, the benefits of fine grained leases can be applied to in-memory server pools without scalability limitations. Centrifuge consists of a Lookup library that is used by servers wishing to send requests and an Owner library used to receive leases and process requests. Communication between the two libraries is handled by a Manager service, which uses consistent hashing to split keys to servers linking in the Owner library. The Manager service consists of two sets of servers--one providing a Paxos group and one acting as either the leader or a standby. The leader is elected by the Paxos group from one of the standby servers, and the current leader handles granting leases to Owner libraries. Leases are used to ensure only one server at a given time is responsible for a piece of data. By default, leases in Centrifuge last only 60 seconds, at which point the Manager recalls all leases and computes a new assignment of ranges to the Owners. Owners also send requests to renew the lease every 15 seconds to the Manager, and after 3 consecutive missed requests the lease expires. Lookup libraries maintain a lease table, which contains the lease generation number and Owner node for every lease range. The table size is small--roughly 200KB for 100 owner nodes. Each Lookup must also contact the Manager every 30 seconds for incremental changes to the lease table. Centrifuge's architecture is limited to server requests only originating and terminating within the datacenter. In addition, Owners and Lookups must frequently poll the Manager for updates, incurring some network overhead (up to 5MB/s). Also, statistical multiplexing is used on the objects in memory under the assumption that the processing load time of any given data object is not near the capacity of the server. While this works well for the design of Live Mesh where Centrifuge is used, it may not work well in other services that must serve CPU-intensive data. ACMS: The Akamai Configuration Management System ACMS is designed to propagate customer configuration options to the Akamai CDN in a distributed and synchronized manner. It allows for multiple entry points that accept and store configuration updates and guarantees updates are propagated to healthy nodes within several minutes. Since the Akamai CDN is optimized for HTTP downloads, a pull-based approach is used. A Storage Point (SP) accepts new update submissions and uses Vector Exchange (VE) to allow a quorum of SPs to agree on the submission. A recovery scheme called Index Merging is used to recover missed updates. CDN nodes run a Receiver process to coordinate subscriptions on that node and Download Points (DP) help to alleviate download bandwidth requirements from SPs. The ACMS Acceptance Algorithm consists of two phases: replication and agreement. VE is used to reach agreement, where a state vector is exchanged between SPs with a 1-bit indicating the SP knows about a given update. Ordering is needed to ensure the most recent update is propagated, and only the most recent configuration file is stored. Limited reordering is allowed, but ACMS guarantees updates submitted within 2T+1 seconds apart will be ordered correctly, where T is the maximum allowed clock skew between any two SPs. SPs only accept one update per second for a given configuration file and submissions for the file are ordered first by timestamp then by Accepting SP name. A recovery protocol is used to account for missed updates to ensure SPs sync up. ACMS's performance in submitting and propagating new configurations is highly sensitive to large files. Files under 100K take less than a second to submit, but files over 10MB take 200 seconds due to replication. Also, there is no discussion of security issues in submitting files. This topic may be outside the scope of their work, but what prevents a malicious user from submitting a configuration file for another user? Since only the most recent version of the file is stored, an attacker could take advantage of the long submission time for large files by overwriting this file. The user would then need to wait several seconds or more before the large, valid configuration file could be submitted. From: Shen LI Sent: Tuesday, April 26, 2011 11:35 AM To: Gupta, Indranil Subject: 525 review 04/26 Name: Shen Li Centrifuge: Integrated Lease Management and Partitioning for Cloud Services This paper proposes an lease manager to solve the requests partitioning problem. This problem raises because current load balancers cannot provide a good programming model for applications and may even route the request for the same state to different servers. For example, several users hold an online video conference via a data center service. It is possible that the load balancer route different users to different server, which will either cause more traffic in the data center network or lead to the failure of this service. There try to use the leasing mechanisms to handle this. For each unique state, there will be one and only one server that successfully acquires the lease for that state. The manager service is responsible for two actions: 1) look up for the current lease allocation map; 2) lease assignment and management. There is a Paxos group running inside the manager service to solve the consensus problem. When a new client tries to request for a state, the client first query for the lease allocation table, the it will know which server is currently holding this state. Pros: Clearly, they solve the request partitioning problem. And the application developers can simply focus on their own logic. Cons: The Paxos group does not guarantee scalability. For example, the google's Chubby typically use five servers to form a group. But at each time point, there is only one server acts as master and it will handle all requests. All other servers just act as backups. In Section 2.4, they mention that Centrifuge will be used in a relatively small set of servers, so that the scalability will not be a problem. But that is not a solution for scalability. For example, in google's bigtable, they guarantees that the load on Chubby server will be very light. So that the Chubby server will not become the bottleneck. However, their manager service has to handle a large number of look ups. ACMS: The Akamai Configuration Management System This paper describes an on using configuration management system for the Akamai Network. The challenge is that they need to provide a management system that achieves high availability, fault-tolerant storage, and correct ordering of accepted updates. And, at the same time, they need the system to be able to efficiently as well as securely propagate updates. Their quorum-based algorithm basically contains two phase: 1) one server receives an updates from a publisher and sending the file to as many SPs as it can reach; 2)?all participating node exchanges a bit vector which indicates whether one node has already got the update or not. When the majority of the servers is aware of this update, they are done. Pros: 1. Their quorum-based algorithm only need a majority of the SPs to stay alive, so that their system can provide good fault tolerance. 2. They add some download points to the front-end nodes, so that the bandwidth demand placed on the storage points are alleviated. Cons: 1. Even though the download points are added, the storage nodes also need to propagate new updates as well as bit vectors. In the case where the arrival rate is very hight, will the front-end point become a bottle neck? From: Qingxi Li Sent: Tuesday, April 26, 2011 11:27 AM To: Gupta, Indranil Subject: 525 review 04/26 ACMS: The Akamai Configuration Management System ACMS used to support the users of CDN to propagate the configuration requirements. ACMS provides persistent fault-tolerant storage which means that when the machine failures and available again, it will be synchronized to the latest situation. Besides this, ACMS provides a unique order for all the requesting and make the update security by encrypting the configuration updates. Additionally, the authors argue that ACMS also provides the efficiency and scalability, however, I don’t think so. The structure of the Akamai includes publishers which send the configuration updating request, Storage Points (SP) which are used to accept and store the configuration updates, CDN receivers which subscribe to the configuration updates. For accepting process, the SP receiving the new configuration will forward it to all the other SPs, and when more than half SP have a copy of this configuration, then the accepting SP will send the accepted message back to the publisher. If the accepting SP crashes down before the accepting SP find out more than half SPs get the copy, after it comes back, it will send a possible accept to the publisher. And publisher will re-submit this updates to another SP. To check whether more than half SPs get the replications, ACMS uses vector exchange, even the authors argue that the total messages are O(n^2) after asynchronies the broadcasting time. However, I still think this process needs large amount of messages and a lot of time to converging. This is also the reason why I think ACMS is not scalable and efficiency. To avoid most SPs crash or disconnected at the same time, the SPs are separated into different tier-1 ASs. For machine recovery, ACMS requests it to merging nearly half of the PSs index of files to find out the latest version of the updates then download the updates. However, there may be some skews between the clocks in the machines. To solve this problem, ACMS uses the time-stamping which I think also need message exchanged between the PSs which may cost if the update is very frequently. Scaling memcached at Facebook This article introduces some modifications of Facebook when they use the memcached. Firstly, instead of allocating memory for each TCP connection, Facebook uses a per-thread connection buffer for TCP and UDP sockets. I’m not sure how this works, I wonder when uses this sharing memory, whether it will decrease memcached’s security and whether it will increase the time to write-in/out the memory. The other change is using UDP instead of TCP and implements some application-layer congestion control. However, when we do some network experience in the last semester, we find that when the network is congested, the possibility of the UDP packet be dropped is larger than that of TCP packet. So I wonder whether changing from TCP to UDP will decrease the performance. Thirdly, Facebook separates the network transmission on every core to adapt memcached to multi-core servers. The last change is removing the state collection’s global lock and moving it to per-thread. This problem is also mentioned in a OS paper I read in this semester. As the increasing of the cores used in current computers, the abusing of the global lock will cause performance decreasing. From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman Sent: Tuesday, April 26, 2011 10:40 AM To: Gupta, Indranil Subject: 525 review 04/26 Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Summary: Centrifuge is a system for interactive cloud services that require low latency, operate on cached data, use many small objects, and do not need replication. The key idea in Centrifuge is to integrate leasing and partitioning which allows scaling to massive number of objects. This integration provokes key architectural changes to existing systems like manager directed leasing and non-traditional API where clients cannot directly request leases. The system architecture consists of a central manager, lookups, and owners. Centrifuge avoids per object leases for scalability reasons. Instead it hands out leases on ranges which are computed using consistent hashing. The manager directed leasing handles both leasing and partitioning. However clients cannot directly request leases on objects. Failure recovery is pushed towards the clients to avoid replication costs and other complexities. The current implementation includes partitioning, consistency, and recovery, whereas membership and load balancing is left out in the current version. Pros: (1) First solution to integrate leasing and partitioning. (2) Highly scalable system. (3) Low complexity due to no replication. (4) The system is deployed in Windows Live Mesh. Cons: (1) The centralized manager introduces a single point of failure. (2) The assumptions might be too restrictive. (3) Does the system work for batch systems? (4) It is not clear whether the system can handle high churn. (5) The clients might be overloaded with reliability responsibilities. When an app server crashes, items are not available until the client republishes. ACMS: The Akamai Configuration Management System Summary: ACMS deals with configuration and control of a distributed platform to enable customers to control how their content is served, and to configure services just like a centralized system. The problem is exacerbated due to large number of servers that need to be synchronized to the latest state in a few minutes, server failures, and the need to initiate configurations from anywhere on the network. The system assumes that submissions may originate from anywhere on the Internet, and that configuration files are submitted in their entirety. The proposed architecture has a front end collection of storage servers, and a back end to ensure reliable and efficient delivery of configuration files by leveraging the Akamai CDN. The front end is fault tolerant due to an agreement protocol on top of replication. The agreement protocol uses vector exchanges based on vector clocks. Each SP continuously runs a recovery protocol to query other SPs for missed agreements. Snapshots are used to further optimize the recovery process. In the back-end, receivers periodically query SP snapshots to learn about updates. This is further optimized by leveraging the existing Akamai CDN infrastructure. Pros: (1) ACMS is reliable, scalable, and fault-tolerant. (2) A centralized leader election protocol is no longer required due to the decentralized protocol. (3) The system utilizes the existing CDN infrastructure. Cons: (1) Services are halted when there is no majority quorum. (2) It seems that the system does not have many novel contributions. It just uses existing techniques to build a full fledged system. -- Muntasir. From: wzhou10 SpamElide Sent: Tuesday, April 26, 2011 9:56 AM To: Gupta, Indranil Subject: CS525 Review 04/26 Review of Memcached Core idea: These articles introduce Memcached, a high-performance, distributed caching system to speed up Web applications. Caching is invented to improve performance by making recent data being accessed faster. Memcached uses a similar method, but in distributed systems, acting as a single cache for recently visited web pages. Users can save and get items in cache using keys. In this way, faster response can be achieved. Facebook is one user of Memcached. Pros 1. A very simple method is introduced for caching to speed up web applications. And user interface is friendly. 2. Memcached is designed for distributed systems. It is scalable, unlike typical caches, constrained in size. Cons 1. There’s no security concern in Memcached design. 2. Memcached only supports single value cache, but not multiple values. 3. Memcached doesn’t provide redundancy. 4. Keys may be conflicted, since there’s no mechanism to prevent users to use the same key. For instance, as one article suggests, users can use application name+ id as key. Then users should be careful about this, increasing the burden of users. Review of “ACMS: The Akamai Configuration Management System” Core idea This paper describes Akamai Configuration Management System (ACMS), one of the largest commercial CDNs today. It’s fully decentralized, and allows reliable yet highly asynchronous delivery on configuration information. Note here, ACMS ropagates no more than simply configuration files, which wasn’t clear to me at the beginning. ACMS’s architecture consists of a small set of front-end distributed Storage Points (SP), typically in number 5, and a back- end process that manages downloads from the front-end. It works as follows. When an update is submitted to one SP, first this SP (accepting SP) makes sure to replicate it to at least a quorum of SPs. Then Vector Exchange algorithm runs to allow a quorum of SPs to agree on this submission, by marking the corresponding bit in a vector. These two phase together are called the Acceptance Algorithm. Once the agreement is achieved, SPs uploads the data to their collocated HTTP servers. To deal with lost updates, ACMS use a recovery scheme called Index Merging. The process to download configuration updates is much simpler. Users look for files with the latest timestamps. Download Points and caching are used to aid in delivery. Pros 1. The evaluation has been performed on top of the real Akamai network, and “Despite 36 network failures that we recorded in the last 9 months, that affected some ACMS Storage Points, the system continued to operate successfully”, which is really impressive. 2. At the core of ACMS is its acceptance algorithm. It’s beautifully simple. VE is quite lightweighted. 3. The way to add a small random delay before a rebroadcast is clever. Cons 1. To ensure unique identifier, only one file is accepted per SP per second. Could this cause bad experience due to the delay? It would be more convincing if there’s proof that this amount of time doesn’t matter. 2. Though they argue the five SPs they implemented do a good job, the this small number of SPs may not work well as the CDN continues to grow. Also, it’s not clear to me how they choose those SPs, besides that they are in distinct ISPs. 3. I have some doubt on one assumption, that old configuration files are of no use. Is this true in real world? 4. I personally like the Acceptance Algorithm like I mentioned, but I wonder if the two phases could be conbined. Also, didn’t they ever consider any quorum-based schemes other than VE? 5. In 4.4, the SPs seem assumed to be able to tell which piece of information is relatively new. But I fail to see how. From: Nipun Sehrawat Sent: Tuesday, April 26, 2011 9:33 AM To: Gupta, Indranil Subject: 525 review 04/26 Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Centrifuge is a lease manager for partitioning incoming requests among a group of backend “in-memory” servers, which maintain their state in memory, to quickly satisfy requests - for eg. Memcached - an in-memory key-value store. In such a config, data is partitioned among backend servers and an incoming request is directed, via a load balancer, to the server responsible for maintaining the requested data. But, according to authors, this leads to a complex programming model for application developers - as they have to handle corner cases such as a inconsistency rising from scenarios such as a request landing up at multiple backend servers. At the very core, Centrifuge leases data objects to the backend servers, using consistent hashing and maintains a replicate state of the leases. Centrifuge features a central manager service, which consists of a leader, several “standbys” and a group of Paxos nodes. The manager service is responsible for: - Partitioning the data (keys) among the backend servers, using consistent hashing - Communicating the pertinent leases to backend servers (run an ‘owner’ library) - Communication the entire key->backend_server map (i.e. collection of all leases) to the ‘clients’, which are originating points of requests, withing a datacenter. (run a ‘lookup’ library) This configuration has both clients and servers inside the data-center, with another layer of indirection involved - the outside world clients’ request are received by the clients inside a datacenter via a load-balancer (such as F5’s BigIP). The manager service aims at high availability, which it attains by replication. It consists of a Paxos group, which is responsible for electing a leader (that is eventually responsible for generating and distributing leases) and maintaining a consistent state of lease (that is used to rollback to last consistent state in case the current leader fails. For easier load-balancing, each backend server is represented as n virtual servers - n can be increased and decreased to balance the load on a server. The paper also presents a protocol to avoid message-races between two machines - each machine maintains a local sequence number and a sequence number corresponding to the other machine. These two numbers are piggy-backed with a message and allow the receiving side to determine if the message received was sent after receiving all the messages from it. Pros: - Provides a replicated, fine-grained lease management - Leasing is adaptive to load, specially with virtual nodes, it is easier to redistribute load amongst backend server. Cons: - On leader failure, backend servers broadcast lease-request message to all the standbys. If the number of standbys is large (though not in case of Centrifuge’s test setting), it might be helpful to have a DNS server pointing to the most-recently elected leader. -- Memcached Memcached is a distributed cache of data (key-value store), spread across multiple server - to cache frequently accessed data, avoid costly database lookups and provide a quick response to a data-request. Memcacded features a simple programming model to the application developers - 1. Application first looks up for data corresponding to a key in memcached 2. If found, it proceeds. Otherwise, it does a database lookup for the data and inserts it into memcached. Memcached uses a consistent hash-map from a key to the server holding the corresponding data. In a typical configuration, memcached has clients (that make requests for data) and servers (that store objects). Clients have knowledge of all the servers and share a common hash function which maps a key to a server. Typically, clients are servers run on a trusted network (e.g. inside a datacenter) requiring little security features, but memcached also supports SASL authentication support for clients running on untrusted networks. Pro: - Reduces the average response time seen by clients, by minimizing database accesses of frequently accessed items - Provides a simple programming model to application developers Cons: - No explicit load-balancing support, some backend servers might get overwhelmed if they cache a popular item - No fault-tolerance to counter failing servers Suggestions: - A natural response after reading Centrifuge paper would be to run centrifuge over a memcached installation to have a better lease management. From: Chi-Yao Hong Sent: Tuesday, April 26, 2011 9:18 AM To: Gupta, Indranil Subject: 525 review 4/26 ----ACMS: The Akamai Configuration Management System, NSDI’05---- This paper studied the problem of configuring a distributed system efficiently. The context they studied is the Akamai system where customers could change their configuration setting frequently in a distributed manner. Given the fact that Akamai serves more than 15,000 servers around the world, the configuration takes become increasing challenging. The proposed management system, denoted by ACMS, could provide decent properties such as fault-tolerance, reliability under an asynchrony delivery model. Pros: 1. I like the idea of using asynchronous messages where the messages can be postponed until the destined machine become available. It is clearly an efficient way to make this system more scalable. 2. It is nice to providing strong message ordering property, while they relaxed it because of the clock skew issue though (I don’t think the bound of 2T is a good value, but I agree that the idea of message ordering is very important). Cons: 1. I don’t see why pull-based is more preferable in Akamai CDN. I noticed that Akamai did some optimization for HTTP download, but why not doing the same scheme for push-based communication by adding http daemon at stage points? Also, pull-based scheme should require more communication overheads (or slower update) as compared with pull-based scheme. 2. The evaluation on scalability can be further discussed. The results in section 7.2 are for a relatively small network (in terms of the number of stage points). ---- Centrifuge: Integrated Lease Management and Partitioning for Cloud Services, NSDI’10 ---- The current state-of-the-art is to implement leasing and partition function separately in the core network of cloud services. However, this way of implementation introduced some drawbacks. Among these, the most important problem is the scalability in which solving the problems independently cannot provide. To this end, the authors proposed Centrifuge to scale both leasing and partition. For partition, the server was sending a huge amount of objects which prevents the scalability. To solve this problem, Centrifuge uses the rage of objects as a way to compress and pack the leases. In particular, the consistent hashing is used to assign lease to a virtual node. To further reduce the communication overhead, the server who handled leasing for a client is responsible to handle partition problem for the same node. Comments: 1. From Microsoft Live Mesh implementation the authors showed that Centrifuge is promising to support high user count with high churn without compromising system load. 2. It would be interesting to discuss if any previous data structure already consider providing a succinct format of a set of data objects with a continuous value attached with it. 3. It will be very interesting to figure out what are root causes of data lost observed in the Live Mesh. It could be more interesting if the data lost is not highly correlated with either the server failures (e.g., machine crash) or network failures (e.g., network switch crash). -- Chi-Yao Hong Computer Science, Illinois http://cyhong.projects.cs.illinois.edu From: Andrew Harris Sent: Tuesday, April 26, 2011 4:26 AM To: Gupta, Indranil Subject: 525 review 04/26 Review of ACMS and memcached Akamai’s configuration management system (ACMS) is a highly-available update system designed to push out configuration file updates to 15,000 servers, with each update possibly being in the range of hundreds of megabytes. As an update is published from a number of Publisher machines, the so-called Storage Point machines come to a consensus over the data received (to check that it was received properly). Receivers send periodic “if modified since” requests for the configuration file, pulling it from their respective SP machines if needed. The system is provably correct for certain clock skew thresholds, in that it guarantees correct transmission of a configuration file within this time flexibility, and also guarantees that consensus will be reached for each update. This configuration is acceptable for a network of their scale only because of the size of information being transmitted. If these were multiple configuration files, each of size in hundreds of gigabytes, the Publisher-SP-Receiver structure here would quickly become bogged down in pushing updates. No longer would clock skew be the bound for correct configurations; the speed at which the update is transferred would become the dominating variable. It would seem that propagating deltas would be much more efficient here than propagating whole configurations, as certain fields are bound to remain constant between revisions. The agreement algorithm could still function in coming to a consensus over the delta being pushed; this and other requirements of the ACMS would be preserved with deltas, with the result being a massively reduced update size and latency. The only assumption for which this fails is the assumption that each new configuration overwrites the previous; keeping deltas would imply rebuilding the configuration at each iteration. Memcached is a distributed general-purpose caching system developed by the makers of the Movable Type blogging platform. It is designed as many caches are: to provide quick access to popular information as a first-stage lookup before reaching an underlying database, to reduce load on that database or speed up related operations. Particular features of memcached are its simple configuration and drop-in usage A drawback of memcached is noted by its developers: it is inherently insecure. Neither writes nor reads are controlled by the system, and rely on access control implemented on the memcached nodes for any semblance of security. This is fine, though, as the scope of memcached does not include reinventing security best-practices for distributed systems. A second drawback, mentioned but perhaps downplayed, is a complete lack of type safety in data stored within memcached. This allows for greater flexibility in one sense, in that memcached can be deployed to serve any arbitrary string-serializable data. As compared with structured systems however, such as relational databases, this explicitly removes any chance for optimization in implementation beyond that which would apply to string handling. Consider an arbitrarily complex data structure, serialized into a memcached entry. To reach a single data member within this structure, the entire entry must first be deserialized. As the complexity of the structure increases, the cost of this deserialization grows. On the other hand, with a relational database, the explicit typing of each entry allows fast lookups without a deserialization step. Consider next an extension of memcached, with each K-V pair pointing to a structured data set instead of a raw string; the deserialization is already done, so direct accesses to members can be done with only the key for the K-V pair. Structured data also allows the implementation to better predict its own resource consumption for some number of entries, which itself may improve load balancing across nodes. Finally, if serialization is later desired (say, to store an instance of memcached’s contents to disk), it is no more complex than specifying an appropriate schema and letting that dictate the serialization. A question then would be, why not simply run a small relational database entirely in memory and have it synchronize periodically to some larger disk-bound database. One reason why this would be undesirable is its performance in periods of heavy load. Memcached is not ACID-compliant, and does not block during multiple concurrent reads. This lends to its stability under load, but also sets it apart from most databases. From the other side, ACID-compliance in many databases would lead to unnecessary read overhead in a cache structure.From: Agarwal, Rachit on behalf of Rachit Agarwal Sent: Tuesday, April 26, 2011 4:09 AM To: Gupta, Indranil Subject: 525 review 04/26 Rachit Agarwal ---- Centrifuge: Integrated lease management and partitioning for cloud services The data center servers today face the challenge of handling massive amount of state, while providing consistency, low latency and responsiveness to user requests. Today, servers operate purely on in-memory state but technologies for building in-memory server pools offload the handling of a lot of system design issues to the user end. This paper presents a technique to overcome this problem. The paper achieves its target by integrating the leasing and the partitioning logic using a manager-directed service. By integrating the leasing and the partitioning mechanism, the paper manages to resolve the inconsistency problem. A very few thoughts/comments: 1. The authors assume that although the number of contents/objects is large, each object in itself is small. I am wondering if there are applications where the servers have to deal with objects that are larger in size. For such applications, Centrifuge may run into scalability problems. 2. A trick used in the paper to avoid inconsistency problems in Centrifuge is to eradicate replication. The authors assume that, in case of failures, the state can be rebuilt either by back-up storage or by requiring the users to republish the content. This seems like a very weak statement, in that a failure may result in large amounts of data transfer and/or the user losing the content (in case the user does not have a back-up service). 3. The concept of lease ranges is interesting. The assumption of everything running within a data center made the use of lease ranges easier. I am wondering how things would change if the application servers were distributed across multiple geographically distributed data centers. I am assuming that many cloud providers would like to have such a distributed data center service for providing better user experience. ---- Memcached With the growing size of the information in data networks, caching is becoming an ever important tool to provide scalability. Memcached is a distributed caching system typically used in web-based systems to reduce the response time. These articles were pretty interesting to read; some of my comments: 1. At the first glance, it is surprising that Facebook found Memcached to be not so scalable. I am wondering if the memcached connection memory allocation process is the only problem leading to non-scalability. For instance, memcached does not perform very well if the data access frequency is well-distributed across a chunk of data that has size larger than the memory size of the cache since this would typically require updating the cache with every access. I am wondering if there are other such reasons leading to non-scalability of memcached. 2. Memcached also seems to have poor performance in presence of data that is being created incrementally and/or is being updated frequently. I assume that Facebook does not see much of such data. But for many other applications, I am wondering if there is a possibility of incremental and/or update-efficient memcached design? 3. The lack of access control seems to be a major problem for memcached. Given the high importance of privacy in applications like Facebook, I wonder how facebook handles these privacy problems without significant overhead. ---- -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: david.m.lundgren SpamElide on behalf of David Lundgren Sent: Tuesday, April 26, 2011 3:41 AM To: Gupta, Indranil Subject: 525 review 04/26 ACMS: The Akamai Configuration Management System Sherman et al. introduce ACMS for the fault-tolerant distribution of configuration files and the maintenance of consistent state in large-scale asynchronous distributed systems. The system is composed of two parts: a client facing ``front-end'' that accepts and initially stores config. files; and the back-end distribution network that propagates new configuration files. Acceptance of a new file occurs in two stages. Clients ``publish'' configuration changes to an accepting storage point (SP) server which initially attempts to replicate the file on a quorum of other SPs. If this step is successful, the accepting SP sends initiates a vector exchange protocol to confirm a quorum of agreement among the SP ring (SPs are distributed across Internet service providers and in well-connected data centers). After this, SPs are free to upload the novel file for download. Subscribed receivers are then free to download the configuration file index tree via HTTP IMS requests using a pull mechanism. Index recovery, optimizations are discussed, and system scalability and speed are also evaluated. Pros, Cons, Comments, and Questions: - While in the Akamai network, there can be little debate over a pull vs. push mechanism for update distribution, I wonder how general this choice is and the repercussions that such a design decision might have in networks not optimized for HTTP GETs like Akamai's. - The authors assume a relatively fixed, stable group of storage points. I think it might be interesting to examine a more dynamic structure here. - I enjoyed the authors' proofs of correctness, guarantees, and, in general, their theoretical analysis. ------------------------------------------------------------------------- Articles and Documentation for Memcached Memcached is designed as a massively scalable, low-latency, distributed object cache for memory. It was originally developed for LiveJournal, but is used by many web companies (including Facebook, who do not seem to have the honor of being listed on the memcached website). Memcached's primary purpose is to lessen database load and is basically a massive (non-secure) key-store that removes items based on expiration date and in least recently used order. A request to memcached is initially hashed to determine the proper node storing the information and then this node hashes internally to return the actual data. The standard memcached install does not seem to implement consistent hashing, but there are external libraries available for it. I thought the choice to provide zero redundancy, no failure handling, and no authentication/security were all good design decision. It seems cleaner to encode logic such as this in the application level, and any attempts to incorporate it into memcached would result in hampered performance. By default, memcached uses slab memory allocation rather than the more traditional malloc/free paradigm, and does this on a per connection basis. Facebook modified this to pool memory among threads and also switched HTTP GET requests to use UDP rather than TCP. While growing their operations to webscale, Facebook also noticed that UDP socket lock contention caused large bottlenecks, forcing them to use one reply socket per thread. Facebook's impressive performance statistics attest to the scalability and excellent execution of Memcached (with some clever tweaks) under load. From: Curtis Wang Sent: Tuesday, April 26, 2011 3:35 AM To: Gupta, Indranil Subject: 525 review 04/26 Curtis Wang (wang505) 4/26/11 Other Industrial Systems ACMS: The Akamai Configuration Management System The authors present the Akamai Configuration Management System, which is a system built for consistent, reliable, asynchronous delivery of configuration information. ACMS addresses the user concerns of maintaining close control over web services and the problem of quickly and reliably propagating a configuration update. The system is divided into two components—what they call the “front end” and the “back end”. The front end consists of Storage Points, which are responsible for accepting and storing configuration updates. The Storage Points utilize a quorum-based (defined as “majority”) acceptance algorithm that relies on Vector Exchange (SPs exchanging state vectors—similar to the vector clocks concept). Replication in the front-end is accomplished through Index Merging. The back end of the system is just the Akamai CDN. Pros - Highly available with significant fault tolerance - Persistent storage of configuration updates for asynchronous updates - Offers data that shows that the system may be scalable - Large, functional deployment in use. Cons - Makes many assumptions about the system, e.g. there are only non-competing writers, which may not always be the case. Facebook Memcached Memcached is an open source distributed in-memory object caching system. Though it has many applications, it was designed with the intended purpose of speeding up dynamic web applications by alleviating the database load. Pros - Alleviate database load - Can be accessed anywhere and can span multiple machines - Available for many popular programming languages - Simple to use - Open source Cons - Does not have built-in security measures, so people unexperienced with memcached may not have the right level of security in their applications From: Qingxi Li Sent: Tuesday, April 26, 2011 1:04 AM To: Gupta, Indranil Subject: 525 review 04/21 ACMS: The Akamai Configuration Management System ACMS used to support the users of CDN to propagate the configuration requirements. ACMS provides persistent fault-tolerant storage which means that when the machine failures and available again, it will be synchronized to the latest situation. Besides this, ACMS provides a unique order for all the requesting and make the update security by encrypting the configuration updates. Additionally, the authors argue that ACMS also provides the efficiency and scalability, however, I don’t think so. The structure of the Akamai includes publishers which send the configuration updating request, Storage Points (SP) which are used to accept and store the configuration updates, CDN receivers which subscribe to the configuration updates. For accepting process, the SP receiving the new configuration will forward it to all the other SPs, and when more than half SP have a copy of this configuration, then the accepting SP will send the accepted message back to the publisher. If the accepting SP crashes down before the accepting SP find out more than half SPs get the copy, after it comes back, it will send a possible accept to the publisher. And publisher will re-submit this updates to another SP. To check whether more than half SPs get the replications, ACMS uses vector exchange, even the authors argue that the total messages are O(n^2) after asynchronies the broadcasting time. However, I still think this process needs large amount of messages and a lot of time to converging. This is also the reason why I think ACMS is not scalable and efficiency. To avoid most SPs crash or disconnected at the same time, the SPs are separated into different tier-1 ASs. For machine recovery, ACMS requests it to merging nearly half of the PSs index of files to find out the latest version of the updates then download the updates. However, there may be some skews between the clocks in the machines. To solve this problem, ACMS uses the time-stamping which I think also need message exchanged between the PSs which may cost if the update is very frequently. Scaling memcached at Facebook This article introduces some modifications of Facebook when they use the memcached. Firstly, instead of allocating memory for each TCP connection, Facebook uses a per-thread connection buffer for TCP and UDP sockets. I’m not sure how this works, I wonder when uses this sharing memory, whether it will decrease memcached’s security and whether it will increase the time to write-in/out the memory. The other change is using UDP instead of TCP and implements some application-layer congestion control. However, when we do some network experience in the last semester, we find that when the network is congested, the possibility of the UDP packet be dropped is larger than that of TCP packet. So I wonder whether changing from TCP to UDP will decrease the performance. Thirdly, Facebook separates the network transmission on every core to adapt memcached to multi-core servers. The last change is removing the state collection’s global lock and moving it to per-thread. This problem is also mentioned in a OS paper I read in this semester. As the increasing of the cores used in current computers, the abusing of the global lock will cause performance decreasing. From: Tengfei Mu Sent: Monday, April 25, 2011 11:16 PM To: Gupta, Indranil Subject: 525 review 04/26 1. ACMS: The Akamai configuration management system This paper introduces the Akamai configuration management system that successfully manages configuration updates for the Akamai network, which is a content delivery network. By using the simple quorum-based algorithm (Vector Exchange and Index Merging), ACMS could propagate the configuration changes and internal modifications to the CDN, in a highly available, distributed and fault-tolerant way. Specifically, ACMS consists of front-end and back-end. Front-end has several storage points used to store and synchronize configuration files. And the back-end could deliver the configuration files to the CDN servers. More importantly, the schemes used in Akamai could be useful in distributed system. Pro: 1. ACMS offers the configuration in scalable, reliable and fault-tolerant way. Con: 1. The overall design is not innovative. 2. The effect of SP exchanging vectors with all the others for the scalability. 2. Memcached This paper introduces Memcache, a tool for speeding up web applications. Memcached is used as a cache for storing frequently visited web pages. It provides methods for quick response to users, even in the fact that the size of cache in Memcached might be very large. The user could easily use the memory, just by saving and retrieving cache items with keys. Specifically, within the Memcached, cache is considered as big associative array in which each item in the array is indexed by some string. Then A key is used to uniquely identify the cache data, and also for retrieving and remove cache data. Note that Memcached does not contain any security mechanism, requiring user to add some security mechanism. Pro: 1. Memory is used for speeding up web-application by cache access information. 2. Memcached is easy to use except for simple secure mechanism. Con: 1. Security issues should be considered in future implementation From: mark overholt Sent: Monday, April 25, 2011 6:24 PM To: Gupta, Indranil Subject: 525 review 04/26 Mark Overholt CS525 Review 04/26/2011 ACMS: The Akamai Configuration Management System Summary: Akamai is a company that specializes in distributed mirrors of websites. In order to do this, they have a highly distributed set of servers running copies of websites. These servers are spread all across the globe and are there to provide a highly available subsystem for sites. The sites running on Akamai do require configurations at the software level in order to function correctly. And these configuration files could change based on a client’s needs. So Akamai created their ACMS. ACMS is a system for accepting new configuration files, and propagating those changes out to their machines. The ACMS uses a 2-tier system to propagate updates to its entire system. The first tier is the front-end set of machines called Storage Points. There are always a fixed number of SP’s. The user uploads their new configuration files to the SP’s. It is the SP’s job to agree on the change, notify the user, and then start sending out the change to the Receivers. The Receivers are the 2nd tier, the backend system. In order to guarantee that an update remains in the system and propagating out amidst failures, a quorum of the SP’s must agree on the update and save it to disk. For Akamai, a quorum is considered a majority. When an update is uploaded, the entry SP saves the file to disk and broadcasts the update to all of the other SP’s. When a majority of the SP’s (including the initial SP) have acknowledged the update, the Vector Exchange begins. This is a timestamp vector that checks to see if the servers are prepared to upload the change, and start serving it to the backend. When the initial SP receives the vector back with at least a majority of accepted machines, it then notifies the user that their update was accepted by the system. Once an update is accepted, all of the SP’s upload the file to their HTTP server, which is then used to download the file to backend machines. The backend machines use a pull method to get updates. That is to say they poll the SP’s for changes, and pull new updates as they arrive. Discussion: Pros: Using local caches for updates, as well as extra download points, allows the system to scale well. Cons: Not sure if their lax ordering assumption is sufficient. Updates uploaded to different SP’s within 20 seconds could be reordered incorrectly. Memcached Summary: Memcached is a general-purpose distributed memory caching system. It is often used to speed up dynamic database -driven websites by caching data and objects in RAM to reduce the number of times an external data source (such as a database or API) must be read. Memcached runs on Unix, Windows and MacOS. Memcached's APIs provide a giant hash table distributed across multiple machines. When the table is full, subsequent inserts cause older data to be purged in least recently used (LRU) order. Applications using Memcached typically layer requests and additions into core before falling back on a slower backing store, such as a database. The system uses a client–server architecture. The servers maintain a key–value associative array ; the clients populate this array and query it. Keys are up to 250 bytes long and values can be at most 1megabyte in size. Clients use client side libraries to contact the servers which, by default, expose their service at port 11211. Each client knows all servers; the servers do not communicate with each other. If a client wishes to set or read the value corresponding to a certain key, the client's library first computes a hash of the key to determine the server that will be used. Then it contacts that server. The server will compute a second hash of the key to determine where to store or read the corresponding value. The servers keep the values in RAM; if a server runs out of RAM, it discards the oldest values. Therefore, clients must treat Memcached as a transitory cache; they cannot assume that data stored in Memcached is still there when they need it. Discussion: Pros: Seems like a great way to alleviate the strain on a heavily used database. The concept of using main memory of a distributed set of servers is a huge improvement over going to disk every time you need to talk to a database. Being crossplatform and cross language makes it easy to adopt, no matter what technology you are currently using. Cons: The facebook article really shed some light on some of memcached’s short comings when it was presented with HUGE load. Memcached seems to be great for moderately high load, but when you get up to huge load, modifications on how it interacts with the OS needed to be looked at. From: Tony Huang Sent: Sunday, April 24, 2011 11:16 PM To: Gupta, Indranil Subject: 525 review 04/26 Paper: ACMS: The Akamai Configuration Management System Core idea: Akamai is a CDN. ACMS is a configuration file management and distribution service that distribute configuration files. The system is based on a set of front-end distributed storage points (SP) and a back-end process that manages downloads from front-end. The front-end SPs are responsible for accepting and storing configuration updates, and the back-end is the entire Akamai CDN that subscribes to updates and delivers the update files. As an optimization, a set of download points (DP) to the front-end. Quorum-based algorithms are run among SPs to synchronize configuration submissions. ACMS uses a pull-based approaches to propogate updates. To propogate a file, an application first submit an update. The publisher transmits a file to one SP. The SP would make sure the update has been replicated to at least a majority of the SPs before returning to the client. After the agreement among the SPs is reached, the data can be downloaded. Then, each machine (node) runs a daemon called receiver would subscribes to the SP and get the new requirements file. The configuration file themselves are represented in a tree format. This facilitates merging of two versions of configuration file. Special attention are taken to guard against clock-skew related errors. Pros: - Use existing mature technology to implement a useful service for a large network service. - Special attention paid for clock-skewed related errors. This design consideration has not been fully discussed by previous papers we read. - Tree based merging reduces traffic sent over the network. Cons and thoughts: - The service uses broadcast and vector exchanges with all other machines in the system. I think the scalability can be further improved by using anti-entropy algorithm instead. - ACMS halts the system wen there is no quorum of functional SPs. This would be unacceptable for a mission critical system. Memcached Core idea: Memcached is a caching service build to alleviate the workload on a website's database storage. It provides the application an hash-table like service. Clients store objects to memcached with a key, and later retrieve it with the same key. It can specify time-to-live to determine the time the object to be kept in the cache. It employs a two-level hash structure, in which the user-library first determine the set of machines responsible for a key, and then look up the object in the server's in-memory hash table. Facebook performs the following optimization to the system. * Instead of per-connection buffer, it employs a shared buffer pool for TVP and UDP sockets. This reduces the memory overhead for sockets. * Application-level flow control for multi-gets. * Due to a linux lock on UDP socket, multi-thread transmission using a single-socket is slow. The system is changed to use separately UDP socket for transmitting replies. * Spread the burdon of network soft-interrupt handling across cores. It does so by using "opportunistic" polling of the network interfaces. That is, the core will poll the network interface when it is idol. Thoughts: Mamcached is a useful infrastructure for large-scale network and is widely deployed in most popular website. However, the cache is initiated by user instead of automatic. As a result, using memcached would require careful analysis on the programmer side. Is it possible to design a service that woudl automatically perform caching for users to reduce the burden on programmer? Also, what is the fundamental difference between using memcached and a CDN? -- Regards -- Tony