From: trowerm SpamElide on behalf of Matt Trower [mtrower2 SpamElide] Sent: Thursday, March 31, 2011 1:59 PM To: indy SpamElide Subject: 525 review 03/31 Forgot to email. Gossip-Style Failure Detection In this paper the authors describe their implementation of a fault detection service which was deployed on Cornell's campus network. The protocol is based on the random dissemination of information to neighbors by all nodes in the network. Each node keeps a heartbeat message counter per node with each node updating its own counter. The authors admit that the idea is not unique but several tweaks to the system allowed for significant improvements in false positives and network overhead. Many of these improvements were based around adding a “cooling” period after an event happened to allow the network to settle. Other improvements were made by taking network locality into account by storing address prefixes in a separate table. Traffic was then generated in a hierarchically random manner which localized most of the traffic. This paper made an interesting change from other works in its goals. Most failure detectors try to become more accurate at any costs. Perhaps because this system was actually deployed the authors focused on scalability instead which led to not as good failure detection times. The authors made the claim that the service was able to catch 100% of the failures in the Cornell network over a three week timespan. I am interested in what kind of failures these were and how they can guarantee that all failures were caught/ there were no false positives. Scalable and Efficient FD's This paper presents a formal framework which analyzes several important parameters in failure detection systems. Using the framework the authors show the optimal network load required to detect a node failure in T seconds (normalized to longest round trip time between nodes). Based on this work the authors propose their own failure detection system which is based on randomized probing of nodes which is within a factor of being optimal in the average case. The most interesting part of the protocol is the out-sourcing of pings to other nodes. This has a nice property in practice that network path heterogeneity would increase reliability. Two points were brought up at the end of the paper which I thought fitting. The first is whether a protocol could be designed which allows all the parameters to be balanced includnig accuracy and completeness. The other question is whether we can do as well as the random case presented with a deterministic or guaranteed type algorithm. From: Long Kai [longkai1 SpamElide] Sent: Thursday, March 31, 2011 1:24 PM To: Gupta, Indranil Subject: 525 review 03/31 CS 525 Reviews 03/31 Long Kai (longkai1) A Gossip-Style Failure Detection Service Summary: This paper presented a failure detection service based on a gossip protocol. Each host in the network runs a failure detector and sends out continuous process failure and recovery reports to the clients. The service scales very well and provides timely detection. The paper further discussed how to leverage the underlying network topology to increase efficiency and save resources. The service is composed of two separate protocols to take advantage of the network topology automatically. The experimental result demonstrated the scalability and accuracy of the service. Pros: This service scales very well and the protocol is straightforward and easy to implement. Cons: However, like other gossiping and epidemic protocols, this service can only guarantee accuracy and low latency with high probability and in some case, failures may not be detected or be detected in a long time. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Summary: This paper presents SWIM which is a generic software module that offers this service for large-scale process groups. The protocol is scalable in a sense that both the expected time to first detection of each process failure, and the expected message load per member, do not vary with group size. SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. The first component is a failure detector component that uses the random-probing based failure detector protocol. The second component is a dissemination component which disseminates membership updates via network multicast. Pros: The protocol is highly scalable and has less overhead to the network. Cons: False positive rate cannot be bound in this protocol and it could be high in some special condition. Best, -- Larry From: Igor Svecs [isvecs SpamElide] Sent: Thursday, March 31, 2011 12:17 PM To: Gupta, Indranil Subject: 525 review 03/31 Igors Svecs CS525 paper review 2011-03-31 A gossip-based failure detection service SUMMARY This paper presents a simple protocol for failure detection that is based on gossiping. Each node maintains a list with (neighbor, heartbeat counter) for each other node. Periodically, a node increments its own counter and sends the entire list to a randomly chosen neighbor. Upon received the message, nodes merge two lists by using maximum of two heartbeat counters for each entry. Nodes announce joining by broadcasting, or using gossip servers. Nodes that are considered faulty are cleaned up after a timeout that can be set to 2*T_fail. The protocol is extended by maintaining subnet information and including it in the updates. Then nodes are tuned to prefer gossiping with their own subnet, eliminating cross-domain traffic and improving local detection time. PROS - Simple - Only completely unreachable hosts are detected as failed (as opposed to individual link failures) - can be seen as both a pro and a con. - Known probability of a false positive CONS - Linear growth of detection time, state (each member maintains a list with all known other members), and bandwidth / message size (entire list is transfered). "We decided to slow gossiping down linearly with the number of members" (that also grows linearly). It seems that message size is the root cause of linear scaling. Partitioning the network in k * sqrt(N) overlapping sets could remove many linear factors. - Member management - as new nodes join, all other nodes need to update their lists. This is not acceptable for large networks with high churn (e.g., p2p). - Catastrophe recovery protocol assumes that IP broadcasts are available, while they are almost always filtered on the Internet. NOTES - Multi-level gossiping based on subnets is an interesting idea, and it will probably work in intranets where several subnets are densely populated with hosts and will not be appropriate for the Internet (second list that contains domains/masks will grow too large). It would be interesting to use information from a routing protocol to infer locality, although this will also not work on the Internet. - I propose using bits of a hash of node's ip address to partition the network into multiple domains, similar to Pastry. Since the structure is inherent, it will not require additional agreement protocols. SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol SUMMARY SWIM is a group membership protocol that is based on gossiping. Two components of this protocol include failure detection and update dissemination. The failure detection on a member periodically selects a random neighbor and sends a ping message to it. If an ack is not received, the member asks k other member of the group to probe the neighbor. If no reply from any of k members is received, the neighbor is marked as faulty, and this information is multicast* by the dissemination component. This protocol is optimized as follows. *Multicasting is eliminated by piggybacking membership updates on pings/acks. This ensures epidemic dissemination of membership updates. Suspection mechanism is introduced (delayed marking of potentially failed processes, dissemination of suspected members) that allows to tweak false positive rate / detection time tradeoff. Time-bounded strong completeness of failure detection can be ensured by selecting probe target in a round robin way. PROS - Failure detection time, rate of false positives, and message load per member are independent of the group size. - Indirect probing means that individual link failures will not affect failure detection. - Simple protocol, optimizations available to reduce false positive rate, ensure time-bounded strong completeness. CONS - Indirect probing leads to the possibility of an easy DDoS attack. - Probabilistic / weak consistency. From: Jason Croft [croft1 SpamElide] Sent: Thursday, March 31, 2011 12:13 PM To: Gupta, Indranil Subject: 525 review 03/31 SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol The SWIM protocol attempts to improve the performance and scalability of maintaining group membership by separating the membership dissemination component from failure detection. The protocol also does not use heartbeats for failure detection, as heartbeat-based failure detectors tend to suffer from scalability limitations. By separating failure detection and membership dissemination, SWIM achieves stable failure detection time, stable rate of false positives, and low message load per group member. The authors note that there is a trade-off between four metrics: strong completeness (failures are detected by non-faulty members), speed of failure detection, accuracy (rate of false positives of failure detection), and network message load (bytes per second generated by the protocol). The basic SWIM protocol consists of the failure detector component and dissemination component. The failure detection component does not require clocks to be synchronized, and uses pings, ping-reqs, and acks to detect failures. Indirect probing, through ping-req, is used after a prespecified time-out and must be based on the round-trip time within the network. The dissemination component can be implemented through hardware multicast or in infection-style. With this design, the failure detector can become inaccurate as slow processes declare non-faulty processes as faulty. To improve on the design, processes are declared as "suspected" before they are considered faulty, after some prespecified time-out. This suspicion subprotocol reduces the rate of failure detection false positives. However, there may be another issue with false positives that is not discussed in the paper, namely nodes that may not be reachable due to NATs or firewalls. In such a case, nodes may not receive pings, but are not faulty--a similar problem that can also affect DHTs. Also, I thought there could have been more evaluation on the scalability of the protocol beyond the 55-member groups, especially as this was one of the metrics the protocol aimed to improve on compared to existing approaches. A Gossip-Style Failure Detection Service In this paper, a simple gossip protocol is proposed for detecting failed processes that aims to be resilient against message loss, allow some clock drift, and scale well for both detection time and network load. Each member maintains a list of known members and updates the list upon receiving a gossip message, which members send out after a specified time period. Unlike SWIM, this protocol is heartbeat- based. Therefore, members gossip at regular intervals, but not necessarily synchronized intervals. One interesting result of this protocol design is that it cannot detect link failures between hosts, but rather only if a host becomes completely unreachable. Thus, if A cannot reach B, but A can reach C and C can reach B, then A would not suspect B of failing. With this protocol, there is a trade-off between minimizing the probability of false detection and the failure detection time. In addition, if many members of a group begin failing, the number of wasted gossips increases. In the case of a network partition, where one set of members becomes unreachable for another set, the basic gossip algorithm performs poorly. To improve on some of the limitations, network topology is taken into account. The lengths of the subnet and host numbers for each domain are gossiped with the heartbeat counters, to allow a different protocol to be used for gossips that cross subnets and domains. This protocol tries to ensure that on average on member per subnet gossips to another subnet in the domain so that bandwidth at each level is proportional to the number of entities in the next lower level. While this reduces bandwidth, it increases the number of rounds needed for dissemination. When the number and size of subnets are the same, the number of rounds is highest, though bandwidth usage becomes most efficient. From: harshitha.menon SpamElide on behalf of Harshitha Menon [gplkrsh2 SpamElide] Sent: Thursday, March 31, 2011 12:09 PM To: Gupta, Indranil Subject: 525 review 03/31 Gossip style failure detection service Failure detectors are valuable for system management, replication, lb and other services. But FD services scale badly as the number of members increases. This paper describes failure detection protocol based on gossiping that scales well. The fd based on this would have the properties that probability that a member is falsely reported as having failed is independent of no of processes, the algorithm is resilient against message loss and process failures, the detection time is O(n logn), scales in network load. Each member maintains a list for each known member and a heartbeat counter. Every Tgossip seconds, each member increments its own heartbeat counter and selects one other member at random to send its list to. Upon receiving a gossip message, a member merges the list in the message with its own list. If a heartbeat of a node is not increased within Tfail, then that node is marked as failed. Pros: -False positive is very low -The detection time is O(n logn) -Scales in network load -Resilient against message loss and process failures. -It can handle network partition -Since this is randomized, there is no overhead of maintaining a topology and repairing it on failure. Cons: -In multi-level gossiping, the number of rounds needed for information to be disseminated through the system is high and hence detection time across subnets and domains is high. -How will it work in the case of Byzantine failures? SWIM Peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a gossip based group membership protocol. SWIM separates failure detection and membership updates. The rate of false failure detection is reduced by allowing group members to suspect a process before declaring it as failed. This protocol achieves scalability by avoiding heartbeat and by using a random probing Pros: -Satisfies strong completeness -Membership updates are propagated efficiently via gossiping. This makes is reliable, scale with network load. -Reduces frequency of false positive by suspecting rather than declaring failed. -The expected message load per member in this protocol is independent of the group size. Cons and suggestions: -The experimental results show for a group size of only 56. Since the paper talks about the protocol being scalable, there could have been runs for larger numbers. From: Anupam Das [anupam009 SpamElide] Sent: Thursday, March 31, 2011 12:05 PM To: Gupta, Indranil Subject: 525 review 03/31 i. On Scalable and Efficient Distributed Failure Detectors  This paper discusses about developing a complete and scalable failure detector in a distributed system. The paper concentrates on reducing the optimum worst-case network load imposed by traditional failure detector schemes. The authors propose a randomized distributed failure detector for this purpose. The randomized distributed failure detector protocol depends on two parameters (protocol period T and subgroup size k) which are derived from the application-specified requirements (speed of detection and accuracy of the protocol) and network level unreliability (message loss probability and node failure probability). For a given node (A) the randomized protocol first tries to ping any random node (say B) for its status and upon failing to retrieve any feedback within a time-out period, it randomly selects k other non-faulty nodes and sends them ping-req messages. These k random nodes then send ping message to the original node (B) and send their response back to A. So in this way, A can be sure whether B is faulty or not through either direct or indirect response. Pros: 1.      The parameters required by the protocol are derived from application requirements and network unreliability. So, it imposes no user dependent parameter. 2.      No clock synchronization is required. Only constant drift rate is required. 3.      The paper provides analytical results of the proposed protocol. 4.      On average the network load behavior of the protocol is much lower than the worst-case network load behavior for distributed heartbeating algorithms. So, the worst-case network load differs from the optimal case by a factor greater than 1. 5.      The sub-optimality of worst case network load is independent of the random group size (k). Cons: 1.      The overhead of the sent message is high as each packet contains only a few bytes of usable data (majority is control bytes and headers). 2.      The protocol assumes that the maximal group membership is always same and not dynamic. This assumption might not always be feasible. 3.      Though the paper provides strong analytical results it would been better is the analytical results could be backed up by few experimental results. -------------------------------------------------------------------- ii. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol  Existing heartbeat based membership protocols suffers from quadratic increase in message load with group size and high false positive rates. This paper describes a new group membership protocol named SWIM which addresses the scalability and accuracy issues of existing membership protocol. SWIM basically has two components- failure detector and information dissemination. The failure detector keeps track of which members are alive while the dissemination component distributes this information throughout the system. The failure detector is a random probe detector (similar to epidemic strategy) in the sense that it randomly selects nodes to probe. Compared to the all-to-all heartbeat scheme this scheme scales much better. To ensure completeness the nodes are not totally selected in a random manner but rather a random permutation of the nodes is used for probing. Now, instead of broadcasting the derived information SWIM piggy-backs this information on the PING, PING-REQ or ACK packets used during the failure detection phase. As a result the dissemination itself exhibits infection-style dissemination.  SWIM also propose another optimization to lower false positive rates through introduction of  pre-failure state called “suspected” to give nodes a second-chance to come alive. The authors also implemented their protocol and tested it on a large-testbed to highlight its performance.  Pros: 1.      The proposed protocol scales linearly with the group size. 2.      Reduces false positives through the use of ‘suspected’ state. 3.      Message overhead is reduced by piggy backing membership updates on ping packets.  Cons: 1.      Since confirm messages can override alive messages, this can cause temporary oscillation in node status (e.g., Suspect->Alive->Confirm).  Moreover, why do you need to send a separate confirm message when suspected node will eventually be deleted after a time-out period? 2.      The paper does not provide comparison with other existing protocols. It would have been nice to see such comparisons.  -----Anupam From: Nipun Sehrawat [sehrawa2 SpamElide] Sent: Thursday, March 31, 2011 11:45 AM To: Gupta, Indranil Subject: 525 review 03/31 On Scalable and Efficient Distributed Failure Detectors This paper first describes four aspects of any failure detector - completeness (guarantee that failure of a member is eventually detected by every other non-faulty member), accuracy (no non-faulty member declares another non-faulty member as failed), speed (time taken to detect a failed member, by some member) and scalability (with network overhead of failure detection). Accuracy, speed and scalability are clubbed together to be refereed as "efficiency" requirements in the paper. Completeness is further classified into weak (some member to detect the failure) and strong (all members detect the failure). The speed of failure detection is generally measured as the time taken to satisfy week consistency. Following from the impossibility of consensus result for asynchronous distributed systems, it has been shown that it is impossible to achieve both completeness and accuracy simultaneously in a failure detector. The paper first presents a model for quantifying the network load generated by a failure-detector which satisfies completeness (with probability 1) and an efficiency level dictated by the concerned application. The paper assumes a large group of members and the set of potential group members known beforehand. Thus, each member is able to maintain information (failure related) about all other group members. It also assumes the same member to have different incarnations, where a member takes a new incarnation everytime it recovers from a failure. The optimal worst-case network load generated by a failure-detection protocol running on above model is derived in the paper and is shown to be directly proportional to the number of members and log of accuracy requirements; and inversily proportional to the log of message loss probability. Failure-detection protocols can then be compared on basis of a 'sub-optimality factor' criteria, which is the ratio of load generated by the protocol to the optimal load. A randomized distributed failure detection protocol is described in the paper, which guarantees completeness (with prob. 1) and accuracy with high probability (tunable by application developer). The protocol dictates every non-failed member to periodically contact a randomly selected member. It does this by sending a ping message to the selected (destination) server. If no ack is received within the worst-case RTT, the source sends a 'ping request' message to k other randomly selected members. On receiving a ping request, these members contact the destination and ask it to reply back to the source member. If the source still doesn't receives an ack from the destination within a fixed time period, the destination is declared as failed. Pros: - Provides a failure-detection protocol, with tunable speed and accuracy but with strong completeness. - Derives an expression for optimal worst-case load that can be expected in running a failure-detector Cons: - As acknowledged by the authors, the assumptions made in section 3, regarding message loss distributed being independent across messages, might not hold true in real life. Member failures might also be dependent, in cases such as an entire datacenter going down. -- A Gossip-Style Failure Detection Service The paper presents a gossip-based failure detection protocol, which scales well, in both network load and detection time, with number of nodes in the system. A fail-stop (processes don't lie, as in Byzantine failure model) model is assumed and no assumptions are made about the message delivery time in the network. A basic protocol is first presented, that assumes uniform network topology and a bounded number of host crashes. In the basic protocol, each node maintains a list of all other nodes and heartbeat counters associated with them. Each node periodically selects a random node and sends its entire view of nodes to it. A recipient node merges the incoming list with its local list, updating the heartbeat counters to maximum of local and the received ones. If the last update time of heartbeat counter of a particular node is greater than a threshold, that node is assumed to have failed. However, in this approach, nodes independently discover failures and can't 'forget' about a node that they deem to be unsuccessful. This is because, some other node might send (outdated) information about the node which has already been judged as failed. So, information about failed nodes is maintained for some time period, to achieve a desired probability of accuracy. The protocol doesn't reflect partial failures, where a node is unreachable from some set of nodes, but is reachable from others. A node is deemed to be failed only when it is entirely unreachable. The basic protocol can be extended to leverage the underlying network topology. The fact that nodes will be divided into many domains and subnets is leveraged to reduce the inter-domain and inter-subnet gossips, thus making the protocol more scalable with number of nodes. On an average a single node per subnet gossips with nodes in other subnets in the same domain and a single node per domain gossips with nodes in other domains, which reduces the network traffic flowing across internet routers. This protocol provides a quick failure detection at intra-subnet and intra-domain level, but the time taken to detect failures across domains/subnets increases (due to infrequent inter-domain/subnet gossip). Pros: - Scales well, with number of nodes, in failure detection time and network load. - Resilient against message losses, as some degree of redundancy (in info received via gossip messages) introduced by random gossip. Cons: - Nodes detect failures independently, which might take long for all nodes to detect a failure in case of topology aware protocol. Comments/Suggestions: - After a node discovers a failure, it might be beneficial to broadcast this discovery to all other nodes in the system. This might help in achieving strong completeness, but might result in lots of broadcasts, if multiple nodes discover a failure at the same time (they all broadcast before receiving anyone else's broadcast). From: Qingxi Li [cs.qingxi.li SpamElide] Sent: Thursday, March 31, 2011 11:27 AM To: Gupta, Indranil Subject: 525 review 03/31 On Scalable and Efficient Distributed Failure Detectors SWIM: Scalable Weakly-consistent Infection-style Process Group Membership These two papers introduce one failure detector protocol, SWIM, which uses randomized to detect the failure nodes. As it evolves the random algorithm, the expected time of the first detecting of the failure and the message load of per node are independent of the group size. This protocol assumes that the failure node will be recovered and the nodes in the whole system may have different membership list in which record which nodes are alive and which are dead in some time. Besides this, it also assumes that the failure in the networks happened independently. The whole protocol includes two parts, failure detector algorithm and the membership update dissemination. For the randomized distributed failure detector, each node will have an ID and a version number. If the node recovery after a failure, it will increase its version number. And node A considers node B as fail, if it cannot be accessed by the nodes or the version ID record by node A is smaller than the version number of node B. I wonder whether it’s useful to detect the fail after the node has already recovered. To detect failing nodes, each node will randomly chooses one node and send ping message. If this node is alive, it will send an ack back. And after a time the sender doesn’t receive the ack, it will randomly choose some nodes to help it send ping to the destination. This can avoid the sender consider the destination as failing when the destination is alive but the path between sender and destination is disconnected. As the destination chosen randomly, the worst case is the failure can never be detected. So in the second paper, instead randomly choosing one node, it uses round-robin and when a node joins, it will be added into the member list randomly. For membership update dissemination, after detecting the failure, the node will send it by piggybacking the information on the ping, ping req, ack. This can avoid more messages be generated. However, the information needs multiple rounds to de disseminated to the whole networks. Besides this, I also wonder whether it’s OK for the whole networks using the same membership. For example, as node A can request the other nodes to send ping to B, it will consider node B as alive even it cannot access node B. However, I think considering from A’s side, node B is failing even it’s alive. To avoid error detection, as the node may sleep or some other reasons doesn’t send the ack back. The sender will consider it suspect and keep sending ping to the destination until timeout. After that, it will disseminate this node is failing. From: muntasir.raihan SpamElide on behalf of muntasir raihan rahman [mrahman2 SpamElide] Sent: Thursday, March 31, 2011 10:52 AM To: Gupta, Indranil Subject: 525 review 03/31 SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Summary: This paper presents the design, and implementation of SWIM, a scalable weakly consistent gossip based membership management protocol. SWIM achieves scalability by separating failure detection and membership update dissemination tasks, and by avoiding heart-beating mechanisms. The proposed random probing method incurs constant overhead on the nodes, as well as keeping expected detection time to a constant. SWIM uses a distributed failure detector introduced in an earlier paper. Membership updates are disseminated in an epidemic style and piggybacked using failure detection packets. The novel suspicion mechanism reduces false-positives, whereas round-robin probe target selection guarantees time bounded detection of failures at non-faulty processes. Pros: (1) Increased scalability through stable failure detection time, stable false positive rates and low message complexity. (2) It satisfies strong-completeness. (3) Provides a deterministic time bound on failure detection. (4) Epidemic style dissemination retains all the desirable properties of gossip algorithms, including low overhead and high reliability. Future Works: (1) A mathematical performance analysis of the suspicion mechanism could be interesting. (2) It would be interesting to investigate whether the suspicion mechanism can help with other distributed computing problems, like byzantine agreement or consensus. On Scalable and Efficient Distributed Failure Detectors Summary: This paper formally specifies desirable properties of efficient failure detectors and consequently proposes an efficient, scalable and randomized distributed failure detector that provide near optimal performance. An ideal failure detector should achieve completeness (each failure is detected), accuracy (0 false positives), speed (fast first time detection), and scalability. A previous impossibility result forbids completeness and accuracy at the same time, forcing researchers to sacrifice accuracy (only probabilistic guarantee) in favor of completeness. The paper proves the worst-case load of an optimal detector for the first time and shows that existing detectors don’t come close to the optimal. A new algorithm is proposed that is parameterized by the protocol period T’, and an integer k. The basic idea is as follows. At each time period, a member i randomly selects a node j to probe. If the ping request to j fails, i does not give up immediately. Instead it sends indirect ping requests to k other nodes, asking them to check j. Only when all the k nodes fail to notify ping success to j does i declare j as failed. This additional randomization allows the proposed algorithm to reach near optimal performance. The paper also mathematically analyzes the performance of the new algorithm. Pros: (1) Proposes first bound on worst-case load of an optimal failure detector. (2) The proposed algorithm imposes equal load on all members. (3) The algorithm allows application designers to tune design parameters in order to achieve desired levels of performance. Cons: (1) The independent message loss and failure assumptions may not be practical in some cases. Future Works: (1) Does the algorithm also work for byzantine failures? (2) Extending the results to models that assume correlated message loss and failures would be interesting. -- Muntasir From: david.m.lundgren SpamElide on behalf of David Lundgren [lundgre4 SpamElide] Sent: Thursday, March 31, 2011 10:24 AM To: Gupta, Indranil Subject: 525 review 03/31 SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol SWIM is a group membership protocol composed of two parts: a Failure Detector Component (FDC) and a Membership Dissemination Component (MDC). The FDC functions as follows: an arbitrary group member is chosen; this member pings a random second member; if this second member acks back, we know the other member is alive; if not, a ping-request for the second member is sent to a group of randomly chosen members who attempt to contact the original second member and forward this response (or lack thereof) back to the initiating node. The FDC is described in the second reading, On Scalable Efficient Distributed Failure Detectors. The MDC uses the network traffic of the FDC as a vessel for updating nodes' membership lists. A ``suspicion mechanism'' is introduced to reduce the false positive rate. It classifies each node as either suspect, alive, or confirmed (dead). Round-robin probing is also introduced to provide time-bounded strong completeness for the FDC. SWIM is evaluated and the suspicion mechanism is shown to lower the frequency of false alarms. Overhead is shown to remain static in spite of varying group size. Pros, Cons, Comments, and Questions: - SWIM's separation of its failure detector from its membership update function allows the membership updater to ``piggy-back'' on the failure detector's periodic pings and acks. This allows SWIM's network bandwidth to remain (relatively) constant. - SWIM is by definition weakly-consistent. A characterization of the degree of SWIM's typical member-list drift would be beneficial. - I am not entirely convinced that the suspicion mechanism's orders of preference and overriding maintain the desired correctness properties of the Failure Detector Component. ------------------------------------------------------------------------- On Scalable and Efficient Distributed Failure Detectors Gupta et al. quantify the worst-case network load of distributed failure detectors (DFD). They do this under the assumptions of: a large, static group membership, non-Byzantine failures, independent failures, constant clock rate, independent message loss, unicast communication, and a constant maximal message size. Due to an impossibility result, the authors assume guaranteed strong consistency and provide probabilistic speed and accuracy guarantees. It is shown that the worst-case network load is a function of network failure probability, group size, and failure detection time; and that there exists a failure detector that has such a minimal network load. This immediately implies that the O(n^2) heartbeat detectors are overkill and are not scalable. A DFD algorithm is introduced. The algorithm initially picks an arbitrary group member. This member then pings a random second member; if this second member acks back, we know the other member is alive. If an ack is not given, a ping-request for the second member is sent to a group of randomly chosen members who attempt to contact the original second member and forward this response (or lack thereof) back to the initiating node. Theoretical guarantees are then analyzed. Pros, Cons, Comments, and Questions: - The randomized failure detector algorithm does not need clock synchronization across members, but it does assume a steady clock rate for all members. Is this a completely reasonable assumption, and what happens to the detection algorithm if this assumption is relaxed? - A consequence of the authors' proof of a minimal worst-case network load is the immediate discarding of distributed heart-beating schemes that attempt to notify all group members (such protocols are O(n^2)). I think this is extremely solid motivation for the authors' exploration of a randomized failure detector. - I think an interesting discussion point might be: strong vs. weak consistency, and the situations where one is more appropriate than the other. - An area of future research could be to determine the efficacy of the algorithm under heavy churn conditions. From: Andrew Harris [harris78 SpamElide] Sent: Thursday, March 31, 2011 7:01 AM To: Gupta, Indranil Subject: 525 review 03/31 Review of “SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”, Das et al, and “A Gossip-Style Failure Detection Service”, van Renesse et al SWIM is a replacement for traditional heartbeat-based group membership monitoring approaches, which uses a cooperative neighborly scheme to test members’ aliveness. Nodes periodically send heartbeats to neighboring nodes. Upon detecting a missed heartbeat, the originating node contacts its other neighbors to test the offending node. If they also fail the connection, the node is marked for removal from the overall network. Updates are pushed across the network in multicast format, with an update to the system allowing these messages to piggyback off of heartbeat packets (thus zeroing their network overhead). The scheme here ensures that dead nodes are eventually found, making it suitable for networks where being able to reach every member at once is not a priority. The group with SWIM also implements a few enhancements. A round-robin neighbor-checking scheme is used in place of a probabilistic scheme, making the estimated time to dead-neighbor-detection directly calculable for a given node. The aforementioned piggybacking scheme slashes the already-low overhead for network structure updates. A system for allowing potentially dead nodes (suspected nodes) to be reported and monitored decreases the false positive rate of the scheme implemented here. The group reports a handful of testbed statistics which confirm their theoretical assessments of the usage of the SWIM scheme across generalized networks. One assumption that is not discussed in depth is the need for a densely connected network. To reliably test the connectivity of nodes, bottlenecks in the network should be minimized. If such bottlenecks exist - such as, a router that bridges different sub-networks - then it could be that only a few nodes are able to check that bottleneck node for connectivity. Assuming the bottleneck also normally experiences a heavier load than other nodes in the network, it would be presumably slower to respond than other nodes as well, so the heartbeat timeout would need to scale with traffic across the node. Finally, if the bottleneck were part of a chain of nodes, it could be that only as few as two nodes could feasibly check a single node for connectivity. If both experience a false positive due to heavy network load (or other factors), then the network may be inappropriately partitioned as the supposedly dead bottlenecking node is removed from the known topology. The node may then attempt to reinsert itself into the network as a response, however the network outage still occurs. Furthermore, there is a delay in propagating the insertion of any new node, so the network partitioning would persist for as long as it would take the insertion message to spread (which is especially problematic for high-diameter networks). A gossip-styled failure detection service (GFDS) functions similarly to SWIM, in that neighboring nodes end up reporting to one another about failed connections. The difference here is an assumption on the network: that it is divided into tangible subnets. The GFDS group seeks to minimize bandwidth consumption, rather than minimize time to detection, for distributed failure detection. This is done both out of an academic pursuit to explore such an approach, and out of a practical concern for preserving the usable capacity of certain interconnected links (consider that the Internet was one of their platforms in focus, and that dialup internet users would have very slow links as compared to Ethernet/broadband users). Among its contributions, the GFDS group addresses the network partition problem in greater depth than SWIM, and offers a proposed solution (at least insofar as their scheme is concerned) which is based on waiting to broadcast a failure until a certain maximum time-to-failure is reached while simultaneously attempting to reconnect the failing machine(s). They also reduce redundant connectivity checks by encouraging nodes to select other nodes distant from them; this allows a check over multiple layers of network topology. The group presents experimental data, which show that their approach to failure detection is both robust and bandwidth efficient for large-scale networks. While mentioned, the cold-start problem seems to be vastly understated here. For an existing physical network, the topology is known a priori, but for a dynamic distributed system (e.g. Gnutella) a node needs a place to start its gossiping for network detection. Without a starting point, a node is left with only naive within-subnet neighbor discovery, which may prove fruitless especially if the subnet is relatively sparse. The group notes that a Hosts file would need to be maintained for cold-starts, akin to Gnutella node lists that are published for new users, however these are potentially severe points of failure and are an outside source of network interference which should be minimized. Suppose for instance that a malicious cold-start list is circulated amongst connecting machines; every machine connecting to the network will, at first, pass through a malicious machine in attempting to discover the network.From: Agarwal, Rachit [rachit.ee SpamElide] Sent: Thursday, March 31, 2011 4:31 AM To: Gupta, Indranil Subject: 525 review 03/31 Rachit Agarwal ---- A Gossip-Style Failure Detection Service The paper considers the problem of scalable failure detection in asynchronous failure-prone networks. It proposes a gossip-based protocol that is scalable (in terms of detection time and network load) and is extremely resilient to failures. I really like the network model assumed in the paper - very neat and acceptable - no clock synchronization, arbitrary message delays, possibility of network partitioning etc. It is also interesting that the protocol achieves a false detection probability that is independent of the number of processes in the network. Some of my thoughts/ideas: 1. Is there a sweet spot that trade-offs network bandwidth consumption and detection time? I believe there must be some -- most of the gossip based schemes possess such a trade-off, but I do admit that this may be dependent on the network and the particular application that requires failure detection. 2. Can these protocols be naturally extended to handle link failures? There are cases where detecting link failures is an important task -- even if most of the nodes are reliable, there may be a large number of link failures in the network. 3. How do gossip-based schemes perform in presence of Byzantine failures and/or malicious processes? I am sure there must be a lot of work in this direction. 4. I am left wondering how to choose the various parameters (T-cleanup and T-fail) in a distributed fashion? This seems like a hard problem. ---- On Scalable and efficient distributed failure detectors The paper takes another step in the direction of failure detector design: formally defines the problem, characterizes the requirements, lower and upper bounds and presents a scheme that seems to perform fairly well in comparison to the bounds. It is interesting that they were able to prove the inefficiency of heartbeat (gossip?) style schemes. Most of my comments/ideas/thoughts are on the network model and failure requirements as stated in the paper: 1. The paper argues that the quickness of a failure detector depends on the time from a member failure to weak completeness. I am slightly confused about the following scenario: what if the system achieves weak completeness and the process detecting the failure fails itself? In such a scenario, weak completeness neither guarantees strong completeness nor the time to weak completeness specifies the quickness of failure detectors. I agree that in a probabilistic sense, it does -- but completeness, as defined in the paper, is deterministic. 2. The paper makes the assumption that failure detection time specified by the application is "much" larger than the message propagation time and/or congestion repair time. Is that true in practice? Even if it is, one of the motivation of the papers was to do scalable real-time failure detection (as mentioned in the introduction), in which case this may not be true, right? Am I missing something? 3. The paper also makes an assumption that each processor in the network has a non-volatile memory device attached to itself (for implementing incarnations). Is that a good enough assumption in practice? I am just not too sure. 4. The protocol allows communication using unicast messages. I am wondering how the results translate to the case of wireless networks? In particular, how do these results translate to networks that have broadcast messages? Do things improve/worsen? ---- Best regards, Rachit -- Siebel Center for Computer Science, University of Illinois at Urbana-Champaign, Web: http://agarwa16.wikidot.com ---- Sing as if No one is Listening, Dance as if No one is Watching, Dream as if You'll Live Forever, Live as if You'll Die Today !! From: Shen LI [geminialex007 SpamElide] Sent: Wednesday, March 30, 2011 10:53 PM To: Gupta, Indranil Subject: 525 review 03/31 Name: Shen Li SWIN: Scalable Weakly-consistent Infection-style Process Group Membership Protocol This paper proposes a low overhead, scalable mechanism for detecting and disseminating process group failures. The basic idea is that, instead of using traditional heart-beating protocols, they introduce probability and randomness into their system. In the detection part, a member i randomly select another member j to contact with. If the initial communication failed, it randomly select k (where k is fixed) other members as proxy and try to conduct communication with the j. Any reply from j will keep it stay in the non-faulty member set. In this way, they reduce the impact of network congestions between i and j. In the dissemination phase, if i find that j is faulty, i just simple multicast this information to other members. Since the IP multicast is only best-effort, they also propose to use a infection-style dissemination to accomplish this task. Pros: 1. By introducing randomness into their system, the communication overhead of each node is only a constant which does not increase with the size of group. In this way, SWIN can scale to very large systems. 2. They also provide deterministic bound for their detection time. Cons: 1. Their scheme has relatively long delay for detection and dissemination. So their work can only be used in non-time-critical scenarios. Other application like on-line gaming cannot use SWIM. 2. In order to keep the overhead as a constant on single member, they choose to use infective-style dissemination. As a result, the time require of update propagation will be longer. 3. The bound for detection time is a function of group size. I know they try to curb the overhead on single member side, but people actually more care about how the whole system perform. On Scalable and Efficient Distributed Failure Detectors This paper actually first propose the detection algorithm described above and provide more specific theoretical proof and analysis. Thanks to their analysis, we now can configure time interval (T') and the size of failure detection subgroups (k) according to application specified requirements, such as the expected time between a failure of member and its detection by some non-faulty members (SPEED), and the probability that a non-faulty member be detected as fault (ACCURACY). Pros: Their theoretical analysis result is really useful for real system designers. They just need simply replace the SPEED and ACCURACY parameter with the value they need. Cons: 1. I am also curious about the variance of the load on each node. In real systems, even if they have the same expectation, large variance can also lead to bad performance. 2. As they said in this paper, the theoretical work is based on several assumptions. And those assumptions are not guaranteed to be true. From: Tengfei Mu [tengfeimu SpamElide] Sent: Wednesday, March 30, 2011 10:33 PM To: Gupta, Indranil Subject: 525 review 03/31 1. A gossip-style failure detection service This paper presents a failure detection service based on gossip protocol. It provides accurate failure detection with known probability, and is resilient against permanent network partitions, transient message lost and host failures. Moreover, this failure detection service uses two separate protocols to achieve scalability in both time and bandwidth. The bandwidth requirement increases at most linearly with the number of processes. As for the detection time, as reported by the paper, it increases O(nlogn) with the number of processes. In addition, the paper provides a basic version, topology-aware version and catastrophe recovery version of the protocol. The basic version uses the gossip-based heart beating protocol to detect failures. The topology-aware version incorporates the network infrastructure into the membership list and reduces the overload burden on switches and routers. The catastrophe recovery version employs probabilistic broadcast to ensure that each host get the updates even a large portion of nodes crash. The authors also have implemented the protocol in their department domain and report failure correctly without making a single false detection. Pro: 1. The protocol provides accurate failure detection with known probability of false detection. 2. The bandwidth requirement for gossiping grows at most linearly with the number of processes. Con: 1. The paper didn’t consider about the failure of the failure detector. 2. SWIM: scalable weakly-consistent infection-style process group membership protocol The paper presents the SWIM, a scalable weakly-consistent process group membership protocol, which is motivated by the unscalability of the popular heartbeat-based protocols. SWIM tries to separate failure detector from the membership update dissemination component. As for the failure detector, in order to achieve the scalability, SWIM uses a random peer-to-peer probing of processes instead of heartbeating. In membership update dissemination component, updates are propagated efficiently in an infection-style by piggybacking on packets generated by failure detector. In addition, SWIM uses a suspicion mechanism to reduce the false positive rate and round-robin probe target selection optimization to ensure a bounded failure detection time. Pro: 1. SWIM achieves strong completeness, low false positive, average delay and network overhead are independent of the number of the nodes, the maximum delay is linearly in the number of nodes. Con: 1. Proactive probing may cause some invisible problem in real-world network setting with NAT. From: Chi-Yao Hong [cyhong1128 SpamElide] Sent: Wednesday, March 30, 2011 2:04 PM To: Gupta, Indranil Subject: 525 review 3/31 ---- A Gossip-Style Failure Detection Service, Middleware’98 ---- Accurate failure detection is considered an important, yet difficult problem for distributed systems. It is very difficult to provide deterministic guarantee on detection accuracy for large-scale network without compromising the network resources. Things are getting worse as the devices might fail. In the worst cast, the network might get partitioned. Also, distributed systems are usually unsynchronized, which implies that message delivery time cannot be bounded. To solve this problem, this paper relaxes the accuracy requirement and seeks for a scalable detection scheme that could provide reasonable accuracy. In particular, the gossiping protocol is used to send heartbeats to other randomly chosen devices. With probabilistic analysis, the system administrator could specify the gossiping parameters to obtain a desirable expected detection time. As an optimization, the authors proposed a location-aware gossiping scheme to improve the network bandwidth consumption. Cons: 1. I am not convinced that gossiping is the best way to do. For a network with high churn (e.g., wireless ad hoc / sensor networks), it is more convincing that randomization (gossiping) is easier to do. However, for static network (e.g., desktops connected to Internet), randomly choosing peers to send message could lead to suboptimal usage of network resource. I am wondering if any one who even compare the gossiping scheme with the optimal case (e.g., solving the problem as an integer programming problem), to evaluate the efficiency of gossiping-based schemes. ---- SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol, DSN’02 ---- In the past, the failure detection algorithms are relying on heart-beating-based communication. However, there are fundamental limitations of heart-beating-based schemes. These schemes provide either bad scalability or long detection/dissemination time. To solve this problem, SWIM proposes to decouple the failure detection from dissemination. SWIM uses a two-phase algorithm to detect failure with a configurable false positive probability. Pros: 1. I like the idea of sending probes proactively, which provides deterministic time bound of detection. 2. I like the idea to perform detection in two stages. First stage consumes small bandwidth to select a set of suspected members. Then in the second stage SWIM leverages subgroups to check the aliveness of these suspects, which provides high completeness. Discussion: 1. It would be interesting to see more comparison between proactive and reactive probing, in either mathematical analysis or simulation. Why not using a mixed scheme? For example, if a node does not receive any proactive pings for a configurable period of time, it will actively send out the heartbeat messages. -- Chi-Yao Hong PhD Student, Computer Science University of Illinois at Urbana-Champaign http://hong78.projects.cs.illinois.edu/ From: Tony Huang [tonyh1986 SpamElide] Sent: Monday, March 28, 2011 12:56 PM To: Gupta, Indranil Cc: Tony Huang Subject: 525 review 03/31 Paper: A gossip-based failure detection service Core Idea: The paper introduces a decentralized, distributed failure detection service based on a distributed gossip style protocol. The basic protocol requires each machine in the networl to maintain a heartbeat counter. Every T_gossip seconds, each member increment its own counter and gossip it to one of its neighbors. Each member occasionally broadcast its list for bootstrapping purpose. A node is considered failed if its counter has not been incremented for T_fail seconds. The threshold, T_fail, is chosen so that the probability of making an error detection is less than some small threshold P_mistake. The faulty node will remain on the list for T_cleanup period before it is removed from the list to prevent accidentally adding to the list again by nodes not yet received the update. The protocol only detects hosts that are completely unreachable. The author evaluates the protocl and shows it also dislay the classic tradeoff between convergence time and probability of mistakes experienced by most randomized algorithm. The author also noted that the baseline protocol does not perform satisfatingly under network partition condition. To address the problem of scalability, the author introduces the idea of multi-level gossiping, in which the nodes are partitioned according to its tcp sub-mask. Nodes within one sub-mask only gossip to its cluster. To improve the performance of catastrophe recovery, it uses a broadcast protocol to attempt to restore the connections to remaining members. Pros: * It's simple conceptually, easy to understand and uses a mature, well understood protocol to perform the task. * The convergence bound property is nice to have since it can guarantee a very accurate detection of failure. Cons and thoughts: * It is not designed to handle byzantine failure. * The detection can not be made absotelute inaccurate in that it requires theoretically infinitely time to converge. For applications that require absolute accurate detection, the protocol becomes inapplicable. * The paper does not address the problem of a mis-detection. It seems to me that once a node is mis-classified by a node, its behavior is unpredictable. Paper: SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol Core Idea: The paper introduces a membership protocol based on gossip style protocol. Membership protocol provides each member in the system a list of current non-faulty nodes in the system. The author's proposed solution includes two components, a failure detector and a disemmination component. The SWIM failure detector employs a two stage protocol for failure detection. It has two parameters, a protocol period T and the size of failure detection subgroups k. During each protocol period, a member would select randomly a neighbor in the same group to ping. If the ping does not return a response, the member asks a group of members inside the same group to ping the member and acknowledge if it is responsed. A node is considerd dead if none of member returns an ack. After detecting the failure, the baseline algorithm simply multicast the identity of the failure node to its group. In an enhanced version of the system, a failure node would be marked as 'suspected failed', and would be marked as permanent failure after a certain period with no successful contact from other members of the group. The author also shows that, by carefully choosing the member to ping, we achieve a time-bound on how long the protocol can detect a failure. Evaluation shows that the protocol generate very load traffic (5 messages for 99% of the time). The group size does not seem to directly relate to the time of detection and the latency of spread of infection, and the protocol achieves stable behavior when facing with high packet drop. Pros: * Clock synchronization is not required. * Simple, effective and scalable, as shown by the evaluation. Cons and thoughs: * Similar to all gossip style protocol, the time bound on latency and final convergence is not hard limit. This promts a question: is it possible to have a short convergence time and high success rate at the same time for a gossip style protocol? * The paper did not describe how to for the multicast group at the beginning without human intervention. -- Regards -- Tony