Lab 8 - Community Detection Algorithms¶

Peter Kairouz and Pramod Viswanath¶

In this lab, you will learn how to design community detection algorithms to cluster a given social network into non-overlapping communities.

After filling this notebook and running all the cells, rename the file lab8.ipynb to firstname_lastname_lab8.ipynb, include your well commented code, and submit it by email. Avoid unneeded steps/computations and make sure your code runs before submitting it. Grading is based on your submission which is due at 4 p.m. March 16, 2016. Your grade will be deducted 20 points for each day after the due date (late penalty).

This lab is adapted from UC Berkeley's EE126 by Prof. Kannan Ramchandran.

You will need the following helper function.

%matplotlib inline
from pylab import *
import random as rnd
import networkx as nx
from __future__ import division

rcParams['figure.figsize'] = 12, 12  # that's default image size for this interactive session

def draw_graph(graph, labels=None, graph_layout='shell',
               node_size=1600, node_color='blue', node_alpha=0.3,
               node_text_size=12,
               edge_color='blue', edge_alpha=0.3, edge_tickness=1,
               edge_text_pos=0.3,
               text_font='sans-serif'):
    """ 
    Based on: https://www.udacity.com/wiki/creating-network-graphs-with-python
    We describe a graph as a list enumerating all edges.
    Ex: graph = [(1,2), (2,3)] represents a graph with 2 edges - (node1 - node2) and (node2 - node3)
    """
    
    # create networkx graph
    G=nx.Graph()

    # add edges
    for edge in graph:
        G.add_edge(edge[0], edge[1])

    # these are different layouts for the network you may try
    # shell seems to work best
    if graph_layout == 'spring':
        graph_pos=nx.spring_layout(G)
    elif graph_layout == 'spectral':
        graph_pos=nx.spectral_layout(G)
    elif graph_layout == 'random':
        graph_pos=nx.random_layout(G)
    else:
        graph_pos=nx.shell_layout(G)

    # draw graph
    nx.draw_networkx_nodes(G,graph_pos,node_size=node_size, 
                           alpha=node_alpha, node_color=node_color)
    nx.draw_networkx_edges(G,graph_pos,width=edge_tickness,
                           alpha=edge_alpha,edge_color=edge_color)
    nx.draw_networkx_labels(G, graph_pos,font_size=node_text_size,
                            font_family=text_font)
    # show graph
    plt.show()

Problem 1: Erdős-Rényi Graphs (10 pts)¶

In this problem, we explore a class of random graphs introduced by Erdős and Rényi. The Erdős-Rényi model states that given $n$ nodes, any pair of nodes $(i,j)$ are connected with probability $p$.

Fill out the following function to create an Erdős–Rényi random graph $\operatorname{G}(n,p)$. For each pair of nodes, randomly create an edge between them with probability $p$. Return the resulting graph (same format as before).

def G(n,p):
    graph = [] 
    # in this lab, we describe a graph as a list of tuples enumerating all edges - node names can be numbers.
    
    # your code goes here
    
    return graph

# try this
graph = G(10,0.1)
draw_graph(graph)

Write a function that accepts a graph (list of tuples) as input and converts it to an adjacency list.

def adjacency_list(graph):
    """
    Takes in the current representation of the graph, outputs an equivalent
    adjacency list
    Example: graph = [(0,1), (1,2), (2,0)] --> adjacency = [ [1, 2], [0, 2], [0, 1]]
    """
    # your code goes here
    
    return adjacency

# try this
graph = G(10,0.1)
draw_graph(graph)
print adjacency_list(graph)

Problem 2: The Stochastic Block Model & Min-Bisection Algorithm (40 pts)¶

The Stochastic Block Model¶

Let's revisit the republicans vs. democrats example. Let's assume, for the sake of simplicity, that there is an equal number of republicans and democrats, and we have a graph with nodes representing the American citizens (edges represent friendships between them). Given such a graph, can we determine which citizens are part of the same community (political party in this case)?

In the stochastic block model (let's call it SBM), we have graphs of the form $G(n,p,q)$. For simplicity, let's assume that $n$ is even and $p>q$. In this model, we have two "communities" each of size $n/2$ such that the probability of an edge existing between any two nodes within a community is $p$ and the probability of an edge between the two communities is $q$. We are interested in recovering the communities from a realization of the random graph.

Use the function you wrote in problem 1 to implement the stochastic block model.

def SBM(n,p,q):
    """
    Let the first n/2 nodes be part of community A and 
    the second n/2 part of community B.
    """
    assert(n % 2 == 0)
    mid = int(n/2)
    graph = []
    for i in xrange(n):
        graph.append((i,i))
        
    # create community A
    # your code goes here
                
    # create community B       
    # your code goes here
                
    # form connections between communities
    for i in xrange(mid):
        for j in xrange(mid, n):
            if rnd.random() < q:
                graph.append( (i, j) )
    return graph

# try this
graph = SBM(20,0.6,0.05)
draw_graph(graph,graph_layout='spring')

The Min-Bisection Algorithm¶

Recall the problem setup: we are given a graph of connections, and we know that half of the nodes belong to one community and that the other half belong to the other community. Intuitively, we expect the $n/2$ nodes in the same community to have more connections amongst each other than with the other $n/2$ nodes. Consider the following toy example. We are given the above graph and would like to recover the community assignments. Perhaps it is clear by inspection that the proper community assignments should be:

In order to properly address this question, we must first introduce the notion of min-bisection. Formally, consider a graph $G=(V,E)$, where $V$ denotes the set of vertices and $E$ denotes the set of edges. We define a $(2,\frac{n}{2})$ partition of the graph to be a split of the graph into $2$ groups of nodes of size $\frac{n}{2}$ each. The min-bisection of the graph is the $(2, \frac{n}{2})$ partition with the minimum total edge weight across partitions. Looking back at the toy example above we can see that the community assignment corresponds to a min-bisection of the graph.

Fill in the functions below to implement the min-bisection algorithm.

def community_degree(community, adjacency):
    """
    Return the number of edges crossing from one community to the other
    """
    # your code goes here 
    
    return total_edges

def min_bisection(graph):
    """
    Return two lists of size n/2, one corresponding to each community in the graph
    """
    adjacency = adjacency_list(graph) # it is easier to implement the min-bisection algorithm by inspecting an adjacency list
    # your code goes here
    
    # use community_degree to count the number of edges crossing from one community to the other
    
    return min_bisection_community

# test your code here!
graph = SBM(20,0.6,0.05)
draw_graph(graph,graph_layout='spring')
min_bisection_community = min_bisection(graph)
print min_bisection_community[0]
print min_bisection_community[1]

In the graphs we create, we know that nodes $(0, \frac{n}{2} - 1)$ from community A while nodes $(\frac{n}{2}, n)$ form community B. Let $p = \alpha\log(n)/n$ and $q = \beta \log(n)/n$. Using this knowledge, simulate the probability of exact recovery when $\frac{\alpha + \beta}{2} - \sqrt{\alpha \beta} > 1$. Then do the same for $\frac{\alpha + \beta}{2} - \sqrt{\alpha \beta} < 1$

import numpy as np 

def prob_recovery(L, n, alpha, beta):
    mid = int(n/2)
    ground_truth1 = tuple(np.arange(mid)) # community A
    ground_truth2 = tuple(np.arange(mid, n)) # community B
    p = (alpha)*np.log(n) / n
    q = (beta)*np.log(n) / n
    num_correct = 0
    
    # your code goes here
    
    # do the following L times
    # generate an SBM graph 
    # use min_bisection(graph) to find the 2 communities in the randomly generated graph
    # compare the communities returned by min_bisection to the true communities (community A and B)
    # if the match increment num_correct
    
    return int(num_correct/L)

L = 100 # number of iterations to average over
# case 1: (alpha + beta)/2 - np.sqrt(alpha * beta) > 1 
alpha = 4
beta = 0.25
n = 10
p = (alpha)*np.log(n) / n
q = (beta)*np.log(n) / n
print "Probability of intra-community edges: ", p
print "Probability of inter-community edges: ", q
print "Value threshold: ", (alpha + beta)/2 - np.sqrt(alpha * beta)
print "Probability of recovery: ", prob_recovery(L, n,alpha,beta)

# case 2: (alpha + beta)/2 - np.sqrt(alpha * beta) < 1 
alpha = 0.5
beta = 0.25
n = 10
p = (alpha)*np.log(n) / n
q = (beta)*np.log(n) / n
print "Probability of intra-community edges: ", p
print "Probability of inter-community edges: ", q
print "Value threshold: ", (alpha + beta)/2 - np.sqrt(alpha * beta)
print "Probability of recovery: ", prob_recovery(L, n,alpha,beta)

# you can also try other values of alpha and beta (just make sure p and q are both between 0 and 1)

What do you observe?

Your answer.

In the above tests, we used $n=10$, which is an incredibly small graph size. We would like to be able to analyze bigger graphs where $n=100, 500, 1000$, etc. What happens when you try to run the min-bisection algorithm on a graph of size $n=100$? Why do you think this is happening?