In this lab, you will learn how to design community detection algorithms to cluster a given social network into non-overlapping communities.
After filling this notebook and running all the cells, rename the file lab8.ipynb to firstname_lastname_lab8.ipynb, include your well commented code, and submit it by email. Avoid unneeded steps/computations and make sure your code runs before submitting it. Grading is based on your submission which is due at 4 p.m. March 16, 2016. Your grade will be deducted 20 points for each day after the due date (late penalty).
This lab is adapted from UC Berkeley's EE126 by Prof. Kannan Ramchandran.
You will need the following helper function.
%matplotlib inline
from pylab import *
import random as rnd
import networkx as nx
from __future__ import division
rcParams['figure.figsize'] = 12, 12 # that's default image size for this interactive session
def draw_graph(graph, labels=None, graph_layout='shell',
node_size=1600, node_color='blue', node_alpha=0.3,
node_text_size=12,
edge_color='blue', edge_alpha=0.3, edge_tickness=1,
edge_text_pos=0.3,
text_font='sans-serif'):
"""
Based on: https://www.udacity.com/wiki/creating-network-graphs-with-python
We describe a graph as a list enumerating all edges.
Ex: graph = [(1,2), (2,3)] represents a graph with 2 edges - (node1 - node2) and (node2 - node3)
"""
# create networkx graph
G=nx.Graph()
# add edges
for edge in graph:
G.add_edge(edge[0], edge[1])
# these are different layouts for the network you may try
# shell seems to work best
if graph_layout == 'spring':
graph_pos=nx.spring_layout(G)
elif graph_layout == 'spectral':
graph_pos=nx.spectral_layout(G)
elif graph_layout == 'random':
graph_pos=nx.random_layout(G)
else:
graph_pos=nx.shell_layout(G)
# draw graph
nx.draw_networkx_nodes(G,graph_pos,node_size=node_size,
alpha=node_alpha, node_color=node_color)
nx.draw_networkx_edges(G,graph_pos,width=edge_tickness,
alpha=edge_alpha,edge_color=edge_color)
nx.draw_networkx_labels(G, graph_pos,font_size=node_text_size,
font_family=text_font)
# show graph
plt.show()
In this problem, we explore a class of random graphs introduced by Erdős and Rényi. The Erdős-Rényi model states that given $n$ nodes, any pair of nodes $(i,j)$ are connected with probability $p$.
def G(n,p):
graph = []
# in this lab, we describe a graph as a list of tuples enumerating all edges - node names can be numbers.
# your code goes here
return graph
# try this
graph = G(10,0.1)
draw_graph(graph)
def adjacency_list(graph):
"""
Takes in the current representation of the graph, outputs an equivalent
adjacency list
Example: graph = [(0,1), (1,2), (2,0)] --> adjacency = [ [1, 2], [0, 2], [0, 1]]
"""
# your code goes here
return adjacency
# try this
graph = G(10,0.1)
draw_graph(graph)
print adjacency_list(graph)
Let's revisit the republicans vs. democrats example. Let's assume, for the sake of simplicity, that there is an equal number of republicans and democrats, and we have a graph with nodes representing the American citizens (edges represent friendships between them). Given such a graph, can we determine which citizens are part of the same community (political party in this case)?
In the stochastic block model (let's call it SBM), we have graphs of the form $G(n,p,q)$. For simplicity, let's assume that $n$ is even and $p>q$. In this model, we have two "communities" each of size $n/2$ such that the probability of an edge existing between any two nodes within a community is $p$ and the probability of an edge between the two communities is $q$. We are interested in recovering the communities from a realization of the random graph.
def SBM(n,p,q):
"""
Let the first n/2 nodes be part of community A and
the second n/2 part of community B.
"""
assert(n % 2 == 0)
mid = int(n/2)
graph = []
for i in xrange(n):
graph.append((i,i))
# create community A
# your code goes here
# create community B
# your code goes here
# form connections between communities
for i in xrange(mid):
for j in xrange(mid, n):
if rnd.random() < q:
graph.append( (i, j) )
return graph
# try this
graph = SBM(20,0.6,0.05)
draw_graph(graph,graph_layout='spring')
Recall the problem setup: we are given a graph of connections, and we know that half of the nodes belong to one community and that the other half belong to the other community. Intuitively, we expect the $n/2$ nodes in the same community to have more connections amongst each other than with the other $n/2$ nodes. Consider the following toy example. We are given the above graph and would like to recover the community assignments. Perhaps it is clear by inspection that the proper community assignments should be:
In order to properly address this question, we must first introduce the notion of min-bisection. Formally, consider a graph $G=(V,E)$, where $V$ denotes the set of vertices and $E$ denotes the set of edges. We define a $(2,\frac{n}{2})$ partition of the graph to be a split of the graph into $2$ groups of nodes of size $\frac{n}{2}$ each. The min-bisection of the graph is the $(2, \frac{n}{2})$ partition with the minimum total edge weight across partitions. Looking back at the toy example above we can see that the community assignment corresponds to a min-bisection of the graph.
def community_degree(community, adjacency):
"""
Return the number of edges crossing from one community to the other
"""
# your code goes here
return total_edges
def min_bisection(graph):
"""
Return two lists of size n/2, one corresponding to each community in the graph
"""
adjacency = adjacency_list(graph) # it is easier to implement the min-bisection algorithm by inspecting an adjacency list
# your code goes here
# use community_degree to count the number of edges crossing from one community to the other
return min_bisection_community
# test your code here!
graph = SBM(20,0.6,0.05)
draw_graph(graph,graph_layout='spring')
min_bisection_community = min_bisection(graph)
print min_bisection_community[0]
print min_bisection_community[1]
import numpy as np
def prob_recovery(L, n, alpha, beta):
mid = int(n/2)
ground_truth1 = tuple(np.arange(mid)) # community A
ground_truth2 = tuple(np.arange(mid, n)) # community B
p = (alpha)*np.log(n) / n
q = (beta)*np.log(n) / n
num_correct = 0
# your code goes here
# do the following L times
# generate an SBM graph
# use min_bisection(graph) to find the 2 communities in the randomly generated graph
# compare the communities returned by min_bisection to the true communities (community A and B)
# if the match increment num_correct
return int(num_correct/L)
L = 100 # number of iterations to average over
# case 1: (alpha + beta)/2 - np.sqrt(alpha * beta) > 1
alpha = 4
beta = 0.25
n = 10
p = (alpha)*np.log(n) / n
q = (beta)*np.log(n) / n
print "Probability of intra-community edges: ", p
print "Probability of inter-community edges: ", q
print "Value threshold: ", (alpha + beta)/2 - np.sqrt(alpha * beta)
print "Probability of recovery: ", prob_recovery(L, n,alpha,beta)
# case 2: (alpha + beta)/2 - np.sqrt(alpha * beta) < 1
alpha = 0.5
beta = 0.25
n = 10
p = (alpha)*np.log(n) / n
q = (beta)*np.log(n) / n
print "Probability of intra-community edges: ", p
print "Probability of inter-community edges: ", q
print "Value threshold: ", (alpha + beta)/2 - np.sqrt(alpha * beta)
print "Probability of recovery: ", prob_recovery(L, n,alpha,beta)
# you can also try other values of alpha and beta (just make sure p and q are both between 0 and 1)