# Community Detection in Graphs¶



### Thomas Bonald¶

Ongoing work with Marc Lelarge and Alexandre Hollocou

# Motivation¶

Graphs are everywhere

• Infrastructure (roads, airlines, Internet,...)
• Information (Web, Wikipedia,...)
• Social networks (Facebook, Twitter,...)
• Biology (brain, protein interaction, ...)

How to extract information from these graphs?

# Community detection¶

Real-life graphs are organized in communities ($\approx$ dense groups of nodes)

We seek to infer these communites

Applications:

• Search engines
• Content recommandation
• Data vizualisation
• Classification
• ...

# Local community detection¶

Most existing algorithms give a partition of the whole graph

We are interested in algorithms revealing the local structure around some target nodes (the seed set)

These local communities are typically:

• centered
• hierarchical
• overlapping

# Outline¶

1. Background
2. Ranking
3. Quality metrics
4. Live examples
5. Future work

# How do graphs look like?¶

Most graphs are big, scale-free and small-world

$$\begin{array}{l|c|c} & \text{DBLP} & \text{Wikipedia} \\ \hline \text{# nodes} & 1.2\text{M} & 4.2\text{M} \\ \text{# edges} & 5.3\text{M} & 101\text{M} \\ \text{average degree} & 8 & 24\\ \text{standard deviation} & 21 & 48\\ \text{3-hop population} & 91\% & 20\% \text{(out) } 98\% \text{(in)}\\ \end{array}$$

Karinthy 1929, Milgram 1967

# Modularity¶

Given a partition of the (undirected) graph, the modularity is defined by $$Q = \frac 1 {2m}\sum_{i,j}A_{ij}\delta_{ij} - \frac 1{(2m)^2} \sum_{i,j}d_{i}d_j\delta_{ij}$$ where

• $A$ is the adjacency matrix
• $d = A1$ is the vector of degrees
• $\delta_{ij}=1$ if $i$ and $j$ are in the same community

# Example¶

In [45]:


In [47]:
modularity(A,C)

Out[47]:
0.39998548410509505

# Modularity of weighted graphs¶

$$Q = \frac 1 {w^T1}\sum_{i,j}A_{ij}\delta_{ij} - \frac 1{(w^T1)^2} \sum_{i,j}w_{i}w_j\delta_{ij}$$

where

• $A$ is the weighted adjacency matrix
• $w = A1$ is the vector of node weights
• $\delta_{ij}=1$ if $i$ and $j$ are in the same community

# Existing algorithms¶

• Greedy algorithms Newman 2004, Blondel et. al. 2008
• Simulated annealing Guimera & Amaral 2005
• Spectral methods von Luxburg 2007, Newman 2013
• Statistical inference Hastings 2006, Newman & Leicht 2007
• Random walks Pons & Latapy 2005, Roswall & Bergstrom 2007

# Example¶

In [54]:



# Datasets¶

SNAP = Stanford Network Analysis Project

Graphs with ground-truth communities:

• Social networks $\to$ groups
• Amazon product network $\to$ product categories
• DBLP collaboration network $\to$ conferences, journals

Yang & Leskovec 2012

# Local community detection¶

Classical approach:

• Rank nodes with respect to their "distance" to the seed set
• Evaluate the resulting successive communities
• Select the best one(s)

Clauset 2005, Andersen & Lang 2006

Other approach: Sozio & Gionas 2010

# Outline¶

1. Background
2. Ranking
3. Quality metrics
4. Live examples
5. Future work

# PageRank¶

Ranking = frequency of visits of a random walk

Brin & Page 1998

In [40]:



# Case of undirected graphs¶

Let $\mu^{(t)}$ be the distribution of the random walk at time $t$: $$\mu^{(t)} = \mu^{(t-1)} P$$ where $P$ is the transition matrix $$P_{ij} = \frac{A_{ij}}{d_i}$$

Limiting distribution: $$\mu^{(t)} \to \mu \propto d$$

# Case of directed graphs¶

What about sinks / absorbing sets?

In [15]:



# Sinks¶

Modified transition matrix: $$P_{ij} = \left\{ \begin{array}{ll} \frac{A_{ij}}{d^+_i} & \text{if } d^+_i>0\\ \frac 1 n & \text{otherwise} \end{array} \right.$$

# Absorbing sets¶

Damping factor $\alpha \in (0,1)$:

• Walk with probability $\alpha$
• Teleport with probability $1-\alpha$

Default value $\alpha = 0.85$

Walking distance geometric with mean $\frac \alpha{1-\alpha}\approx 5.7$

# Dynamics¶

Let $p^{(t)}$ be the distribution of the random walk at time $t$: $$p^{(t)} = (1-\alpha)p^{(0)} + \alpha p^{(t-1)}P$$ Limiting distribution: $$p^{(t)}\to p = (1-\alpha)\sum_{t=0}^{+\infty}\alpha^t \mu^{(t)}$$

• $\alpha \to 1$ (long paths): $p\to \mu^{(\infty)}$
• $\alpha \to 0$ (short paths): $p\to \mu^{(0)}$ (uniform distribution)

# Revisiting PageRank¶

Idea: In a directed graph, two nodes are "close" if they have many successors in common $\to$ co-citation graph

The weight between $i$ and $j$ is the number of co-citations: $$w_{ij} = \sum_k {A_{ik}A_{jk}}$$ The weight of node $i$ is: $$w_i =\sum_j w_{ij} = \sum_{k}A_{ik} d^-_k$$

# Normalized weights¶

Contribution of each co-citation $k$ normalized by its number of citations: $$w'_{ij} = \sum_k \frac{A_{ik}A_{jk}}{d^-_k}$$ Weight of node $i$ is: $$w'_i =\sum_j w'_{ij} = \sum_{k}A_{ik} = d^+_i$$

A random walk in the co-citation graph corresponds to a forward-backward random walk in the original graph ($\approx$ HITS algorithm)

# Removing self-loops¶

Usual weights: $$\forall i\ne j,\ w_{ij} = \sum_k {A_{ik}A_{jk}} \Longrightarrow w_i = \sum_{j\ne i }w_{ij} = \sum_k A_{ik} (d^-_k - 1)$$ Normalized weights: $$\forall i\ne j,\ w'_{ij} = \sum_k \frac{A_{ik}A_{jk}}{d^-_k} \Longrightarrow w'_i = \sum_{j\ne i }w'_{ij} = \sum_k A_{ik} \frac{d^-_k - 1}{d_k^-}$$ $\to$ non-backtracking forward-backward random walk in the original graph

# Personalized PageRank¶

Damping factor $\alpha\in (0,1)$:

• Start from the seed set
• Walk with probability $\alpha$
• Teleport to the seed set with probability $1-\alpha$

Walking distance geometric with mean $\frac \alpha{1-\alpha}\approx 5.7$

# Dynamics¶

Let $p^{(t)}$ be the distribution of the random walk at time $t$: $$p^{(t)} = (1-\alpha)p^{(0)} + \alpha p^{(t-1)}P$$ Limiting distribution: $$p^{(t)}\to p = (1-\alpha)\sum_{t=0}^{+\infty}\alpha^t \mu^{(t)}$$

• $\alpha \to 1$ (long paths): $p\to \mu^{(\infty)}$
• $\alpha \to 0$ (short paths): $p\to \mu^{(0)}+ \alpha \mu^{(1)}+\alpha^2 \mu^{(2)}+\ldots$

# LexRank¶

• Start from the seed set
• For $k = 1, 2,\ldots$, rank the $k$-hop neighbors after $k$ jumps of the random walk

If the graph is directed, use the co-citation graph!

# Outline¶

1. Background
2. Ranking
3. Quality metrics
4. Live examples
5. Future work

# Conductance¶

The conductance of a community $C$ is: $$\phi(C) = \frac{\sum_{i\in C, j\not\in C} A_{ij}}{\sum_{i\in C, j\in V} A_{ij}}= \frac{\sum_{i\in C} d_i^{\rm out}}{\sum_{i\in C} d_i}$$ This is the probability that a random walk starting from $C$ in steady state leaves $C$ in one jump: $$\phi(C) = \frac{\sum_{i\in C,j\not \in C} d_i P_{ij}}{\sum_{i\in C} d_i}$$ A "good" community has a low conductance

# Strength¶

The strength of a community $C$ is: $$\sigma(C) = 1-\phi(C) = \frac{\sum_{i\in C, j\in C} A_{ij}}{\sum_{i\in C, j\in V} A_{ij}} = \frac{\sum_{i\in C} d_i^{\rm in}}{\sum_{i\in C} d_i}$$ This is the probability that a random walk starting from $C$ in steady state stays in $C$ in one jump: $$\sigma(C) = \frac{\sum_{i\in C,j \in C} d_i P_{ij}}{\sum_{i\in C} d_i}$$ A "good" community is strong in the sense that $\sigma(C)\ge \mu(C)$

# Modularity¶

The modularity is related to the average strength of the communities: $$Q = \sum_{C} \mu(C)(\sigma(C)-\mu(C))$$

Chang et. al. 2015

# Normalized conductance¶

The normalized conductance of a community $C$ is: $$\phi'(C) = \frac{\sum_{i\in C, j\not\in C} A_{ij}/d_i}{\sum_{i\in C, j\in V} A_{ij}/d_i} = \frac 1 {|C|}\sum_{i\in C} \frac{d^{\rm out}_{i}}{d_i}$$ This is the probability that a random walk starting from $C$ leaves $C$ in one jump: $$\phi'(C) = \frac 1{|C|} \sum_{i\in C,j\not \in C} P_{ij}$$

# Normalized strength¶

The normalized strength of a community $C$ is: $$\sigma'(C) = 1-\phi'(C) = \frac{\sum_{i\in C, j\in C} A_{ij}/d_i}{\sum_{i\in C, j\in V} A_{ij}/d_i} = \frac 1 {|C|}\sum_{i\in C} \frac{d^{\rm in}_{i}}{d_i}$$ This is the probability that a random walk starting from $C$ stays in $C$ in one jump: $$\sigma'(C) = \frac 1{|C|} \sum_{i\in C,j\in C} P_{ij}$$ A "good" community is strong in the sense that $\sigma'(C)\ge |C|/n$

# Normalized modularity¶

The normalized modularity is related to the average normalized strength of the communities: $$Q' = \sum_{C} \frac {|C|}n(\sigma'(C)-\frac {|C|}n)$$ We get: $$Q' = \frac 1 {n}\sum_{i,j}\frac{A_{ij}}{d_i}\delta_{ij} - \frac 1{n^2} \sum_{i,j}\delta_{ij}=\frac 1 {n}\sum_{i}\frac{d^{\rm in}_{i}}{d_i} - \frac 1{n^2} \sum_{i,j}\delta_{ij}$$

# Outline¶

1. Background
2. Ranking
3. Quality metrics
4. Live examples
5. Future work

# Approach: directional ranking¶

Input: seed node $s$

For each neighbor $u$ of $s$:

• Rank nodes for the seed set $S=\{s,u\}$
• Compute the normalized strength of the resulting successive communities

# Example: Wikipedia¶

In [66]:
s = inv_page['Donald Trump']
top_pages(s)

Donald Trump
0 Mitt Romney
1 Tim Pawlenty
2 Ann Coulter
3 John McCain presidential campaign, 2008
4 United States House of Representatives elections, 2006
5 Manhattan
6 John McCain
7 CNN
8 United States cable news
9 Michele Bachmann
10 Michael Bloomberg
11 Newt Gingrich
12 Hillary Rodham Clinton
13 Pat Boone
14 The Rush Limbaugh Show
15 Larry King Live
16 Rudy Giuliani
17 Barack Obama citizenship conspiracy theories
18 112th United States Congress
19 Gary Johnson


# Example: Wikipedia¶

In [67]:
direction = [12]

In [68]:
pagerank_score = simple_detection(s,direction,algo="pagerank")
lexrank_score = simple_detection(s,direction,algo="lexrank")
lexrank_star_score = simple_detection(s,direction,algo="lexrank_star")
plot(pagerank_score,label="PageRank")
plot(lexrank_score,label="LexRank")
plot(lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Normalized strength")
legend(bbox_to_anchor=(1.4, 1))
show()

Donald Trump, Hillary Rodham Clinton, United States, Barack Obama, Republican Party (United States), Democratic Party (United States), New York City, President of the United States, George W. Bush, John McCain,
Donald Trump, Hillary Rodham Clinton, Ted Kennedy, Democratic Party (United States), First inauguration of Barack Obama, John McCain, United States, Joe Biden, John Kerry, History of the United States Democratic Party,
Donald Trump, Hillary Rodham Clinton, John McCain, Mitt Romney, Ted Kennedy, Bill Clinton, Joe Biden, Presidency of Bill Clinton, Ann Coulter, 111th United States Congress,


# Example: Wikipedia¶

In [63]:
direction = [18]

In [64]:
pagerank_score,u_pagerank_score = detection(s,direction,algo="pagerank")
lexrank_score,u_lexrank_score = detection(s,direction,algo="lexrank")
lexrank_star_score,u_lexrank_star_score = detection(s,direction,algo="lexrank_star")
figure(figsize=(12, 4))
subplot(121)
plot(u_pagerank_score,label="PageRank")
plot(u_lexrank_score,label="LexRank")
plot(u_lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Strength")
subplot(122)
plot(pagerank_score,label="PageRank")
plot(lexrank_score,label="LexRank")
plot(lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Normalized strength")
legend(bbox_to_anchor=(1.4, 1))
show()

Paris, French Revolution, France, Napoleon, Louis XVI of France, Reign of Terror, Departments of France, National Convention, Ancien RĂ©gime, National Constituent Assembly,
Paris, French Revolution, France, History of France, Napoleon, Liberalism, History of Europe, July Monarchy, Louis XVI of France, Maximilien de Robespierre,
Paris, French Revolution, France, History of France, Maximilien de Robespierre, Napoleon, Louis XVI of France, Gilbert du Motier, Marquis de Lafayette, July Monarchy, Georges Danton,


# Outline¶

1. Background
2. Ranking
3. Quality metrics
4. Live examples
5. Future work

# Future work¶

Improved algorithms through

• Adaptive ranking
• Stopping criterion
• Post-processing (selection / merge of communities)

Test on both real and synthetic data

Application to data analysis $\to$ similarity graph