Graphs are everywhere
How to extract information from these graphs?
Real-life graphs are organized in communities ($\approx$ dense groups of nodes)
We seek to infer these communites
Applications:
Most existing algorithms give a partition of the whole graph
We are interested in algorithms revealing the local structure around some target nodes (the seed set)
These local communities are typically:
Most graphs are big, scale-free and small-world
$$ \begin{array}{l|c|c} & \text{DBLP} & \text{Wikipedia} \\ \hline \text{# nodes} & 1.2\text{M} & 4.2\text{M} \\ \text{# edges} & 5.3\text{M} & 101\text{M} \\ \text{average degree} & 8 & 24\\ \text{standard deviation} & 21 & 48\\ \text{3-hop population} & 91\% & 20\% \text{(out) } 98\% \text{(in)}\\ \end{array} $$Karinthy 1929, Milgram 1967
Given a partition of the (undirected) graph, the modularity is defined by $$ Q = \frac 1 {2m}\sum_{i,j}A_{ij}\delta_{ij} - \frac 1{(2m)^2} \sum_{i,j}d_{i}d_j\delta_{ij} $$ where
modularity(A,C)
where
SNAP = Stanford Network Analysis Project
Graphs with ground-truth communities:
Yang & Leskovec 2012
Classical approach:
Clauset 2005, Andersen & Lang 2006
Other approach: Sozio & Gionas 2010
Let $\mu^{(t)}$ be the distribution of the random walk at time $t$: $$ \mu^{(t)} = \mu^{(t-1)} P $$ where $P$ is the transition matrix $$ P_{ij} = \frac{A_{ij}}{d_i} $$
Limiting distribution: $$ \mu^{(t)} \to \mu \propto d $$
What about sinks / absorbing sets?
Modified transition matrix: $$ P_{ij} = \left\{ \begin{array}{ll} \frac{A_{ij}}{d^+_i} & \text{if } d^+_i>0\\ \frac 1 n & \text{otherwise} \end{array} \right. $$
Damping factor $\alpha \in (0,1)$:
Default value $\alpha = 0.85$
Walking distance geometric with mean $\frac \alpha{1-\alpha}\approx 5.7$
Let $p^{(t)}$ be the distribution of the random walk at time $t$: $$ p^{(t)} = (1-\alpha)p^{(0)} + \alpha p^{(t-1)}P $$ Limiting distribution: $$ p^{(t)}\to p = (1-\alpha)\sum_{t=0}^{+\infty}\alpha^t \mu^{(t)} $$
Idea: In a directed graph, two nodes are "close" if they have many successors in common $\to$ co-citation graph
The weight between $i$ and $j$ is the number of co-citations: $$ w_{ij} = \sum_k {A_{ik}A_{jk}} $$ The weight of node $i$ is: $$ w_i =\sum_j w_{ij} = \sum_{k}A_{ik} d^-_k $$
Contribution of each co-citation $k$ normalized by its number of citations: $$ w'_{ij} = \sum_k \frac{A_{ik}A_{jk}}{d^-_k} $$ Weight of node $i$ is: $$ w'_i =\sum_j w'_{ij} = \sum_{k}A_{ik} = d^+_i $$
A random walk in the co-citation graph corresponds to a forward-backward random walk in the original graph ($\approx$ HITS algorithm)
Usual weights: $$ \forall i\ne j,\ w_{ij} = \sum_k {A_{ik}A_{jk}} \Longrightarrow w_i = \sum_{j\ne i }w_{ij} = \sum_k A_{ik} (d^-_k - 1) $$ Normalized weights: $$ \forall i\ne j,\ w'_{ij} = \sum_k \frac{A_{ik}A_{jk}}{d^-_k} \Longrightarrow w'_i = \sum_{j\ne i }w'_{ij} = \sum_k A_{ik} \frac{d^-_k - 1}{d_k^-} $$ $\to$ non-backtracking forward-backward random walk in the original graph
Damping factor $\alpha\in (0,1)$:
Walking distance geometric with mean $\frac \alpha{1-\alpha}\approx 5.7$
Let $p^{(t)}$ be the distribution of the random walk at time $t$: $$ p^{(t)} = (1-\alpha)p^{(0)} + \alpha p^{(t-1)}P $$ Limiting distribution: $$ p^{(t)}\to p = (1-\alpha)\sum_{t=0}^{+\infty}\alpha^t \mu^{(t)} $$
If the graph is directed, use the co-citation graph!
The conductance of a community $C$ is: $$ \phi(C) = \frac{\sum_{i\in C, j\not\in C} A_{ij}}{\sum_{i\in C, j\in V} A_{ij}}= \frac{\sum_{i\in C} d_i^{\rm out}}{\sum_{i\in C} d_i} $$ This is the probability that a random walk starting from $C$ in steady state leaves $C$ in one jump: $$ \phi(C) = \frac{\sum_{i\in C,j\not \in C} d_i P_{ij}}{\sum_{i\in C} d_i} $$ A "good" community has a low conductance
The strength of a community $C$ is: $$ \sigma(C) = 1-\phi(C) = \frac{\sum_{i\in C, j\in C} A_{ij}}{\sum_{i\in C, j\in V} A_{ij}} = \frac{\sum_{i\in C} d_i^{\rm in}}{\sum_{i\in C} d_i} $$ This is the probability that a random walk starting from $C$ in steady state stays in $C$ in one jump: $$ \sigma(C) = \frac{\sum_{i\in C,j \in C} d_i P_{ij}}{\sum_{i\in C} d_i} $$ A "good" community is strong in the sense that $\sigma(C)\ge \mu(C)$
The modularity is related to the average strength of the communities: $$ Q = \sum_{C} \mu(C)(\sigma(C)-\mu(C)) $$
Chang et. al. 2015
The normalized conductance of a community $C$ is: $$ \phi'(C) = \frac{\sum_{i\in C, j\not\in C} A_{ij}/d_i}{\sum_{i\in C, j\in V} A_{ij}/d_i} = \frac 1 {|C|}\sum_{i\in C} \frac{d^{\rm out}_{i}}{d_i} $$ This is the probability that a random walk starting from $C$ leaves $C$ in one jump: $$ \phi'(C) = \frac 1{|C|} \sum_{i\in C,j\not \in C} P_{ij} $$
The normalized strength of a community $C$ is: $$ \sigma'(C) = 1-\phi'(C) = \frac{\sum_{i\in C, j\in C} A_{ij}/d_i}{\sum_{i\in C, j\in V} A_{ij}/d_i} = \frac 1 {|C|}\sum_{i\in C} \frac{d^{\rm in}_{i}}{d_i} $$ This is the probability that a random walk starting from $C$ stays in $C$ in one jump: $$ \sigma'(C) = \frac 1{|C|} \sum_{i\in C,j\in C} P_{ij} $$ A "good" community is strong in the sense that $\sigma'(C)\ge |C|/n$
The normalized modularity is related to the average normalized strength of the communities: $$ Q' = \sum_{C} \frac {|C|}n(\sigma'(C)-\frac {|C|}n) $$ We get: $$ Q' = \frac 1 {n}\sum_{i,j}\frac{A_{ij}}{d_i}\delta_{ij} - \frac 1{n^2} \sum_{i,j}\delta_{ij}=\frac 1 {n}\sum_{i}\frac{d^{\rm in}_{i}}{d_i} - \frac 1{n^2} \sum_{i,j}\delta_{ij} $$
Input: seed node $s$
For each neighbor $u$ of $s$:
s = inv_page['Donald Trump']
top_pages(s)
direction = [12]
pagerank_score = simple_detection(s,direction,algo="pagerank")
lexrank_score = simple_detection(s,direction,algo="lexrank")
lexrank_star_score = simple_detection(s,direction,algo="lexrank_star")
plot(pagerank_score,label="PageRank")
plot(lexrank_score,label="LexRank")
plot(lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Normalized strength")
legend(bbox_to_anchor=(1.4, 1))
show()
direction = [18]
pagerank_score,u_pagerank_score = detection(s,direction,algo="pagerank")
lexrank_score,u_lexrank_score = detection(s,direction,algo="lexrank")
lexrank_star_score,u_lexrank_star_score = detection(s,direction,algo="lexrank_star")
figure(figsize=(12, 4))
subplot(121)
plot(u_pagerank_score,label="PageRank")
plot(u_lexrank_score,label="LexRank")
plot(u_lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Strength")
subplot(122)
plot(pagerank_score,label="PageRank")
plot(lexrank_score,label="LexRank")
plot(lexrank_star_score,label="LexRank*")
xlabel("Community")
ylabel("Normalized strength")
legend(bbox_to_anchor=(1.4, 1))
show()
Improved algorithms through
Test on both real and synthetic data
Application to data analysis $\to$ similarity graph