Graph Representation Learning 图表示学习(图神经网络)

Graph Representation Learning

(Graph Neural Networks, GNN)

A Review of methods and applications, Zhou Jie 2020, on AI Open

An overwiew of GNN (2020)
Figure. An overwiew of computational modules of GNN (2020) (Zhou Jie et al. 2020, AIOpen)

An overview
Figure. An overview of GNN variants from graph type perspective (2020) (Zhou Jie et al. 2020, AIOpen)

Neural fingerprints(Duvenaud et al., 2015, on NIPS):

\[\bm h^{t+1}_v = \sigma\left(\bm W_{|\mathcal N(v)|}^{t+1}\left(\bm h_v^t + \sum_{u\in \mathcal N(v)}\bm h_u^t\right)\right) \]

, where \(\bm W^{t+1}_{|\mathcal N(v)|}\) is the weight matrix for nodes with degree \(|\mathcal N(v)|\) at layer \(t+1\) , \(\sigma\) is the softmax function.

Spectral Network (Bruna et al., 2014)

It uses a learnable diagonal matrix as the filter, \(\bm g_w = \rm{diag}(\bm w)\) .

CheNet. (Defferrard et al. (2016))

\[\bm g_w \star \bm x \approx \sum_{k=0}^K w_k \bm T_k (\tilde {\bm L}) \bm x \]

, where \(\star\) denotes convolution operation, \(\bm T_k(\cdot)\) denotes Chebyshev polynomials with \(k\) -th order, \(\tilde {\bm L}=\frac{2}{\lambda_{max} }\bm L- \bm I\) , \(\lambda_{max}\) deontes the largest eigenvalues of \(\bm L\) , \(\tilde{ \bm L }\in [-1, 1]\) . The Chebyshev polynomials are defined as \(\bm T_k(\bm x)=2\bm x\bm T_{k-1}(\bm x)-\bm T_{k-2}(\bm x), \bm T_1(\bm x)= \bm x, \bm T_0(\bm x)=\bm 1\) .

GCN. (Kipf and Welling (2017))

with \(K=1, \lambda_{max}\approx 2\) , equation in the CheNet(Defferrard et al. (2016)) is simplified to

\[\bm g_w \star \bm x \approx w_0 \bm x+ w_1(\bm L-\bm I) \bm x = w_0 \bm x - w_1\bm D^{-\frac12}\bm A \bm D^{-\frac12} \bm x \]

. Further constraining \(w_0= -w_1\) and introducing new simple notation for \(w_0\) , then

\[\bm g_w \star \bm x\approx w(\bm I + \bm D^{-\frac12}\bm A \bm D^{-\frac12}) \bm x \]

. Further introducing renormalization trick to solve the exploding/vanishing gradient problem: $(\bm I + \bm D^{-\frac12}\bm A \bm D^{-\frac12}) \rightarrow \tilde{\bm D}^{-\frac12}\tilde{\bm A} \tilde{\bm D}^{-\frac12} $ where \(\tilde{\bm A}\) denotes adjacency matrix with self loops( \(\tilde {\bm A}=\bm A+\bm I\) ), \(\tilde{\bm D}\) denotes the degree matrix with self loops( \(\tilde{ \bm D}_{ii}=\sum_j \tilde{\bm A}_{ij}\) ).
Finally the form of GCN is defined as

\[\bm H=\tilde{\bm D}^{-\frac12}\tilde{\bm A} \tilde{\bm D}^{-\frac12}\bm X\bm W \]

, where \(\bm X\) is the input matrix, \(\bm W\) is the parameter matrix, \(\bm H\) is the convolved matrix.

AGCN (Adaptive Graph Convolution Network). Li et al., 2018a
DGCN (dual graph convlution network). Zhuang and Ma, 2018

Using two convolutional networks to capture the local and global consistency.

The first is \(\bm H=\tilde{\bm D}^{-\frac12}\tilde{\bm A} \tilde{\bm D}^{-\frac12}\bm X\bm W\) .
The second is $\bm H'=\rho\left(\tilde{\bm D}_P^{-\frac12}\tilde{\bm A}_P \tilde{\bm D}_P^{-\frac12}\bm H\bm W\right) $ . It replaces the adjacency matrix in the first GCN with positive pointwise mutual information matrix.

GWNN(Graph wavlet neural network).Xu et al., 2019a

It uses the graph wavlet transform to replace the graph Fourier transform. advantages: (1) graph wavlets can be fastly obtained without matrix decompositioin; (2) graph wavlets are sparse and localizaed thus the results are better and more explainable.


In almost all of the spectral approaches mentioned above, the learned filters depend on graph structure. That is to say, the filters cannot be applied to a graph with a different structure and those models can only be applied under the “transductive” setting of graph tasks.


GCNN(diffusion convolutional neural network). Atwood and Towsle, 2016

It uses transition matrix to define neighborhood for nodes. For node classification, the diffusion representations fo each node can be expressed as:

\[\bm H=f(\bm W\odot \bm P^* \bm X) \in \R^{N\times K\times F} \]

, where \(\bm X\in\R^{N\times F}\) is the input matrix. \(\bm P^*\) is an \(N\times K\times N\) tensor which contains the power series \(\{\bm P, \bm P^2, \dots, \bm P^K\}\) of \(\bm P\) . \(\bm P\) is the degree normalized transition matrix from the graphs adjacency matrix \(\bm A\) .

LGCN (learnable graph convolutional network) (Gao et al., 2018a)

It performs max pooling on neighborhood matrices of nodes to get top-k feature elements and then applies 1-D CNN to compute hidden representations.

GraphSAGE. Hamilton et al., 2017a.

It generates embeddings by smapling and aagregating features form a node's local neighborhood.

\[\bm h^{t+1}_{\mathcal N(v)}=\rm{AGG}_{t+1}({\bm h_u: u\in \mathcal N(v)}) \\ \bm h_v^{t+1}=\sigma\left(\bm W^{t+1}\cdot \left[\bm h_v^t \| \bm h_{\mathcal N(v)}^{t+1}\right]\right) \]

, where \(\|\) denotes vector concatenation.
GraphSAGE uniformly smaples a fixed-size set of neighbors to aggregate information. GraphSAGE suggests three aggregators: mean, LSTM, pooling. GraphSAGE with mean aggregator can be regarded as an inductive version of GCN.

Loss function:

\[J(u) = -\log(\sigma( \bm z_u ^T \bm z_v))-Q\cdot\mathbb E_{v_n\sim p_n(v)}\log(-\bm z_u^T \bm z_{v_n})` \]

, where \(\bm z_u = \bm h_u^L\) is the output of the final hidden layer (total \(L\) layers), \(v\) is a k-hop reachable node from \(u\) with fixed \(k\) (a super-parameter), \(v_n\) is a node drawn from a negative sampling distribution, \(p_n(v)\) is a negative smapling distribution, \(Q\) the number of negative samples.

GAT (graph attention network). (Velickovic et al., 2018)

Neighbourhood Attention

aggregation:

\[\bm h_u=f\left(\sum_{v\in \mathcal N(u)} \alpha_{u,v} \bm W \bm h_v \right) \]

where \(\alpha_{u,v}\) denotes the attention on neighbour \(v\in \mathcal N(u)\) .

\[\alpha_{u,v}=\frac{\exp \left(\rm{LeakyReLU}\left( \bm a^T [\bm W\bm h_u ⧺ \bm W \bm h_v] \right)\right)}{\sum_{z\in \mathcal N(u)}\exp\left(\rm{LeakyReLU}\left(\bm a^T [\bm W\bm h_u ⧺ \bm W \bm h_z]\right)\right)} \]

where \(⧺\) denotes vector concatenation, \(\bm a, \bm W\) are trainable parameters.

multi-head attention:

  1. by concatenation:

\[\bm h^{t+1}_u = ⧺^K_{k=1}\sigma\left(\sum_{v\in\mathcal N(u)} \alpha_{uv}^{(k)}\bm W_k \bm h_v^t\right) \]

  1. by average:

\[\bm h_u^{t+1} = \frac1K \sum_{k=1}^K \sigma\left(\sum_{v\in\mathcal N(u)} \alpha_{uv}^{(k)}\bm W_k \bm h_v^t \right) \]

Variants of GAT-style attention:

\[\alpha_{u,v}=\frac{\exp \left(\bm h_u^T \bm W \bm h_v \right)}{\sum_{z\in \mathcal N(u)}\exp\left(\bm h_u^T \bm W \bm h_z\right)} \]

\[\alpha_{u,v}=\frac{\exp \left( \mathrm{NeurNet}(\bm h_u, \bm h_v) \right)}{\sum_{z\in \mathcal N(u)}\exp\left(\mathrm{NeurNet}(\bm h_u, \bm h_z)\right)} \]

where \(\mathrm{NeurNet}\) denotes a neural network which is restricted to a scalar output.

GaAN (gated attention network) Zhang et al. 2018

multi-head attention. It uses self-attention mechanism to gather information from multiple heads instead of the average operation of GAT.

MPNN (message passing neural network) Gilmer et al., 2017

\[\bm m_u^{t+1} = \sum_{v\in\mathcal N(v)} M_t (\bm h_u^t, \bm h_v^t, \bm e_{uv}) \\ \bm h_v^{t+1} = U_t (\bm h_v^t, \bm m_v^{t+1}) \\ \hat{\bm y} = R(\{\bm h_u^T| u\in G\}) \]

message function \(M_t\) , update function \(U_t\) , readout function \(R\) , time step \(t\) , total time steps \(T\) .

NLNN (non-local neural network)

It generalizes and extends the classic non-local mean operation in computer vision. The generic non-local operation is defined as

\[\bm h_v^{t+1} = \frac{1}{c(\{\bm h_w^t| \forall w\})}\sum_{\forall u} f(\bm h_v^t, \bm h_u^t) g(\bm h_u^t) \]

, where \(c(\{\bm h_w^t| \forall w\})\) is a normalization factor.

The NLNN can be viewed as a unification of different “self-attention”-style methods (Hoshen, 2017; Vaswani et al., 2017; Velickovic et al., 2018).

Graph Network (Battaglia et al., 2018)

It is a more general framework compared to others by learning node-level, edge-level and graph level representations. It can unify many variants like MPNN, NLNN, Interaction Networks (Battaglia et al., 2016; Watters et al., 2017), Neural Physics Engine (Chang et al., 2017), CommNet (Sukhbaatar Ferguset al., 2016), structure2vec (Dai et al., 2016; Khalil et al., 2017), GGNN (Li et al., 2016), Relation Network (Raposo et al., 2017; Santoro et al., 2017), Deep Sets (Zaheer et al., 2017), Point Net (Qi et al., 2017a) and so on.

Tree-LSTM (Tai et al. 2015)

two types: Child-Sum, N-ary.

In constract to traditional LSTM, which uses a single forget gate, the Tree-LSTM unit for node v contains one forget gate for each child. Further, the N-ary Tree-LSTM is a special tree where each node has at most K children and the children are ordered.

GGCN (gated graph neural network) (Li et al., 2016)

\[{\bm h}_{N_v}^t = \sum_{u\in N_v} {\bm h}_u^{t-1} + \bm b \\ {\bm z}_v^t = \sigma(\bm W_z \bm h_{N_v} + \bm U_z \bm h_v^t) \\ {\bm r}_v^t = \sigma(\bm W_r \bm h_{N_v} + \bm U_r \bm h_v^t) \\ \tilde {\bm h}_v^t =\tanh (\bm W \bm h_{N_v}^{t-1}+ \bm U(\bm r_v^t \odot {\bm h}_v^{t-1})) \\ \bm h^{t+1}_v = (1-\bm z_v^t)\odot \bm h_v^t + \bm z_v^t \odot \tilde {\bm h}_v^t \]

Graph LSTM

graph-structure LSTM

S-LSTM (Sentence LSTM) (Zhang et al., 2018d)

It converts text into a graph and ultilizes the Graph LSTM to learn the representation.

Highway GCN (Rahimi et al., 2018)

layerwise gates.

\[\bm T(\bm h^t)=\sigma(\bm W_t\bm h^t + \bm b_t)\\ \bm h^{t+1} = \bm h^{t+1}\odot \bm T(\bm h^t) + \bm h^t \odot \bm T(\bm h^t) \]

JKN (jump knowledge network) (Xu et al. 2018)

It could learn adaptive and structure-aware representations.

JKN selects from all of the intermediate representations for each node at the last layer ('jump' to the last layer)

DeepGCNs (Li et al. 2019a)

borrows ideas from ResNet and DenseNet.

\[\bm h^{t+1}_{Res} = \bm h^{t+1} +\bm h^t \\ \bm h^{t+1}_{Dense} = \|_{i=0}^{t+1} \bm h^i \]

, where \(\|\) denotes vector concatenation.

The experiments show that the best results are achieved with 56-layer net.

HGT (Hu et al., 2020a)

It designs a sampling method, HGSampling, which is a heterogeneous version of LADIES.

GPT-GNN (Hu et al., 2020b)

It focuses on the academic knowledge graph.

Sampling modules

Node sampling, layer smapling, subgraph sampling.

Node sampling
  • random sampling a fixed number of neighbors. (GraphSAGE)
  • control-variate based stochastic approximation (in the 1-hop neighbors).
  • importance-based sampling. (PinSage, Ying et al. 2018a)
  • random walks.
Layer smapling

Layer sampling retains a set of nodes for aggregation in each layer.

  • importance-based sampling. FastGCN (Chen et al. 2018a)
  • parameterized and trainable layer-wise sampling conditioned on the former layer. (Huang at el. 2018)
  • generating samples from the union of neighbors to alleviate the sparsity issue. (LADIES, Zou et al. 2019)
Subgraph sampling

ClusterGCN (Chiang et al. 2019), graph clustering algorithms to sample subgraphs.

GraphSAINT (Zeng et al. 2020) directly sampleing nodes or edges.

Pooling modules

direct pooling, hierachical pooling.

Direct pooling
  • mean, max, sum, attention.
  • set2set. dealing with unordered set, using LSTM-based method to produce an order-invariant reprentatioin.
  • SortPooling. sorting according to the structural roles of nodes, then feeding into CNNs.
Hierachical pooling

Graclus (2007), a faster way to cluster nodes, used in ChebNet and MoNet.

  • ECC (Edge-Conditioned Convolution) (2017). recursively downsampling based on splitting graph into two components by the sign of the largest eigenvector of the Laplacian.

  • DiffPool (2018). learnable assignment matrix \(\bm S^t\) for each layer.

\[\bm S^t = \rm{softmax} (\rm{GNN}_t(\bm A^t, \bm H^t)), \\ \bm A^{t+1}=(\bm S^t)^T \bm A^t \bm S^t \]

, where \(\bm H\) nodes features, \(\bm A\) carsened adjacency matrix, \(\bm S^t\) probabilities that a node in layer \(t\) can be assigned into a carser node in next layer \(t+1\) .

  • gPool (2019). to learn a projector vector to score nodes (does not consider the graph structure).
  • EigenPooling (2019). use the node features and local structure features jointly. use graph Fourier transform to extract subgraph information.
  • SAGPool (2019). use node feateures and topology jointly. use self-attention based method.

Graph types

  • directed graphs.
  • heterogeneous graphs.
    • meta-path-based methods.
    • edge-based methods.
    • methods for relational graphs. (relational graph: edges of graph have rich information, or quantity of types of edges is large.)
    • methods for multiplex graph. (multiplex graph: there are multiple edges of different types between two nodes)
  • dynamic graphs. (dynamic graph: graph structure, information of edges and nodes vary over time.)
  • hypergraphs (an edge connects two or more nodes).
  • signed graphs (signed edges, which can be positive or negative, e.g. friend/enemy edge).
  • large-scale graphs.

Unsupervised training

graph auto-encoders, contrastive learning.

graph auto-encoders

GAE (graph auto-encoders), VGAE (variational GAE) (Kipf and Welling, 2016), It uses a simple decoder to reconstruct the adjacency matrix.

\[\bm H = \rm{GCN}(\bm X, \bm A) \\ \tilde {\bm A}= f(\bm H \bm H^T) \]

Wang et al. (2017), Park et al. (2019), try to reconstruct the feature matrix.

MGAE (Wang et al., 2017) ultilized marginalized denoising graph auto-encoders.

GALA (Park et al., 2019) proposed Laplacian sharpening (inverse of Laplacian smoothing) to alleviate the oversmoothing issue in GNN training.

AGE (Cui et al., 2020) employed adaptive learning for the measurement of pairwise node similarity (for node clustering and link prediction).

constrastive learning

DGI (Deep Graph Infomax) (Velickovic et al., 2019), maximizes the mutal information between node representations and graph representations.

Infograph (Sun et al., 2020), learns graph representation by maxmizing mututal information between different scales of objects including nodes, edges, triangles, and the graph.

Multi-view (Hassani and Khasahmadi, 2020), contrasts representations from first-order adjacency matrix and graph diffusion.

Graph Signal Processing

Li et al. (2018c) first address the graph convolution in graph
neural networks is actually Laplacian smoothing, which smooths the
feature matrix so that nearby nodes have similar hidden representations.
Laplacian smoothing reflects the homophily assumption that nearby
nodes are supposed to be similar. The Laplacian matrix serves as a
low-pass filter for the input features. SGC (Wu et al., 2019b) further
removes the weight matrices and nonlinearties between layers, showing
that the low-pass filter is the reason why GNNs work.
Following the idea of low-pass filtering, Zhang et al. (2019c), Cui et al.
(2020), NT and Maehara (Nt and Maehara, 2019), Chen et al. (2020b)
analyze different filters and provide new insights. To achieve low-pass
filtering for all the eigenvalues, AGC (Zhang et al., 2019c) designs a
graph filter I  1
2 L according to the frequency response function. AGE (Cui
et al., 2020) further demonstrates that filter with I  1
λmax L could get better
results, where λmax is the maximum eigenvalue of the Laplacian matrix.
Despitelinearfilters,GraphHeat (Xu et al., 2019a)leverages heat kernels for
better low-pass properties. NT and Maehara (Nt and Maehara, 2019) state
that graph convolution is mainly a denoising process for input features, the
model performances heavily depend on the amount of noises in the feature
matrix. To alleviate the over-smoothing issue, Chen et al. (2020b) present
two metrics for measuring the smoothness of node representations and the
over-smoothness of GNN models. The authors conclude that the
information-to-noise ratio is the key factor for over-smoothing.

generative models

NetGAN (Shchur et al., 2018b)

GCPN (You et al., 2018a)

GraphRNN (You et al., 2018b)

Li et al. (2018d)

GraphAF (Shi et al., 2020)

MolGAN (De Cao and Kipf, 2018) (generating adjacency matrix)

Ma et al. (2018)

GNF (Liu et al., 2019)

Graphite (Grover et al., 2019)

GCPN (You et al., 2018a) (incorporating domain specific rules through reinforcement learning)

GNF (Liu et al., 2019) (normalizing flow)

Graphite (Grover et al., 2019)

GNN for recommendation

GC-MC (van den Berg et al., 2017), (user-item rating)

PinSage (Ying et al., 2018a) (user-item)

GraphRec (Fan et al., 2019) (social network)

Wu et al.(2019c) (social network)


Shchur et al. (2018a) concludes that different dataset splits lead to dramatically different rankings of models. simple models could outperform complicated ones under proper settings.

In graph learning, widely-adopted benchmarks are problematic. For example, most node classification datasets contain only 3000 to 20,000 nodes.


Datasets

There are dataset (until 2020) lists in the Appendix A of the paper Graph neural networks: A review of methods and applications.

Open source implementations

There are official open source implementations of published papers listed on the Appendix B of the paper Graph neural networks: A review of methods and applications.



Graph Node Classification

We can define a negative log-likelihood loss function as

\[\mathcal L(G)=\sum_{u\in G} -\log({\rm likelihood}(\bm z_u, \bm y_u)) {\rm likelihood}(\bm z_u, \bm y_u)=\bm y_u^T \sigma(W\bm z_u) \]

where \(\bm z_u\in\R^d\) is the embedding of node \(u\) , \(\bm y_u\in\{0,1\}^{|C|}\) is a one-hot vector of label for node \(u\) , \(C\) is the label set of classification, \(W\in\R^{|C|\times d}\) is trainable parameters, \(\sigma\) is the softmax function \(\sigma(\bm W\bm z_u)[i]=\frac{\exp \bm w_i \bm z_u}{\sum_{j\in C} \exp \bm w_j \bm z_u}\) .


For a symmetric normalized Laplacian matrix \(L\) , the Fourier transform and the inverse Fourier transform are defined as

\[\mathcal F(\bm x)=U^T \bm x \\ \mathcal F^{-1}(\bm x)= U \bm x \]

, where \(U\) is the matrix of eigenvectors of \(L\) , or \(L=U\Lambda U^T\) .

the convolution product of \(\bm g, \bm x\) is defined as

\[\bm g * \bm x = \mathcal F^{-1}(\mathcal F(\bm g)\odot F(\bm x)) \\ =U(U^T\bm g\odot U^T \bm x) \]

posted @ 2022-07-06 23:48  二球悬铃木  阅读(167)  评论(0编辑  收藏  举报