Contenu du post
Transformers are Graph Neural Networks It seems today everyone is talking about this blog post. It describes how NLP's Transformers are just part of a more general concept we call Graph Neural Networks. In Transformer, we have an encoder that for every word gives an embedding. The idea is very simple: let's the embedding of word A be a weighted average of embeddings of other words in a sentence. That's it. That's what everyone calls attention. Attention = weighted average. The weights are trained together with embeddings, you also add normalization tricks, positional encodings, MLP, and other deep learning mumbles, but this is like in any other model. What's novel is that you use weighted average, instead of aggregation word by word sequentially. In GNN we also have an encoder, this encoder just takes a graph in and outputs embeddings of nodes. The most popular way in GNN to do this is through message-passing mechanism, the idea of which is also simple: the embedding of node A is a function of embeddings of its neighbors. Two changes from Transformer: 1) We can have any function, instead of weighted average. 2) We aggregate over the neighbors, instead of all nodes. If you take weighted average function, i.e. attention, then you have Graph Attention Network. Surprise. You could also aggregate the nodes beyond your neighbors and I believe there are works that do this, but common assumption is that graph connections are already constructed in such a way that node is most influenced by the incoming edges and not by some distant nodes. So Transformers are not GNNs, at most Transformers are Graph Attention Networks. This is important if we want to draw some conclusions for NLP from GNN area. For example, Graph Attention Network is less powerful than many other GNNs such as GIN. It also does not have function approximation guarantees that I described in previous posts. But sure the connection between NLP and GML exists and is not fully explored. I bet you can leverage graphs when you analyze text or you can use GNN for better encoding. GML on the other hand benefited from NLP in many ways already: usually we first see some results in NLP (RNN, Transformers, BERT) and then see its adaptation in GML. So maybe it's the moment we'll see more contributions to NLP with graphs.