Contenu du post
Compositional Tokenization for Knowledge Graphs This is a guest post by Michael Galkin about their new paper of reducing the memory issues of existing approaches. Pretty much all KG embedding algorithms are, in fact, shallow embedding algorithms. It means that each node is mapped to a unique vector - and as a basis for all downstream tasks you need to store the whole embedding matrix in memory. Already at the OGB scale (2.5-5M nodes) you’d need 2-10 GB VRAM on the embeddings only, not counting forward passes and backprop. The more nodes you have - the bigger the matrix, the more expensive GPU you need. Looking back to 2015, it resembles word2vec and GloVe a lot - huge shallow word vocabularies of 0.5-3M words, every other word is OOV (out of vocab). Then, subword units arrived (as Byte-Pair Encoding or WordPiece) and dramatically reduced vocab sizes allowing building infinite combinations from a rather small tokens vocab (30-50K in BERT & GPT-2/3 ). Saved params are now better invested into a flurry of Transformer encoders. If we treat nodes in a graph like “words”, what would be their “sub-word” units? Can we have a similar approach that would allow to bootstrap a representation of both seen and unseen nodes using the same vocab? We tackle those questions in our new work where we design NodePiece (pun intended), a compositional tokenization scheme for KGs where tokens are anchor nodes and relation types. Going from shallow to compositional encoding, we reduce embedding matrices 10-1000x times and still observe a competitive performance. Interestingly, sometimes you don’t even need trainable node embeddings to perform well on node classification and relation prediction, i.e, relations around the node are enough! We encourage you to find even more details in the pre-print, Medum blog, and try out the code in Github repo.