Contenu du post
Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework This is a post by Michael Galkin (@gimmeblues) about their new work on comprehensive evaluation of knowledge graph embeddings. A lot of interesting insights about knowledge graphs. Today we are publishing the results of our large-scale benchmarking study of knowledge graph (KG) embedding approaches. Further, we are releasing the code of PyKEEN 1.0 - the library behind the study (in PyTorch)! What makes KGs special: they often have hundreds or thousands of different relations (edge types), and having good representations is essential for reasoning in embedding spaces as well as for numerous NLP tasks. We often evaluate KG embeddings on the link prediction task - given subject+predicate, the model has to predict most plausible objects. As typical KGs contain 50k-100k different entities, you can guess the top1/top10 ranking task is quite complex! Why benchmarking is important: currently, there is no baseline numbers to refer to. Lots of papers in the domain are not reproducible, or the authors simply take metrics values as reported in other papers withougt reproducing their results. In this study, we ran 65K+ experiments and spent 21K+ GPU hours evaluating 19 models spanning from RESCAL first published in 2011 to the late 2019's RotatE and TuckER, 5 loss functions, training strategies with/without negative sampling, and many more hyper-parameters that turn out to be important to consider. Key findings: - Careful HPO optimization brings us new SOTA results giving significant gains of 4-5% compared to reported results in respective papers (btw, we used Optuna for HPO); - Properly tuned classical models (TransE, DistMult) are still good and actually outperform several newer models; - No Best-of-the-Best Silver Bullet model that beats all others across all tasks - some models better capture transitivity, whereas other better capture symmetric relations; - Surprisingly, for the inherently ranking task, the ranking loss (or MarginRankingLoss in PyTorch) is suboptimal. Instead, Cross-Entropy and its variations show better result; - Using all enities for negative sampling, i.e., sigmoid/softmax distribution over all enities, works well but can be quite expensive on large KGs. Stochastic negative sampling is a way to go then; - Computationally expensive and bigger models do not yield that big and drastic performance gains. In fact, 64-d Rotate is better than most 500-d models. Paper: https://arxiv.org/abs/2006.13365 Code: https://github.com/pykeen/pykeen