Contenu du post
A Critical Look at the Evaluation of GNNs under Heterophily: Are we really making progress? ICLR 2023, guest post by Oleg Platonov Stop evaluating on squirrel and chameleon It is often believed that standard GNNs work well for node classification only on homophilous graphs. Thus, many specialized models have been recently proposed for learning on heterophilous graphs. However, these models are typically evaluated on the same set of six heterophilous graphs called Squirrel, Chameleon, Actor, Texas, Cornell, and Wisconsin. In our recent paper, we show that these datasets have serious problems, which make results obtained using them unreliable. These problems include low diversity, small graph size, and strong class imbalance. But the most significant is the presence of a large number of duplicated nodes in Squirrel and Chameleon, which leads to train-test data leakage. We show that removing the duplicates strongly affects the performance of GNNs on these datasets. We have proposed an alternative benchmark of five diverse heterophilous graphs that come from different domains and exhibit a variety of structural properties. Our benchmark includes a word dependency graph Roman-empire, a product co-purchasing network Amazon-ratings, a synthetic graph emulating the minesweeper game Minesweeper, a crowdsourcing platform worker network Tolokers, and a question-answering website interaction network Questions. We have evaluated a large number of models, both standard GNNs and heterophily-specific methods, and, surprisingly, found that standard GNNs augmented with skip connections and layer normalization almost always outperform specialized models. We hope that the proposed benchmark and the insights obtained using it will facilitate further research in learning under heterophily. The datasets are available on GitHub and in PyG Datasets. For more details, see our paper.