Contenu du post
On Cora dataset Cora, Citeseer, and Pubmed are three popular data sets for node classification. It's one of those cases where you can clearly see the power of GNN. For example, on Cora GNNs have around 80% accuracy, while GBDT/MLP have only around 60%. This is not often the case: for many data sets I can see marginal win for GNN compared to non-graph methods and for some data sets it's actually lower. So why the performance of GNN is so great on this data set? I don't have a good answer for this, but here are some thoughts. Cora is a citation network, where nodes are papers and classes are papers' field. However, it's not clear what are the links between this documents. The original paper didn't describe how exactly links are established. If links were based on citation, i.e. two papers are connected if they have a citation from one to another, then it could explain such big improvement of GNN: GNN explore all nodes during training, while MLP only training nodes and since two papers likely to share the same field, GNN leverage this graph information. If that's the case simple k-nn majority vote baseline would be performing similar to GNN. However, there is an opinion from people who know the authors of the original paper saying that the links are established based on word similarity between documents. If that's true, I'm not sure why GNN is doing so well for this data set. In all cases, establishing the graphs from real-world data is something that requires a lot of attention and visibility, that's why structure learning is such an active topic.