Contenu du post
GraphML News (Jan 18th) - MatterGen release, Aviary, Metagene-1 We are getting back to the regular schedule after the winter break! ⚛️ MSR AI 4 Science has finally released code and weights of MatterGen, the generative model for inorganic materials, together with its publication on Nature (free and no paywall). The final version includes new evaluation pipeline that accounts for compositional disorder (which unrealistically increases performance metrics of recent generative models) and experimental validation of the first generated material TaCr2O6. MatterSim, an ML potential model, was also used during the filtering stages along with standard DFT calculations. Great result for MSR, materials science community, and diffusion model experts who can apply a bag of tricks to a new domain 👏 📜 Together with MatterGen, MaSIF-neosurf from EPFL got published on Nature as well. MaSIF-neosurf is a geometric model for studying surfaces of protein-ligand complexes which was experimentally evaluated on binders agains three real-world protein complexes. To conclude the celebration of massive scientific publications, ESM 3 - the foundation model for proteins - got accepted at Science. 🕊️ FutureHouse released Aviary, the agentic framework for scientific tasks like molecular cloning, protein stability, and scientific QA. Aviary extends the framework of PaperQA (the best open source scientific RAG tool) with learnable RL environments and demonstrated that even small LLMs like LLaMa 3.1 7B excel at these tasks with enough inference compute. Get ready to hear about inference time scaling all over 2025 😉 ✏️ The AI 4 Science consortium published a blog post AI 4 Science 2024 highlighting AF3 and its replications, the success of non-equivariant models, scientific foundation models, new progress in small molecules and quantum chemistry. A short but insightful read. 🪣Metagene-1 (by USC and Prime Intellect) is a foundation model trained on metagenomic sequences (”human wastewater samples” we all know what it means 💩) that might help in pandemic monitoring and pathogen detection. It’s a standard LLaMa 2-7B architecture but, interestingly, outperforms some state space models like HyenaDNA on genome understanding. Weekend reading: GenMol: A Drug Discovery Generalist with Discrete Diffusion by Seul Lee, NVIDIA and KAIST - a generalist generative model for a suite of drug discovery tasks like de-novo generation, fragment-conditioned generation, and lead optimization. The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out by Riza Özçelik and Francesca Grisoni - on metrics and benchmarking for generative models for molecules. Explaining k-Nearest Neighbors: Abductive and Counterfactual Explanations by Pablo Barceló and CENIA team from Chile - a theoretical work tackling classical (but still important) kNN classifiers and how their predictions can be explained. Experiments on MNIST and can run on a laptop