TGTGInsightintelligence telegramLIVE / telegram public index
← Hugging Face
Hugging Face avatar

TGINSIGHT POST

Post #1709

@huggingface

Hugging Face

Visualizzazioni17Numero di visualizzazioni
Pubblicato12 nov12/11/2025, 15:45
Contenuto del post

Contenuto

Hugging Face (Twitter) RT @Thom_Wolf: We’re releasing a new, extremely high-quality large dataset: *FinePDF-edu* This continues our work on open-sourcing everything possible about the science of pre-training datasets. In this new data, we applied the same filtering techniques we used for the widely adopted *FineWeb-edu*, but this time to our recently published PDF dataset called FinePDF — 3 trillion tokens of Common Crawl PDFs released a few months ago. The result is *FinePDF-edu* a high-quality dataset of about 350B tokens (130B English), representing the most educational slice of all Common Crawl PDFs and outperforming our previously released pretraining dataset on our benchmark. Some details 👇 High-level: - 350B+ highly educational tokens in 69 languages with strong performance - 69 education classifiers powered by ModernBERT and mmBERT - 300k+ EDU annotations per language, generated with Qwen3-235B Method: We filtered this dataset by asking an LLM to score the of a... Перейти на оригинальный пост