Contenuto
Hugging Face (Twitter) RT @rohanpaul_ai: MASSIVE. THE LARGEST open-sourced pdf data just dropped on @huggingface . Finepdfs 3 trillion tokens across 475 million documents in 1733 languages. This is the largest publicly available corpus sourced exclusively from PDFs, containing about The data was sourced from 105 CommonCrawl snapshots, spanning the summer of 2013 to February 2025, as well as refetched from the internet, and processed using 🏭 datatrove, huggingface's large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 3.65 terabytes of 3T tokens. For PII and opt-out see Personal and Sensitive Information and opt-out. The dataset is fully reproducible and released under the ODC-By 1.0 license. You will be able to access the reproduction code, ablation and evaluation setup in this GitHub repository soon 👷. Compared to HTML datasets, despite being only mildly filtered, it achieves results nearly on par with... Перейти на оригинальный пост