TGTGInsightintelligence telegramLIVE / telegram public index
Contenuto del post
Contenuto
Hugging Face (Twitter) RT @maximelabonne: Pheww, another banger dataset from @huggingface! > 3T tokens, 475M PDFs, 1733 languages > Close to Nemotron-CC v2 and FineWeb-Edu+DCLM on its own (‼️) > Greatly boosts perf when combined, likely because it provides high diversity that complements the other datasets well