TGTGInsightintelligence telegramLIVE / telegram public index
Contenuto del post
Contenuto
Hugging Face (Twitter) RT @Thom_Wolf: This is huge Continuing our foundational work to enable anyone to train state of the art AI model, we’re thrilled to release « FinePDFs » 3T tokens of textual data that until now was locked away in PDFs, arguably some of the highest quality publicly available data out there. We gathered FinePDF to create the largest permissively licensed corpus sourced exclusively from PDFs. Amazingly challenging infra and processing work, h/t to the fineweb team https://twitter.com/HKydlicek/status/1964584936524124645#m