TGTGInsightintelligence telegramLIVE / telegram public index
Contenuto del post
Contenuto
Hugging Face (Twitter) RT @eliebakouch: DeepSeek-OCR has some weird architectural choices for the LLM decoder: DeepSeek3B-MoE-A570M -> uses MHA, no MLA (not even GQA?) -> 2 shared experts (like DeepSeek V2, but V3 only has 1) -> quite low sparsity, activation ratio is 12.5%. For V3 it’s 3.52%, for V2 it’s 5% -> not very deep, 12 layers