Post #625

@dps_build

DPS Build

Views412Post view count

PostedMar 403/04/2026, 02:08 AM

Post content

昨天看到一个逆向工程 Apple ANE 的项目，于是顺手丢给 Claude 改了改跑 Qwen 3.5 的 dense model。一开始效果一般，只能跑通 0.8b 的模型，4b 和 9b 都跑不起来。因为 ANE 有119 kernels 的限制。今天看到 ANE-LM 这个项目，有更多的创新，于是又让 Claude 改了改，这下三个模型都能在 M4 Pro 上跑起来了。效果见截图，模型越大，ANE 的优势越明显。 - Opt 1: Saves ~64KB zeroing × 96 calls/forward → minor latency reduction - Opt 2: Eliminates 320 powf/cosf/sinf calls per layer → measurable CPU savings - Opt 3: Removes inner loops in conv1d hot path → tighter CPU code - Opt 4: Saves 1 ane_eval + 1 IOSurface round-trip per layer → ~36ms total for ANE mode (biggest win) - Opt 5: Eliminates MPS object allocation per matvec → GPU mode overhead reduction 最后再贴一个项目： https://github.com/vipuldivyanshu92/ANEgpt