Contenuto
Hugging Face (Twitter) RT @Thom_Wolf: Very excited to finally release this new long form open-science post from the team: our first work on mechanistic interpretability. @dlouapre spent several months working on a fully open-source and shareable reproduction of the « Golden Gate Claude » experiments. It was quite a journey, way less straightforward than it seemed from the start. Hence a great occasion to explore practical mechanistic interpretability challenges :) Here are the main findings of this reproduction (full interactive blog post and demo online): - The steering ‘sweet spot’ is small. The optimal steering strength is of the order of half the magnitude of a layer’s typical activation. This is consistent with the idea that steering vectors should not overwhelm the model’s natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts. - Clamping is more effective than adding. We found that... Перейти на оригинальный пост