

AI Safety Grant Update: Purging Corrupted Capabilities across Language Models

With the growing challenge of implementing safety measures across different AI models, we're excited to share groundbreaking work from one of our AI safety grants. This team has developed a novel technique for transferring safety behaviors between language models, representing a potential breakthrough in scaling AI safety work.
We're excited to share this progress update from one of Martian's AI safety grant teams. Their research introduces a novel approach to scaling mechanistic interpretability techniques by transferring insights between different language models.
Key contributions:
- A new technique for transferring mechanistic interpretability insights across LLMs, reducing the need to analyze each model individually and potentially saving significant compute resources
- A method for transferring safety behaviors between models using steering vectors, advancing our ability to mitigate undesirable behaviors in LLMs at scale
Read the complete progress report on LessWrong.
Interested in working on projects like this or exploring the frontiers of machine intelligence? We're hiring!