Profile

mishka_discord

November 2025

S M T W T F S
      1
234 5678
910 1112131415
16 17 18 19 202122
23 24 25 2627 2829
30      

Custom Text

Most Popular Tags

"Weight-sparse transformers have interpretable circuits" (Nov 13, 2025), openai.com/index/understanding-neural-networks-through-sparse-circuits/

There is also coverage at MIT Technology Review, archive.md/quI1s (that is, www.technologyreview.com/2025/11/13/1127914/openais-new-llm-exposes-the-secrets-of-how-ai-really-works/); they have got some comments from Leo Gao and from the senior author on the paper Dan Mossing, "who leads the mechanistic interpretability team at OpenAI".

We'll look at the details of sparsity below. Here are more quotes from MIT Technology Review:

>This [sparsity] forced the model to represent features in localized clusters rather than spread them out.
>
>Their model is far slower than any LLM on the market. But it is easier to relate its neurons or groups of neurons to specific concepts and functions. “There’s a really drastic difference in how interpretable the model is,” says Gao.


As we see, they still don't expect this technique to help with capabilities, but only to help with interpretability. That, of course, remains to be seen:

>Where will the research go next? Grigsby is not convinced the technique would scale up to larger models that have to handle a variety of more difficult tasks.    
>
>Gao and Mossing acknowledge that this is a big limitation of the model they have built so far and agree that the approach will never lead to models that match the performance of cutting-edge products like GPT-5. And yet OpenAI thinks it might be able to improve the technique enough to build a transparent model on a par with GPT-3, the firm’s breakthrough 2021 LLM. 
>
>“Maybe within a few years, we could have a fully interpretable GPT-3, so that you could go inside every single part of it and you could understand how it does every single thing,” says Gao. “If we had such a system, we would learn so much.”


***

Quoting OpenAI blog post for context openai.com/index/understanding-neural-networks-through-sparse-circuits/

>Interpretability refers to methods that help us understand why a model produced a given output. There are many ways we might achieve this. 
>
>For example, reasoning models are incentivized to explain their work on the way to a final answer. Chain of thought interpretability leverages these explanations to monitor the model’s behavior. This is immediately useful: current reasoning models’ chains of thought seem to be informative with respect to concerning behaviors like deception. However, fully relying on this property is a brittle strategy, and this may break down over time.
>
>On the other hand,
mechanistic interpretability, which is the focus of this work, seeks to completely reverse engineer a model’s computations. It has so far been less immediately useful, but in principle, could offer a more complete explanation of the model’s behavior. By seeking to explain model behavior at the most granular level, mechanistic interpretability can make fewer assumptions and give us more confidence. But the path from low-level details to explanations of complex behaviors is much longer and more difficult.

***

They have published the models and the code for inference and visualizations, but not the code for training: github.com/openai/circuit_sparsity/ 

The first author has a Twitter thread on this: x.com/nabla_theta/status/1989043939374924251

I have started talking to GPT-5.1 (Extended Thinking) trying to understand the details of that paper better: chatgpt.com/share/6917aedb-6f18-8010-9169-872a6431104c

If I end up understanding it better, I might add the relevant info in the comments or I might make a follow-up post.

One final remark is that the models they are using are very classical Transformers, just very sparse.

One might be able to obtain better results by using different sparse models (more sophisticated and less "flat"). After all, the whole point of Transformers was to be very efficient on GPUs. Since this does not seem to be the case here, there is a lot of room for variations of the network architecture.
Tags:

Expand Cut Tags

No cut tags

Style Credit