A follow-up from the previous post (an OpenAI paper on weight-sparse transformers by Leo Gao et al.)
Looking at the paper and at the discussion chatgpt.com/share/6917aedb-6f18-8010-9169-872a6431104c one can extract quite a number of interesting tidbits.
Tidbit 1.
>Gradients and Adam moments remain dense; only the weights are hard-thresholded
That's interesting. GPT-5.1 confirms that this does imply that zeroed-out weights might become non-zero at subsequent iterations.
When I have been doing similar "sparsifying by pruning during training" experiments in 2022-2023, if something has been pruned, it stayed pruned (so gradients and Adam moments also got pruned). This is a very interesting technical difference.
***
Tidbit 2.
Another important aspect is a "bridge" technique where they connect a pre-trained dense version and a sparse one layer-wise and train that configuration with the idea of making the activity of the sparse one similar to the activity of the dense one and with the goal of using it to extract the circuits from the actual dense model, see sections 2.3, 3.3, A.3, and A.4, and the appropriate place in the chat I am quoting from:
********************
1 Start from a fixed dense model.
2 Train a new weight-sparse model of similar depth plus a series of bridges at each sublayer (before each attention and MLP):
2.1 Each bridge = encoder (dense → sparse) + decoder (sparse → dense).
2.2 Encoder: linear map + AbsTopK; decoder: linear. These can be seen as sparse autoencoders whose latent is the sparse model’s residual stream.
3 Loss: normal pretraining loss plus several bridge terms:
3.1 normalized MSE between dense and sparse activations (both directions),
3.2 KL between the dense model and “hybrid” forward passes that mix dense and sparse layers via bridges.
This gives a sparse model whose internal computations are tightly coupled to the dense model. They then do the same structured pruning on the sparse model to get circuits, and use the bridges to map perturbations of these sparse circuits back into the dense model.
********************
Section A.4 "Bridge interventions" says
>Bridges are trained to convert between residual stream locations
in a dense and in a sparse model. We would like
to reuse these bridges to convert single-node interventions
in sparse models to “interpretable” dense interventions in
dense models.
>
> ...
>
>To construct the sparse model intervention, we first subtract
the activation of the channel of interest in the presented
condition (e.g. double quote) from the activation in the
counterfactual condition (e.g. single quote) at all tokens.
We then take the outer product with the corresponding row
of the bridge (scaled as described above) and construct a
tensor of tokens by dense model hidden dimension. We scale
this by a “steering strength” between 0.0 (no intervention)
and 1.0 (fully patched).
This is what one would want to look at as a starting point, if one wants to ponder possible modifications of this setup for capability boosts.
***
Tidbit 3.
They have a whole section dedicated to the issues of fighting the inefficiency of sparse computations and possible ways to improve the situation, that's Appendix B, Systems considerations for scaling
However, they don't seem to be aware that one can actually use Tensor Cores for unstructured sparsity, although it's not "a Tensor Core fast path":
"High Performance Unstructured SpMM Computation Using Tensor Cores", arxiv.org/abs/2408.11551
Generally speaking, NVidia seems to have a new effort to improve the situation with unstructured sparsity in the future, but it's not clear how serious is this effort: chatgpt.com/share/68f872a2-3e10-8010-93ee-22e7f3cf5983
Looking at the paper and at the discussion chatgpt.com/share/6917aedb-6f18-8010-9169-872a6431104c one can extract quite a number of interesting tidbits.
Tidbit 1.
>Gradients and Adam moments remain dense; only the weights are hard-thresholded
That's interesting. GPT-5.1 confirms that this does imply that zeroed-out weights might become non-zero at subsequent iterations.
When I have been doing similar "sparsifying by pruning during training" experiments in 2022-2023, if something has been pruned, it stayed pruned (so gradients and Adam moments also got pruned). This is a very interesting technical difference.
***
Tidbit 2.
Another important aspect is a "bridge" technique where they connect a pre-trained dense version and a sparse one layer-wise and train that configuration with the idea of making the activity of the sparse one similar to the activity of the dense one and with the goal of using it to extract the circuits from the actual dense model, see sections 2.3, 3.3, A.3, and A.4, and the appropriate place in the chat I am quoting from:
********************
1 Start from a fixed dense model.
2 Train a new weight-sparse model of similar depth plus a series of bridges at each sublayer (before each attention and MLP):
2.1 Each bridge = encoder (dense → sparse) + decoder (sparse → dense).
2.2 Encoder: linear map + AbsTopK; decoder: linear. These can be seen as sparse autoencoders whose latent is the sparse model’s residual stream.
3 Loss: normal pretraining loss plus several bridge terms:
3.1 normalized MSE between dense and sparse activations (both directions),
3.2 KL between the dense model and “hybrid” forward passes that mix dense and sparse layers via bridges.
This gives a sparse model whose internal computations are tightly coupled to the dense model. They then do the same structured pruning on the sparse model to get circuits, and use the bridges to map perturbations of these sparse circuits back into the dense model.
********************
Section A.4 "Bridge interventions" says
>Bridges are trained to convert between residual stream locations
in a dense and in a sparse model. We would like
to reuse these bridges to convert single-node interventions
in sparse models to “interpretable” dense interventions in
dense models.
>
> ...
>
>To construct the sparse model intervention, we first subtract
the activation of the channel of interest in the presented
condition (e.g. double quote) from the activation in the
counterfactual condition (e.g. single quote) at all tokens.
We then take the outer product with the corresponding row
of the bridge (scaled as described above) and construct a
tensor of tokens by dense model hidden dimension. We scale
this by a “steering strength” between 0.0 (no intervention)
and 1.0 (fully patched).
This is what one would want to look at as a starting point, if one wants to ponder possible modifications of this setup for capability boosts.
***
Tidbit 3.
They have a whole section dedicated to the issues of fighting the inefficiency of sparse computations and possible ways to improve the situation, that's Appendix B, Systems considerations for scaling
However, they don't seem to be aware that one can actually use Tensor Cores for unstructured sparsity, although it's not "a Tensor Core fast path":
"High Performance Unstructured SpMM Computation Using Tensor Cores", arxiv.org/abs/2408.11551
Generally speaking, NVidia seems to have a new effort to improve the situation with unstructured sparsity in the future, but it's not clear how serious is this effort: chatgpt.com/share/68f872a2-3e10-8010-93ee-22e7f3cf5983
Tags: