But for the tasks this paper uses for RL training, it's all about improving the way the net is manipulating concepts. So the middle layers are where the focus should be.
Note: RL is also used for tasks that aren't about conceptual manipulation, like instruct training. I bet that their result doesn't hold for that because the delta vs the foundation model is all about the selection of words and flow of the text, not the core understanding.
https://dnhkng.github.io/posts/rys/
Feels it should be straightforward to integrate in LLMs a network to control the looping. Or just duplicate entire blocks of layers after the initial training.
Most errors are probably responses that didn’t finish before their 3K token limit. They’ve measured how well RL is able to shorten the response to their limit.
RL post-training alters the parameters of the transformer, while your f(manifold) idea seems to suggest that a new layer on top would suffice, no need to alter the transformer itself at all.
It would be extremely handy if that were so, but I'm guessing it isn't, or it would be the prevailing approach.
Worth noting a different manifold "exists" after each transformation (e.g. layer). You only sample from the same manifold when you apply the same transformation(s).
[0] not simply
The current model architectures we use have a fixed routing of residuals per layer, from the first to the last. I'm imagining replacing this with a matrix of routing weights[0] that determines how "strong" the connection is between each Transformer layer. We still evaluate each layer "in order", but now instead of just giving the layer the last layer's residuals, it gets the sum of all prior layers times their weight in the routing matrix. Recurrent connections (i.e. output of layer 9 to input of layer 3) could be handled by doing a second pass and using the first pass's recurrent residuals as inputs. You could then "loop" the model as many times as desired per token, or even have it do parallel decoding with each token communicating with the others while also recurring on itself.
You'd probably need some kind of normalization akin to what Deepseek did with Manifold Hyper Connections (mHC). Hell, mHC might also be useful in combination with this kind of layer routing, so the model could grow different recurrent loops for various bits of it's thought-space.
EDIT: if anyone uses it please call it "neuralese recurrence" just to scare the AI safety bros
[0] I'm not sure how you'd initialize these weights. Maybe each row/column is a narrow gaussian centered around the prior layer, with some random or constant weighting everywhere else?
It seems that the input layers to a Transformer are necessarily going to be doing the most low level work of syntax -> semantic augmentation starting with things like tagging parts of speech etc. Similarly the output layers are by necessity going to be concerned with mapping high level representations back into surface level word sequence form. This leaves the middle layers to do the work of first recognizing deep enough patterns to support good quality prediction, then do the high level predication itself which is what RL is typically going to be trying to shape.