0:00
/
Transcript

Why Self-Distillation Is Taking Over LLM Post-Training (w/ the Researchers Behind It)

very pumped for this one shout out to Jonas Hübotter (ETH Zurich) and Idan Shenfeld (MIT) for blessing us with their knowledge

I had an awesome time interviewing @IdanShenfeld and @jonashubotter from MIT and ETH Zurich about self-distillation.

this very promising post-training paradigm where the model acts as its own teacher by conditioning on environment feedback or demonstrations.

we cover the SDPO algo for reinforcement learning with rich feedback and SDFT for continual learning without forgetting along with many applications.

we dig into how it works, why it’s simpler and faster than GRPO, and where this is already showing up in production systems.

table of content:

0:00 - what is self distillation

2:50 - idan (MIT) and jonas (ETH Zurich) introduction and motivation

18:40 - different perspective of on-policy self-distillation (presentation)

36:00 - metacognition and specificity in self-distillation

37:24 - very long hard task and self-distillation

42:00 - continual learning with self-distillation (presentation)

1:16:50 - what is next in this research direction?

1:20:00 - is there any experience with subjective feedbacks?

1:22:50 - quality vs number of feedbacks?

1:26:40 - what setting would self-distillation struggle vs GRPO?

my random thoughts on the paradigm

I think this is it. I think this will have the same impact as CoT had when it was first introduced and I would be very surprised given it’s strong performance against SFT that it is not already weaved into the main closed source models.

It also is becoming clearer to me that the traditional very rigid boundary between pre-mid-post training are starting to collapse a bit and that the reality is more of a mix that is very dependent on the expected performance of the model.

on a more meta note, I think having the model being it’s own teacher make just so much sense. like I was looking at my kids and I realized that each one of them has this bias of looking at the next in term of age for clues about how to do things.

they have this innate fascination about HOW the one that is a bit older is doing things, not even the eldest.

and I don’t think it’s just a fun quirk of nature.

the “policy” of the next in line kid is technically very close to whatever the base kid is at the moment. yes they can learn from me or my wife, but we are so much advanced that it’s a bit hard for them to understand how we are doing things (even though the movement are more precise).

it’s much easier to copy the one that is a bit older even with a slightly flawed policy and listen to them if they give feedback because they can just understand them better.

it reminded me of the example idan gave where if a smaller model was just listening to the big teacher model in robotics they will just hit point in the data they can’t recover from because they are going to make mistakes that are just not possible for the larger model.

anyway, this paradigm has a beautiful kernel of truth in term that is fundamental in my view and is a very exciting angle to get up to speed with!

papers

the slides were super crisp really cool of them to share!

if you are interested in digging more into the literature here is a few papers that are worth checking out:

📌 Reinforcement Learning via Self-Distillation (SDPO): :https://arxiv.org/abs/2601.20802

📌 Self-Distillation Enables Continual Learning (SDFT): https://arxiv.org/abs/2601.19897

📌 Aligning Language Models from User Interactions: https://arxiv.org/abs/2603.12273

📌 RL’s Razor: Why Online Reinforcement Learning Forgets Less: https://arxiv.org/abs/2509.04259

📌 Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs: https://arxiv.org/abs/2410.08020

enjoy my guys 🌹

Discussion about this video

User's avatar

Ready for more?