论文arxiv cs.LG · 2w ago需要关注
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
分类释义:学术论文 / 技术报告
TL;DR
arXiv:2605.13935v1 Announce Type: new Abstract: Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (T
关键要点
- 01arXiv:2605.13935v1 Announce Type: new Abstract: Diffusion language models are a promising alternative to autoregressive models。
- 02yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths。
- 03reducing coverage of alternative correct solutions under repeated sampling. To address this。
- 04we propose TraFL (T。
为什么值得关注
对你的工程实践意味着什么
LLM 实时生成MiniMax-M2.7缓存命中
| 角色 | 你应该做什么 |
|---|---|
| Tech Lead | 评估扩散语言模型路线 vs 自回归模型的技术选型,参考此研究判断后训练稳定性 |
| 应用工程师 | 暂无直接影响,关注扩散语言模型采样多样性问题的最佳实践即可 |
| 运维 / 平台 | 暂无直接影响,了解即可 |
| 产品 / 业务 | 暂无直接影响,了解此研究进展即可 |
同类资讯
本页 TL;DR 与「为什么」由 LLM 生成 · 模型:MiniMax-M2.7 / Claude Haiku 4.5