论文arxiv cs.LG · 2w ago需要关注

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

分类释义:学术论文 / 技术报告

TL;DR

arXiv:2605.13935v1 Announce Type: new Abstract: Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (T

关键要点

  • 01arXiv:2605.13935v1 Announce Type: new Abstract: Diffusion language models are a promising alternative to autoregressive models
  • 02yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths
  • 03reducing coverage of alternative correct solutions under repeated sampling. To address this
  • 04we propose TraFL (T
为什么值得关注

对你的工程实践意味着什么

LLM 实时生成MiniMax-M2.7缓存命中
角色你应该做什么
Tech Lead评估扩散语言模型路线 vs 自回归模型的技术选型,参考此研究判断后训练稳定性
应用工程师暂无直接影响,关注扩散语言模型采样多样性问题的最佳实践即可
运维 / 平台暂无直接影响,了解即可
产品 / 业务暂无直接影响,了解此研究进展即可
阅读原文 ↗来源:arxiv cs.LG

同类资讯

本页 TL;DR 与「为什么」由 LLM 生成 · 模型:MiniMax-M2.7 / Claude Haiku 4.5