[8 September 2025] Interesting Things I Learnt This Week

1. Multi-Token Prediction Reshapes LLM Training Paradigms - This research proposes that training large language models to predict multiple future tokens simultaneously, rather than just the next one, improves sample efficiency. The approach involves the model predicting the next 'n' tokens at each position in the training data using separate output heads on a shared model base. When used as an auxiliary task, multi-token prediction enhances performance in downstream tasks for both code and natural language models without increasing training time. The benefits are more significant with larger models and persist across multiple training epochs. Notably, this method excels in generative tasks like coding, where a 13B parameter model outperforms next-token models by solving 12% more problems on HumanEval and 17% more on MBPP. Additionally, it aids in developing induction heads and algorithmic reasoning on small tasks, and offers up to 3x faster inference speeds with 4-token predi...