{"id":12,"url":"https://pm.philipcastiglione.com/papers/12.json","title":"Scaling Laws for Neural Language Models","read":false,"authors":"Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei","year":2023,"auto_summary":"The paper \"Scaling Laws for Neural Language Models\" by Kaplan et al. investigates how the performance of neural language models, specifically Transformers, scales with model size, dataset size, and compute resources. The authors find that the cross-entropy loss of language models follows predictable power-law relationships with these factors, allowing for the optimization of compute budget allocation.\n\nKey findings include:\n1. **Power-Law Scaling**: The performance of language models improves as a power-law function of model size (number of parameters), dataset size (number of tokens), and compute used for training. These relationships hold over several orders of magnitude.\n2. **Optimal Compute Allocation**: Larger models are more sample-efficient, suggesting that the most compute-efficient training involves using large models with modest data and stopping training before full convergence.\n3. **Universality of Overfitting**: Overfitting depends predictably on the ratio of model size to dataset size. Increasing model size requires less than proportional increases in data to avoid overfitting.\n4. **Training Dynamics**: Training curves follow predictable power-laws, and early training dynamics can predict performance if training were extended.\n5. **Transfer Performance**: Models transfer well to different text distributions, with a constant offset in loss, indicating that improvements in training performance correlate with transfer performance.\n6. **Critical Batch Size**: The ideal batch size scales with the model's loss, and using this critical batch size optimizes the trade-off between training time and compute efficiency.\n\nThe authors conclude that larger models trained with optimal compute allocation will continue to outperform smaller ones, and they suggest that the observed scaling laws may apply to other generative modeling tasks. They also note potential contradictions in scaling laws at extreme scales, which might indicate limits to current modeling approaches. The paper provides a framework for understanding and predicting the performance of language models based on their scale, offering insights for future research and development in AI language modeling.","notes":{"id":12,"name":"notes","body":null,"record_type":"Paper","record_id":12,"created_at":"2024-12-10T04:32:30.046Z","updated_at":"2024-12-10T04:32:30.046Z"},"created_at":"2024-12-10T04:32:14.400Z","updated_at":"2024-12-10T04:32:35.079Z"}