Quit Emailing Yourself

1 link tagged with all of: performance + dataset-mixing

Click any tag below to further narrow down your results

Links

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

This article outlines how researchers trained a GPT-2 model using a carefully crafted 1 billion token dataset, achieving over 90% of the performance of models trained on 10 times more data. They found that a static mix of 50% finePDFs, 30% DCLM-baseline, and 20% FineWeb-Edu outperformed traditional curriculum learning methods. Key insights include the importance of dataset quality and the dangers of abrupt transitions between data distributions.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

dataset-mixing ✓ + gpt-2 + training-strategy performance ✓ + machine-learning