6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article outlines how researchers trained a GPT-2 model using a carefully crafted 1 billion token dataset, achieving over 90% of the performance of models trained on 10 times more data. They found that a static mix of 50% finePDFs, 30% DCLM-baseline, and 20% FineWeb-Edu outperformed traditional curriculum learning methods. Key insights include the importance of dataset quality and the dangers of abrupt transitions between data distributions.
If you do, here's more
The piece details a study where researchers trained a GPT-2 model using only 1 billion tokens, a fraction of the typical 10 trillion used for modern language models. They aimed to match or exceed 90% of the performance of models trained on ten times the data. Through over 50 experiments, they found that a dataset mix of 50% high-quality educational PDFs, 30% diverse web content, and 20% curated educational resources produced the best results. This combination achieved a validation perplexity of 27.38 and a FineWiki perplexity of 346, highlighting its effectiveness in both in-domain performance and generalization to unseen data.
A significant takeaway was the validation-generalization tradeoff. Pure synthetic data yielded excellent validation scores but failed at generalization, while diverse data fared the opposite. The 50-30-20 mix struck a balance, sacrificing some validation performance for much better generalization. The researchers also debunked the effectiveness of curriculum learning, which had been thought to enhance training. They found that shifting from synthetic to diverse data caused catastrophic forgetting in models, leading to worse performance. In contrast, the static mixing approach maintained a consistent data distribution, which improved both training speed and model performance.
Ultimately, the team successfully trained the codelion/gpt-2-70m model with 70 million parameters in about eight hours, using their optimal dataset mix. This model's architecture included 12 layers and 8 attention heads, demonstrating that efficient training can be achieved without excessive data or complex strategies.
Questions about this article
No questions yet.