Click any tag below to further narrow down your results
Links
This article outlines how researchers trained a GPT-2 model using a carefully crafted 1 billion token dataset, achieving over 90% of the performance of models trained on 10 times more data. They found that a static mix of 50% finePDFs, 30% DCLM-baseline, and 20% FineWeb-Edu outperformed traditional curriculum learning methods. Key insights include the importance of dataset quality and the dangers of abrupt transitions between data distributions.