DataDecide is a newly released suite from Ai2 that enables researchers to predict the best pretraining datasets for language models using small experiments. The findings suggest that simple ranking methods outperform more complex scaling laws, and that certain benchmarks can be predicted effectively with significantly less compute. This resource aims to enhance model development efficiency by providing actionable insights into dataset selection and evaluation metrics.
data-decisions ✓
model-development ✓
pretraining ✓
benchmarks ✓
scaling-laws ✓