2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article discusses a method for shaping language model capabilities during pretraining by filtering tokens from the training data. The authors demonstrate that token filtering is more effective and efficient than document filtering, particularly for minimizing unwanted medical capabilities. They also introduce a new labeling methodology and show that this approach remains effective even with noisy labels.
If you do, here's more
The paper presents a novel approach to mitigating unwanted capabilities in language models during their pretraining phase. Traditional methods often apply adjustments after training, making them susceptible to exploitation. Instead, the authors focus on filtering data at the token level before training, specifically targeting medical capabilities as a case study. Their findings indicate that token-level filtering is not only effective but also more efficient than filtering entire documents.
Through experiments with models of varying sizes, the authors demonstrate that token filtering improves as the model scales. For larger models, the filtering process can slow down performance by up to 7000 times in terms of compute on the "forget" domain, yet models still manage to stay aligned with this domain. They introduce a methodology for labeling tokens using sparse autoencoders and generate robust classifiers, even in the presence of noisy labels. This approach allows for precise control over the training data, enhancing model reliability while maintaining performance.
Questions about this article
No questions yet.