Quit Emailing Yourself

# language-models → pretraining → machine-learning

1 link tagged with all of: language-models + pretraining + machine-learning

Click any tag below to further narrow down your results

Links

Shaping capabilities with token-level data filtering

This article discusses a method for shaping language model capabilities during pretraining by filtering tokens from the training data. The authors demonstrate that token filtering is more effective and efficient than document filtering, particularly for minimizing unwanted medical capabilities. They also introduce a new labeling methodology and show that this approach remains effective even with noisy labels.

Saved by tldr-importer · Last saved February 14, 2026 · 2 min read

language-models ✓ + token-filtering pretraining ✓ machine-learning ✓ + data-filtering