1 link tagged with all of: language-models + pretraining + machine-learning
Click any tag below to further narrow down your results
Links
This article discusses a method for shaping language model capabilities during pretraining by filtering tokens from the training data. The authors demonstrate that token filtering is more effective and efficient than document filtering, particularly for minimizing unwanted medical capabilities. They also introduce a new labeling methodology and show that this approach remains effective even with noisy labels.