3 links
tagged with all of: language-models + tokenization
Click any tag below to further narrow down your results
Links
The article discusses the limitations of tokenization in large language models (LLMs) and argues for a shift towards more general methods that leverage compute and data, in line with The Bitter Lesson principle. It explores potential alternatives, such as Byte Latent Transformers, and examines the implications of moving beyond traditional tokenization approaches, emphasizing the need for improved modeling of natural language.
The article discusses the phenomenon that shorter tokens in language models tend to have a higher likelihood of being selected in various contexts. It explores the implications of this tendency for understanding how language processing works in computational models. Additionally, the author examines how the length of tokens can affect the efficiency and accuracy of these models.
StochasTok is a novel stochastic tokenization method that enhances large language models' (LLMs) understanding of subword structures by randomly splitting tokens during training. This approach significantly improves performance on various subword-level tasks, such as character counting and substring identification, without the high computational costs associated with previous methods. Additionally, StochasTok can be easily integrated into existing pretrained models, yielding considerable improvements with minimal changes.