Quit Emailing Yourself

4 links tagged with all of: tokenization + machine-learning

Click any tag below to further narrow down your results

Links

GitHub - apple/ml-flextok: FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

FlexTok is a method for resampling images into 1D token sequences of flexible length, with official implementations and pre-trained models available on GitHub. The repository includes instructions for installation, usage examples, and model checkpoints, emphasizing the importance of using trusted sources for loading checkpoints due to potential security vulnerabilities. Users can easily integrate the FlexTok tokenizer and VAE inference into their projects using provided code snippets and Jupyter notebooks.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ flextok machine-learning ✓ + image-processing tokenization ✓ + deep-learning

The Bitter Lesson is coming for Tokenization

The article discusses the limitations of tokenization in large language models (LLMs) and argues for a shift towards more general methods that leverage compute and data, in line with The Bitter Lesson principle. It explores potential alternatives, such as Byte Latent Transformers, and examines the implications of moving beyond traditional tokenization approaches, emphasizing the need for improved modeling of natural language.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

tokenization ✓ + language-models machine-learning ✓ + byte-level + transformer

GitHub - SilentView/GigaTok: [ICCV 2025] Official repo for "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation"

GigaTok is a novel method designed for scaling visual tokenizers to 3 billion parameters, addressing the reconstruction vs. generation dilemma through semantic regularization. It offers a comprehensive framework for training and evaluating tokenizers, alongside various model configurations and instructions for setup and usage. The project is a collaboration involving extensive research and experimentation, with resources available for further exploration.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ gigatok tokenization ✓ machine-learning ✓ + image-generation + research

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

StochasTok is a novel stochastic tokenization method that enhances large language models' (LLMs) understanding of subword structures by randomly splitting tokens during training. This approach significantly improves performance on various subword-level tasks, such as character counting and substring identification, without the high computational costs associated with previous methods. Additionally, StochasTok can be easily integrated into existing pretrained models, yielding considerable improvements with minimal changes.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

tokenization ✓ + language-models + subword + stochastic machine-learning ✓