Quit Emailing Yourself

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

PyTorch has released native quantized models, including Phi4-mini-instruct and Qwen3, optimized for both server and mobile platforms using int4 and float8 quantization methods. These models offer efficient inference with minimal accuracy degradation and come with comprehensive recipes for users to apply quantization to their own models. Future updates will include new features and collaborations aimed at enhancing quantization techniques and performance.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

+ pytorch quantization ✓ + machine-learning models ✓ + deployment

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization

ParetoQ is a novel algorithm for low-bit quantization of large language models, unifying binary, ternary, and 2-to-4 bit quantization-aware training. It achieves state-of-the-art performance across all bit widths and offers a reliable framework for comparing quantization methods, demonstrating that lower-bit quantization can surpass traditional 4-bit methods in both accuracy and efficiency. The integration of ParetoQ into the torchao library facilitates easy deployment on edge devices while optimizing accuracy and compression trade-offs.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

quantization ✓ + deep-learning models ✓ + performance + paretoq

Links

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization