Click any tag below to further narrow down your results
Links
Mooncake has been integrated into the PyTorch Ecosystem to enhance the performance of large language models. It offers advanced KVCache solutions that improve efficiency and scalability in model serving. The article details Mooncake’s features and deployment configurations with various inference engines.
ExecuTorch is a tool for deploying AI models directly on devices like smartphones and microcontrollers without needing intermediate format conversions. It supports various hardware backends and simplifies the process of exporting, optimizing, and running models with familiar PyTorch APIs. This makes it easier for developers to implement on-device AI across multiple platforms.
Learn how to build and deploy custom CUDA kernels using the kernel-builder library, which streamlines the development process and ensures scalability and efficiency. The guide walks through creating a practical RGB to grayscale image conversion kernel with PyTorch, covering project structure, CUDA coding, and registration as a native PyTorch operator. It also discusses reproducibility, testing, and sharing the kernel with the community.
PyTorch has released native quantized models, including Phi4-mini-instruct and Qwen3, optimized for both server and mobile platforms using int4 and float8 quantization methods. These models offer efficient inference with minimal accuracy degradation and come with comprehensive recipes for users to apply quantization to their own models. Future updates will include new features and collaborations aimed at enhancing quantization techniques and performance.