35 links
tagged with pytorch
Click any tag below to further narrow down your results
Links
Researchers demonstrated the use of torchft and torchtitan for training a model under extreme synthetic failure rates, achieving fault tolerance without relying on checkpoints. By employing a novel asynchronous weight transfer method, they successfully isolated failures and maintained training continuity across multiple GPU groups.
ConceptAttention is an interpretability method designed for multi-modal diffusion transformers, specifically implemented for the Flux DiT architecture using PyTorch. The article provides installation instructions and a code example for generating images and concept attention heatmaps. It also references the associated research paper for further details.
PyTorch Distributed Checkpointing (DCP) offers a customizable solution for managing model checkpoints in distributed training, allowing significant reductions in storage size through compression techniques. By implementing the zstd compression algorithm, the team achieved a 22% decrease in checkpoint sizes while optimizing performance with multi-threading. The article details the customization process and encourages developers to explore DCP's extensibility for improved efficiency in their workflows.
PyTorch Day France on May 7 in Paris marks the inaugural event in a new international series aimed at showcasing advancements in open source AI and fostering community collaboration. Attendees will hear from industry leaders and participate in technical sessions covering a range of AI topics, alongside the GOSIM AI Paris event. Registration is free with a special code for access to all sessions.
Miloš Švaňa discusses the difficulties of setting up a PyTorch project that functions across various operating systems and hardware accelerators. He explores solutions using PEP 508 for dependency management and ultimately decides to switch from PyTorch to ONNX Runtime for easier installation and better compatibility with PyPI.
Learn how to build and deploy custom CUDA kernels using the kernel-builder library, which streamlines the development process and ensures scalability and efficiency. The guide walks through creating a practical RGB to grayscale image conversion kernel with PyTorch, covering project structure, CUDA coding, and registration as a native PyTorch operator. It also discusses reproducibility, testing, and sharing the kernel with the community.
Modern techniques have emerged since the original "Attention Is All You Need" paper to optimize transformer architectures, focusing on reducing memory usage and computational costs during inference. Key advancements include Group Query Attention, Multi-head Latent Attention, and various architectural innovations that enhance performance without significantly compromising quality. These methods aim to improve the efficiency of large models in practical applications.
The 2025 PyTorch Docathon is a community-driven event focused on improving PyTorch documentation, making it accessible for newcomers and enhancing user experience. Participants can expect a collaborative environment to learn, contribute, and see the tangible impact of their work. The event runs from June 3 to June 18, with various skill-level tasks available.
The article provides an overview of a codebase for training language and vision-language models using PyTorch, highlighting installation instructions, model inference, and training setup. It details the required dependencies, configuration paths, and methods for integrating new datasets and models, while also addressing the usage of various GPU resources for efficient training and evaluation.
FlashPack is a new file format and loading mechanism for PyTorch that significantly speeds up model checkpoint loading, achieving 3-6 times faster performance than existing methods. By flattening weights into a contiguous byte stream and optimizing parallel processing between CPU and GPU, FlashPack enhances efficiency in model I/O, making it ideal for machine learning applications. Users can easily convert and integrate their models with FlashPack to benefit from faster loading times.
The article discusses advancements in accelerating graph learning models using PyG (PyTorch Geometric) and Torch Compile, highlighting methods that enhance performance and efficiency in processing graph data. It details practical implementations and the impact of these optimizations on machine learning tasks involving graphs.
LSNet is a new family of lightweight vision models that leverage a "See Large, Focus Small" strategy, inspired by the human visual system, to improve efficiency and performance in various vision tasks. Utilizing LS convolution, which combines large-kernel perception with small-kernel aggregation, LSNet outperforms existing lightweight networks while maintaining computational efficiency. The models have been trained on ImageNet-1K and tested on a Nvidia RTX3090 for throughput.
PyTorch Conference 2025 will take place in San Francisco on October 22-23, featuring keynotes, technical sessions, and workshops dedicated to AI advancements. The event includes a range of summits on topics like measuring intelligence and AI infrastructure, as well as training and certification opportunities. Attendees will connect with leaders and innovators in the AI community.
The Kubeflow Trainer project has been integrated into the PyTorch ecosystem, providing a scalable and community-supported solution for running PyTorch on Kubernetes. It simplifies distributed training of AI models and fine-tuning of large language models (LLMs) while optimizing GPU utilization and supporting advanced scheduling capabilities. The integration enhances the deployment of distributed PyTorch applications and offers a streamlined experience for AI practitioners and platform admins alike.
ZClip is an adaptive gradient clipping technique for mitigating gradient spikes during LLM pre-training, utilizing Exponential Moving Averages to adjust clipping thresholds dynamically. It enhances training stability and efficiency by responding to changes in gradient norms without relying on fixed thresholds. The implementation is compatible with PyTorch and PyTorch Lightning, allowing seamless integration into training pipelines.
Helion introduces a high-level domain-specific language that simplifies kernel development for machine learning by compiling Python-embedded code into optimized Triton code. It automates complex tasks like memory management and tuning, allowing developers to focus on algorithmic logic rather than hardware specifics. Helion's autotuning engine enhances performance portability across different hardware architectures with minimal effort.
PyTorch has evolved from an AI research framework to a foundational tool for production and generative AI, supported by major industry players. The PyTorch Foundation is expanding to encompass a broader ecosystem, addressing current challenges in AI while aiming to establish itself as the "Open Language of AI." Future initiatives will focus on improving performance, model deployment, and fostering a diverse community around AI development.
PyTorch and vLLM have been integrated to enhance generative AI applications by implementing Prefill/Decode Disaggregation, which improves inference efficiency at scale. This collaboration has optimized Meta's internal inference stack by allowing independent scaling of prefill and decode processes, resulting in better performance metrics. Key optimizations include enhanced KV cache transfer and load balancing, ultimately leading to reduced latency and increased throughput.
PyTorch Distributed Checkpointing (DCP) has integrated support for HuggingFace safetensors, allowing users to save and load checkpoints directly within the HuggingFace ecosystem without custom converters. This enhancement simplifies the user experience for machine learning engineers and improves efficiency in projects like torchtune by eliminating the need for format-specific checkpointing solutions. Future developments will focus on advanced support for distributed loading and saving of safetensors.
ZeroGPU enables efficient use of Nvidia H200 hardware in Hugging Face Spaces by allowing users to avoid keeping GPUs locked during idle periods. The article discusses how ahead-of-time (AoT) compilation with PyTorch can significantly enhance performance, reducing processing time for generating images and videos with speedups of 1.3x to 1.8x. It also provides a guide on implementing AoT compilation in ZeroGPU Spaces, including advanced techniques like FP8 quantization.
PyTorch has released native quantized models, including Phi4-mini-instruct and Qwen3, optimized for both server and mobile platforms using int4 and float8 quantization methods. These models offer efficient inference with minimal accuracy degradation and come with comprehensive recipes for users to apply quantization to their own models. Future updates will include new features and collaborations aimed at enhancing quantization techniques and performance.
The article introduces the PyTorch Native Agentic Stack, a new framework designed to enhance the development of AI applications by providing a more efficient and integrated approach to leveraging PyTorch's capabilities. It emphasizes the stack's ability to simplify the implementation of agent-based systems and improve overall performance in machine learning tasks.
UCGM is an official PyTorch implementation that provides a unified framework for training and sampling continuous generative models, such as diffusion and flow-matching models. It enables significant acceleration of sampling processes and efficient tuning of pre-trained models, achieving impressive FID scores across various datasets and resolutions. The framework supports diverse architectures and offers tools for both training and evaluating generative models.
PyTorch Conference 2025 will take place in San Francisco from October 22-23, featuring keynotes, workshops, and technical sessions focused on advancements in AI. The event includes co-located summits and the launch of PyTorch training and certification, aimed at connecting AI innovators and practitioners. Session recordings and presentation slides will be available for attendees to review after the conference.
Monarch is a distributed programming framework for PyTorch that utilizes scalable actor messaging and features such as fault tolerance, point-to-point RDMA transfers, and support for distributed tensors. The framework is currently in experimental development, and users are encouraged to report bugs and contribute to its improvement. Installation requires specific dependencies and can be set up on various operating systems, with examples provided to guide users in utilizing its APIs effectively.
PyTorch and vLLM are increasingly integrated to enhance generative AI applications, providing optimized performance and support for various hardware types. Key features include torch.compile for model optimization, TorchAO for quantization, and FlexAttention for custom attention patterns, all aimed at streamlining the deployment of advanced models. Collaborative efforts are focused on improving large-scale inference and post-training processes for AI systems.
The article provides detailed information about registration rates and procedures for the upcoming PyTorch Conference, including deadlines for various attendee categories and special discounts for groups and small businesses. It also outlines the refund policy and options for substitutions and certificate downloads post-event.
The article discusses the implementation of Andrej Karpathy's original recurrent neural network (RNN) code using PyTorch, emphasizing hands-on coding to understand RNNs better. It also highlights the differences in dataset formatting for training RNNs compared to transformer-based language models. Future posts will delve deeper into the author's personal implementations of RNNs.
The article discusses the author's experience of deploying the DeepSeek-OCR model on an NVIDIA Spark using Claude Code, emphasizing the challenges faced with compatibility and dependencies in running a PyTorch CUDA model. The author details the process of setting up the environment, troubleshooting issues, and successfully executing OCR on an image after overcoming obstacles related to GPU capabilities and software versions.
The article discusses the competitive landscape of machine learning frameworks in 2019, highlighting the shift from TensorFlow to PyTorch among researchers. It presents data showing PyTorch's growing dominance in academic publications while TensorFlow remains prevalent in industry applications. The author suggests that PyTorch's simplicity, API design, and community preference may hinder TensorFlow's future in research.
The article introduces "create-llm," a CLI tool designed to quickly scaffold production-ready PyTorch training projects for language models. It offers various templates tailored for different project scopes and includes essential features like data preprocessing, tokenizer training, and deployment tools, enabling users to train their own language models efficiently.
The article introduces torchcomms, a lightweight communication API designed for PyTorch Distributed, aimed at enhancing large-scale model training. It offers a flexible framework for rapid prototyping, supports scaling to over 100,000 GPUs, and emphasizes fault tolerance and device-centric communication. The development process is open to community feedback as it evolves towards comprehensive support for next-generation distributed technologies.
The article discusses a challenging bug encountered while using PyTorch, which caused training loss to plateau due to a GPU kernel issue on the Apple Silicon MPS backend. After extensive debugging and investigation, the author uncovered the underlying problem related to non-contiguous memory layouts, ultimately leading to insights about PyTorch internals and the importance of understanding framework details in troubleshooting. The article serves as a guide for others who may face similar issues, offering a thorough walkthrough of the debugging process.
The article discusses a method for visualizing high-dimensional tensors by representing them as matrices of matrices, which helps in identifying the dimensions more clearly. The author demonstrates this technique with examples of tensors from 0D to 5D, explaining how to stack lower-dimensional matrices both horizontally and vertically to maintain clarity. Additionally, the article touches on the fractal nature of this representation and provides a knowledge check on splitting tensors using PyTorch functions.
The article introduces PyTorch Monarch, a new distributed programming framework designed to simplify the complexity of distributed machine learning workflows. By adopting a single controller model, Monarch allows developers to program clusters as if they were single machines, seamlessly integrating with PyTorch while managing processes and actors efficiently across large GPU clusters. It aims to enhance fault handling and data transfer, making distributed computing more accessible and efficient for ML applications.