Quit Emailing Yourself

From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels

Learn how to build and deploy custom CUDA kernels using the kernel-builder library, which streamlines the development process and ensures scalability and efficiency. The guide walks through creating a practical RGB to grayscale image conversion kernel with PyTorch, covering project structure, CUDA coding, and registration as a native PyTorch operator. It also discusses reproducibility, testing, and sharing the kernel with the community.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

cuda ✓ + pytorch + kernel-builder development ✓ + deployment

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

A pull request (PR) is being developed to add a CUDA backend to the MLX project, with the goal of improving developer experience for local testing and deployment to supercomputers. While the CUDA backend is still in progress, optimizations have led to significant performance improvements, and collaboration is encouraged for further development and testing across different environments, including ROCm support.

Saved by tldr-importer · Last saved October 29, 2025 · 7 min read

cuda ✓ + mlx + optimization development ✓ + collaboration

Links

From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx