Click any tag below to further narrow down your results
Links
This article explores an unusual optimization where adding "cutlass" to a CUDA kernel's name can significantly increase performance, sometimes by over 100 TFLOPs. It discusses the underlying mechanics of this optimization and its varying effects on different architectures and projects, emphasizing the importance of benchmarking.
A pull request (PR) is being developed to add a CUDA backend to the MLX project, with the goal of improving developer experience for local testing and deployment to supercomputers. While the CUDA backend is still in progress, optimizations have led to significant performance improvements, and collaboration is encouraged for further development and testing across different environments, including ROCm support.
The blog post details a reverse-engineering effort of Flash Attention 4 (FA4), a new CUDA kernel optimized for Nvidia's architecture, achieving a ~20% speedup over previous versions. It explores the kernel's architecture and asynchronous operations, making it accessible for software engineers without CUDA experience, while providing insights into its tile-based computation processes and optimizations for generative AI tasks.