Quit Emailing Yourself

# optimization → cuda

3 links tagged with all of: optimization + cuda

Click any tag below to further narrow down your results

Links

Maybe consider putting "cutlass" in your CUDA/Triton kernels

This article explores an unusual optimization where adding "cutlass" to a CUDA kernel's name can significantly increase performance, sometimes by over 100 TFLOPs. It discusses the underlying mechanics of this optimization and its varying effects on different architectures and projects, emphasizing the importance of benchmarking.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

cuda ✓ + cutlass optimization ✓ + performance + ptxas

[WIP] CUDA backend by zcbenz · Pull Request #1983 · ml-explore/mlx

A pull request (PR) is being developed to add a CUDA backend to the MLX project, with the goal of improving developer experience for local testing and deployment to supercomputers. While the CUDA backend is still in progress, optimizations have led to significant performance improvements, and collaboration is encouraged for further development and testing across different environments, including ROCm support.

Saved by tldr-importer · Last saved October 29, 2025 · 7 min read

cuda ✓ + mlx optimization ✓ + development + collaboration

We reverse-engineered Flash Attention 4

The blog post details a reverse-engineering effort of Flash Attention 4 (FA4), a new CUDA kernel optimized for Nvidia's architecture, achieving a ~20% speedup over previous versions. It explores the kernel's architecture and asynchronous operations, making it accessible for software engineers without CUDA experience, while providing insights into its tile-based computation processes and optimizations for generative AI tasks.

Saved by tldr-importer · Last saved October 29, 2025 · 7 min read

+ flash-attention cuda ✓ + gpu + neural-networks optimization ✓