6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article explores an unusual optimization where adding "cutlass" to a CUDA kernel's name can significantly increase performance, sometimes by over 100 TFLOPs. It discusses the underlying mechanics of this optimization and its varying effects on different architectures and projects, emphasizing the importance of benchmarking.
If you do, here's more
A recent observation highlights that adding βcutlassβ to CUDA or Triton kernel names can significantly boost performance, sometimes by 100-150 TFLOPs. This phenomenon stems from how the CUDA compilation toolchain, particularly the ptxas optimizer, interprets kernel names to make optimization decisions. For instance, a simple addition to a kernel name like `add` to `add_cutlass` can lead to better performance due to different instruction selection and reordering during compilation.
Benchmarks from the article show varied results across different GPUs, including the RTX 3090 and H100. Notably, running llama.cpp on the RTX 3090 with the cutlass optimization yielded a 1% performance increase. However, not all applications benefit equally; some, like Flash Attention 2, saw performance drops of up to 1% without the optimization. The article provides a detailed comparison of instruction changes in the assembly code after applying the cutlass modification, revealing shifts in register usage and instruction selection that can influence performance and register pressure.
For those looking to implement this technique, the article offers practical coding examples in both Triton and ptxas, demonstrating how to rename kernels to leverage this optimization. It emphasizes the need for benchmarking, as results can differ significantly based on the specific hardware and application. The key takeaway is that while this naming trick can enhance performance, itβs not a one-size-fits-all solution.
Questions about this article
No questions yet.