Click any tag below to further narrow down your results
Links
This article explores an unusual optimization where adding "cutlass" to a CUDA kernel's name can significantly increase performance, sometimes by over 100 TFLOPs. It discusses the underlying mechanics of this optimization and its varying effects on different architectures and projects, emphasizing the importance of benchmarking.