Quit Emailing Yourself

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

2 min read | Saved February 14, 2026 | Copied!

kernel-evolution 🤖 optimization 🤖 deep-learning 🤖 heterogeneous-hardware 🤖 automation 🤖

Do you care about this?

This paper introduces KernelEvolve, a framework designed to automate the generation and optimization of kernels for deep learning recommendation models across various hardware platforms. It addresses challenges related to model and kernel diversity by using a graph-based search method for efficient kernel optimization. The framework has been validated on multiple NVIDIA and AMD GPUs and Meta's AI accelerators, achieving high correctness and significantly reducing development time.

If you do, here's more

KernelEvolve is a framework aimed at optimizing deep learning recommendation models (DLRMs) efficiently across various hardware architectures. It addresses significant challenges in the field: the diversity of model architectures, the variety of kernel primitives, and the differences among hardware generations. By automating kernel generation and optimization, KernelEvolve simplifies the development process, allowing for faster and more efficient training and inference.

The framework operates at multiple programming levels, from high-level languages like Triton and CuTe DSL down to low-level hardware-agnostic languages. The optimization process employs a graph-based search strategy, which dynamically adapts to the context of runtime execution. This includes features like a selection policy and a fitness function, making it versatile for different hardware environments. KernelEvolve has been implemented and tested on various production recommendation models, achieving a 100% pass rate on 250 problems across three difficulty levels, as well as verifying 160 PyTorch ATen operators across diverse platforms.

Notably, KernelEvolve has significantly cut down development time from weeks to mere hours while enhancing performance compared to PyTorch baselines. Its automated approach also lowers the programming barrier for new AI hardware, making it easier for developers to work with in-house AI accelerators. This advancement not only improves efficiency but also fosters innovation in the deployment of AI technologies.

Questions about this article

No questions yet.