Click any tag below to further narrow down your results
Links
The article explains how low-bit inference techniques help optimize large AI models by reducing memory and computational demands. It discusses quantization methods, their impact on performance, and trade-offs for running AI workloads effectively on GPUs.
This article discusses the unique difficulties in hardware design for large language model inference, particularly during the autoregressive Decode phase. It identifies memory and interconnect issues as primary challenges and proposes four research directions to improve performance, focusing on datacenter AI but also considering mobile applications.
This article explores the development and significance of Google's Tensor Processing Unit (TPU), detailing its evolution from a research project to a powerful hardware accelerator for deep learning. It highlights how the TPU is specialized for neural network tasks and addresses the challenges posed by the slowing pace of traditional chip scaling.
Microsoft has unveiled Maia 200, an AI inference accelerator built on TSMC’s 3nm process, designed to enhance AI token generation efficiency. It features advanced memory systems and high-performance capabilities, making it more efficient than previous generations of AI hardware. Maia 200 will support multiple models, including OpenAI's GPT-5.2, and aims to streamline AI development across Microsoft's cloud services.