2 links tagged with all of: performance + inference + gpu
Click any tag below to further narrow down your results
Links
Azure's ND GB300 v6 virtual machines achieved a record-breaking performance of 1.1 million tokens per second on the Llama2 70B model. This surpasses the previous record by 27% and features enhanced hardware optimizations for better inference workloads. The results were verified by Signal65.
GPUs are critical for high-performance computing, particularly for neural network inference workloads, but achieving optimal GPU utilization can be challenging. This guide outlines three key metrics of GPU utilization—allocation, kernel, and model FLOP/s utilization—and discusses strategies to improve efficiency and performance in GPU applications. Modal's solutions aim to enhance GPU allocation and kernel utilization, helping users achieve better performance and cost-effectiveness.