4 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Azure's ND GB300 v6 virtual machines achieved a record-breaking performance of 1.1 million tokens per second on the Llama2 70B model. This surpasses the previous record by 27% and features enhanced hardware optimizations for better inference workloads. The results were verified by Signal65.
If you do, here's more
Azure's ND GB300 v6 virtual machines have set a new record in AI inference, achieving 1,100,000 tokens per second with the Llama2 70B model. This performance surpasses the previous record of 865,000 tokens per second from the ND GB200 v6 by 27%. The ND GB300 v6 machines utilize NVIDIA's Blackwell architecture, which provides 50% more GPU memory and 16% higher Thermal Design Power (TDP) compared to its predecessor. This combination of hardware enhancements positions Azure's offering as a leading option for large-scale AI deployments.
The tests were conducted using 18 virtual machines in a single NVIDIA GB300 NVL72 rack, with performance metrics showing an average throughput of around 61,163 tokens per second per node. Notably, the ND GB300 v6 machines deliver five times the throughput per GPU than the previous ND H100 v5 machines. The benchmarks also indicate significant gains in GEMM efficiency, high-bandwidth memory throughput, and faster CPU-to-GPU transfer speeds due to improved NVLink connectivity.
For developers looking to replicate these results, the article provides a step-by-step guide on setting up the environment and running benchmarks. Key steps include cloning specific repositories, downloading necessary models and datasets, and configuring Docker containers for the tests. The use of FP4 precision during inference plays a crucial role in achieving high speeds while maintaining accuracy. Overall, this breakthrough highlights Azure's commitment to enhancing AI infrastructure for enterprise needs.
Questions about this article
No questions yet.