Quit Emailing Yourself

# optimization → machine-learning → distributed-computing

1 link tagged with all of: optimization + machine-learning + distributed-computing

H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips

H2 is a framework designed to enhance the training of large language models (LLMs) on hyper-heterogeneous clusters with over 1,000 chips, addressing inefficiencies caused by diverse hardware and software environments. It integrates DiTorch for consistent programming across chips and DiComm for optimized communication, alongside an adaptive pipeline parallelism strategy that achieves significant speedup compared to traditional homogeneous training methods. Experimental results show a performance improvement of up to 16.37% on a 100-billion-parameter LLM, demonstrating the framework's effectiveness at large scales.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

+ llm-training distributed-computing ✓ + heterogeneous-clusters optimization ✓ machine-learning ✓

Links

H2:Towards Efficient Large-Scale LLM Training on Hyper-Heterogeneous Cluster over 1,000 Chips