3 links
tagged with distributed
Click any tag below to further narrow down your results
Links
The article introduces torchcomms, a lightweight communication API designed for PyTorch Distributed, aimed at enhancing large-scale model training. It offers a flexible framework for rapid prototyping, supports scaling to over 100,000 GPUs, and emphasizes fault tolerance and device-centric communication. The development process is open to community feedback as it evolves towards comprehensive support for next-generation distributed technologies.
The article introduces TernFS, an open-source, exabyte-scale distributed filesystem developed by XTX Markets to meet the growing storage demands of their algorithmic trading operations. TernFS is designed to handle large-scale compute efforts with features like redundancy, multi-region support, and a permissionless architecture, while also addressing the limitations of existing filesystems. The article outlines its architecture, key components, and the practical implementation details of TernFS.
The article introduces PyTorch Monarch, a new distributed programming framework designed to simplify the complexity of distributed machine learning workflows. By adopting a single controller model, Monarch allows developers to program clusters as if they were single machines, seamlessly integrating with PyTorch while managing processes and actors efficiently across large GPU clusters. It aims to enhance fault handling and data transfer, making distributed computing more accessible and efficient for ML applications.