The article discusses the competitive landscape of machine learning frameworks in 2019, highlighting the shift from TensorFlow to PyTorch among researchers. It presents data showing PyTorch's growing dominance in academic publications while TensorFlow remains prevalent in industry applications. The author suggests that PyTorch's simplicity, API design, and community preference may hinder TensorFlow's future in research.
The article discusses a challenging bug encountered while using PyTorch, which caused training loss to plateau due to a GPU kernel issue on the Apple Silicon MPS backend. After extensive debugging and investigation, the author uncovered the underlying problem related to non-contiguous memory layouts, ultimately leading to insights about PyTorch internals and the importance of understanding framework details in troubleshooting. The article serves as a guide for others who may face similar issues, offering a thorough walkthrough of the debugging process.