This article investigates why transformer models struggle with multi-digit multiplication despite their advanced capabilities. Through reverse-engineering, the authors identify that while the model can encode necessary long-range dependencies, it converges to a local optimum that lacks these dependencies, suggesting that introducing an auxiliary loss can help the model learn this task effectively.