Continued scaling of large language models (LLMs) may not yield diminishing returns as previously thought; even small improvements in accuracy can lead to significant advancements in long-horizon task execution. The study reveals that LLMs struggle with longer tasks not due to reasoning limitations, but execution errors that compound over time, highlighting the importance of model size and strategic thinking in improving performance.