Modern language models utilizing sliding window attention (SWA) face limitations in effectively accessing information from distant words due to information dilution and the impact of residual connections. Despite theoretically being able to see a vast amount of context, practical constraints reduce their effective memory to around 1,500 words. The article explores these limitations through mathematical modeling, revealing how the architecture influences information flow and retention.