Quit Emailing Yourself

# language-models → attention

2 links tagged with all of: language-models + attention

Links

Why Stacking Sliding Windows Can't See Very Far

Modern language models utilizing sliding window attention (SWA) face limitations in effectively accessing information from distant words due to information dilution and the impact of residual connections. Despite theoretically being able to see a vast amount of context, practical constraints reduce their effective memory to around 1,500 words. The article explores these limitations through mathematical modeling, revealing how the architecture influences information flow and retention.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ sliding-window attention ✓ language-models ✓ + information-flow + neural-networks

Rope to Nope and Back Again: A New Hybrid Attention Strategy

Long-context large language models (LLMs) have made significant progress due to methods such as Rotary Position Embedding (RoPE). This paper analyzes various attention mechanisms, revealing performance limitations of RoPE and proposing a new hybrid attention architecture that effectively combines global and local attention spans, resulting in improved performance and efficiency for long-context tasks.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

attention ✓ language-models ✓ + ropemethods + hybrid-architecture + long-context