Quit Emailing Yourself

How much do language models memorize?

A new method for estimating the memorization capacity of language models is proposed, distinguishing between unintended memorization and generalization. The study finds that GPT-style models have an estimated capacity of 3.6 bits per parameter, revealing that models memorize data until their capacity is reached, after which generalization begins to take precedence.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

language-models ✓ + memorization + generalization scaling-laws ✓ + data-analysis

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models (LLMs) by separating parameters from computational costs. This study introduces the Efficiency Leverage (EL) metric to quantify the computational advantage of MoE models and establishes a unified scaling law that predicts EL based on configuration parameters, demonstrating that a model with significantly fewer active parameters can achieve comparable performance to a larger dense model while using less computational resources.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

+ mixture-of-experts + efficiency-leverage scaling-laws ✓ language-models ✓ + computational-resources

Links

How much do language models memorize?

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models