Quit Emailing Yourself

There is no data-generating distribution

5 min read | Saved February 14, 2026 | Copied!

machine-learning 🤖 statistics 🤖 decision-theory 🤖 population-models 🤖 education 🤖

Do you care about this?

The author reflects on teaching machine learning without relying on the concept of data-generating distributions, arguing that such distributions don’t exist in practice. Instead, the focus should be on population models and how they inform decision-making based on sample data. The article emphasizes the importance of understanding how samples relate to the populations they represent.

If you do, here's more

The author reflects on their experience teaching a graduate machine learning course, particularly focusing on the misconceptions surrounding the concept of the data-generating distribution. They argue that this idea, often central to machine learning theory, doesn't actually exist. Instead, they emphasize that machine learning is about identifying patterns in data and making predictions based on those patterns. The belief in a data-generating distribution has been a longstanding myth that the author has worked to dismantle throughout their teaching.

The article outlines various models of machine learning, emphasizing that understanding the population from which data is drawn is critical for making predictions. The author introduces the concept of "metrical determinism," which highlights that the choice of scoring system for predictions at the population level significantly influences decision-making. They explain three learning paradigms: Batch Learning, where models are evaluated based on how well they perform on the larger population; Online Learning, which involves sequential predictions and adjustments without relying on randomness; and Empiricist Learning, which focuses on decision-making within the population context. 

The author also critiques the reliance on the data-generating distribution as a simplification that doesn’t necessarily aid in understanding or applying machine learning techniques. They argue that various methods can be utilized regardless of whether data is generated by true randomness or not. By stripping away the myth of the data-generating distribution, the author aims to provide a clearer framework for thinking about machine learning in practical terms, steering the focus back to how models interact with the populations they aim to represent.

Questions about this article

No questions yet.