Quit Emailing Yourself

Text classification by data compression • Max Halford

5 min read | Saved February 12, 2026 | Copied!

text-classification 🤖 compression 🤖 python 🤖 machine-learning 🤖 algorithms 🤖

Do you care about this?

This article explores an unconventional method for classifying text by leveraging compression algorithms. The author demonstrates how to concatenate labeled documents, compress them, and use the compressed sizes to predict labels for new texts. While the method shows promise, it is computationally expensive and generally underperforms compared to traditional classifiers.

If you do, here's more

The article explores using data compression algorithms for text classification, a concept inspired by a chapter in "Artificial Intelligence: A Modern Approach." The author outlines a method where a compression algorithm, particularly LZW, can model the probability distribution of words in a text corpus. By compressing concatenated texts from labeled training sets, the method seeks to classify new documents based on the size increase of compressed outputs. The smaller the increase, the more similar the document is to the training texts associated with a given label.

The implementation uses the 20 Newsgroups dataset from scikit-learn, focusing on four categories: alt.atheism, talk.religion.misc, comp.graphics, and sci.space. After preprocessing the texts and measuring the sizes of the compressed results, the classification yields a macro F1-score of 0.749 using gzip. While this method is intriguing, it suffers from slow performance—over 5 minutes for just 1,353 test cases—compared to faster classifiers like multinomial Naive Bayes, which achieves a macro F1-score of 0.88.

The author tests different compression methods, including zlib, bz2, and lzma. The results vary significantly, with lzma surprisingly achieving a high accuracy of 0.897, albeit at a steep computational cost of 32 minutes. This suggests that while compression algorithms can yield interesting results in text classification, they may not be practical due to their inefficiency. The author emphasizes that this approach illustrates the connection between statistical modeling and information theory, highlighting the potential for unconventional thinking in machine learning.

Questions about this article

No questions yet.