Quit Emailing Yourself

GitHub - derrickburns/generalized-kmeans-clustering: Production-ready K-Means clustering for Apache Spark with pluggable Bregman divergences (KL, Itakura-Saito, L1, etc). 6 algorithms, 740 tests, cross-version persistence. Drop-in replacement for MLlib with mathematically correct distance functions for probability distributions, spectral data, and count data.

The article presents a GitHub repository for a production-ready K-Means clustering implementation designed for Apache Spark, featuring pluggable Bregman divergences such as KL and Itakura-Saito. It includes multiple algorithms and a comprehensive test suite, serving as a drop-in replacement for MLlib with enhanced mathematical accuracy for various data types. The project supports a modern DataFrame API and maintains a legacy RDD API for backward compatibility.

Saved by hn_user_8 · Last saved October 27, 2025 · 2 min read

k-means ✓ clustering ✓ apache-spark ✓

Links