The article presents a GitHub repository for a production-ready K-Means clustering implementation designed for Apache Spark, featuring pluggable Bregman divergences such as KL and Itakura-Saito. It includes multiple algorithms and a comprehensive test suite, serving as a drop-in replacement for MLlib with enhanced mathematical accuracy for various data types. The project supports a modern DataFrame API and maintains a legacy RDD API for backward compatibility.
k-means ✓
clustering ✓
apache-spark ✓