Quit Emailing Yourself

Machine-learning predictive autoscaling for Flink

6 min read | Saved February 14, 2026 | Copied!

flink 🤖 autoscaling 🤖 machine-learning 🤖 resource-management 🤖 kafka 🤖

Do you care about this?

This article discusses Grab's approach to optimizing CPU provisioning for Flink applications using machine learning. It highlights the limitations of reactive autoscaling and proposes a predictive model that forecasts workload demands to improve resource allocation and reduce inefficiencies.

If you do, here's more

Grab is experiencing rapid growth in stream-processing applications using Apache Flink, with a notable 2.5 times increase over the past year. As more users from diverse backgrounds create Flink pipelines, they often struggle with resource configuration, leading to over-provisioning and wasted resources. The internal Flink platform team is focusing on efficient CPU provisioning for TaskManagers, especially for applications sourcing data from systems like Kafka. These workloads present opportunities for cost savings due to their predictable seasonal patterns.

Grab's initial approach relied on Flink’s Adaptive Scheduler in Reactive Mode, which used Kubernetes’ Horizon Pod Autoscaler (HPA) to scale TaskManagers based on metrics like CPU usage. This reactive setup led to challenges, including "restart spikes" that caused temporary surges in CPU usage and consumer latency. For instance, during peak times, CPU usage could jump from 0.5 to 2.5 cores, and consumer latency could spike dramatically from under a second to several minutes. This cycle of reactive scaling sometimes spiraled out of control, with the system struggling to cope and leading to repeated restarts.

To mitigate these issues, the proposed solution emphasizes predictive autoscaling rather than reactive measures. By focusing on vertical scaling, the team aims to address the limitations of fixed parallelism in Kafka connectors. The new approach involves calculating CPU requirements based on anticipated workload changes before they occur, which should prevent the artificial workload increases that current reactive methods create. This predictive model aims for accurate and consistent resource allocation, reducing the trial-and-error process users currently face.

Questions about this article

No questions yet.