Quit Emailing Yourself

# kubernetes → llm

4 links tagged with all of: kubernetes + llm

Click any tag below to further narrow down your results

Links

Introducing Kthena: LLM inference for the cloud native era

Kthena is a new system tailored for Kubernetes that optimizes the routing, orchestration, and scheduling of Large Language Model (LLM) inference. It addresses key challenges like resource utilization and latency, offering features such as intelligent routing and production-grade orchestration. This sub-project of Volcano enhances support for AI lifecycle management.

Saved by tldr-importer · Last saved February 14, 2026 · 5 min read

+ kthena kubernetes ✓ llm ✓ + mlops + volcano

Large Scale Distributed LLM Inference with Kubernetes | by Kshitiz Lohia | GoPenAI

This article explains how to implement large-scale inference for language models using Kubernetes. It covers key concepts like batching strategies, performance metrics, and intelligent routing to optimize GPU usage. Practical deployment examples and challenges in managing inference are also discussed.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

kubernetes ✓ llm ✓ + inference + batching + gpu

Announcing KServe v0.15: Advancing Generative AI Model Serving

KServe v0.15 has been released, enhancing capabilities for serving generative AI models, including support for large language models (LLMs) and advanced caching mechanisms. Key features include integration with Envoy AI Gateway, multi-node inference, and autoscaling with KEDA, aimed at improving performance and scalability for AI workloads. The update also introduces a dedicated documentation section for generative AI and various performance optimizations.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ kserve + generative-ai kubernetes ✓ llm ✓ + autoscaling

Implementing High-Performance LLM Serving on GKE: An Inference Gateway Walkthrough | Google Cloud Blog

The guide outlines how to deploy large language models (LLMs) at scale using Google Kubernetes Engine (GKE) and the GKE Inference Gateway, which optimizes load balancing by considering AI-specific metrics. It provides a step-by-step walkthrough for setting up an inference pipeline with the vLLM framework, ensuring efficient resource management and performance for AI workloads. Key features include intelligent load balancing, simplified operations, and support for multiple models and hardware configurations.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ gke llm ✓ + inference-gateway kubernetes ✓ + ai-serving