Quit Emailing Yourself

# inference → gpu → llm

3 links tagged with all of: inference + gpu + llm

Click any tag below to further narrow down your results

Links

Large Scale Distributed LLM Inference with Kubernetes | by Kshitiz Lohia | GoPenAI

This article explains how to implement large-scale inference for language models using Kubernetes. It covers key concepts like batching strategies, performance metrics, and intelligent routing to optimize GPU usage. Practical deployment examples and challenges in managing inference are also discussed.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ kubernetes llm ✓ inference ✓ + batching gpu ✓

GitHub - Mega4alik/ollm

oLLM is a lightweight Python library designed for large-context LLM inference, allowing users to run substantial models on consumer-grade GPUs without quantization. The latest update includes support for various models, improved VRAM management, and additional features like AutoInference and multimodal capabilities, making it suitable for tasks involving large datasets and complex processing.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ ollm llm ✓ inference ✓ + python gpu ✓

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

A new compiler called Mirage Persistent Kernel (MPK) transforms large language model (LLM) inference into a single, high-performance megakernel, significantly reducing latency by 1.2-6.7 times. By fusing computation and communication across multiple GPUs, MPK maximizes hardware utilization and enables efficient execution without the overhead of multiple kernel launches. The compiler is designed to be user-friendly, requiring minimal input to compile LLMs into optimized megakernels.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

llm ✓ gpu ✓ inference ✓ + compiler + megakernel