Quit Emailing Yourself

# gpu → llm

4 links tagged with all of: gpu + llm

Click any tag below to further narrow down your results

Links

GitHub - Mega4alik/ollm

oLLM is a lightweight Python library designed for large-context LLM inference, allowing users to run substantial models on consumer-grade GPUs without quantization. The latest update includes support for various models, improved VRAM management, and additional features like AutoInference and multimodal capabilities, making it suitable for tasks involving large datasets and complex processing.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ ollm llm ✓ + inference + python gpu ✓

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

A new compiler called Mirage Persistent Kernel (MPK) transforms large language model (LLM) inference into a single, high-performance megakernel, significantly reducing latency by 1.2-6.7 times. By fusing computation and communication across multiple GPUs, MPK maximizes hardware utilization and enables efficient execution without the overhead of multiple kernel launches. The compiler is designed to be user-friendly, requiring minimal input to compile LLMs into optimized megakernels.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

llm ✓ gpu ✓ + inference + compiler + megakernel

GitHub - shuzhangzhong/HybriMoE-Preview

KTransformers is a Python-based framework designed for optimizing large language model (LLM) inference with an easy-to-use interface and extensibility, allowing users to inject optimized modules effortlessly. It supports various features such as multi-GPU setups, advanced quantization techniques, and integrates with existing APIs for seamless deployment. The framework aims to enhance performance for local deployments, particularly in resource-constrained environments, while fostering community contributions and ongoing development.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ ktransformers + optimization llm ✓ gpu ✓ + api

GitHub - lemonade-sdk/lemonade: Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs. Join our discord: https://discord.gg/5xXzkMu8Zk

Lemonade is a tool designed to help users efficiently run local large language models (LLMs) by configuring advanced inference engines for their hardware, including NPUs and GPUs. It supports both GGUF and ONNX models, offers a user-friendly interface for model management, and is utilized by various organizations, from startups to large companies like AMD. The platform also provides an API and CLI for Python application integration, alongside extensive hardware support and community collaboration opportunities.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ lemonade llm ✓ gpu ✓ + open-source + amd