Quit Emailing Yourself

# llm → inference

8 links tagged with all of: llm + inference

Click any tag below to further narrow down your results

Links

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs

Bitnet.cpp is a framework designed for efficient inference of 1-bit large language models (LLMs), offering significant speed and energy consumption improvements on both ARM and x86 CPUs. The software enables the execution of large models locally, achieving speeds comparable to human reading, and aims to inspire further development in 1-bit LLMs. Future plans include GPU support and extensions for other low-bit models.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ bitnet llm ✓ inference ✓ + optimization + open-source

GitHub - Mega4alik/ollm

oLLM is a lightweight Python library designed for large-context LLM inference, allowing users to run substantial models on consumer-grade GPUs without quantization. The latest update includes support for various models, improved VRAM management, and additional features like AutoInference and multimodal capabilities, making it suitable for tasks involving large datasets and complex processing.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ ollm llm ✓ inference ✓ + python + gpu

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference

A new compiler called Mirage Persistent Kernel (MPK) transforms large language model (LLM) inference into a single, high-performance megakernel, significantly reducing latency by 1.2-6.7 times. By fusing computation and communication across multiple GPUs, MPK maximizes hardware utilization and enables efficient execution without the overhead of multiple kernel launches. The compiler is designed to be user-friendly, requiring minimal input to compile LLMs into optimized megakernels.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

llm ✓ + gpu inference ✓ + compiler + megakernel

Groq on Hugging Face Inference Providers 🔥

Groq has been integrated as a new Inference Provider on the Hugging Face Hub, enhancing serverless inference capabilities for a variety of text and conversational models. Utilizing Groq's Language Processing Unit (LPU™), developers can achieve faster inference for Large Language Models with a pay-as-you-go API, while managing preferences and API keys directly from their user accounts on Hugging Face.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ groq inference ✓ + hugging-face + ai llm ✓

Scaling Large Language Model Serving Infrastructure at Meta

Charlotte Qi discusses the challenges of serving large language models (LLMs) at Meta, focusing on the complexities of LLM inference and the need for efficient hardware and software solutions. She outlines the critical steps to optimize LLM serving, including fitting models to hardware, managing latency, and leveraging techniques like continuous batching and disaggregation to enhance performance.

Saved by tldr-importer · Last saved October 29, 2025 · 7 min read

llm ✓ inference ✓ + optimization + meta + infrastructure

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

Tokasaurus is a newly released LLM inference engine designed for high-throughput workloads, outperforming existing engines like vLLM and SGLang by more than 3x in benchmarks. It features optimizations for both small and large models, including dynamic prefix identification and various parallelism techniques to enhance efficiency and reduce CPU overhead. The engine supports various model families and is available as an open-source project on GitHub and PyPI.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

llm ✓ inference ✓ + throughput + optimization + open-source

GitHub - Anemll/Anemll: Artificial Neural Engine Machine Learning Library

ANEMLL is an open-source project designed to facilitate the porting of Large Language Models (LLMs) to Apple Neural Engine (ANE) with features like model evaluation, optimized conversion tools, and on-device inference capabilities. The project includes support for various model architectures, a reference implementation in Swift, and automated testing scripts for seamless integration into applications. Its goal is to ensure privacy and efficiency for edge devices by enabling local model execution.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ anemll llm ✓ + apple-neural-engine + open-source inference ✓

[2510.02361] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

The article presents ChunkLLM, a lightweight and pluggable framework designed to enhance the inference speed of transformer-based large language models (LLMs) while maintaining performance. It introduces two novel components, QK Adapter and Chunk Adapter, which effectively manage feature compression and chunk attention acquisition, achieving significant speedups during inference, especially with long texts. Experimental results demonstrate that ChunkLLM retains a high level of performance while accelerating processing speeds by up to 4.48 times compared to standard transformer models.

Saved by hn_user_11 · 1 other saved this · Last saved October 28, 2025 · 3 min read

llm ✓ inference ✓ + transformer