22 links
tagged with all of: machine-learning + optimization
Click any tag below to further narrow down your results
Links
The article discusses the transformation of a batch machine learning inference system into a real-time system to handle explosive user growth, achieving a 5.8x reduction in latency and maintaining over 99.9% reliability. Key optimizations included migrating to Redis for faster data access, compiling models to native C binaries, and implementing gRPC for improved data transmission. These changes enabled the system to serve millions of predictions quickly while capturing significant revenue that would have otherwise been lost.
The article discusses how to optimize the performance of diffusion models using the torch.compile feature, which enhances speed with minimal user experience impact. It provides practical advice for both model authors and users on implementing compilation strategies, such as regional compilation and handling recompilations, to achieve significant efficiency gains. Additionally, it highlights methods to extend these optimizations to popular Diffusers features, making them compatible with memory-constrained GPUs and rapid personalization techniques.
The EdgeAI for Beginners course offers a comprehensive introduction to deploying artificial intelligence on edge devices, emphasizing practical applications, privacy, and real-time performance. It covers small language models, optimization techniques, and production strategies, with hands-on workshops and resources for various technical roles across multiple industries. Participants can follow a structured learning path and engage with a community of developers for support.
This study presents a framework for dynamic assortment selection and pricing using a censored multinomial logit choice model, where sellers can optimize product offerings and prices based on buyer preferences and valuations. By employing a Lower Confidence Bound pricing strategy alongside Upper Confidence Bound or Thompson Sampling approaches, the proposed algorithms achieve significant regret bounds, which are validated through simulations.
Lyft tackles the complex challenge of matching drivers to riders in real-time using graph theory and optimization techniques. By modeling the problem as a bipartite graph, Lyft aims to maximize efficiency while adapting to dynamic urban conditions and demand fluctuations. The article discusses the mathematical foundations of matching problems and the practical considerations involved in dispatching within a ridesharing framework.
Moonshot AI's Kimi K2 model outperforms GPT-4 in several benchmark tests, showcasing superior capabilities in autonomous task execution and mathematical reasoning. Its innovative MuonClip optimizer promises to revolutionize AI training efficiency, potentially disrupting the competitive landscape among major AI providers.
Prompt bloat can significantly hinder the quality of outputs generated by large language models (LLMs) due to irrelevant or excessive information. This article explores the impact of prompt length and extraneous details on LLM performance, highlighting the need for effective techniques to optimize prompts for better accuracy and relevance.
The article discusses practical lessons for effectively working with large language models (LLMs), emphasizing the importance of understanding their limitations and capabilities. It provides insights into optimizing interactions with LLMs to enhance their utility in various applications.
An in-depth exploration of DoorDash's proprietary search engine reveals how it enhances the user experience by personalizing and optimizing search results for food delivery. The system leverages machine learning algorithms and user data to improve accuracy and relevance, ultimately aiming to increase customer satisfaction and operational efficiency.
VistaDPO is a new framework for optimizing video understanding in Large Video Models (LVMs) by aligning text-video preferences at three hierarchical levels: instance, temporal, and perceptive. The authors introduce a dataset, VistaDPO-7k, consisting of 7.2K annotated QA pairs to address the challenges of video-language misalignment and hallucinations, showing significant performance improvements in various benchmarks.
The article discusses advancements in accelerating graph learning models using PyG (PyTorch Geometric) and Torch Compile, highlighting methods that enhance performance and efficiency in processing graph data. It details practical implementations and the impact of these optimizations on machine learning tasks involving graphs.
Strategies for deploying the DeepSeek-V3/R1 model are explored, emphasizing parallelization techniques, Multi-Token Prediction for improved efficiency, and future optimizations like Prefill Disaggregation. The article highlights the importance of adapting computational strategies for different phases of processing to enhance overall model performance.
The article provides an in-depth exploration of the process involved in handling inference requests using the VLLM framework. It details the steps from receiving a request to processing it efficiently, emphasizing the benefits of utilizing VLLM for machine learning applications. Key aspects include optimizing performance and resource management during inference tasks.
Introducing static network sparsity through one-shot random pruning can enhance the scaling potential of deep reinforcement learning (DRL) models. This approach provides higher parameter efficiency and better optimization resilience compared to traditional dense networks, demonstrating benefits in both visual and streaming RL scenarios.
H2 is a framework designed to enhance the training of large language models (LLMs) on hyper-heterogeneous clusters with over 1,000 chips, addressing inefficiencies caused by diverse hardware and software environments. It integrates DiTorch for consistent programming across chips and DiComm for optimized communication, alongside an adaptive pipeline parallelism strategy that achieves significant speedup compared to traditional homogeneous training methods. Experimental results show a performance improvement of up to 16.37% on a 100-billion-parameter LLM, demonstrating the framework's effectiveness at large scales.
TreeRL is a novel reinforcement learning framework that integrates on-policy tree search to enhance the training of language models. By incorporating intermediate supervision and optimizing search efficiency, TreeRL addresses issues common in traditional reinforcement learning methods, such as distribution mismatch and reward hacking. Experimental results show that TreeRL outperforms existing methods in math and code reasoning tasks, showcasing the effectiveness of tree search in this domain.
DuPO introduces a dual learning-based preference optimization framework designed to generate annotation-free feedback, overcoming limitations of existing methods such as RLVR and traditional dual learning. By decomposing a task's input into known and unknown components and reconstructing the unknown part, DuPO enhances various tasks, achieving significant improvements in translation quality and mathematical reasoning accuracy. This framework positions itself as a scalable and general approach for optimizing large language models (LLMs) without the need for costly labels.
Lyft leverages machine learning to enhance its ride-sharing services, resulting in significant financial benefits. By optimizing driver allocation and improving customer experience through data analysis, Lyft aims to generate an additional $100 million in revenue. This strategic use of technology highlights the company's commitment to innovation in the competitive transportation sector.
The study introduces a theoretical framework for understanding in-context learning (ICL) in large language models (LLMs) by utilizing hierarchical concept modeling and optimization theory. It demonstrates how nonlinear residual transformers can effectively perform factual-recall tasks through vector arithmetic, proving strong generalization and robustness against concept recombination and distribution shifts. Empirical simulations support these theoretical findings, showcasing the advantages of transformers over traditional static embeddings.
An optimized Triton BF16 Grouped GEMM kernel is presented, achieving up to 2.62x speedup over the manual PyTorch implementation for Mixture-of-Experts (MoE) models like DeepSeekv3 on NVIDIA H100 GPUs. The article details several optimization techniques, including persistent kernel design, grouped launch ordering for improved cache performance, and efficient utilization of the Tensor Memory Accelerator (TMA) for expert weights. End-to-end benchmarking results demonstrate significant improvements in training throughput.
Pinterest has enhanced its machine learning (ML) infrastructure by extending the capabilities of Ray beyond just training and inference. By addressing challenges such as slow data pipelines and inefficient compute usage, Pinterest implemented a Ray-native ML infrastructure that improves feature development, sampling, and labeling, leading to faster, more scalable ML iteration.
The article highlights impactful papers and blog posts that have significantly influenced the author's understanding of programming languages and compilers. Each referenced work introduced new concepts, improved problem-solving techniques, or offered fresh perspectives on optimization and compiler design. The author encourages readers to explore these transformative resources for deeper insights into the field.