7 links
tagged with all of: infrastructure + machine-learning
Click any tag below to further narrow down your results
Links
LinkedIn has developed OpenConnect, a next-generation AI pipeline ecosystem that significantly enhances the efficiency and reliability of processing large volumes of data for AI applications. By addressing challenges from its previous ProML system, OpenConnect reduces launch times, improves iteration speed, and supports robust experimentation, thereby facilitating the deployment of AI features for over 1.2 billion members.
OpenAI leverages Kubernetes and Apache technologies to manage their scalable infrastructure effectively, ensuring that machine learning models can be deployed and maintained seamlessly. The integration of these tools allows for efficient resource management and orchestration, enabling OpenAI to handle complex workloads and enhance their service delivery.
Discord has successfully transitioned from using a single-node system to implementing multi-GPU clusters, making distributed computing more accessible for machine learning engineers. This shift allows for improved performance and efficiency in handling complex machine learning tasks. The article details the technical advancements and the impact on Discord's infrastructure.
Grab has evolved its machine learning feature store by transitioning from a traditional model to a more sophisticated feature table design, utilizing Amazon Aurora Postgres for efficient data management and retrieval. This new architecture addresses complexities in high-cardinality data and improves atomicity, ensuring consistency and reliability in ML model serving. The feature tables enhance user experience and streamline the model lifecycle, resulting in better performance of ML models.
Pinterest has enhanced its machine learning (ML) infrastructure by extending the capabilities of Ray beyond just training and inference. By addressing challenges such as slow data pipelines and inefficient compute usage, Pinterest implemented a Ray-native ML infrastructure that improves feature development, sampling, and labeling, leading to faster, more scalable ML iteration.
Modern infrastructure complexity necessitates advanced observability tools, which can be achieved through cost-effective storage solutions, standardized data collection with OpenTelemetry, and the integration of machine learning and AI for better insight and efficiency. The evolution in observability is marked by the need for high-fidelity data, seamless signal correlation, and intelligent alert management to keep pace with scaling systems. Ultimately, successful observability will hinge on these innovations to maintain operational efficacy in increasingly intricate environments.
Grab has modernized its machine learning model serving platform, Catwalk, by adopting NVIDIA Triton Inference Server to enhance performance and reduce costs. The transition involved creating a "Triton manager" for seamless integration and backward compatibility, resulting in significant improvements in latency and infrastructure spending for deployed models.