Click any tag below to further narrow down your results
Links
This article explains Slonk, a system developed at Character.ai that combines SLURM and Kubernetes to manage GPU research clusters effectively. It addresses the challenges of providing a reliable scheduling environment for researchers while maintaining the operational benefits of Kubernetes. The open-source snapshot offers tools and configurations for others to implement similar systems.
This article explains how to use the Pulumi Kubernetes Operator and Kargo together for effective change management in Kubernetes environments. It covers features like controlled promotions, automatic verification, and approval gates to streamline infrastructure rollouts.
This article explores how Claude Code enhances development workflows by simplifying Git worktree management and streamlining Kubernetes deployments. It highlights the benefits of using AI to handle complex infrastructure tasks, making it easier for teams to work in parallel without conflicts.
This article explores viewing Kubernetes not just as a container orchestrator, but as a runtime for declarative infrastructure. It emphasizes the importance of its type system and continuous reconciliation processes, which help maintain the desired state of applications. The author highlights practical approaches for managing Kubernetes clusters effectively.
The article discusses how Airbnb achieved high availability for its distributed database systems using Kubernetes. It highlights the technical challenges faced and the solutions implemented to ensure robust performance and reliability in managing data across multiple services. The focus is on the architectural improvements and operational strategies that support scalable database management.
Rafay offers an infrastructure orchestration layer tailored for enterprise AI workloads and Kubernetes management, aiming to alleviate the complexities and costs of traditional infrastructure. The platform enhances GPU and CPU management, providing a secure and efficient environment for innovation in AI development. Analyst insights from a dedicated eBook highlight the advantages of GPU Clouds for accelerating AI application deployment.
The article discusses the migration of over 30 Kubernetes clusters to Terraform, detailing the challenges faced with previous tools like Sceptre and AWS CDK, and outlining a structured, iterative approach to the transition. Key strategies included automating processes, ensuring safety during rollbacks, and emphasizing hands-on knowledge transfer over traditional documentation. The authors share insights on tooling, risk management, and team collaboration throughout the migration journey.
Crossplane 2.0 has been launched, marking a significant evolution in how platform teams manage both applications and infrastructure within Kubernetes. The new version introduces first-class application support, broader composition capabilities, and declarative operations, while maintaining backward compatibility. This release aims to simplify the user experience and enhance self-service APIs for developers.
OpenAI leverages Kubernetes and Apache technologies to manage their scalable infrastructure effectively, ensuring that machine learning models can be deployed and maintained seamlessly. The integration of these tools allows for efficient resource management and orchestration, enabling OpenAI to handle complex workloads and enhance their service delivery.
k0rdent v1.0.0 has been released, marking a significant milestone with enhanced features for managing distributed infrastructure at scale using Kubernetes. This version focuses on unified observability, cost optimization, and improved operational capabilities through the k0rdent Cluster Manager and Observability & FinOps components, providing production-grade stability and advanced service management. Key highlights include automated IP management, multi-cluster support, and integration with popular observability tools for better resource tracking and financial accountability.