Click any tag below to further narrow down your results
Links
Helm 4 has launched, marking the package manager's 10th anniversary and the first major update in six years. This version enhances application deployment safety and addresses challenges like CI/CD complexity and security across Kubernetes environments. Key features include an improved SDK, a new plugin system, and support for future chart enhancements.
This article lists essential resources for learning and staying updated on Kubernetes, catering to both beginners and experienced users. It includes tutorials, hands-on labs, podcasts, community forums, and advanced topics to help users deepen their understanding and skills in the cloud-native ecosystem.
CAST AI's report reveals that organizations waste significant cloud resources, using only 13% of provisioned CPUs and 20% of memory in Kubernetes clusters. The study highlights overprovisioning and low utilization of spot instances as key factors. It calls for AI-driven solutions to optimize resource management amid rising cloud costs.
The article discusses recent advancements in Kubernetes GPU management, focusing on dynamic resource allocation (DRA) and a new workload abstraction. DRA allows for more flexible GPU requests, while the workload abstraction aims to improve scheduling for complex AI deployments.
This article provides a guide to 15 essential metrics for monitoring Kubernetes environments. It focuses on how these metrics can help optimize performance, troubleshoot issues, and maintain system health. The content is aimed at developers and IT operations teams.
mirrord for CI allows developers to run tests directly against a shared staging environment in Kubernetes without deploying code or creating separate test setups. It enhances testing speed and accuracy by connecting CI runners to real services, cutting down on setup time and costs.
Google introduced Agent Sandbox, a new feature for Kubernetes that enhances security and performance for AI agents. It allows rapid provisioning of isolated environments for executing agent tasks, optimizing resource use while maintaining strong operational guardrails. GKE users can also leverage Pod Snapshots for faster start-up times.
This article explores essential FinOps tools that platform engineers should consider to optimize cloud costs effectively. It highlights the challenges of managing costs in dynamic environments and presents a structured framework for evaluation, focusing on Kubernetes integration, API-first architecture, and seamless workflows.
Kubernetes v1.35 introduces 60 enhancements, including significant features like in-place Pod resource updates and native workload identity with automated certificate rotation. The release also features new stable, beta, and alpha functionalities, along with some deprecations. Community contributions continue to drive the project's growth and improvement.
Kubernetes 1.35 introduces five key features that improve Day 2 operations, including in-place pod resource updates and fine-grained supplemental group control. These enhancements streamline resource management, security, and network efficiency for containerized applications.
The Node Readiness Controller enhances node readiness management in Kubernetes by allowing operators to define custom readiness conditions tailored to their infrastructure needs. It automates the application of node taints based on specific health signals, ensuring workloads are only scheduled on fully operational nodes. This controller supports flexible enforcement modes and integrates with existing health reporting tools.
This article explains how to use the Pulumi Kubernetes Operator and Kargo together for effective change management in Kubernetes environments. It covers features like controlled promotions, automatic verification, and approval gates to streamline infrastructure rollouts.
This article introduces debugwand, a tool for debugging Python applications in Kubernetes and Docker without the usual setup hassles. It leverages the new sys.remote_exec() feature in Python 3.14 to inject a debug server into a running process, allowing for real-time debugging with minimal configuration.
SurveyMonkey faced challenges with remote workflows after COVID, leading to unreliable local development setups. They adopted mirrord, which provided seamless Kubernetes integration, significantly reducing time-to-ship, improving code quality, and enhancing developer onboarding and satisfaction.
This article outlines five often-overlooked factors to consider when selecting a container platform, focusing on Kubernetes. It highlights the importance of timely access to updates, long support durations, the ability to run multiple versions, included tools, and treating VMs and containers equally.
This article marks the 10th anniversary of Helm, which started during a hackathon after Kubernetes 1.1.0 was released. It highlights Helm's evolution from its initial commit to becoming a key part of the Kubernetes ecosystem. The post also notes its early exposure at the first KubeCon.
This article explains Slonk, a system developed at Character.ai that combines SLURM and Kubernetes to manage GPU research clusters effectively. It addresses the challenges of providing a reliable scheduling environment for researchers while maintaining the operational benefits of Kubernetes. The open-source snapshot offers tools and configurations for others to implement similar systems.
This article discusses the security challenges of exposing AI workloads in Kubernetes, emphasizing the need for enhanced ingress security measures. It highlights various threats, such as resource exhaustion and prompt injection, and suggests using a specialized gateway like Calico Ingress Gateway with integrated WAF for better protection.
This article details a mentorship experience focused on enhancing the performance of the Kyverno CLI by identifying and addressing key bottlenecks. The author implemented solutions that reduced execution time for policy application from 15 minutes to just 1-2 seconds for large clusters. Insights into open source contribution and community support are also shared.
Amazon announced new capabilities for its Elastic Kubernetes Service (EKS) that simplify workload orchestration and cloud resource management. These features include Argo CD, AWS Controllers for Kubernetes, and Kube Resource Orchestrator, allowing users to manage Kubernetes applications more efficiently without handling underlying infrastructure complexities.
This article discusses the archiving of ingress-nginx, which will no longer receive maintenance starting in 2026. It outlines two main migration paths for Kubernetes users: moving to Cilium Ingress for a quick transition or adopting the Gateway API for enhanced traffic management.
This article offers a checklist to help platform engineers and SREs secure cloud and container workloads. It emphasizes the need for updated strategies in light of expanding attack surfaces and the integration of AI. The checklist covers asset inventory, vulnerability assessment, and compliance monitoring.
Kubernetes v1.35 adds a new alpha feature called Restart All Containers, allowing users to restart all containers in a Pod efficiently without deleting it. This is particularly beneficial for complex applications with inter-container dependencies and helps reduce resource waste during AI/ML workloads.
This article highlights the pitfalls of adopting technologies without understanding business needs, illustrated through examples like cloud migrations and Kubernetes usage. It emphasizes the importance of aligning technology choices with specific requirements and offers practical recommendations for better architectural decisions.
This article explores how Claude Code enhances development workflows by simplifying Git worktree management and streamlining Kubernetes deployments. It highlights the benefits of using AI to handle complex infrastructure tasks, making it easier for teams to work in parallel without conflicts.
This article discusses the challenges of scaling Kubernetes nodes from zero, focusing on the startup latency that can occur. It introduces concepts like reservation and overprovisioning placeholders to reduce delays and improve user experience, especially during spikes in traffic.
This article discusses Google's latest advancements in Google Kubernetes Engine (GKE) as it marks its 10th anniversary. Key updates include the introduction of Agent Sandbox for AI workloads, enhancements to autoscaling, and new compute classes to improve efficiency and performance across various workloads.
This article explores Kubernetes' architecture and its various attack vectors. It discusses security concerns, threat hunting, and how tools like Falco can help detect and mitigate potential threats within Kubernetes environments.
This repository offers a set of Falco detection rules and configuration files aimed at identifying various Kubernetes attack techniques. It includes scripts for testing these detections by simulating attacker behavior in a controlled environment.
Kubernetes v1.35 introduces an opt-in feature for CSI drivers to receive service account tokens through a dedicated secrets field instead of the volume context. This change aims to improve security by preventing accidental logging of sensitive tokens and standardizing how they are handled. Drivers can opt-in at their own pace, ensuring backward compatibility.
A security researcher revealed a Kubernetes vulnerability that allows users with read-only permissions to execute arbitrary commands on pods. This exploit stems from the nodes/proxy GET resource, which many monitoring tools use, and poses significant risks to cluster security. Until the upcoming KEP-2862 is fully implemented, organizations need to audit their permissions and consider stricter access controls.
Three serious vulnerabilities in the runC container runtime could allow attackers to bypass isolation and gain root access to the host system. The flaws affect multiple versions of runC, with potential exploits requiring the ability to configure custom mounts. While no active exploitation has been reported, developers recommend using mitigations like user namespaces and rootless containers.
OpenEverest, developed by Percona and now part of the CNCF, is a tool for managing multiple databases like PostgreSQL, MySQL, and MongoDB within Kubernetes environments. It offers a unified interface and simplifies database operations, making it easier for organizations to handle their data infrastructure. The project aims to become vendor-agnostic and community-driven.
The kgateway v2.1 release introduces significant features, including the integration of agentgateway for AI workloads and a new global policy attachment capability. It also enhances session affinity options and passive health checks for improved performance and reliability. Deprecated features include the Envoy-based AI Gateway, which will be removed in future versions.
The article explores security vulnerabilities in AWS EKS by deploying misconfigured Kubernetes pods. It demonstrates how an attacker can escape from a compromised pod to gain root access on the host and potentially access other services. The focus is on the implications of specific dangerous configurations and their exploitation.
Kthena is a new system tailored for Kubernetes that optimizes the routing, orchestration, and scheduling of Large Language Model (LLM) inference. It addresses key challenges like resource utilization and latency, offering features such as intelligent routing and production-grade orchestration. This sub-project of Volcano enhances support for AI lifecycle management.
This article explores AWS Bottlerocket, a secure operating system designed for container hosting. It tests how Bottlerocket defends against common container escape techniques, demonstrating its effective security measures compared to less hardened systems like Ubuntu.
The Ingress NGINX controller for Kubernetes will be discontinued in March 2026 due to a lack of maintainers and funding. Despite its widespread use, the project has struggled to attract support, leading to its decline and unresolved security issues. Experts argue that the open source community must find ways to pay and sustain crucial projects like this one.
This article explores viewing Kubernetes not just as a container orchestrator, but as a runtime for declarative infrastructure. It emphasizes the importance of its type system and continuous reconciliation processes, which help maintain the desired state of applications. The author highlights practical approaches for managing Kubernetes clusters effectively.
This article explains how to multiplex MCP servers to give AI agents access to specialized tools for specific tasks. It highlights the need for agents to interact with multiple servers simultaneously to enhance their capabilities, particularly in enterprise environments. The post also includes deployment instructions for two example servers: one for math functions and another for retrieving the current date.
This article outlines the development of Argo CD integration within Octopus Deploy, based on user feedback and design thinking. It details the features in the Early Access release and the design choices made to enhance usability and connectivity. The team also invites users to share their experiences for future improvements.
The CNCF Technical Oversight Committee has approved KServe as an incubating project, recognizing its role as a scalable AI inference platform on Kubernetes. Originally developed under Kubeflow, KServe supports generative and predictive AI workloads and has seen broad adoption across various industries.
This article promotes Octopus, a tool designed for efficient software deployment across various environments like Kubernetes and multi-cloud setups. It highlights the benefits of using Octopus, including improved deployment frequency and reduced downtime, and invites users to book a demo to learn more.
This article details a vulnerability in Kubernetes where service accounts with nodes/proxy GET permissions can execute commands in any Pod across reachable Nodes. This issue arises from how the Kubelet authorizes WebSocket connections, potentially leading to full cluster compromise without proper logging.
This article discusses how Aurea Imaging uses Kairos to manage NVIDIA Jetson devices for remote sensing in agriculture. By adopting an immutable OS approach, they simplify updates and maintenance of their fleet, ensuring reliable operations in the field. The collaboration with the Kairos community also enhances their device management capabilities.
This article introduces Container Network Observability for Amazon EKS, a feature that enhances visibility into network performance and traffic patterns within Kubernetes clusters. It details key functionalities like performance metrics, service maps, and flow tables to help teams troubleshoot and optimize their containerized applications.
This article discusses the clopus-watcher, an autonomous agent designed to monitor applications in Kubernetes and apply hotfixes as needed. The author argues that such systems could eventually replace many roles currently held by 24/7 on-call engineers.
Kubernetes v1.35 introduces workload aware scheduling, enhancing how multiple Pods are scheduled together. It features a new Workload API for defining scheduling requirements and supports gang scheduling to optimize resource use for large workloads. The update also includes opportunistic batching to speed up scheduling for identical Pods.
This article explains the in-place Pod resizing feature introduced in Kubernetes 1.27, allowing users to adjust resource limits without restarting Pods. It covers how the resizing process works, practical use cases, and limitations. The author provides step-by-step instructions on implementing this feature.
Kubernetes v1.35 introduces a beta feature that allows CSI drivers to opt-in to receive service account tokens via a more secure secrets field instead of the volume context. This change aims to reduce the risk of sensitive token exposure in logs and improve consistency across drivers. Authors of CSI drivers are encouraged to adopt this feature with backward compatibility in mind.
The article discusses the rising adoption of GPUs for AI workloads and how organizations are increasingly using serverless compute services like AWS Lambda and Google Cloud Run. It highlights the inefficiencies in resource utilization across various platforms and the growing use of Kubernetes features like Horizontal Pod Autoscaler to optimize resource management.
Industry experts predict significant changes in Kubernetes networking by 2026, focusing on the integration of VMs and containers, improved user experiences with KubeVirt, and the emergence of specialized roles like the Kubernetworker. The increasing demand for AI workloads will drive innovations in network management and microsegmentation strategies.
The article discusses the Kube Resource Orchestrator (kro) and its role in enhancing Kubernetes resource management. It highlights kro's ability to simplify composition while acknowledging its limitations in addressing broader workflow and environment needs. The piece emphasizes the importance of integrating kro with other tools for a comprehensive platform strategy.
The External Secrets Operator integrates various external secret management systems with Kubernetes, automatically injecting secrets from providers like AWS, HashiCorp, and Google. It is an open-source project welcoming contributions and offers resources for developers, including a roadmap and meeting details.
This article discusses the challenges of managing Kubernetes contexts, which define the cluster, user, and namespace for commands. The author suggests using the `$KUBECONFIG` environment variable to separate configurations for different environments, making it easier to avoid mistakes in production.
This article discusses the Kubernetes Guardrail Extension, which provides real-time compliance checks for Kubernetes YAML configurations directly in GitHub and GitLab. It aims to prevent issues by offering instant feedback and recommendations, allowing developers to address compliance concerns early in the development process.
Kubernetes v1.35 introduces a security feature allowing users to control which executables can run via kubeconfig. By configuring an allowlist in the kuberc file, users can restrict or permit specific credential plugins, enhancing security against potential supply-chain attacks.
Qovery is a Kubernetes management platform designed to simplify operations and automate DevOps tasks using AI. It allows teams to manage cloud infrastructure efficiently, reducing complexity and eliminating vendor lock-in. The platform offers predictable pricing and integrates seamlessly with existing tools.
This article explains how to implement large-scale inference for language models using Kubernetes. It covers key concepts like batching strategies, performance metrics, and intelligent routing to optimize GPU usage. Practical deployment examples and challenges in managing inference are also discussed.
Amazon EKS now offers a Provisioned Control Plane that allows users to pre-allocate control plane capacity for predictable and high performance during demanding workloads. This feature provides multiple scaling tiers to ensure responsiveness during peak traffic without needing to scale dynamically. Users can monitor and adjust their control plane tier as workload requirements change.
SkyPilot is a platform that allows AI teams to run and manage workloads across various infrastructures like Kubernetes and cloud services. It offers an easy interface for job management, resource provisioning, and cost optimization, supporting multiple hardware configurations without code changes.
This article explains how Dapr, an open-source project under the CNCF, streamlines the development of microservices by automating common tasks like messaging, tracing, and observability. It also discusses Dapr's integration with other tools like KEDA for dynamic autoscaling in event-driven applications.
This article discusses the security challenges of deploying AI and machine learning workloads on Oracle Kubernetes Engine and Oracle Cloud Infrastructure. It highlights the shared responsibility model for security and outlines strategies for protecting against evolving threats, including runtime detection and posture management.
Brex faced significant challenges with unmaintained Kubernetes preview environments, leading to high costs and reduced developer productivity. By adopting Signadot, they created isolated sandboxes within a shared cluster, drastically improving testing efficiency and reducing infrastructure costs. This shift enhanced both developer experience and the quality of testing data.
This article outlines the ten new Alpha features in Kubernetes 1.35, focusing on enhancements aimed at improving AI workload orchestration and resource management. Key features include Device Binding Conditions, Mixed Version Proxy, and support for gang scheduling, which collectively enhance scheduling reliability and efficiency.
This article details how Klaviyo developed DART Jobs, a system that simplifies running distributed machine learning tasks using the Ray framework. It highlights the architecture, including the DART Jobs API, central database, and sync service, which together ensure reliable job management across multiple Kubernetes clusters.
This article explains how to use the AWS Secrets Manager Agent as a sidecar container in Amazon EKS. It details the benefits of caching secrets locally to reduce API calls and enhance application security. The post also covers the deployment steps, prerequisites, and IAM role configuration required for setup.
Kubernetes v1.35 introduces Extended Toleration Operators, allowing numeric comparisons for scheduling. This lets users set specific thresholds for tolerations, improving workload placement based on metrics like failure probability and performance. The new Gt and Lt operators enhance flexibility in managing workloads across on-demand and spot nodes.
This article explains Kubernetes metrics and their importance in monitoring cluster health and performance. It covers various types of metrics, such as cluster, node, pod, network, storage, and application metrics, along with tools for effective monitoring.
Tim Goodwin developed Kamera, a simulation tool for verifying Kubernetes controller logic. It helps developers understand and debug controller interactions without needing a live cluster, using model checking to explore execution paths and identify potential issues.
This article explains Istio Ambient Mode, a sidecarless service mesh designed to reduce operational complexity in Kubernetes environments. It highlights how this approach streamlines networking, security, and observability while improving efficiency and security through mutual TLS (mTLS) without requiring changes to application code.
This article discusses how Signadot enables local development by connecting machines to Kubernetes clusters. It emphasizes faster testing and validation processes, allowing developers to iterate quickly and efficiently without complex setups.
This article explores the challenges of scaling Next.js in Kubernetes and presents Watt as a solution. It details performance improvements, including faster request handling and better resource management, supported by benchmark results.
Mirrord allows developers to run local processes in a Kubernetes context without deploying to the cloud. It connects your local environment to a selected pod, mirroring traffic and file interactions. Available as a VS Code extension, IntelliJ plugin, and CLI tool.
Tigera has updated its visual identity to better reflect its mission and community focus. The new design features sharper lines and a modern aesthetic while maintaining its commitment to open-source networking and security for Kubernetes. This evolution includes updates across various platforms, including documentation and community channels.
This article explains how Kubernetes rolling updates work to deploy new application versions without downtime. It details the configuration options for managing Pod availability during updates, including how to roll back or pause updates when needed.
Major is a platform for quickly creating internal applications connected to your existing systems. It offers features like fine-grained access control, self-hosting options, and integration with various data sources, ensuring security and performance.
This article explains how to simplify the management of Amazon EKS clusters using Kube Resource Orchestrator (kro) and AWS Controllers for Kubernetes (ACK). It details the process of creating interconnected resources for Kubernetes clusters, addressing dependency management, and enabling GitOps workflows for better operational efficiency.
Google Cloud successfully tested a 130,000-node Kubernetes cluster, doubling the previous limit. The article details the architectural innovations that enable this scale and the implications for AI workloads, including advanced job scheduling and optimized storage solutions.
Kubernetes 1.35 introduces significant changes to its security features, including the removal of cgroup v1 support and enhanced image pull verification. Users will need to review their RBAC policies and ensure proper credentials are in place to avoid potential issues during upgrades.
Amazon EKS and EKS Distro now support Kubernetes version 1.35, which includes features like In-Place Pod Resource Updates and PreferSameNode Traffic Distribution. Users can create new clusters or upgrade existing ones to this version through various tools. The update is available in all AWS regions, including GovCloud.
This guide offers practical strategies for improving Kubernetes operations, addressing common challenges like high cloud costs and security vulnerabilities. It covers resource management, service availability, security measures, and scaling techniques to help teams streamline their Kubernetes environments.
This article outlines how zero trust architecture addresses security challenges in cloud-native environments like Kubernetes. It emphasizes the need for strict authentication and authorization at every layer, ensuring that every request is verified regardless of network location. The piece also discusses implementing policies and security measures to protect shared infrastructures.
This guide outlines essential practices for deploying Kubernetes effectively in production environments. It addresses common challenges like high cloud costs, security vulnerabilities, and service interruptions, offering practical solutions to improve resource management and system reliability.
Signadot provides a platform for agile code validation using isolated sandboxes within Kubernetes. It allows developers and AI agents to quickly test and verify code changes in real-time, ensuring efficient workflows and reducing errors before merging.
HolmesGPT is an open-source AI tool designed to streamline troubleshooting in Kubernetes environments. It aggregates logs, metrics, and traces, helping on-call engineers diagnose issues faster by providing clear, actionable insights. The tool is extensible and community-driven, promoting collaboration in observability practices.
The OpenCost project outlined its achievements in 2025 and plans for 2026, including 11 releases that improved usability and multi-cloud cost tracking. Key advancements include an AI-ready MCP server for real-time cost analysis and ongoing community mentorship efforts. Future goals focus on tracking machine-learning workloads and enhancing supply chain security.
The Amazon EKS Auto Mode workshop offers hands-on training for deploying workloads using Amazon Elastic Kubernetes Service (EKS) Auto Mode, which simplifies Kubernetes operations on AWS. Participants will learn to enable Auto Mode, deploy applications, and manage upgrades while gaining insights into migrating existing workloads. The workshop is designed for users with a basic understanding of Kubernetes and is accessible through AWS accounts or hosted events.
Implementing guardrails around containerized large language models (LLMs) on Kubernetes is crucial for ensuring security and compliance. This involves setting resource limits, using namespaces for isolation, and implementing access controls to mitigate risks associated with running LLMs in a production environment. Properly configured guardrails can help organizations leverage the power of LLMs while maintaining operational integrity.
The article discusses the introduction of a new feature in Kubernetes v1.33 that ensures secrets are used to pull images securely. It highlights the significance of this update in enhancing security measures for container deployments. The feature is currently in alpha stage, indicating ongoing development and testing.
Thorium is a scalable file analysis and data generation platform that enables users to orchestrate various tools at scale, offering features like static and dynamic analysis sandboxes, a user-friendly interface, and a RESTful API. It supports multi-tenant permissions, full-text search, and the import of numerous analysis tools, making it suitable for both development and analytical purposes. Thorium is designed for deployment in Kubernetes clusters but can also run on a local machine with limited production capabilities.
OSDFIR Infrastructure facilitates the deployment and integration of various open-source digital forensics tools on Kubernetes clusters using Helm. It supports tools like Timesketch, Yeti, and GRR, enabling collaborative forensic analysis and incident response. Users can easily install and configure the infrastructure by following Helm commands and documentation provided in the repository.
The article provides a step-by-step guide for testing configuration scanners on a deliberately insecure Kubernetes deployment using Terraform and Helm. It outlines the setup of an EKS cluster with insecure application pods, detailing the commands needed for deployment, testing, and cleanup, while highlighting the various security vulnerabilities present in the deployed applications.
Pinterest encountered a significant performance issue during the migration of its search infrastructure, Manas, to Kubernetes, where one in a million search requests experienced latency spikes. The investigation revealed that cAdvisor’s memory monitoring processes were causing excessive contention, leading to these delays. The team resolved the issue by disabling a specific metric in cAdvisor, allowing them to continue their migration efforts without compromising performance.
The article discusses the transition to a self-service approach for connecting applications to datastores, highlighting the use of Kubernetes to automate credential management and rotation. By implementing mutating admission webhooks and init containers, developers can deploy applications without manual credential handling, enhancing security and efficiency. This allows developers to focus on writing code rather than managing datastore complexities.
The article compares the security features of AWS Elastic Kubernetes Service (EKS) and Google Kubernetes Engine (GKE), focusing on key areas such as identity and access management, network traffic control, configuration management, vulnerability management, and runtime threat detection. It highlights the differences in default settings and capabilities of both managed services, emphasizing aspects like IAM integration, firewall options, and runtime security tools.
Kubernetes v1.34 introduces beta support for PSI (Pressure Stall Information) metrics, allowing users to monitor resource pressure on nodes more effectively. This enhancement aims to provide better insights into resource allocation and improve overall cluster performance. The update includes detailed guidance on how to enable and use these metrics within Kubernetes environments.
The blog post introduces the key features and improvements in Calico v3.31, focusing on the integration of eBPF (Extended Berkeley Packet Filter) and nftables, which enhance network performance and security. It highlights advancements in network policy management and observability, aiming to streamline Kubernetes networking capabilities.
The article provides an in-depth exploration of OrbStack, a tool designed to simplify container and Kubernetes development. It highlights the features, advantages, and potential use cases of OrbStack in streamlining the development process for developers working with containerized applications.
Amazon EKS has announced support for ultra scale clusters with up to 100,000 nodes, enabling significant advancements in artificial intelligence and machine learning workloads. The enhancements include architectural improvements and optimizations in the etcd data store, API servers, and overall cluster management, allowing for better performance, scalability, and reliability for AI/ML applications.