Keep your cluster secure by following the best practices around Kubernetes Upgrades.

Kubernetes Upgrades

Like this Article?

Subscribe to our Linkedin Newsletter to receive more educational content

Subscribe now

The Kubernetes project releases new software versions on a regular four-month cadence. New releases can include additional features, bug fixes, and security enhancements, all of which are critical for infrastructure running in production environments.

While this frequent release cycle constantly provides useful functionality updates, users will often need help to keep up with the tasks involved in planning and executing regular cluster upgrades. Upstream changes to the Kubernetes project will usually require steps such as reading release notes, upgrading other Kubernetes components for compatibility, resolving breaking API changes, mitigating downtime during cluster upgrades, and recovering from broken upgrades.

Due to the operational overhead involved with each cluster upgrade, many users are running older Kubernetes versions far behind the upstream recommended versions. Running deprecated versions will cause its own set of problems, like running buggy software, inability to use the latest Kubernetes features, and forced cluster upgrades by cloud providers who have minimum version policies.

This article discusses how to manage the challenge of effectively keeping clusters up to date by building a high-quality upgrade strategy. An effective strategy will involve analyzing upcoming changes in upstream Kubernetes software versions, testing compatibility with running cluster workloads, rolling out changes safely, monitoring for potential issues, and disaster recovery (if necessary). Planning a strategy for approaching cluster upgrades will enable users to gain confidence in their ability to stay current with upstream Kubernetes releases and roll out changes safely and frequently.

Summary of key Kubernetes upgrade concepts

Keeping Kubernetes clusters up to date is essential due to the pace at which the project releases versions and the requirements set by the ecosystem of third-party projects. The table below summarizes more specific reasons for keeping clusters current and best practices for doing so.

Why staying updated is important Maintaining compatibility with Kubernetes tools The massive ecosystem of thousands of Kubernetes projects provides valuable functionality for Kubernetes users. Keeping clusters current is required to maintain compatibility with these tools.
Mitigating forced upgrades by cloud providers Major cloud providers like AWS, GCP, and Azure will forcefully upgrade Kubernetes clusters, which is unwanted in production environments. Keeping clusters up to date prevents this problem.
New features, bug fixes, and security enhancements Each new release of Kubernetes provides valuable features, a long list of bug fixes, and critical security patches. Ensuring that clusters stay up to date enables users to access this functionality sooner.
Kubernetes upgrade best practices Plan upgrades ahead of time Proper planning and documentation ensure that cluster upgrades are executed as safely and smoothly as possible and that lessons learned can be applied to future upgrades.
Test in non-production environments Validating application compatibility with newer Kubernetes versions should be done in non-production environments before modifying production clusters.
Use rolling or blue/green node upgrades Rolling out changes to worker nodes can be done with two common strategies; the appropriate choice will depend on upgrade speed, safety, and disaster recovery requirements.
Monitor the upgrade Implementing observability tools allows users to gain clear insight into their clusters’ behavior during upgrades, allowing for validation and effective troubleshooting.
Develop a disaster recovery plan Recovering the cluster’s state in case of upgrade failures is essential for maintaining application uptime.

Why staying updated is important

There are many reasons why users will benefit from keeping clusters up to date.

Maintaining compatibility with Kubernetes tools

The Kubernetes ecosystem includes a massive collection of high-value projects for enabling critical functionality in Kubernetes clusters, such as Istio, CoreDNS, Helm, and Prometheus. The developers of these projects are responsible for ensuring that their software is compatible with Kubernetes by keeping pace with changes to the Kubernetes API.

There are thousands of projects supporting the Kubernetes ecosystem.

The developers of Kubernetes-related projects will typically only support some versions of Kubernetes because the overhead of maintaining compatibility with older versions is too high. Therefore, most projects have a support policy that defines what specific versions of Kubernetes their software will support and for how long.

For example, let’s look at the Istio project’s support policy: The latest release of Istio will only support the four most recent releases of Kubernetes. This allows the project developers to avoid the overhead of supporting old software while ensuring that users are aware of their responsibility to upgrade their Kubernetes cluster versions in a timely manner to maintain compatibility.

Due to the Kubernetes version skew policy, users will need to ensure their cluster components are running tightly coupled versions (typically within 2 versions). For example, running a 1.28 cluster while using the 1.24 version of Kubectl or other tools implementing the Kubernetes API may not work. Control plane and worker node versions cannot skew more than 2 versions apart, along with the core addons like Kube Proxy.

Compatibility is a key reason for keeping Kubernetes cluster versions up to date. Almost all Kubernetes-related tooling will have a strict support policy like Istio’s, and users who fall behind on their cluster versions will be unable to use newer tool versions. A further risk is that users will typically not receive troubleshooting support from developers when running versions that are too old.

Comprehensive Kubernetes cost monitoring & optimization

Mitigating forced upgrades by cloud providers

Similar to the point above, cloud providers like AWS, GCP, and Azure will only provide support for a limited number of Kubernetes versions. Users are expected to keep their clusters up to date via regular upgrades. When users fall behind on their Kubernetes versions, users running managed Kubernetes distributions from a cloud provider will risk being forced to upgrade at an unexpected time. All three of the above cloud providers will automatically upgrade users running older Kubernetes versions if they do not manually upgrade first to a supported version.

Cloud provider Kubernetes version support duration Number of supported Kubernetes versions
Amazon EKS 14 months 5
Google GKE 14 months 5
Azure AKS 12 months 4

Users may also fail to receive assistance via their cloud provider’s paid support when running outdated Kubernetes versions.

New features, bug fixes, and security enhancements

Each release of Kubernetes contains many changes to improve the functionality and stability of Kubernetes clusters. Releases involve changes to core components like the Kubelet, Kube Proxy, CoreDNS, ContainerD, API Server, Controller Managers, and many other core projects required for clusters to operate.

Users who upgrade their clusters often will benefit from the new features being released by the upstream Kubernetes project and by other tools in the ecosystem developing new functionality based on new Kubernetes APIs.

Security patches for the many Kubernetes cluster components are also released frequently to address reported vulnerabilities. The security of any workloads running on a Kubernetes cluster depends highly on the cluster's security. Many vulnerabilities are being patched regularly for Kubernetes and related tooling, so keeping clusters current is an important part of maintaining a strong security posture.

Kubernetes upgrades best practices

Here are the best practices for staying current with your Kubernetes install.

Plan upgrades ahead of time

Cluster upgrades are complex operational tasks and require proper planning to do correctly. Developing a high-quality upgrade plan is important so that users can document and improve their upgrade processes over time. Iteratively learning and improving on the process will help reduce the impact and overhead for future upgrades.

Developing an upgrade plan may involve answering questions such as the following:

  • What breaking changes are occurring for the Kubernetes API based on the release notes released by the Kubernetes team?
  • What breaking changes are occurring for third-party projects utilized in the cluster based on the release notes provided by third-party project maintainers?
  • Are there any release notes published by the cloud provider (if the user is running a managed Kubernetes service) indicating that manual changes are required prior to upgrading?
  • What changes are required to the cluster and its applications to navigate the breaking changes identified above?
  • What are the organization's vendor support policies? For example, support for the operating system, infrastructure, or application components may have specific requirements to maintain an active support policy.
  • How long will it take to implement the changes to prepare for the upgrade?
  • Are there any new features from the upgraded version that may be beneficial for the user or the organization to implement?
  • For multi-tenant clusters, what notification needs to be communicated to others to ensure a smooth upgrade? For example, do tenants need to modify anything in their applications to maintain compatibility with the upgraded cluster?
  • How should the upgrade be tested and deployed?
  • What can be improved based on learnings from the last upgrade attempt?

Any other upgrade aspects where the user may encounter challenges should be documented for future review. Revisiting the plan after an upgrade is complete will help validate whether the plan was useful for preparing for the upgrade and what areas of the plan should be revised to improve the upgrade process in the future.

Learning from each upgrade will help users refine their approach until it is smooth while simultaneously improving their confidence in executing upgrades in the future.

K8s clusters handling 10B daily API calls use Kubecost

Test in non-production environments

Users should test new Kubernetes versions in non-production environments to validate their stability and compatibility before modifying production clusters. The goal is to minimize the risk of deploying breaking changes to production by detecting potential issues in a separate environment.

Users may follow these overall steps for testing in non-production environments:

  1. Create a Kubernetes cluster separate from the production environment. This cluster should have the same Kubernetes version, infrastructure configuration, and applications deployed. While this may not always be practical in all circumstances, aiming to create a replica cluster as closely mirrored to production as possible will help provide accurate test results. Implementing infrastructure-as-code tools like Terraform will help quickly replicate clusters without manual configuration.
  2. Upgrade the Kubernetes version of the non-production cluster. Document the steps taken to ensure any challenges and learnings are recorded for reference in the production cluster upgrade.
  3. Test and validate the workloads running in the upgraded non-production cluster. Automated integration and load tests will help validate that applications are running as expected. Standard tests will include tools like Apache Benchmark to load test any web-based applications to validate their performance and stability. Observability tools like Prometheus and Grafana will provide data on how many pods are running successfully versus being stuck in the pending/crashing statuses, allowing users to see if pod stability changed following the upgrade. Application log data will provide insight into error messages reported by the application, and users can monitor key error messages with tools like FluentBit. The approach to testing the applications in the non-production cluster will vary depending on the application’s design, but the overall objective is to ensure no adverse effect on the applications following the cluster upgrade.
  4. Review, document, and improve the upgrade process based on the learnings from the above. Upgrading the non-production cluster will provide valuable data on how the upgrade progressed, what areas require special attention, what application issues might occur, and approximate timelines for how long the end-to-end upgrade process takes. Recording this data in a document will be valuable to reference when proceeding to the production clusters.

Testing cluster upgrades in a non-production environment is valuable for validating changes before modifying a production environment. Applying upgrades to production clusters directly without adequate testing is risky due to the number of changes introduced in every Kubernetes version.

Use rolling or blue/green node upgrades

Various upgrade strategies are available to determine how worker node changes should be rolled out to a cluster. The two most common strategies are rolling upgrades and blue/green upgrades. Either of these two options will be better than upgrading an entire cluster directly due to the significant risk and lack of rollback capability of a full upgrade.

Users should evaluate an appropriate upgrade strategy based on their use case and requirements. The key factors are operational overhead, speed of upgrades, safety, and rollback capability. Starting with rolling upgrades is typically a good first strategy for new users.

Rolling upgrades

This approach involves upgrading the worker nodes gradually rather than all at once. Worker nodes may be upgraded one at a time or in batches, depending on the use case. For example, upgrading too few nodes at once will delay the upgrade process significantly in large clusters with hundreds of nodes. Upgrading in batches is a common approach to balance the speed and safety of the upgrade.

Rolling upgrades aim to incrementally deploy upgrades to a subset of nodes to allow time for testing. The behavior of applications running on the subset of new nodes can be observed to validate that they’re running correctly. If an issue occurs, the user benefits from only a small subset of nodes requiring downgrading or removal.

The rolling upgrade approach is a common strategy used to upgrade Kubernetes clusters because it balances speed, safety, and rollback capability. Rolling upgrades are typically implemented by launching upgraded worker nodes and then gradually cordoning and draining existing worker nodes to migrate pods onto upgraded nodes.

This diagram shows how nodes can be divided into batches to roll out upgrades in incremental groups. With this strategy, a subset of nodes is upgraded and validated before proceeding to the next batch.

Blue/green upgrades

This approach involves fully replicating the entire Kubernetes cluster. The replica cluster will have the same configuration and infrastructure as the initial cluster. All applications are deployed in the new cluster, which is then upgraded to the new Kubernetes version. Traffic is shifted from the old cluster to the new one, and the old cluster is terminated.

Learn how to manage K8s costs via the Kubecost APIs

This upgrade strategy is more complex, slower, and less cost-effective than rolling upgrades. However, it provides a high degree of safety and rollback capability. The upgrade is safe because all upgrades are applied to a separate cluster replica that is not yet receiving any production traffic. Therefore, if the upgrade breaks any part of the cluster, it has no impact: The cluster can be disposed of and recreated if necessary. If anything goes wrong while traffic is being shifted from the old cluster to the newly upgraded one, the rollback plan is to simply shift traffic back. These benefits are significant for any use case where safety and rollback capabilities are highly critical. The added infrastructure cost and operational overhead involved with blue/green upgrades may be worthwhile for those use cases.

Incoming traffic is shifted to the new cluster running the upgraded Kubernetes version. Rolling back will be as easy as shifting the traffic back. Traffic can also be incrementally shifted (e.g, 10% of requests) to validate the upgrade with a subset of traffic volume.

Monitor the upgrade

Users will benefit from implementing observability tools for monitoring cluster upgrades and validating the state of the cluster.

Observability tools allow visibility into data, such as cluster metrics, logs, and traces, to provide users with insight into the behavior of the cluster and the applications running inside it. It is standard for any production cluster to implement appropriate observability tooling to monitor every aspect of the cluster.

Standard tools include Prometheus, Grafana, FluentBit, AlertManager, and Jaeger. The overall objective of these types of tools is to ensure that users can analyze their clusters in any way necessary to maintain operational hygiene. These aspects may include audit logging for forensic analysis and maintaining security, analyzing performance bottlenecks, detecting failures and breakages in cluster infrastructure, and aggregating log data from applications.

In the context of cluster upgrades, observability tools provide valuable data related to the following:

  • Obtaining an application performance baseline. Having a performance baseline before executing an upgrade allows users to compare the performance impact after the upgrade is complete. Any drops in performance may require investigation and further analysis.
  • Verifying the availability/uptime of the cluster’s applications. Users may want to have data to verify whether cluster upgrades are causing any application downtime. Observe running application behavior, such as error messages, dropped requests, latency spikes, etc., to determine if cluster upgrades are impacting applications. Analyzing the impact on running applications will help determine a root cause and implement a mitigation plan.
  • Verifying the cluster’s overall health during and after an upgrade. A cluster upgrade can only be validated as a success or failure if there’s observability data to confirm whether any problems occurred during the upgrade. Implementing tools to gather cluster data will help users gain confidence that their clusters were upgraded successfully without adverse impact.

Implementing observability is a crucial aspect of running a production cluster and is particularly useful for validating cluster upgrades. Users will benefit from the ability to investigate potential upgrade-related issues, confirm whether upgrades were completed successfully, and obtain a baseline for what to expect during the cluster upgrade process.

Develop a disaster recovery plan

Despite all the planning and testing users may implement to prepare for an upgrade, the upgrade process may still cause issues for the cluster and its applications. This situation may require disaster recovery to revert the cluster to a working, usable, and stable state. A disaster recovery plan is vital for users running managed Kubernetes on cloud providers because providers typically don’t allow downgrading the cluster control plane version.

A disaster recovery plan aims to roll back and revert the state of the cluster and its applications as quickly and accurately as possible to mitigate downtime and potential data loss. The key elements of a recovery plan will include the following:

  • Backing up the cluster state. There are many tools available for backing up Kubernetes clusters. Velero is an example of an open-source project that allows backing up the Kubernetes objects running in a cluster and the data in any Persistent Volume resources. Deploying the cluster via infrastructure-as-code tools like Terraform allows easier backup of the cluster’s configuration. Overall, the objective for backups is to ensure that the entire cluster can be replicated easily in an identical state, if necessary. Backups should be periodically tested by restoring to a separate cluster to verify that they are valid.
  • Documenting the current state of the cluster. Combining documentation with backups will help users reassemble clusters in a disaster recovery situation. Useful items to document include any manual customizations applied to the cluster, architectural design decisions to explain why the cluster is set up with a particular strategy, historical benchmarks from observability tools to validate the recovered cluster’s state, what client-side tools and commands are executed to setup and configure the cluster, and any other details required to set up an exact replica of the broken cluster.
  • Monitoring observability tools to determine where problems are originating from. Examining data to determine the root cause of a problem caused by an upgrade may enable easier disaster recovery than trying to reset the cluster’s state. Observability tools can provide insight into what infrastructure or applications in a cluster are failing and can enable users to focus their investigations accordingly.
  • Recording any learnings from disaster recovery situations. This will ensure that these findings lead to improvements in addressing similar problems in the future.

Designing a proper rollback plan is essential for ensuring that cluster issues can be mitigated effectively. Recovering the cluster’s state via either root cause analysis or replicating the cluster’s infrastructure can be necessary for a disaster recovery plan, especially for production environments that are sensitive to downtime.

Summary

Maintaining up-to-date Kubernetes clusters is a challenging but necessary task that can be helped with effective planning and execution. Understanding how to plan and prepare for cluster upgrades enables users to more confidently keep their clusters up to date, mitigate technical debt and security risks associated with outdated clusters, mitigate forced upgrades by cloud providers, and access new features sooner. Upgrading clusters regularly is required for any Kubernetes setup due to the nature of the project and its ecosystem, so upskilling in developing an upgrade strategy will be greatly beneficial for users.

Comprehensive Kubernetes cost monitoring & optimization

Continue reading this series