How to Zero-Downtime Upgrade Kubernetes on Amazon EKS with EKSCTL
Upgrading an Amazon EKS Kubernetes cluster or worker nodes can be tricky. There are a lot of details and steps to follow to ensure the availability of your services is uninterrupted. Follow along while I show you just what is involved.
Kubernetes is the modern platform to build platforms, offering all the features a modern engineering-based organization needs to deploy, setup, maintain, scale, and monitor applications and services in the cloud.
This platform has regular improvements and security patches; to ensure customers don't keep around old and possibly insecure technologies, Amazon's managed Kubernetes service forces you to upgrade your cluster on a fairly regular cadence. Because of this, it's important to know how upgrades work and how to do this with minimal to no impact on your company's operations and services. This is the focus of this article, and to help manage the process today we'll be using Amazon's recommended tool, EKSCTL.
Below are the steps necessary to perform zero-downtime upgrades with Amazon's managed Kubernetes service, EKS. However, some points of note I must stress before we begin...
- You should ALWAYS, only upgrade one point release at a time.
Eg: Do not go from 1.21 to 1.23 without doing the full process. - Go through ALL the steps below for each release. Even if you need to do two releases in a row.
- A zero-downtime upgrade relies on you having setup and used Kubernetes in a way that supports high-availability and zero-downtime.
Specifically, you need to be using > 1 pod for every service, and you must be using a PodDisruptionBudget on every critical service. If using my recommended set of open-source Universal Helm Charts, and provided you specified > 1 pod for all services and/or turned on autoscaling, this is done for you automatically.
Pre-upgrade - Validate Compatibility
First, critical before upgrading is to go through a checklist to ensure backwards and forwards compatibility. This must be done before you begin the upgrade process.
- If using EKSCTL which I highly recommend, update your machine to ensure you have the latest stable version of EKSCTL (on OS-X using homebrew, run
brew upgrade eksctl && brew link --overwrite eksctl
). - Go through all controllers and foundational components in your cluster (storage, network, aws, dns, mesh networks, ingress controllers, etc) and ensure compatibility with the new version of EKS.
Basically, you are looking for a version that is compatible with your current version of EKS and the next version of EKS. There is always a version that has this compatibility in order to facilitate a zero-downtime upgrade.
This include services such as aws-ebs-csi-driver, aws-fsx-csi-driver, aws-efs-csi-driver, aws-load-balancer-controller, external-dns, aws-node-termination-handler, gitlab-runner (or Github Actions runner, or other CI/CD Runners), metrics-server, cluster-autoscaler, ingress-nginx (or other ingress controller eg: Kong/Ambassador/Traefik/etc), Prometheus, Kubernetes-Volume-Autoscaler, Grafana.
If your current version of any of such controllers is not compatible with the current and next version of EKS, you'll need to upgrade that controller to the version of that component necessary for compatibility. Make sure the version you install supports both your old version AND your new version of EKS, this may not necessarily be the latst release version of that controller. Most/every tool in Kubernetes is engineered to support multiple incremental versions intentionally to support a zero-downtime upgrade pattern. - (If desired, optionally:) Take this opportunity to upgrade every component you assessed in step 2 to the latest stable release version that supports both EKS versions.
This is optional, per your preference depending on the components and your willingness to accept more work if the upgrade of those components causes regressions or issues you will need to handle. - Ensure the nodes in your cluster are running the latest current point release, this ensures forwards-compatibility.
The easiest way to do this is to rollout completely new nodes with EKSCTL that are exactly of the same configuration as your old nodes, but with a different name (use the same trick mentioned below in the name version increment). EKSCTL automatically uses the latest point release every time it creates a new node group (provided that you have updated EKSCTL to the latest version in step 1 above). - Review the Amazon EKS Upgrade Notes: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html to ensure the version you're jumping doesn't have certain manual steps you need to perform. When EKS is upgraded there's occasionally some components that MUST be at a specific version, and if they are not there will be a potential downtime that occurs. Pay close attention to this, and specifically the version you're about to upgrade to.
- Finally, upgrade all EKS built-in components with help from EKSCTL, with the following commands. Make SURE before you do this you update EKSCTL.
eksctl utils update-kube-proxy --cluster CLUSTER_NAME_HERE # This is the "plan/preview"
eksctl utils update-kube-proxy --cluster CLUSTER_NAME_HERE --approve # This is the "apply"
eksctl utils update-coredns --cluster CLUSTER_NAME_HERE # This is the "plan/preview"
eksctl utils update-coredns --cluster CLUSTER_NAME_HERE --approve # This is the "apply"
# Finally check for the latest here: https://github.com/aws/amazon-vpc-cni-k8s/releases
# And copy/paste the command it shows you there similar to the following (Latest as of Nov 24, 2022)
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.12.0/config/master/aws-k8s-cni.yaml
# And then the following commands to re-update the VPC CNI to the defined standard configuration (again from README.md)
## Enable the CNI plugin to manage network interfaces for pods
kubectl set env daemonset aws-node -n kube-system ENABLE_POD_ENI=true
## Allow liveness probes to succeed (with a SG attached to it)
kubectl patch daemonset aws-node -n kube-system -p '{"spec": {"template": {"spec": {"initContainers": [{"env":[{"name":"DISABLE_TCP_EARLY_DEMUX","value":"true"}],"name":"aws-vpc-cni-init"}]}}}}'
## Also need to set this to allow a lot more IP addresses to be set per-node (meaning more services with SGs allowed per-node)
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
# After doing all these, make sure they all "settle" and get healthy. Debug them if not
kubectl --namespace kube-system get pods -w
Upgrade EKS (prepare)
First, lets do some preparation for auditing our environment to ensure after we upgrade everything looks okay, and that our cluster is in a healthy place right now.
I like to run the following commands and save their output just for auditing/validation purposes later:
# Show all pods and their status
kubectl get pods --all-namespaces
# Show events happened in the last hour
kubectl get events --all-namespaces
In these above, what you are looking for generally is stabilization. If there are a bunch of pods unable to launch or a bunch of churn happening in the events stream that looks troubling, then I would generally not proceed with this upgrade, solve those problems first.
Assuming we are happy with the output of the above commands and consider our cluster stable (and this is a low-load time or a pre-agreed change-window) we can proceed. Make sure to note down somewhere the output of the above commands to refer to and compare against later.
Upgrade EKS (masters)
Next, we need to upgrade the AWS managed service, EKS.
This will perform a highly-available upgrade of their Kubernetes masters. There's nothing you need to do besides initiate this process, this is a fully managed zero-downtime activity that Amazon manages for you. This is also one of the advantages of using Amazon's managed service for this, instead of having a self-managed Kubernetes cluster.
WARNING: This process usually takes up to/around 1 hour to perform, usually I find around 25-30 minutes. Just let it finish, it does this highly-available, automatically.
# Assuming you're using EKSCTL, preview/plan an upgrade
eksctl upgrade cluster --name CLUSTER_NAME_HERE --version 1.23
# Execute an upgrade
eksctl upgrade cluster --name CLUSTER_NAME_HERE --version 1.23 --approve
# NOTE: While waiting for the above to finish, update your `cluster.yaml` file with the new EKS version
If you aren't using EKSCTL, feel free to just use the AWS Console to perform the update. EKSCTL doesn't really add any magic to this part. This is just a service and process AWS has to go through.
After this is finished, you need to update the EKSCTL yaml file to have the matching version you just upgraded for this cluster, setting for example metadata.version: "1.23"
in the cluster.yaml file (or whatever you've named it).
At this point, having upgraded only the masters (with the above command) should cause no outages or issues, as none of your traffic comes from those, they are merely the puppetmasters of Kubernetes. Your worker nodes should be unaffected and generally unchanged by this change, same for any pods, services, etc running on them. Still, we need to do some basic validation but should not need to do an extensive one. You should at least glance at the Kubernetes event log (kubectl get events
) and just glance at your pods in general kubectl get pods --all-namespaces
to see if anything weird, or any odd failures, shows in there.
Also, if you have a metrics and visualization suite up (eg: Prometheus + Grafana) it would be good to validate there that everything is still working perfectly, all services are receiving traffic, have good/healthy/successful HTTP response codes, etc.
Upgrade EKS (workers)
Next, we need to upgrade the worker node(s) and node groups.
Every cluster has at least one node group, but typically many. What is recommended generally for scalability and for not "miss"-ing on AZ alignment for persistent volumes (as they are az-bound) is to have one node group per-AZ. This is what we have standardized on here.
Technically, all you need to do is re-roll out the SAME set of node groups that you already have, but with the new EKS version. You will notice in my example EKSCTL config file each node group has a version, eg; "primary-spot-uw2a-v1". This is for a reason, so we can easily re-deploy the same node group with no changes to EKSCTL, only changes based on time and the latest stable point release, and/or an upgraded EKS version increment by just incrementing that "v1" value.
What I typically do is create two more node groups with a number increment on the version string. So for example in cluster.yaml add two new node groups below with something like the following:
# ASSUMING that this file only has "v1" for both node groups, just add these two below them (make sure of the AZ and naming)
- name: primary-spot-uw2a-v2
# Items for this node group only
availabilityZones: ["us-west-2a"]
# Import all defaults from above
<<: *spotNodeGroupDefaults
<<: *primarySpotInstanceConfiguration
- name: primary-spot-uw2b-v2
# Items for this node group only
availabilityZones: ["us-west-2b"]
# Import all defaults from above
<<: *spotNodeGroupDefaults
<<: *primarySpotInstanceConfiguration
If the above syntax looks unfamiliar to you, I highly recommend you read this article on YAML Anchors and then review any of my example EKSCTL configurations to help grasp this. With this we will create these new node groups with the following command:
eksctl create nodegroup --config-file=./cluster.yaml
Only when that command runs successfully and you see the new nodes of the new version in kubectl
then you can proceed. Your output should look similar to the following:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-20-123-3.us-west-2.compute.internal Ready <none> 26d v1.22.12-eks-ba74326
ip-172-20-123-4.us-west-2.compute.internal Ready <none> 24d v1.22.12-eks-ba74326
ip-172-20-123-5.us-west-2.compute.internal Ready <none> 2m24s v1.23.13-eks-fb459a0 <-- New node
ip-172-20-123-6.us-west-2.compute.internal Ready <none> 78s v1.23.13-eks-fb459a0 <-- New node
Validate new node receives/sends traffic
Before we continue, you need to validate that the ingresses and your services will work on the new node. This is an important step to take, because there are a few edge cases you may encounter where traffic will not flow into your new nodes from the AWS load balancers due to security group issues. Generally, the AWS Load Balancer Controller notices and fixes this, but this has happened to me enough times to warrant a manual check EVERY time I roll out new node groups when I need zero-downtime.
To validate this works properly, go into your list of nodes with kubectl get nodes
and cordon
every node that is not on the latest version with the command kubectl cordon <nodename>
. Now check again kubectl get nodes
and make sure that your new nodes are active, and your old nodes are cordoned. Your output should look as follows:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-20-123-3.us-west-2.compute.internal Ready,SchedulingDisabled <none> 26d v1.22.12-eks-ba74326
ip-172-20-123-4.us-west-2.compute.internal Ready,SchedulingDisabled <none> 24d v1.22.12-eks-ba74326
ip-172-20-123-5.us-west-2.compute.internal Ready <none> 2m24s v1.23.13-eks-fb459a0 <-- New node
ip-172-20-123-6.us-west-2.compute.internal Ready <none> 78s v1.23.13-eks-fb459a0 <-- New node
Once this is done, you can follow one of two strategies.
First, you can drain
one of the old nodes to get ALL pods to replace on the new nodes, however I find this to be a bit of a "blunt" process and if there are issues then likely all your pods won't place/route properly. Otherwise, you can find your ingress controllers (eg: the ingress-nginx controller pods) and "delete" one of them off your old nodes, causing it to be re-launched onto a different node. Because the only nodes that are not cordoned are your new nodes, it will get assigned and launch on your new node.
# Get the pods (in whatever namespace your controllers are in)
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-ingress-controller-86cdd954c-sdjjw 1/1 Running 0 7m18s
nginx-ingress-controller-86cdd954c-xkdg6 1/1 Running 0 7m18s
# Then delete the pod...
kubectl delete pod nginx-ingress-controller-86cdd954c-sdjjw
pod "nginx-ingress-controller-86cdd954c-sdjjw" deleted
# Then validate the new pod is launched on the new version node
kubectl get pods
# Then grab the new pod name and...
kubectl describe pod nginx-ingress-controller-86cdd954c-dkxd4 | grep -i node:
Node: ip-172-22-38-3.us-west-2.compute.internal/172.22.38.3
# And check as long as this ^ node matches one of your new version ones from above, you now need to view its logs to validate it is both receiving traffic from the ALB successfully, AND that it is sending it to other services properly...
kubectl logs nginx-ingress-controller-86cdd954c-dkxd4 --follow
# Here if you don't see traffic flowing through it, you may need to debug. The first thing I'd do is restart the AWS Load Balancer Controller, which forces it to re-analyze your environment, autoscaler, node groups, etc, and adjust any Security Groups it needs to
# If that doesn't fix it, you will need to debug through the AWS Console where the traffic is getting stuck.
# If things aren't working on this new node, and this is a production cluster, and you don't have time to fix this now you need to pull your rip-cord and "undo" some stuff. Very quickly uncordon your old, working nodes, cordon your new nodes, and then use the "delete nodegroup" feature to delete your new nodegroups and perhaps try this again some other time after you work out the issues.
Once you have followed the above and validated that traffic is flowing through it successfully, we can proceed to removing our old nodes.
NOTE: You do NOT need to "uncordon" the old nodes above, leave them cordoned, because they will be removed shortly below.
Removing old nodes and node groups
Now, we will proceed to removing our old nodes and node groups. This is pretty painless and fully automatic and zero-downtime with EKSCTL, assuming that you configured all your services to be highly available having more than two pods and having an anti-affinity rule to ensure the pods launch on different AWS AZs and different nodes. If you use our Open-Source Helm charts this is automated and baked-in to every chart.
So, now you will just need to identify the old node groups and delete them. For more precision/cautiousness do this deletion one at a time. Assuming as stated above our old ones were v1
suffixed, you would run:
# First deleting one node group from one AZ, letting it finish successfully...
eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v1"
# If the above "plan" looks good, then approve/apply it with...
eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v1" --approve
### NOTE: Before proceeding to more node groups, please validate that pods and all your services launch on your new nodes properly
### Compare to the "prepare" steps output above, checking the pods and event log. Make sure everything looks stable
# Then plan our deleting the next node group...
### I'm not joking, please run 'kubectl get nodes --all-namespaces' and make sure everything is running properly, nothing is sitting "Pending"
### If there is stuff sitting pending and you continue, you're likely to cause an outage. Don't say I didn't warn you...
eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v1"
# Similar to above, now approve/apply this with...
eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v1" --approve
# And any further if you have more AZs... (though, you should really only have 2 or 3)
After all the old node groups have been removed, you need to remove them from the EKSCTL config file (eg: cluster.yaml) so they do not get re-added.
However, your v1
node groups, typically your first one, acts as a YAML anchor template for all others. So what I usually do is now delete the two newly-created items (from above) from the file (eg: deleting primary-spot-uw2a-v2
and primary-spot-uw2b-v2
stanzas). Once deleted, rename the above ones with the new version string. Your "diff" on this file after an upgrade should look something like this:
diff --git a/cluster.yaml b/cluster.yaml
index c35b410..ea2378e 100644
--- a/cluster.yaml
+++ b/cluster.yaml
@@ -4,7 +4,7 @@ kind: ClusterConfig
metadata:
name: stage
region: us-west-2
- version: "1.22"
+ version: "1.23"
iam:
withOIDC: true
@@ -77,10 +77,10 @@ nodeGroups:
# <<: *primaryNodeGroupSettings
- # eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v1"
- # eksctl create nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v1"
+ # eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v2"
+ # eksctl create nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2a-v2"
# Node Group #3: us-west-2a spot (cheap but still fast, at the risk of "spot death")
- - name: primary-spot-uw2a-v1
+ - name: primary-spot-uw2a-v2
# Items for this node group only
availabilityZones: ["us-west-2a"]
@@ -134,10 +134,10 @@ nodeGroups:
- arn:aws:iam::aws:policy/AmazonEKSVPCResourceController # Requires for pod security groups support
- # eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v1"
- # eksctl create nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v1"
+ # eksctl delete nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v2"
+ # eksctl create nodegroup --config-file=./cluster.yaml --include "primary-spot-uw2b-v2"
# Node Group #4: us-west-2a spot (cheap but still fast, at the risk of "spot death")
- - name: primary-spot-uw2b-v1
+ - name: primary-spot-uw2b-v2
# Items for this node group only
If that roughly matches what your changes look like, then you should be golden. Commit and push this file so that your EKSCTL in Git is "in sync" with what is deployed, so you can share/collaborate with your team.
WARNING: If during the delete nodegroup above it stalls out on some pods being unable to delete because of some reasons, please use kubectl
to manually rectify the situation on a different terminal window.
One edge case that sometimes occurs is that due to a Pod Disruption Budget (or PDB) it is dis-allowing to remove a pod from your node that is being removed. Kubernetes and EKSCTL are giving you pause here, because IT KNOWS (or assumes) that you are likely in a situation that will cause downtime and it does not want to do that to you. So, you either accept the downtime and delete the pod(s) manually, OR you need to scale up this service to work around the PDB; or if you can not tolerate the downtime of this pod, you will need to cancel this node group deletion and consider scheduling it during a period in which you can tolerate downtime.
Re-upgrade/update EKS built-in components
Next, after the masters and worker nodes have been upgraded to the latest version, EKSCTL may need to further update certain built-in components.
This is the same set of commands explained above in the Pre-upgrade steps. It now also needs to be done AFTER the EKS upgrade, because EKSCTl may recommend slightly different settings/configuration on this new version of EKS.
eksctl utils update-kube-proxy --cluster CLUSTER_NAME_HERE # This is the "plan/preview"
eksctl utils update-kube-proxy --cluster CLUSTER_NAME_HERE --approve # This is the "apply"
eksctl utils update-coredns --cluster CLUSTER_NAME_HERE # This is the "plan/preview"
eksctl utils update-coredns --cluster CLUSTER_NAME_HERE --approve # This is the "apply"
# NOTE: You do NOT need to re-update the CNI controller (from above copy/paste of this from Pre-upgrade)
Post-upgrade - Validate Compatibility
There is one final tricky step that you need to do, which is validate that everything deploys properly. Meaning however you deploy updates to your cluster, be that via Helm, kubectl apply, or however, that you can go do a few service upgrades.
The reason we need to check this is that sometimes between versions of Kubernetes releases, they deprecate or entirely remove versions of API objects inside of Kubernetes. When they do this, during an upgrade those objects in Kubernetes are automatically upgraded from a previous/deprecated version to the new version. However, this does not apply retroactively when you attempt to deploy/apply objects into Kubernetes.
Typically, I grab a few minor services and do a minor change to force it to run through its full CI/CD pipeline of building images, publishing them, and deploying them into the cluster. If this happens successfully, then you are likely okay. If not, then some changes to that Helm chart are necessary. Most well-engineered community Helm charts auto-detect and change the version of these objects automatically for you. However, if you have engineered your own you likely do not have this logic in there. If you want to resolve that, I've created these Universal Kubernetes Helm Charts for you, grab them on GitHub.
Profit?
Good job, you did it! You have upgraded EKS with zero downtime.