[29] 為什麼 EKS 在更新 Kubernetes deployment 時會有 HTTP 502 的 error（四）

本文將依據昨日主要提及的兩種解決方式 preStop hook 搭配 terminationGracePeriodSeconds 以及啟用 AWS Load Balancer Controller Pod Readiness Gates 功能分別實測：

ALB 作為 Ingress endpoint
NLB 作為 Service endpoint

以下環境使用 AWS Load Balancer Controller v2.4.3 進行測試：

$ kubectl -n kube-system get deploy aws-load-balancer-controller -o yaml | grep "image: "
        image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.3

ALB

更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。

$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled

更新 nginx-demo-ingress.yaml YAML 內容。以下使用 game 2048 作為測試 server。先前在測試驗證過程中，發現 Nginx image 終止（terminated）速度太快，很難識別到底是否真的有遭遇到 HTTP 500-level 的錯誤，故替換成 game 2048 image 進行測試。

其 game 2048 需要約 40 秒時間才能完全終止，故 sleep 約 40 秒時間， terminationGracePeriodSeconds 則額外延長 20 秒故設定 60 秒，

$ cat ./nginx-demo-ingress.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "demo-nginx-deployment"
  namespace: "demo-nginx"
spec:
  selector:
    matchLabels:
      app: "demo-nginx"
  replicas: 10
  template:
    metadata:
      labels:
        app: "demo-nginx"
        role: "backend"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        # - image: nginx:latest
      - image: public.ecr.aws/l6m2t8p7/docker-2048:latest
        imagePullPolicy: Always
        name: "demo-nginx"
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]

  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: "service-demo-nginx"
  namespace: "demo-nginx"
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: "demo-nginx"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: "demo-nginx"
  name: "ingress-nginx"
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  rules:
  - http:
      paths:
      - path: "/"
        pathType: Prefix
        backend:
          service:
            name: service-demo-nginx
            port:

部署此 YAML。

kubectl apply -f  ./nginx-demo-ingress.yaml

使用以下 curl command 每 0.5 秒訪問一次 ALB endpoint。

$ counter=1; for i in {1..200}; do echo $counter; counter=$((counter+1)); sleep │
0.5; curl -I k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com >> test-result.txt; done

於步驟 3 同一時間，於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。

kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:latest" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

未查看到 HTTP 500-level，其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1。

$ grep "50[2|3|4]" ./test-result.txt

$ grep "Server: " ./test-result.txt
...
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
...
...

NLB

刪除所有 namespace demo-nginx 資源。

kubectl delete ns demo-nginx --grace-period=0

建立 nginx-demo-nlb.yaml 一樣使用與 ALB 相同的 terminationGracePeriodSeconds 及 preStop。

$ cat ./nginx-demo-nlb.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "demo-nginx-deployment"
  namespace: "demo-nginx"
spec:
  selector:
    matchLabels:
      app: "demo-nginx"
  replicas: 10
  template:
    metadata:
      labels:
        app: "demo-nginx"
        role: "backend"
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        # - image: nginx:latest
      - image: public.ecr.aws/l6m2t8p7/docker-2048:latest
        imagePullPolicy: Always
        name: "demo-nginx"
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 40"]

  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
  name: "service-demo-nginx"
  namespace: "demo-nginx"
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  loadBalancerClass: service.k8s.aws/nlb
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: "demo-nginx"

部署此 YAML。

$ kubectl apply -f ./nginx-demo-nlb.yaml
namespace/demo-nginx created
deployment.apps/demo-nginx-deployment created
service/service-demo-nginx created

$ kubectl -n demo-nginx get svc -o wide
NAME                         TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)        AGE   SELECTOR
service/service-demo-nginx   LoadBalancer   10.100.58.251   k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com   80:30415/TCP   5m   app=demo-nginx

更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。

$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled

若 Pod 並未更新 condition 則需要重新啟用 Pod
$ kubectl -n demo-nginx rollout restart deploy demo-nginx-deployment

確認 Target Group 使用 IP 類型。

$ aws elbv2 describe-target-groups --load-balancer-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111

{
    "TargetGroups": [
        {
            "TargetGroupArn": "arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53",
            "TargetGroupName": "k8s-demongin-serviced-4a99c83f08",
            "Protocol": "TCP",
            "Port": 80,
            "VpcId": "vpc-0b150640c72b722db",
            "HealthCheckProtocol": "TCP",
            "HealthCheckPort": "traffic-port",
            "HealthCheckEnabled": true,
            "HealthCheckIntervalSeconds": 10,
            "HealthCheckTimeoutSeconds": 10,
            "HealthyThresholdCount": 3,
            "UnhealthyThresholdCount": 3,
            "LoadBalancerArns": [
                "arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111"
            ],
            "TargetType": "ip",
            "IpAddressType": "ipv4"
        }
    ]
}

也可以查看 Target id 為 IP 地址。

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].Target' --output table
------------------------------------------------
|             DescribeTargetHealth             |
+-------------------+------------------+-------+
| AvailabilityZone  |       Id         | Port  |
+-------------------+------------------+-------+
|  eu-west-1a       |  192.168.0.223   |  80   |
|  eu-west-1c       |  192.168.44.248  |  80   |
|  eu-west-1a       |  192.168.19.10   |  80   |
|  eu-west-1c       |  192.168.41.36   |  80   |
|  eu-west-1b       |  192.168.76.79   |  80   |
|  eu-west-1b       |  192.168.83.233  |  80   |
|  eu-west-1a       |  192.168.22.141  |  80   |
|  eu-west-1b       |  192.168.80.209  |  80   |
|  eu-west-1b       |  192.168.66.152  |  80   |
|  eu-west-1b       |  192.168.66.129  |  80   |
|  eu-west-1c       |  192.168.55.7    |  80   |
|  eu-west-1c       |  192.168.61.246  |  80   |
|  eu-west-1c       |  192.168.49.142  |  80   |
|  eu-west-1c       |  192.168.45.23   |  80   |
|  eu-west-1b       |  192.168.67.240  |  80   |
+-------------------+------------------+-------+

使用以下 curl command 每 0.5 秒訪問一次 NLB endpoint。因 NLB Target 於 initial 進行健康檢查時間較長，因此延長測試時間。

counter=1; for i in {1..600}; do echo $counter; counter=$((counter+1)); sleep 0.5; curl -I k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com >> test-nlb-result.txt; done

於步驟 6 同一時間，於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。

kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

在 Pod 更新過程中，透過 AWS CLI aws elbv2 describe-target-health 檢查 Target Group 健康檢查狀態，可以觀察確認 Target initial 及 draining。

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].[Target.Id,TargetHealth.State]' --output table
--------------------------------
|     DescribeTargetHealth     |
+-----------------+------------+
|  192.168.48.65  |  draining  |
|  192.168.0.136  |  initial   |
|  192.168.25.57  |  healthy   |
|  192.168.84.107 |  healthy   |
|  192.168.22.132 |  draining  |
|  192.168.69.15  |  healthy   |
|  192.168.56.52  |  draining  |
|  192.168.19.10  |  draining  |
|  192.168.41.36  |  initial   |
|  192.168.80.209 |  initial   |
|  192.168.66.152 |  healthy   |
|  192.168.48.198 |  healthy   |
|  192.168.56.132 |  healthy   |
|  192.168.66.129 |  healthy   |
|  192.168.55.7   |  initial   |
|  192.168.49.142 |  draining  |
|  192.168.67.240 |  initial   |
|  192.168.45.23  |  healthy   |
+-----------------+------------+

檢視結果輸出 test-nlb-result.txt 未查看到 HTTP 500-level，其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1。

$ grep "50[2|3|4]" ./test-nlb-result.txt

$ grep "Server: " ./test-nlb-result.txt
...
...
...
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
...
...
...

總結

不論使用 ALB 作為 Ingress endpoint 或是 NLB 作為 Service endpoint 公開服務時，皆可以透過 preStop hook 及啟用 Pod Readiness Gates 功能來確保 deployment 於 rolling update 過程中不會遭遇到 HTPP 500-level 錯誤。

[29] 為什麼 EKS 在更新 Kubernetes deployment 時會有 HTTP 502 的 error（四）

ALB​

結果​

NLB​

結果​

總結​

ALB

結果

NLB

結果

總結