跳至主要内容

[29] 為什麼 EKS 在更新 Kubernetes deployment 時會有 HTTP 502 的 error(四)

本文將依據昨日主要提及的兩種解決方式 preStop hook 搭配 terminationGracePeriodSeconds 以及啟用 AWS Load Balancer Controller Pod Readiness Gates 功能分別實測:

  • ALB 作為 Ingress endpoint
  • NLB 作為 Service endpoint

以下環境使用 AWS Load Balancer Controller v2.4.3 進行測試:

$ kubectl -n kube-system get deploy aws-load-balancer-controller -o yaml | grep "image: "
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.3

ALB

  1. 更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。
$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled
  1. 更新 nginx-demo-ingress.yaml YAML 內容。以下使用 game 2048 作為測試 server。先前在測試驗證過程中,發現 Nginx image 終止(terminated)速度太快,很難識別到底是否真的有遭遇到 HTTP 500-level 的錯誤,故替換成 game 2048 image 進行測試。

其 game 2048 需要約 40 秒時間才能完全終止,故 sleep 約 40 秒時間, terminationGracePeriodSeconds 則額外延長 20 秒故設定 60 秒,

$ cat ./nginx-demo-ingress.yaml
apiVersion: v1
kind: Namespace
metadata:
name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: "demo-nginx-deployment"
namespace: "demo-nginx"
spec:
selector:
matchLabels:
app: "demo-nginx"
replicas: 10
template:
metadata:
labels:
app: "demo-nginx"
role: "backend"
spec:
terminationGracePeriodSeconds: 60
containers:
# - image: nginx:latest
- image: public.ecr.aws/l6m2t8p7/docker-2048:latest
imagePullPolicy: Always
name: "demo-nginx"
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 40"]

strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
name: "service-demo-nginx"
namespace: "demo-nginx"
spec:
ports:
- port: 80
targetPort: 80
protocol: TCP
selector:
app: "demo-nginx"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: "demo-nginx"
name: "ingress-nginx"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- http:
paths:
- path: "/"
pathType: Prefix
backend:
service:
name: service-demo-nginx
port:
  1. 部署此 YAML。
kubectl apply -f  ./nginx-demo-ingress.yaml
  1. 使用以下 curl command 每 0.5 秒訪問一次 ALB endpoint。
$ counter=1; for i in {1..200}; do echo $counter; counter=$((counter+1)); sleep │
0.5; curl -I k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com >> test-result.txt; done
  1. 於步驟 3 同一時間,於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。
kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:latest" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

未查看到 HTTP 500-level,其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1

$ grep "50[2|3|4]" ./test-result.txt

$ grep "Server: " ./test-result.txt
...
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx/1.23.1
...
...

NLB

  1. 刪除所有 namespace demo-nginx 資源。
kubectl delete ns demo-nginx --grace-period=0
  1. 建立 nginx-demo-nlb.yaml 一樣使用與 ALB 相同的 terminationGracePeriodSecondspreStop
$ cat ./nginx-demo-nlb.yaml
apiVersion: v1
kind: Namespace
metadata:
name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: "demo-nginx-deployment"
namespace: "demo-nginx"
spec:
selector:
matchLabels:
app: "demo-nginx"
replicas: 10
template:
metadata:
labels:
app: "demo-nginx"
role: "backend"
spec:
terminationGracePeriodSeconds: 60
containers:
# - image: nginx:latest
- image: public.ecr.aws/l6m2t8p7/docker-2048:latest
imagePullPolicy: Always
name: "demo-nginx"
ports:
- containerPort: 80
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 40"]

strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
name: "service-demo-nginx"
namespace: "demo-nginx"
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
loadBalancerClass: service.k8s.aws/nlb
type: LoadBalancer
ports:
- port: 80
targetPort: 80
protocol: TCP
selector:
app: "demo-nginx"
  1. 部署此 YAML。
$ kubectl apply -f ./nginx-demo-nlb.yaml
namespace/demo-nginx created
deployment.apps/demo-nginx-deployment created
service/service-demo-nginx created

$ kubectl -n demo-nginx get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/service-demo-nginx LoadBalancer 10.100.58.251 k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com 80:30415/TCP 5m app=demo-nginx
  1. 更新 namespace label elbv2.k8s.aws/pod-readiness-gate-inject=enabled 啟用 Pod Readiness Gates 功能。
$ kubectl label namespace demo-nginx elbv2.k8s.aws/pod-readiness-gate-inject=enabled
namespace/demo-nginx labeled

若 Pod 並未更新 condition 則需要重新啟用 Pod
$ kubectl -n demo-nginx rollout restart deploy demo-nginx-deployment
  1. 確認 Target Group 使用 IP 類型。
$ aws elbv2 describe-target-groups --load-balancer-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111

{
"TargetGroups": [
{
"TargetGroupArn": "arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53",
"TargetGroupName": "k8s-demongin-serviced-4a99c83f08",
"Protocol": "TCP",
"Port": 80,
"VpcId": "vpc-0b150640c72b722db",
"HealthCheckProtocol": "TCP",
"HealthCheckPort": "traffic-port",
"HealthCheckEnabled": true,
"HealthCheckIntervalSeconds": 10,
"HealthCheckTimeoutSeconds": 10,
"HealthyThresholdCount": 3,
"UnhealthyThresholdCount": 3,
"LoadBalancerArns": [
"arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/net/k8s-demongin-serviced-0d3bf709e1/1111111111111111"
],
"TargetType": "ip",
"IpAddressType": "ipv4"
}
]
}

也可以查看 Target id 為 IP 地址。

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].Target' --output table
------------------------------------------------
| DescribeTargetHealth |
+-------------------+------------------+-------+
| AvailabilityZone | Id | Port |
+-------------------+------------------+-------+
| eu-west-1a | 192.168.0.223 | 80 |
| eu-west-1c | 192.168.44.248 | 80 |
| eu-west-1a | 192.168.19.10 | 80 |
| eu-west-1c | 192.168.41.36 | 80 |
| eu-west-1b | 192.168.76.79 | 80 |
| eu-west-1b | 192.168.83.233 | 80 |
| eu-west-1a | 192.168.22.141 | 80 |
| eu-west-1b | 192.168.80.209 | 80 |
| eu-west-1b | 192.168.66.152 | 80 |
| eu-west-1b | 192.168.66.129 | 80 |
| eu-west-1c | 192.168.55.7 | 80 |
| eu-west-1c | 192.168.61.246 | 80 |
| eu-west-1c | 192.168.49.142 | 80 |
| eu-west-1c | 192.168.45.23 | 80 |
| eu-west-1b | 192.168.67.240 | 80 |
+-------------------+------------------+-------+

  1. 使用以下 curl command 每 0.5 秒訪問一次 NLB endpoint。因 NLB Target 於 initial 進行健康檢查時間較長,因此延長測試時間。
counter=1; for i in {1..600}; do echo $counter; counter=$((counter+1)); sleep 0.5; curl -I k8s-demongin-serviced-0d3bf709e1-1111111111111111.elb.eu-west-1.amazonaws.com >> test-nlb-result.txt; done
  1. 於步驟 6 同一時間,於另一個 terminal 上執行更新 image 至 Nginx image latest 版本。
kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx="nginx:" && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment

結果

在 Pod 更新過程中,透過 AWS CLI aws elbv2 describe-target-health 檢查 Target Group 健康檢查狀態,可以觀察確認 Target initialdraining

$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-4a99c83f08/c9a702f211fbfc53 --query 'TargetHealthDescriptions[].[Target.Id,TargetHealth.State]' --output table
--------------------------------
| DescribeTargetHealth |
+-----------------+------------+
| 192.168.48.65 | draining |
| 192.168.0.136 | initial |
| 192.168.25.57 | healthy |
| 192.168.84.107 | healthy |
| 192.168.22.132 | draining |
| 192.168.69.15 | healthy |
| 192.168.56.52 | draining |
| 192.168.19.10 | draining |
| 192.168.41.36 | initial |
| 192.168.80.209 | initial |
| 192.168.66.152 | healthy |
| 192.168.48.198 | healthy |
| 192.168.56.132 | healthy |
| 192.168.66.129 | healthy |
| 192.168.55.7 | initial |
| 192.168.49.142 | draining |
| 192.168.67.240 | initial |
| 192.168.45.23 | healthy |
+-----------------+------------+

檢視結果輸出 test-nlb-result.txt 未查看到 HTTP 500-level,其中在 Server 欄位上也可以觀察到由 nginx 逐漸替換成 nginx/1.23.1

$ grep "50[2|3|4]" ./test-nlb-result.txt

$ grep "Server: " ./test-nlb-result.txt
...
...
...
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx/1.23.1
Server: nginx
Server: nginx
Server: nginx/1.23.1
Server: nginx
...
...
...

總結

不論使用 ALB 作為 Ingress endpoint 或是 NLB 作為 Service endpoint 公開服務時,皆可以透過 preStop hook 及啟用 Pod Readiness Gates 功能來確保 deployment 於 rolling update 過程中不會遭遇到 HTPP 500-level 錯誤。