0%

K8S问题排查-升级K8S后apiserver的token超期问题

问题背景

K8S集群环境稳定运行一年后,pod重启卡在ContainerCreating状态:

1
2
3
4
5
6
7
8
[root@node1 ~]# kubectl get pod -n kube-system -owide
NAME READY STATUS RESTARTS AGE IP NODE
calico-kube-controllers-cd96b6c89-bpjp6 1/1 Running 0 40h 10.10.0.1 node3
calico-node-ffsz8 1/1 Running 0 14s 10.10.0.1 node3
calico-node-nsmwl 1/1 Running 0 14s 10.10.0.2 node2
calico-node-w4ngt 1/1 Running 0 14s 10.10.0.1 node1
coredns-55c8f5fd88-hw76t 1/1 Running 1 260d 192.168.135.55 node3
xxx-55c8f5fd88-vqwbz 1/1 ContainerCreating 1 319d 192.168.104.22 node2

分析过程

describe查看

1
2
3
4
5
6
7
[root@node1 ~]# kubectl describe pod -n xxx xxx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 52m default-scheduler Successfully assigned xxx/xxx to node1
Warning FailedCreatePodSandBox 52m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: connection is unauthorized: Unauthorized, failed to clean up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to teardown pod "xxx" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
Normal SandboxChanged 50m (x10 over 52m) kubelet Pod sandbox changed, it will be killed and re-created.

事件里显示的Unauthorized,也就是因为无权限从kube-apiserver中获取相关信息,查看对应pod使用的token,发现确实存在过期时间相关的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
alg: "RS256",
kid: "nuXGyK2zjFNBRnO1ayeOxJDm_luMf4eqQFnqJbsVl7I"
}.
{
aud: [
"https://kubernetes.default.svc.cluster.local"
],
exp: 1703086264, // 时间过期的定义,一年后该token过期
iat: 1671550264,
nbf: 1671550264,
iss: "https://kubernetes.default.svc.cluster.local",
kubernetes.io: {
namespace: "kube-system",
pod: {
name: "xxx",
uid: "c7300d73-c716-4bbc-ad2b-80353d99073b"
},
serviceaccount: {
name: "multus",
uid: "1600e098-6a86-4296-8410-2051d45651ce"
},
warnafter: 1671553871
},
sub: "system:serviceaccount:kube-system:xxx"
}.

查看相关issue[1,2,3],基本确认是**k8s版本迭代引起的**,为了提供更安全的token机制,从v1.21版本开始,BoundServiceAccountTokenVolume特性进入beta版本,并默认启用。

解决方案

  1. 如果不想使用该特性,可以按照下面提供的方法[4],对kube-apiserverkube-controller-manager组件添加feature gate禁用即可。
1
2
3
4
5
6
1. How can this feature be enabled / disabled in a live cluster?
Feature gate name: BoundServiceAccountTokenVolume
Components depending on the feature gate: kube-apiserver and kube-controller-manager
Will enabling / disabling the feature require downtime of the control plane? yes, need to restart kube-apiserver and kube-controller-manager.
Will enabling / disabling the feature require downtime or reprovisioning of a node? no.
2. Does enabling the feature change any default behavior? yes, pods' service account tokens will expire after 1 year by default and are not stored as Secrets any more.
  1. 如果需要使用该特性,则要求使用token的一方适配修改,做到不缓存或者token失效后支持自动刷新新的token到内存即可,已知新版本的client-gofabric8客户端均已支持。

参考资料

  1. https://github.com/k8snetworkplumbingwg/multus-cni/issues/852
  2. https://github.com/projectcalico/calico/issues/5712
  3. https://www.cnblogs.com/bystander/p/rancher-jian-kong-bu-xian-shi-jian-kong-shu-ju-wen.html
  4. https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md