0%

K8S问题排查-删除Pod后处于Terminating状态

问题背景

通过kubectl delete命令删除某个业务Pod后,该Pod一直处于Terminating状态。

原因分析

根据现象看,应该是删除过程中有哪个流程异常,导致最终的删除卡在了Terminating状态。先describe看一下:

1
2
[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5
// 没有发现什么异常信息,这里就不贴日志了

Event事件中未见明显异常,那就看负责删除Pod的kubelet组件日志(已过滤出关键性日志):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
I0728 16:24:57.339295    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720 9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418 9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.

// 可疑点1:没有获取到pod IP
W0728 16:25:19.303513 9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
with error: exit status 1

I0728 16:25:19.341068 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}

I0728 16:25:20.711967 9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"

// 可疑点2:Unmount失败
E0728 16:25:20.939400 9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"

从删除Pod的日志看,有2个可疑点:

  1. docker_sandbox.go:384打印的获取pod IP错误;
  2. nestedpendingoperations.go:301打印的Unmount失败错误;

先看第1点,根据日志定位到代码[1]位置如下,IP没有拿到所以打印了个告警并返回空IP地址;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
pkg/kubelet/dockershim/docker_sandbox.go:348
func (ds *dockerService) getIP(podSandboxID string, sandbox *dockertypes.ContainerJSON) string {
if sandbox.NetworkSettings == nil {
return ""
}
if networkNamespaceMode(sandbox) == runtimeapi.NamespaceMode_NODE {
// For sandboxes using host network, the shim is not responsible for
// reporting the IP.
return ""
}

// Don't bother getting IP if the pod is known and networking isn't ready
ready, ok := ds.getNetworkReady(podSandboxID)
if ok && !ready {
return ""
}

ip, err := ds.getIPFromPlugin(sandbox)
if err == nil {
return ip
}

if sandbox.NetworkSettings.IPAddress != "" {
return sandbox.NetworkSettings.IPAddress
}
if sandbox.NetworkSettings.GlobalIPv6Address != "" {
return sandbox.NetworkSettings.GlobalIPv6Address
}

// 错误日志在这里
klog.Warningf("failed to read pod IP from plugin/docker: %v", err)
return ""
}

继续看getIP方法的调用处代码,这里如果没有拿到IP,也没有什么异常,直接把空值放到PodSandboxStatusResponse中并返回;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
pkg/kubelet/dockershim/docker_sandbox.go:404
func (ds *dockerService) PodSandboxStatus(ctx context.Context, req *runtimeapi.PodSandboxStatusRequest) (*runtimeapi.PodSandboxStatusResponse, error) {
podSandboxID := req.PodSandboxId

r, metadata, err := ds.getPodSandboxDetails(podSandboxID)
if err != nil {
return nil, err
}

// Parse the timestamps.
createdAt, _, _, err := getContainerTimestamps(r)
if err != nil {
return nil, fmt.Errorf("failed to parse timestamp for container %q: %v", podSandboxID, err)
}
ct := createdAt.UnixNano()

// Translate container to sandbox state.
state := runtimeapi.PodSandboxState_SANDBOX_NOTREADY
if r.State.Running {
state = runtimeapi.PodSandboxState_SANDBOX_READY
}

// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}

labels, annotations := extractLabels(r.Config.Labels)
status := &runtimeapi.PodSandboxStatus{
Id: r.ID,
State: state,
CreatedAt: ct,
Metadata: metadata,
Labels: labels,
Annotations: annotations,
Network: &runtimeapi.PodSandboxNetworkStatus{
Ip: IP,
},
Linux: &runtimeapi.LinuxPodSandboxStatus{
Namespaces: &runtimeapi.Namespace{
Options: &runtimeapi.NamespaceOption{
Network: networkNamespaceMode(r),
Pid: pidNamespaceMode(r),
Ipc: ipcNamespaceMode(r),
},
},
},
}
return &runtimeapi.PodSandboxStatusResponse{Status: status}, nil
}

到此看不出这个错误会不会中断删除流程,那就本地构造一下试试。修改上面的代码,在调用getIP方法的位置后面增加调试日志(从本地验证结果看,Pod正常删除,说明异常问题与此处无关);

1
2
3
4
5
6
7
8
9
10
11
// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}

// 新加调试日志,如果是指定的Pod,强制将IP置空
isTestPod := strings.Contains(metadata.GetName(), "testpod")
if isTestPod {
IP = ""
}

再看第2点,这个是ERROR级别的错误,问题出在Unmount挂载点时失败。那么卸载挂载点失败会导致卸载流程提前终止吗?网上关于Pod删除流程的源码分析文章很多,我们就直接找几篇[2,3,4]看看能不能解答上面的问题。

简单总结来说,删除一个Pod的流程如下:

  1. 调用kube-apiserverDELETE接口(默认带grace-period=30s);
  2. 第一次的删除只是更新Pod对象的元信息(DeletionTimestamp字段和DeletionGracePeriodSeconds字段),并没有在Etcd中删除记录;
  3. kubectl命令的执行会阻塞并显示正在删除Pod;
  4. kubelet组件监听到Pod对象的更新事件,执行killPod()方法;
  5. kubelet组件监听到pod的删除事件,第二次调用kube-apiserverDELETE接口(带grace-period=0
  6. kube-apiserverDELETE接口去etcd中删除Pod对象;
  7. kubectl命令的执行返回,删除Pod成功;

从前面kubelet删除异常的日志看,确实有两次DELETE操作,并且中间有个Killing container的日志,但从上面的删除流程看,两次DELETE操作之间应该是调用killPod()方法,通过查看源码,对应的日志应该是Killing unwanted pod,所以,实际上第二次的DELETE操作并没有触发。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
pkg/kubelet/kubelet_pods.go:1073
func (kl *Kubelet) podKiller() {
killing := sets.NewString()
// guard for the killing set
lock := sync.Mutex{}
for podPair := range kl.podKillingCh {
runningPod := podPair.RunningPod
apiPod := podPair.APIPod

lock.Lock()
exists := killing.Has(string(runningPod.ID))
if !exists {
killing.Insert(string(runningPod.ID))
}
lock.Unlock()

// 这里在调用killPod方法前会打印v2级别的日志
if !exists {
go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
klog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
err := kl.killPod(apiPod, runningPod, nil, nil)
if err != nil {
klog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
}
lock.Lock()
killing.Delete(string(runningPod.ID))
lock.Unlock()
}(apiPod, runningPod)
}
}
}

怎么确认第二次的DELETE操作有没有触发呢?很简单,看代码或者实际验证都可以。这里我就在测试环境删除个Pod看下相关日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@node2 ~]# kubectl delete pod -n xxx  testpodrc2-7b749f6c9c-qh68l
pod "testpodrc2-7b749f6c9c-qh68l" deleted

// 已过滤出关键性日志
[root@node2 ~]# tailf kubelet.log
I0730 13:27:31.854178 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:31.854511 24588 kuberuntime_container.go:581] Killing container "docker://e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e" with 30 second grace period
I0730 13:27:32.203167 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

I0730 13:27:32.993294 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e"}
I0730 13:27:32.993428 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"c6a587614976beed0cbb6e5fabf70a2d039eec6c160154fce007fe2bb1ba3b4f"}

I0730 13:27:34.072494 24588 kubelet_pods.go:1090] Killing unwanted pod "testpodrc2-7b749f6c9c-qh68l"

I0730 13:27:40.084182 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:40.085735 24588 kubelet.go:1898] SyncLoop (REMOVE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

对比正常和异常场景下的日志可以看出,正常的删除操作下,Killing unwanted pod日志之后会有DELETEREMOVE的操作,这也就说明问题出在第二次DELETE操作没有触发。查看相关代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pkg/kubelet/status/status_manager.go:470
//kubelet组件有一个statusManager模块,它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除,非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
...

//当pod带有DeletionTimestamp字段,并且其内容器已被删除、持久卷已被删除等的多条件下,才会进入if语句内部
if m.canBeDeleted(pod, status.status) {
deleteOptions := metav1.NewDeleteOptions(0)
deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))

//强制删除pod对象:kubectl delete pod podA --grace-period=0
err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
...

}
}

从源码可以看出,第二次DELETE操作是否触发依赖于canBeDeleted方法的校验结果,而这个方法内会检查持久卷是否已经被删除:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
pkg/kubelet/status/status_manager.go:538
func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
return false
}
return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}

pkg/kubelet/kubelet_pods.go:900
func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
...

// 这里会判断挂载卷是否已卸载
if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
klog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
return false
}
if kl.kubeletConfiguration.CgroupsPerQOS {
pcm := kl.containerManager.NewPodContainerManager()
if pcm.Exists(pod) {
klog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
return false
}
}
return true
}

结合出问题的日志,基本能确认是Unmount挂载点失败导致的异常。那么,挂载点为啥会Unmount失败?

1
2
// umount失败关键日志
Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"

仔细看卸载失败的日志,可以看到这个挂载点的后端存储是glusterfs,而目标忙一般来说是存储设备侧在使用,所以无法卸载。那就找找看是不是哪个进程使用了这个挂载目录(以下定位由负责glusterfs的同事提供):

1
2
3
4
[root@node1 ~]# fuser -mv /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g
用户 进程号 权限 命令
root kernel mount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-dialog-gl.uster-pv-50g
root 94549 f.... glusterfs

除了内核的mount,还有个pid=94549glusterfs进程在占用挂载点所在目录,看看是什么进程:

1
2
[root@node1 ~]# ps -ef| grep 94549
root 94549 1 0 7月26 ? 00:01:13 /usr/sbin/glusterfs --log-level=ERR0R --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/global-diaglog-pv/web-fddf96444-stxpf-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=xxx --volfile-server=xxx --tfolfile-server=xxx --volfile-id=global-diaglog --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/xxx/volumes/kubernetef.io-glusterfs/global-diaglog-pv

发现这个进程维护的是web-xxx的挂载信息,而web-xxxcam-xxx没有任何关联。由此推断出是glusterfs管理的挂载信息发送错乱导致,具体错乱原因就转给相关负责的同事看了。

解决方案

从分析结果看,是共享存储卷未正常卸载导致的删除Pod异常,非K8S问题。

参考资料

  1. Kubernetes v1.15.12源码
  2. kubernetes删除pod的流程的源码简析
  3. Kubernetes源码分析之Pod的删除
  4. kubernetes grace period 失效问题排查