K8S问题排查-删除Pod后处于Terminating状态

问题背景

通过kubectl delete命令删除某个业务Pod后，该Pod一直处于Terminating状态。

原因分析

根据现象看，应该是删除过程中有哪个流程异常，导致最终的删除卡在了Terminating状态。先describe看一下：

1 2	[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5 // 没有发现什么异常信息，这里就不贴日志了

Event事件中未见明显异常，那就看负责删除Pod的kubelet组件日志（已过滤出关键性日志）：

I0728 16:24:57.339295    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720    9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.

// 可疑点1：没有获取到pod IP
W0728 16:25:19.303513    9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
 with error: exit status 1
 
I0728 16:25:19.341068    9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095    9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}

I0728 16:25:20.711967    9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"

// 可疑点2：Unmount失败
E0728 16:25:20.939400    9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g：目标忙。\n        (有些情况下通过 lsof(8) 或 fuser(1) 可以\n         找到有关使用该设备的进程的有用信息。)\n\n"

从删除Pod的日志看，有2个可疑点：

docker_sandbox.go:384打印的获取pod IP错误；
nestedpendingoperations.go:301打印的Unmount失败错误；

先看第1点，根据日志定位到代码[1]位置如下，IP没有拿到所以打印了个告警并返回空IP地址；

pkg/kubelet/dockershim/docker_sandbox.go:348
func (ds *dockerService) getIP(podSandboxID string, sandbox *dockertypes.ContainerJSON) string {
	if sandbox.NetworkSettings == nil {
		return ""
	}
	if networkNamespaceMode(sandbox) == runtimeapi.NamespaceMode_NODE {
		// For sandboxes using host network, the shim is not responsible for
		// reporting the IP.
		return ""
	}

	// Don't bother getting IP if the pod is known and networking isn't ready
	ready, ok := ds.getNetworkReady(podSandboxID)
	if ok && !ready {
		return ""
	}

	ip, err := ds.getIPFromPlugin(sandbox)
	if err == nil {
		return ip
	}
	
	if sandbox.NetworkSettings.IPAddress != "" {
		return sandbox.NetworkSettings.IPAddress
	}
	if sandbox.NetworkSettings.GlobalIPv6Address != "" {
		return sandbox.NetworkSettings.GlobalIPv6Address
	}

	// 错误日志在这里
	klog.Warningf("failed to read pod IP from plugin/docker: %v", err)
	return ""
}

继续看getIP方法的调用处代码，这里如果没有拿到IP，也没有什么异常，直接把空值放到PodSandboxStatusResponse中并返回；

pkg/kubelet/dockershim/docker_sandbox.go:404
func (ds *dockerService) PodSandboxStatus(ctx context.Context, req *runtimeapi.PodSandboxStatusRequest) (*runtimeapi.PodSandboxStatusResponse, error) {
	podSandboxID := req.PodSandboxId

	r, metadata, err := ds.getPodSandboxDetails(podSandboxID)
	if err != nil {
		return nil, err
	}

	// Parse the timestamps.
	createdAt, _, _, err := getContainerTimestamps(r)
	if err != nil {
		return nil, fmt.Errorf("failed to parse timestamp for container %q: %v", podSandboxID, err)
	}
	ct := createdAt.UnixNano()

	// Translate container to sandbox state.
	state := runtimeapi.PodSandboxState_SANDBOX_NOTREADY
	if r.State.Running {
		state = runtimeapi.PodSandboxState_SANDBOX_READY
	}

	// 调用getIP方法的位置
	var IP string
	if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
		IP = ds.getIP(podSandboxID, r)
	}

    labels, annotations := extractLabels(r.Config.Labels)
	status := &runtimeapi.PodSandboxStatus{
		Id:          r.ID,
		State:       state,
		CreatedAt:   ct,
		Metadata:    metadata,
		Labels:      labels,
		Annotations: annotations,
		Network: &runtimeapi.PodSandboxNetworkStatus{
			Ip: IP,
		},
		Linux: &runtimeapi.LinuxPodSandboxStatus{
			Namespaces: &runtimeapi.Namespace{
				Options: &runtimeapi.NamespaceOption{
					Network: networkNamespaceMode(r),
					Pid:     pidNamespaceMode(r),
					Ipc:     ipcNamespaceMode(r),
				},
			},
		},
	}
	return &runtimeapi.PodSandboxStatusResponse{Status: status}, nil
}

到此看不出这个错误会不会中断删除流程，那就本地构造一下试试。修改上面的代码，在调用getIP方法的位置后面增加调试日志（从本地验证结果看，Pod正常删除，说明异常问题与此处无关）；

// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
	IP = ds.getIP(podSandboxID, r)
}

// 新加调试日志，如果是指定的Pod，强制将IP置空
isTestPod := strings.Contains(metadata.GetName(), "testpod")
if isTestPod {
	IP = ""
}

再看第2点，这个是ERROR级别的错误，问题出在Unmount挂载点时失败。那么卸载挂载点失败会导致卸载流程提前终止吗？网上关于Pod删除流程的源码分析文章很多，我们就直接找几篇[2,3,4]看看能不能解答上面的问题。

简单总结来说，删除一个Pod的流程如下：

调用kube-apiserver的DELETE接口（默认带grace-period=30s）；
第一次的删除只是更新Pod对象的元信息（DeletionTimestamp字段和DeletionGracePeriodSeconds字段），并没有在Etcd中删除记录；
kubectl命令的执行会阻塞并显示正在删除Pod；
kubelet组件监听到Pod对象的更新事件，执行killPod()方法；
kubelet组件监听到pod的删除事件，第二次调用kube-apiserver的DELETE接口（带grace-period=0）
kube-apiserver的DELETE接口去etcd中删除Pod对象；
kubectl命令的执行返回，删除Pod成功；

从前面kubelet删除异常的日志看，确实有两次DELETE操作，并且中间有个Killing container的日志，但从上面的删除流程看，两次DELETE操作之间应该是调用killPod()方法，通过查看源码，对应的日志应该是Killing unwanted pod，所以，实际上第二次的DELETE操作并没有触发。

pkg/kubelet/kubelet_pods.go:1073
func (kl *Kubelet) podKiller() {
	killing := sets.NewString()
	// guard for the killing set
	lock := sync.Mutex{}
	for podPair := range kl.podKillingCh {
		runningPod := podPair.RunningPod
		apiPod := podPair.APIPod

		lock.Lock()
		exists := killing.Has(string(runningPod.ID))
		if !exists {
			killing.Insert(string(runningPod.ID))
		}
		lock.Unlock()

		// 这里在调用killPod方法前会打印v2级别的日志
		if !exists {
			go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
				klog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
				err := kl.killPod(apiPod, runningPod, nil, nil)
				if err != nil {
					klog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
				}
				lock.Lock()
				killing.Delete(string(runningPod.ID))
				lock.Unlock()
			}(apiPod, runningPod)
		}
	}
}

怎么确认第二次的DELETE操作有没有触发呢？很简单，看代码或者实际验证都可以。这里我就在测试环境删除个Pod看下相关日志：

[root@node2 ~]# kubectl delete pod -n xxx  testpodrc2-7b749f6c9c-qh68l
pod "testpodrc2-7b749f6c9c-qh68l" deleted

// 已过滤出关键性日志
[root@node2 ~]# tailf kubelet.log
I0730 13:27:31.854178   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:31.854511   24588 kuberuntime_container.go:581] Killing container "docker://e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e" with 30 second grace period
I0730 13:27:32.203167   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

I0730 13:27:32.993294   24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e"}
I0730 13:27:32.993428   24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"c6a587614976beed0cbb6e5fabf70a2d039eec6c160154fce007fe2bb1ba3b4f"}

I0730 13:27:34.072494   24588 kubelet_pods.go:1090] Killing unwanted pod "testpodrc2-7b749f6c9c-qh68l"

I0730 13:27:40.084182   24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:40.085735   24588 kubelet.go:1898] SyncLoop (REMOVE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

对比正常和异常场景下的日志可以看出，正常的删除操作下，Killing unwanted pod日志之后会有DELETE和REMOVE的操作，这也就说明问题出在第二次DELETE操作没有触发。查看相关代码：

pkg/kubelet/status/status_manager.go:470
//kubelet组件有一个statusManager模块，它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除，非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
	...
    
    //当pod带有DeletionTimestamp字段，并且其内容器已被删除、持久卷已被删除等的多条件下，才会进入if语句内部
    if m.canBeDeleted(pod, status.status) {
        deleteOptions := metav1.NewDeleteOptions(0)
        deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
        
        //强制删除pod对象：kubectl delete pod podA --grace-period=0
        err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions) 
	...
        
    }
}

从源码可以看出，第二次DELETE操作是否触发依赖于canBeDeleted方法的校验结果，而这个方法内会检查持久卷是否已经被删除：

pkg/kubelet/status/status_manager.go:538
func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
	if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
		return false
	}
	return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}

pkg/kubelet/kubelet_pods.go:900
func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
	...
    
	// 这里会判断挂载卷是否已卸载
	if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
		// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
		klog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
		return false
	}
	if kl.kubeletConfiguration.CgroupsPerQOS {
		pcm := kl.containerManager.NewPodContainerManager()
		if pcm.Exists(pod) {
			klog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
			return false
		}
	}
	return true
}

结合出问题的日志，基本能确认是Unmount挂载点失败导致的异常。那么，挂载点为啥会Unmount失败？

1
2

// umount失败关键日志
Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g：目标忙。\n        (有些情况下通过 lsof(8) 或 fuser(1) 可以\n         找到有关使用该设备的进程的有用信息。)\n\n"

仔细看卸载失败的日志，可以看到这个挂载点的后端存储是glusterfs，而目标忙一般来说是存储设备侧在使用，所以无法卸载。那就找找看是不是哪个进程使用了这个挂载目录（以下定位由负责glusterfs的同事提供）：

[root@node1 ~]# fuser -mv /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g
用户  进程号  权限  命令
root  kernel mount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-dialog-gl.uster-pv-50g
root  94549  f.... glusterfs

除了内核的mount，还有个pid=94549的glusterfs进程在占用挂载点所在目录，看看是什么进程：

1
2

[root@node1 ~]# ps -ef| grep 94549
root 94549 1 0 7月26 ? 00:01:13 /usr/sbin/glusterfs --log-level=ERR0R --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/global-diaglog-pv/web-fddf96444-stxpf-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=xxx --volfile-server=xxx --tfolfile-server=xxx --volfile-id=global-diaglog --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/xxx/volumes/kubernetef.io-glusterfs/global-diaglog-pv

发现这个进程维护的是web-xxx的挂载信息，而web-xxx和cam-xxx没有任何关联。由此推断出是glusterfs管理的挂载信息发送错乱导致，具体错乱原因就转给相关负责的同事看了。

解决方案

从分析结果看，是共享存储卷未正常卸载导致的删除Pod异常，非K8S问题。

云原生知识星球

K8S问题排查-删除Pod后处于Terminating状态

问题背景

原因分析

解决方案

参考资料