0%

问题背景

  1. 查看kube-system下的系统组件,发现harbor、influxdb、coredns等组件反复重启;
  2. 使用kubectl get pod -n kube-system命令查看pod列表,发现命令会稳定的卡15s

原因分析

先看第一点,这几个pod反复重启已经遇到过几次,看过前面的问题排查文章[1,2]的应该知道,在业务高并发场景下可能出现。先使用top命令看一下负载情况:

1
2
3
4
5
6
load average: 14.76  18.45  17.85
Tasks: 1998 total, 7 running, 1937 sleeping, 0 stopped, 54 zombie
%CPU: 15.2 us, 8.3 sys, 0.7 ni, 75.3 id

cat /proc/cpuinfo|grep MHz| wc -l
40

实际负载不高,排除这个可能(实际定位过程中,也依据前面的经验,做了相关内核参数的优化,问题确实没什么改善)。那就继续看,先describe看几个异常pod的错误信息:

1
2
3
kubelet     Liveness probe failed: Get "http://177.177.138.139:8885/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
kubelet Readiness probe failed: Get "http://177.177.138.139:8083/"": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
kubelet Back-off restarting failed container

错误基本一致,就是kubelet调用pod提供的健康检查接口超时了,所以被不断地kill再重启。为什么调不通?模拟kubelet的调用操作,在pod所在节点上使用curl命令调用,结果显示timeout,再ping一下看看pod通信有没有问题:

1
2
ping 177.177.212.186
^C

ping不通!这也就讲通了为啥健康检查不过,因为kubelet与所在节点上的Pod根本就无法通信。为啥会这样?通过一顿验证,发现一个规律:集群各节点无法访问自己节点上的pod,但可以访问其他节点上的pod

阅读全文 »

问题背景

为了解决节点上请求部分service延迟63s问题[1],我们临时把OS的版本换成了Redhat 8.4(内核版本4.18.0-305),在VMware上虚出3节点集群后部署跨三层环境失败,提示Harbor部署失败。

原因分析

说明:距离定位这个问题已经有一段时间了,其实最终也没完全定位出根本原因,所以当时也没有整理记录定位过程,这里就简单描述一下,做个记录。

通过查看Harbor的日志,发现部署失败的原因是健康检查失败,因为HarborpodIp:port请求在跨三层下超时,抓包发现请求经过vxlan.calico时止于SYN_SENT报文;

实测发现,该环境上不仅是HarborpodIp:port请求超时,其他pod服务的请求经过跨三层的网络也同样存在问题,说明是一个共性问题,并且pod网段的七层如http请求受影响,4层的icmp请求不受影响);

继续通过抓包确认,podIp:port的请求已经发送到节点网卡上,但跨三层对端的节点没有收到。所以,可以排除主机上的iptables和路由的影响。实际上,也确实跟踪了iptables的请求过程,确认请求没有被丢弃;

综上,初步怀疑请求被跨三层网络的中间设备(如交换机、路由器)丢弃了,之后相关同事在交换机上抓包,发现ping包可以抓到,但http请求依然抓不到,说明交换机侧也没有收到报文,问题原因进一步缩小,可能情况有:

  1. 出口网卡丢弃;
  2. 出网卡后,入交换机之前丢弃

通过ehtool -s xxx命令统计虚机网卡报文信息,没有发现丢弃情况,说明问题不在虚机的出网卡;

阅读全文 »

问题背景

K8S集群中修改calico的网络为vxlan模式后,发现部分service在节点上无法访问(实际上是延迟访问,延迟时间稳定在1min3s左右)。

1
2
3
4
[root@node2 ~]# time curl -s http://10.96.91.255
real 1m3.136s
user 0m0.005s
sys 0m0.005s

原因分析

先确认问题范围,在环境上找多个service依次测试发现,如果调用service的节点和实际pod不在同一个节点上,则出现延迟,否则请求正常。也就是说跨节点的访问才有问题。而直接用service对应的podIP访问,也不存在问题,猜测问题可能出在servicepod的过程。

再确认基本环境,OSK8Scalico等基础环境没有发生任何变化,仅仅是把calico的网络模式从BGP改为了vxlan,但是这个改动改变了集群内servicepod的请求路径,也即所有的容器请求需要走节点上新增的calico.vxlan接口封装一下。网络模式修改前没有问题,修改后必现,之后切回BGP模式问题就消失了,说明问题可能跟新增的calico.vxlan接口有关系。

先看下环境情况,并触发service请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 验证环境:node2(10.10.72.11)——> node1(10.10.72.10)
# 验证方法:node2上curl service:10.96.91.255 ——> node1上pod:166.166.166.168:8082
[root@node2 ~]# ip addr
10: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN group default
link/ether 66:2d:bf:44:a6:8b brd ff:ff:ff:ff:ff:ff
inet 166.166.104.10/32 brd 166.166.104.10 scope global vxlan.calico
valid_lft forever preferred_lft forever

[root@node1 ~]# ip addr
15: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN group default
link/ether 66:f9:37:c3:7e:94 brd ff:ff:ff:ff:ff:ff
inet 166.166.166.175/32 brd 166.166.166.175 scope global vxlan.calico
valid_lft forever preferred_lft forever

[root@node2 ~]# time curl http://10.96.91.255

node1的主机网卡上抓包看看封装后的请求是否已到达:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
[root@node1 ~]# tcpdump -n -vv -i eth0 host 10.10.72.11 and udp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:19:42.730996 IP (tos 0x0, ttl 64, id 6470, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39190, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe556 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101892130 ecr 0,nop,wscale 7], length 0
07:19:43.733741 IP (tos 0x0, ttl 64, id 6804, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39191, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe16b (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101893133 ecr 0,nop,wscale 7], length 0
07:19:45.736729 IP (tos 0x0, ttl 64, id 7403, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39192, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xd998 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101895136 ecr 0,nop,wscale 7], length 0
07:19:49.744801 IP (tos 0x0, ttl 64, id 9648, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39193, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xc9f0 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101899144 ecr 0,nop,wscale 7], length 0
07:19:57.768735 IP (tos 0x0, ttl 64, id 12853, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39194, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xaa98 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101907168 ecr 0,nop,wscale 7], length 0
07:20:05.087057 IP (tos 0x0, ttl 64, id 8479, offset 0, flags [none], proto UDP (17), length 164)
10.10.72.10.34565 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3425, offset 0, flags [DF], proto UDP (17), length 114)
166.166.166.168.57850 > 166.166.104.6.domain: [udp sum ok] 10121+ AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86)
07:20:05.087076 IP (tos 0x0, ttl 64, id 54475, offset 0, flags [none], proto UDP (17), length 164)
10.10.72.10.51841 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3424, offset 0, flags [DF], proto UDP (17), length 114)
166.166.166.168.57984 > 166.166.104.6.domain: [udp sum ok] 20020+ A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86)
07:20:05.087671 IP (tos 0x0, ttl 64, id 13540, offset 0, flags [none], proto UDP (17), length 257)
10.10.72.11.60395 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19190, offset 0, flags [DF], proto UDP (17), length 207)
166.166.104.6.domain > 166.166.166.168.57850: [udp sum ok] 10121 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179)
07:20:05.087702 IP (tos 0x0, ttl 64, id 13541, offset 0, flags [none], proto UDP (17), length 257)
10.10.72.11.48571 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19191, offset 0, flags [DF], proto UDP (17), length 207)
166.166.104.6.domain > 166.166.166.168.57984: [udp sum ok] 20020 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179)
07:20:05.088801 IP (tos 0x0, ttl 64, id 8480, offset 0, flags [none], proto UDP (17), length 152)
10.10.72.10.55780 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3427, offset 0, flags [DF], proto UDP (17), length 102)
166.166.166.168.56015 > 166.166.104.6.domain: [udp sum ok] 19167+ AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. (74)
07:20:05.089048 IP (tos 0x0, ttl 64, id 13542, offset 0, flags [none], proto UDP (17), length 245)
10.10.72.11.50151 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19192, offset 0, flags [DF], proto UDP (17), length 195)
166.166.104.6.domain > 166.166.166.168.56015: [udp sum ok] 19167 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (167)
07:20:05.089212 IP (tos 0x0, ttl 64, id 8481, offset 0, flags [none], proto UDP (17), length 148)
10.10.72.10.50272 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3430, offset 0, flags [DF], proto UDP (17), length 98)
166.166.166.168.54926 > 166.166.104.6.domain: [udp sum ok] 40948+ A? influxdb-nginx-service.kube-system.svc.cluster.local. (70)
07:20:05.089403 IP (tos 0x0, ttl 64, id 13543, offset 0, flags [none], proto UDP (17), length 241)
10.10.72.11.59882 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19193, offset 0, flags [DF], proto UDP (17), length 191)
166.166.104.6.domain > 166.166.166.168.54926: [udp sum ok] 40948 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (163)
07:20:05.089524 IP (tos 0x0, ttl 64, id 8482, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.10.58964 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3431, offset 0, flags [DF], proto UDP (17), length 84)
166.166.166.168.50263 > 166.166.104.6.domain: [udp sum ok] 18815+ A? influxdb-nginx-service.kube-system.svc. (56)
07:20:05.089681 IP (tos 0x0, ttl 64, id 13544, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.11.51874 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19194, offset 0, flags [DF], proto UDP (17), length 84)
166.166.104.6.domain > 166.166.166.168.50263: [udp sum ok] 18815 ServFail- q: A? influxdb-nginx-service.kube-system.svc. 0/0/0 (56)
07:20:05.089706 IP (tos 0x0, ttl 64, id 8483, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.10.59891 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3433, offset 0, flags [DF], proto UDP (17), length 84)
166.166.166.168.49202 > 166.166.104.6.domain: [udp sum ok] 58612+ AAAA? influxdb-nginx-service.kube-system.svc. (56)
07:20:05.089859 IP (tos 0x0, ttl 64, id 13545, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.11.44146 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19195, offset 0, flags [DF], proto UDP (17), length 84)
166.166.104.6.domain > 166.166.166.168.49202: [udp sum ok] 58612 ServFail- q: AAAA? influxdb-nginx-service.kube-system.svc. 0/0/0 (56)
阅读全文 »

问题描述

Docker守护进程在异常断电后卡在activating状态,并且内存占用在无限增加。

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 docker]# service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─restart.conf
Active: activating (start) since Tue 2021-12-07 15:44:32 CST; 1min 18s ago
Docs: https:/docs.docker.com
Main PID: 274797 (dockerd)
Tasks: 481
Memory: 15.4G -- 占用不正常
CGroup: /system
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node01 docker]# service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─restart.conf
Active: deactivating (stop-sigterm)
Docs: https:/docs.docker.com
Main PID: 274797 (dockerd)
Tasks: 481
Memory: 247.3G -- 一直在疯狂增加
CGroup: /system.slice/docker.service
└─274

定位过程

首先,查看docker版本和docker info信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node01files]# docker version
Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:58:10 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[root@node01 files]# docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
scan: Docker Scan (Docker Inc., v0.8.0)

Server:
Server Version: 20.10.7
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.3.el7.x86_64
Operating System: CentOS 7.6
OSType: linux
Architecture: x86_64
Total Memory: 256GiB
Name: node01
Docker Root Dir: /var/lib/docker
Debug Mode: false
File Descriptors: 366
Goroutines: 290
EventsListeners: 0
Registry: https:/index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true

版本还算比较新,查看docker日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.831370861+08:00" level=info msg="Starting up"
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950367668+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950430356+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950486773+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950524941+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004622322+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004663918+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004697382+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004726533+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Dec 07 14:58:14 node01 dockerd[239420]: time="2021-12-07T14:58:14.832862519+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Dec 07 14:59:00 node01 dockerd[239420]: time="2021-12-07T14:59:00.039554832+08:00" level=info msg="Loading containers: start."
Dec 07 15:02:27 node01 dockerd[239420]: fatal error: runtime: out of memory
Dec 07 15:02:27 node01 dockerd[239420]: runtime stack:
Dec 07 15:02:27 node01 dockerd[239420]: runtime.throw(0x5604d0deee9d, 0x16)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/panic.go:774 +0x74
Dec 07 15:02:28 node01 dockerd[239420]: runtime.sysMap(0xfb6c000000, 0x5c000000, 0x5604d35dab58)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mem_linux.go:169 +0xc7
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).sysAlloc(0x5604d35be240, 0x5beee000, 0x7f6fd1fbab88, 0x5604cf22bf6b)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:701 +0x1cf
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).grow(0x5604d35be240, 0x2df77, 0xffffffff)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1255 +0xa5
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).allocSpanLocked(0x5604d35be240, 0x2df77, 0x5604d35dab68, 0x5604cf212927)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1170 +0x268
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc_m(0x5604d35be240, 0x2df77, 0x7f6c01180100, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1022 +0xc6
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc.func1()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1093 +0x4e
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc(0x5604d35be240, 0x2df77, 0x7f6fd1010100, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1092 +0x8c
Dec 07 15:02:28 node01 dockerd[239420]: runtime.largeAlloc(0x5beee000, 0x5604cf250001, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1138 +0x99
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc.func1()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1033 +0x48
Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack(0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:370 +0x63
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mstart()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/proc.go:1146
Dec 07 15:02:28 node01 dockerd[239420]: goroutine 1 [running]:
Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack_switch()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0008bff70 sp=0xc0008bff68 pc=0x5604cf2528e0
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc(0x5beee000, 0x5604d1e8fde0, 0x1, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1032 +0x8a7 fp=0xc0008c0010 sp=0xc0008bff70 pc=0x5604cf200ef7
Dec 07 15:02:28 node01 dockerd[239420]: runtime.growslice(0x5604d1e8fde0, 0xef119e4000, 0x3107eaa, 0x3107eaa, 0x3107eab, 0x5604cf20146b, 0xfb6bbbfb80, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/slice.go:181 +0x1e4 fp=0xc0008c0078 sp=0xc0008c0010 pc=0x5604cf23b0b4
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).first(0xc0008c0288)
Dec 07 15:02:28 node01 dockerd[239420]: docker/vendor/go.etcd.io/bbolt/cursor.go:182 +0x138 fp=0xc0008c00d8 sp=0xc0008c0078 pc=0x5604d012d348
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).next(0xc0008c0288, 0x0, 0x9f, 0x9f, 0x7f7150506429, 0x2b8, 0xc0008c0188, 0x5604cf23ee10)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:234 +0x84 fp=0xc0008c0128 sp=0xc0008c00d8 pc=0x5604d012d684
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).Next(0xc0008c0288, 0x7f715050638a, 0x9f, 0x9f, 0xfb6bbcf720, 0x9f, 0x2b8)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:75 +0x43 fp=0xc0008c0198 sp=0xc0008c0128 pc=0x5604d012ca43
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List.func1(0xc0034350a0, 0xc0008c0200, 0xc0034350a0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:288 +0x19a fp=0xc0008c02b8 sp=0xc0008c0198 pc=0x5604d0142a4a
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*DB).View(0xc0001dc200, 0xc0008c03f8, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/db.go:725 +0xaa fp=0xc0008c0340 sp=0xc0008c02b8 pc=0x5604d0130a3a
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List(0xc00482f2c0, 0xc001a7af00, 0x5e, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:279 +0x1b3 fp=0xc0008c0430 sp=0xc0008c0340 pc=0x5604d0141583
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).kmap(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x5e, 0xc001a7ae40)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:43 +0x286 fp=0xc0008c0560 sp=0xc0008c0430 pc=0x5604cfcc81a6
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).list(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:164 +0x7e fp=0xc0008c0678 sp=0xc0008c0560 pc=0x5604cfcc988e
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*datastore).List(0xc0079ff680, 0xc001a7ae40, 0x5e, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/datastore.go:517 +0x205 fp=0xc0008c0730 sp=0xc0008c0678 pc=0x5604cfccc745
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*network).getEndpointsFromStore(0xc00289a380, 0xc004a99258, 0x7, 0x5604d2206920, 0xc001a515f0, 0xc002f3ff48)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/store.go:190 +0x343 fp=0xc0008c09f8 sp=0xc0008c0730 pc=0x5604d01acaf3
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*controller).reservePools(0xc0004b7400)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:977 +0x4c1 fp=0xc0008c0c28 sp=0xc0008c09f8 pc=0x5604d0159701
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.New(0xc001055d00, 0x9, 0x10, 0xc0007ac870, 0xc001387fb0, 0xc001055d00, 0x9)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:245 +0x615 fp=0xc0008c0dd8 sp=0xc0008c0c28 pc=0x5604d0154815
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).initNetworkController(0xc00000c1e0, 0xc000207080, 0xc001387fb0, 0xc000236f50, 0xc00000c1e0, 0xc0007170c8, 0xc001387fb0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon_unix.go:855 +0xa9 fp=0xc0008c0e90 sp=0xc0008c0dd8 pc=0x5604d0a49429
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).restore(0xc00000c1e0, 0xc00004e5c0, 0xc000220000)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:490 +0x50b fp=0xc0008c1088 sp=0xc0008c0e90 pc=0x5604d0a3898b
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.NewDaemon(0x5604d21de4c0, 0xc00004e5c0, 0xc000207080, 0xc0007ac870, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:1147 +0x2be8 fp=0xc0008c19c8 sp=0xc0008c1088 pc=0x5604d0a3c938
Dec 07 15:02:28 node01 dockerd[239420]: main.(*DaemonCli).start(0xc0007ab410, 0xc000218600, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/cmd/dockerd/daemon.go:195 +0x743 fp=0xc0008c1d00 sp=0xc0008c19c8 pc=0x5604d0dbb3c3
Dec 07 15:02:28 node01 dockerd[239420]: main.runDaemon(...)
阅读全文 »

问题背景

karaf框架没有直接依赖log4j包,所以简单的升级项目中的log4j或实际项目中没有log4j,都无法解决最近发现的漏洞问题(CVE-2021-44228CVE-2021-45046CVE-2021-45046)。

分析过程

分析发现,karaf框架实际依赖的日志包是org.ops4j.pax.logging.xxx,而org.ops4j.pax.logging.xxx依赖了log4j,相当于做了一层包装。所以,要解决漏洞,有三种升级方式:

  1. 升级框架:这个影响就比较大了,而且框架的版本发布周期比较慢,目前还没有编译好的框架包,要升级框架就需要自己编译出所有的框架包,风险较大;
  2. 升级依赖包:影响较小,如果没有配置依赖包的地方,可能无法升级;(实际确认,无法单独升级)
  3. 修改当前版本依赖包并重新编译:影响较小,如果与最新版本跨度较大,可能修改点会很多;

综合比较,考虑使用第3个方案走走看,从参考资料[1]的代码提交记录看,org.ops4j.pax.logging为了解决log4j漏洞,仅涉及依赖包log4j的版本升级,版本跨度是从1.11.9升级到1.11.12,跨度不大,实际有哪些修改点,先编译看看有没有问题:

1
2
3
4
5
6
7
8
9
pom.xml
<version.org.apache.felix.configadmin>1.9.20</version.org.apache.felix.configadmin>
<version.org.apache.felix.framework>5.6.12</version.org.apache.felix.framework>
<version.org.apache.felix6.framework>6.0.3</version.org.apache.felix6.framework>
- <version.org.apache.logging.log4j>2.16.0</version.org.apache.logging.log4j>
+ <version.org.apache.logging.log4j>2.17.0</version.org.apache.logging.log4j>
<version.org.apache.servicemix.bundles.javax-inject>1_3</version.org.apache.servicemix.bundles.javax-inject>
<version.org.jboss.logging>3.4.1.Final</version.org.jboss.logging>
<version.org.mockito>3.7.7</version.org.mockito>

下载当前使用的版本,从源码的各模块看,可能需要jdk9

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node org.ops4j.pax.logging-logging-1.11.9]# ll
total 144
-rw-r--r-- 1 root root 939 Feb 23 2021 CONTRIBUTORS.txt
-rw-r--r-- 1 root root 755 Feb 23 2021 license-header.txt
-rw-r--r-- 1 root root 12525 Feb 23 2021 LICENSE.txt
drwxr-xr-x 4 root root 99 Dec 21 17:51 pax-logging-api
drwxr-xr-x 4 root root 46 Dec 21 17:51 pax-logging-api-java9
drwxr-xr-x 4 root root 46 Dec 21 17:51 pax-logging-it
drwxr-xr-x 6 root root 106 Feb 23 2021 pax-logging-it-karaf
drwxr-xr-x 4 root root 99 Dec 21 17:51 pax-logging-log4j2
drwxr-xr-x 4 root root 99 Dec 21 17:51 pax-logging-log4j2-extra
drwxr-xr-x 4 root root 99 Dec 21 17:51 pax-logging-logback
drwxr-xr-x 2 root root 21 Feb 23 2021 pax-logging-report
drwxr-xr-x 10 root root 166 Feb 23 2021 pax-logging-samples
drwxr-xr-x 4 root root 99 Dec 21 17:51 pax-logging-service
-rw-r--r-- 1 root root 46604 Dec 21 17:46 pom.xml
-rw-r--r-- 1 root root 67356 Feb 23 2021 readme.adoc
drwxr-xr-x 3 root root 18 Feb 23 2021 src

先下载个maven:3.3.9-jdk-9编译镜像试试:

阅读全文 »

本次发现的问题是踩到了老版本Kubernetes的坑,查找资料发现fatedier大佬已经做了很棒的分析,此处转载过来仅做学习记录。

作者:fatedier
本文出处:https://blog.fatedier.com/2020/04/17/pod-loopcrash-of-k8s-subpath/
文章版权归本人所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文链接,否则保留追究法律责任的权利。

问题背景

Kubernetes对于挂载了 subpath 的容器,在configmap 或其他 volume 变更后,如果容器因为意外退出后,就会持续crash,无法正常启动。

社区相关 issue #68211,问题已经在v1.19版本解决。

分析过程

复现步骤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
---
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
volumes:
- configMap:
name: extra-cfg
name: extra-cfg
containers:
- name: test
image: ubuntu:bionic
command: ["sleep", "30"]
resources:
requests:
cpu: 100m
volumeMounts:
- name: extra-cfg
mountPath: /etc/extra.ini
subPath: extra.ini
---
apiVersion: v1
data:
extra.ini: |
somedata
kind: ConfigMap
metadata:
name: extra-cfg

Apply 此配置,Pod 启动完成后,修改 configmap的内容,等待 30 秒后容器自动退出,kubelet 重启容器,此时观察到容器持续 mount 失败。

1
Error: failed to start container "test": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/e044883a-48da-4d28-b304-1a57dcb32203/volume-subpaths/extra-cfg/test/0\\\" to rootfs \\\"/var/lib/docker/overlay2/31b076d0012aad47aa938b482de24ecda8b41505489a22f63b8a3e4ce39b43ba/merged\\\" at \\\"/var/lib/docker/overlay2/31b076d0012aad47aa938b482de24ecda8b41505489a22f63b8a3e4ce39b43ba/merged/etc/extra.ini\\\" caused \\\"no such file or directory\\\"\"": unknown
阅读全文 »

功能总结

导出数据包

方法:File | Export Packet Dissections | as”CSV”(Comma Separated Values packet summary)file…

  1. 导出格式有纯文本、CSV、XML、JSON等;
  2. 不仅可以输出选中列,还可以输出当前页面展示的列,以及所有列(在导出弹框中设置);
  3. 可以设置包格式,比如导出统计行、统计头、包详细数据等;

添加展示列

方法:Package Details 面板中展开包详情,找到指定字段右键单击并选择Apply as Column 选项

显示一个TCP/UDP 会话

方法1:选中一个包,右键选择Conversation Filter|[TCPIUDP]命令

方法2:选中一个包,右键选择Follow[TCPIUDP] Stream 命令

方法3:工具栏选择Statistics|Conversations命令

阅读全文 »

问题背景

IPv6环境下,在浏览器中通过http://[vip:port]访问web业务,提示无法访问此网站,[vip]的响应时间过长。

分析过程

之前碰到过多次在PC浏览器上无法访问vip的情况,排查方法也很明确:

  1. 在集群的vip所在节点上访问是否正常;
  2. 在集群范围内其他节点上访问是否正常;
  3. 在集群之外的同网段linux环境上访问是否正常;
  4. 在其他环境的PC浏览器上访问是否正常;

验证发现,直接在vip所在节点上访问竟然不通!登录vip所在节点执行ip addr可以看到该地址确实是正确配置了,但 ping6该地址无回应,对应的ipv4地址 ping有回应。按说ping本机的地址不应该和链路的状态有关系,那会是什么原因呢?在仔细检查地址配置情况后发现该地址有个标记tentative dadfailed

1
2
3
4
5
6
7
8
17: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 0c:da:41:1d:a8:62 brd ff:ff:ff:ff:ff:ff
inet 10.10.10.17/16 scope global eth0
valid_lft forever preferred_lft forever
inet6 2000::10:18/128 scope global tentative dadfailed
valid_lft forever preferred_lft 0sec
inet6 fe80::eda:41ff:fe1d:a862/64 scope link
valid_lft forever preferred_lft forever

ip-address(8) 查到对该标记的解释如下:

1
2
tentative
(IPv6 only) only list addresses which have not yet passed duplicate address detection.

显然该地址没有通过地址重复探测(duplicate address detection,简称dad),而且这种检查机制只针对IPv6经确认,该环境的IPv6网段只有自己在用,且未手工配置过IPv6地址,但该环境曾经发生过切主

阅读全文 »

问题背景

使用附加网络的Pod在服务器重启后启动异常,报错信息如下:

1
2
3
4
5
6
Events:
Type Reason Age From Message
Normal Scheduled 53m default-scheduler Successfully assigned xxx/xxx1-64784c458b-q67tx to node001
Warning FailedCreatePodSandBox 53m kubelet, node001 Failed to create pod sandbox: rpc er or: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxxl-64784c458b-q67tx": NetworkPlugin cni failed to set up pod "xxx1-64784c458b-q67tx_xxx" network: Multus: Err adding pod to network "net-netl-nodeOOl": Multus: error in invoke Delegate add - "macvlan": failed to create macvlan: device or resource busy
Warning FailedCreatePodSandBox 53m kubelet, node001 Failed to create pod sandbox: rpc er or: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxxl-64784c458b-q67tx": NetworkPlugin cni failed to set up pod "xxx1-64784c458b-q67tx_xxx" network: Multus: Err adding pod to network "net-netl-nodeOOl": Multus: error in invoke Delegate add - "macvlan": failed to create macvlan: device or resource busy
...

分析过程

从日志初步看,创建Pod的sandbox异常,具体是Multus无法将Pod添加到net-netl-nodeOOl网络命名空间内,再具体点是Multus无法创建macvlan网络,原因是device or resource busy

最后的这个错误信息还是比较常见的,从字面理解,就是设备或资源忙,常见于共享存储的卸载场景。那这里也应该类似,有什么设备或资源处于被占用状态,所以执行macvlan的创建失败,既然是附加网络的问题,那优先查看了下附加网络相关的CRD资源,没什么异常;

网上根据日志搜索一番,也没有什么比较相关的问题,那就看代码吧,首先找到Multus的源码,根据上述日志找相关处理逻辑,没有找到。再一想,Multus实现macvlan网络使用的是macvlan插件,再下载插件代码,找到了相关处理逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
plugins/main/macvlan/macvlan.go:169
if err := netlink.LinkAdd(mv); err != nil {
return nil, fmt.Errorf("failed to create macvlan: %v", err)
}

// LinkAdd adds a new link device. The type and features of the device
// are taken from the parameters in the link object.
// Equivalent to: `ip link add $link`
func LinkAdd(link Link) error {
return pkgHandle.LinkAdd(link)
}

// LinkAdd adds a new link device. The type and features of the device
// are taken from the parameters in the link object.
// Equivalent to: `ip link add $link`
func (h *Handle) LinkAdd(link Link) error {
return h.linkModify(link, unix.NLM_F_CREATE|unix.NLM_F_EXCL|unix.NLM_F_ACK)
}
...

根据上述代码和注释简单的看,是在执行ip link add $link命令时报错,实际验证看看:

1
2
[root@node001 ~] ip link add link bond1 name macvlan1 type macvlan mode bridge
RTNETLINK answers: Device or resource busy
阅读全文 »

问题背景

通过kubectl delete命令删除某个业务Pod后,该Pod一直处于Terminating状态。

原因分析

根据现象看,应该是删除过程中有哪个流程异常,导致最终的删除卡在了Terminating状态。先describe看一下:

1
2
[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5
// 没有发现什么异常信息,这里就不贴日志了

Event事件中未见明显异常,那就看负责删除Pod的kubelet组件日志(已过滤出关键性日志):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
I0728 16:24:57.339295    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720 9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418 9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.

// 可疑点1:没有获取到pod IP
W0728 16:25:19.303513 9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
with error: exit status 1

I0728 16:25:19.341068 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}

I0728 16:25:20.711967 9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"

// 可疑点2:Unmount失败
E0728 16:25:20.939400 9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"

从删除Pod的日志看,有2个可疑点:

  1. docker_sandbox.go:384打印的获取pod IP错误;
  2. nestedpendingoperations.go:301打印的Unmount失败错误;

先看第1点,根据日志定位到代码[1]位置如下,IP没有拿到所以打印了个告警并返回空IP地址;

阅读全文 »