0%

K8S问题排查-Calico的Vxlan模式下节点发起K8s Service请求延迟

问题背景

K8S集群中修改calico的网络为vxlan模式后,发现部分service在节点上无法访问(实际上是延迟访问,延迟时间稳定在1min3s左右)。

1
2
3
4
[root@node2 ~]# time curl -s http://10.96.91.255
real 1m3.136s
user 0m0.005s
sys 0m0.005s

原因分析

先确认问题范围,在环境上找多个service依次测试发现,如果调用service的节点和实际pod不在同一个节点上,则出现延迟,否则请求正常。也就是说跨节点的访问才有问题。而直接用service对应的podIP访问,也不存在问题,猜测问题可能出在servicepod的过程。

再确认基本环境,OSK8Scalico等基础环境没有发生任何变化,仅仅是把calico的网络模式从BGP改为了vxlan,但是这个改动改变了集群内servicepod的请求路径,也即所有的容器请求需要走节点上新增的calico.vxlan接口封装一下。网络模式修改前没有问题,修改后必现,之后切回BGP模式问题就消失了,说明问题可能跟新增的calico.vxlan接口有关系。

先看下环境情况,并触发service请求:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 验证环境:node2(10.10.72.11)——> node1(10.10.72.10)
# 验证方法:node2上curl service:10.96.91.255 ——> node1上pod:166.166.166.168:8082
[root@node2 ~]# ip addr
10: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN group default
link/ether 66:2d:bf:44:a6:8b brd ff:ff:ff:ff:ff:ff
inet 166.166.104.10/32 brd 166.166.104.10 scope global vxlan.calico
valid_lft forever preferred_lft forever

[root@node1 ~]# ip addr
15: vxlan.calico: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN group default
link/ether 66:f9:37:c3:7e:94 brd ff:ff:ff:ff:ff:ff
inet 166.166.166.175/32 brd 166.166.166.175 scope global vxlan.calico
valid_lft forever preferred_lft forever

[root@node2 ~]# time curl http://10.96.91.255

node1的主机网卡上抓包看看封装后的请求是否已到达:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
[root@node1 ~]# tcpdump -n -vv -i eth0 host 10.10.72.11 and udp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:19:42.730996 IP (tos 0x0, ttl 64, id 6470, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39190, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe556 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101892130 ecr 0,nop,wscale 7], length 0
07:19:43.733741 IP (tos 0x0, ttl 64, id 6804, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39191, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xe16b (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101893133 ecr 0,nop,wscale 7], length 0
07:19:45.736729 IP (tos 0x0, ttl 64, id 7403, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39192, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xd998 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101895136 ecr 0,nop,wscale 7], length 0
07:19:49.744801 IP (tos 0x0, ttl 64, id 9648, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39193, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xc9f0 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101899144 ecr 0,nop,wscale 7], length 0
07:19:57.768735 IP (tos 0x0, ttl 64, id 12853, offset 0, flags [none], proto UDP (17), length 110)
10.10.72.11.nim-vdrshell > 10.10.72.10.4789: [bad udp cksum 0xffff -> 0x3af0!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 39194, offset 0, flags [DF], proto TCP (6), length 60)
166.166.104.10.35632 > 166.166.166.168.us-cli: Flags [S], cksum 0xaa98 (correct), seq 3025623348, win 29200, options [mss 1460,sackOK,TS val 101907168 ecr 0,nop,wscale 7], length 0
07:20:05.087057 IP (tos 0x0, ttl 64, id 8479, offset 0, flags [none], proto UDP (17), length 164)
10.10.72.10.34565 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3425, offset 0, flags [DF], proto UDP (17), length 114)
166.166.166.168.57850 > 166.166.104.6.domain: [udp sum ok] 10121+ AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86)
07:20:05.087076 IP (tos 0x0, ttl 64, id 54475, offset 0, flags [none], proto UDP (17), length 164)
10.10.72.10.51841 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3424, offset 0, flags [DF], proto UDP (17), length 114)
166.166.166.168.57984 > 166.166.104.6.domain: [udp sum ok] 20020+ A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. (86)
07:20:05.087671 IP (tos 0x0, ttl 64, id 13540, offset 0, flags [none], proto UDP (17), length 257)
10.10.72.11.60395 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19190, offset 0, flags [DF], proto UDP (17), length 207)
166.166.104.6.domain > 166.166.166.168.57850: [udp sum ok] 10121 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179)
07:20:05.087702 IP (tos 0x0, ttl 64, id 13541, offset 0, flags [none], proto UDP (17), length 257)
10.10.72.11.48571 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19191, offset 0, flags [DF], proto UDP (17), length 207)
166.166.104.6.domain > 166.166.166.168.57984: [udp sum ok] 20020 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (179)
07:20:05.088801 IP (tos 0x0, ttl 64, id 8480, offset 0, flags [none], proto UDP (17), length 152)
10.10.72.10.55780 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3427, offset 0, flags [DF], proto UDP (17), length 102)
166.166.166.168.56015 > 166.166.104.6.domain: [udp sum ok] 19167+ AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. (74)
07:20:05.089048 IP (tos 0x0, ttl 64, id 13542, offset 0, flags [none], proto UDP (17), length 245)
10.10.72.11.50151 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19192, offset 0, flags [DF], proto UDP (17), length 195)
166.166.104.6.domain > 166.166.166.168.56015: [udp sum ok] 19167 NXDomain*- q: AAAA? influxdb-nginx-service.kube-system.svc.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (167)
07:20:05.089212 IP (tos 0x0, ttl 64, id 8481, offset 0, flags [none], proto UDP (17), length 148)
10.10.72.10.50272 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3430, offset 0, flags [DF], proto UDP (17), length 98)
166.166.166.168.54926 > 166.166.104.6.domain: [udp sum ok] 40948+ A? influxdb-nginx-service.kube-system.svc.cluster.local. (70)
07:20:05.089403 IP (tos 0x0, ttl 64, id 13543, offset 0, flags [none], proto UDP (17), length 241)
10.10.72.11.59882 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19193, offset 0, flags [DF], proto UDP (17), length 191)
166.166.104.6.domain > 166.166.166.168.54926: [udp sum ok] 40948 NXDomain*- q: A? influxdb-nginx-service.kube-system.svc.cluster.local. 0/1/0 ns: cluster.local. SOA ns.dns.cluster.local. hostmaster.cluster.local. 1647633218 7200 1800 86400 5 (163)
07:20:05.089524 IP (tos 0x0, ttl 64, id 8482, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.10.58964 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3431, offset 0, flags [DF], proto UDP (17), length 84)
166.166.166.168.50263 > 166.166.104.6.domain: [udp sum ok] 18815+ A? influxdb-nginx-service.kube-system.svc. (56)
07:20:05.089681 IP (tos 0x0, ttl 64, id 13544, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.11.51874 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19194, offset 0, flags [DF], proto UDP (17), length 84)
166.166.104.6.domain > 166.166.166.168.50263: [udp sum ok] 18815 ServFail- q: A? influxdb-nginx-service.kube-system.svc. 0/0/0 (56)
07:20:05.089706 IP (tos 0x0, ttl 64, id 8483, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.10.59891 > 10.10.72.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 3433, offset 0, flags [DF], proto UDP (17), length 84)
166.166.166.168.49202 > 166.166.104.6.domain: [udp sum ok] 58612+ AAAA? influxdb-nginx-service.kube-system.svc. (56)
07:20:05.089859 IP (tos 0x0, ttl 64, id 13545, offset 0, flags [none], proto UDP (17), length 134)
10.10.72.11.44146 > 10.10.72.10.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 19195, offset 0, flags [DF], proto UDP (17), length 84)
166.166.104.6.domain > 166.166.166.168.49202: [udp sum ok] 58612 ServFail- q: AAAA? influxdb-nginx-service.kube-system.svc. 0/0/0 (56)

从抓包结果看,出现一个可疑点:前几个报文中提示bad udp cksum 0xffff,请求通的最后几个报文提示的是no cksum

根据这个错误信息,搜索发现是个已知bug,相关的详细定位可以参考[1]-[3],这里就不细说了。大概原因如下所述:

内核中存在一个和VXLAN处理有关的缺陷,该缺陷会导致Checksum Offloading不能正确完成。这个缺陷仅仅在很边缘的场景下才会表现出来。

VXLANUDP头被NAT过的前提下,如果:

  1. VXLAN设备禁用(这是RFC的建议)了UDP Checksum
  2. VXLAN设备启用了Tx Checksum Offloading

就会导致生成错误的UDP Checksum

从资料[1]看,K8Sv1.18.5版本已经修复了这个问题,但我的问题是在v1.21.0上发现的,所以不确定只升级K8S是否可以解决该问题,或者升级后还需要额外配置什么?

从资料[3]和[4]看,calicov3.20.0版本做了修改:在kernels < v5.7时也禁用了calico.vxlan接口的Offloading功能。

本地临时禁用并验证:

1
2
3
4
5
6
7
8
9
10
11
[root@node2 ~]# ethtool --offload vxlan.calico rx off tx off
Actual changes:
rx-checksumming: off
tx-checksumming: off
tx-checksum-ip-generic: off
tcp-segmentation-offload: off
tx-tcp-segmentation: off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]
1
2
3
4
[root@node2 ~]# time curl http://10.96.91.255
real 0m0.009s
user 0m0.002s
sys 0m0.007s

请求恢复正常。

解决方案

  1. 临时解决: ethtool --offload vxlan.calico rx off tx off
  2. 永久解决:升级calico >=v3.20.0或升级内核到5.6.13, 5.4.41, 4.19.123, 4.14.181,单独升级K8S >= v1.18.5版本待确认是否能解决

参考资料

  1. https://blog.gmem.cc/nodeport-63s-delay-due-to-kernel-issue
  2. https://cloudnative.to/blog/kubernetes-1-17-vxlan-63s-delay/
  3. https://bowser1704.github.io/posts/vxlan-bug/
  4. https://github.com/projectcalico/calico/issues/3145
  5. https://github.com/projectcalico/felix/pull/2811/files