0%

问题背景

IPv6环境下,在浏览器中通过http://[vip:port]访问web业务,提示无法访问此网站,[vip]的响应时间过长。

分析过程

之前碰到过多次在PC浏览器上无法访问vip的情况,排查方法也很明确:

  1. 在集群的vip所在节点上访问是否正常;
  2. 在集群范围内其他节点上访问是否正常;
  3. 在集群之外的同网段linux环境上访问是否正常;
  4. 在其他环境的PC浏览器上访问是否正常;

验证发现,直接在vip所在节点上访问竟然不通!登录vip所在节点执行ip addr可以看到该地址确实是正确配置了,但 ping6该地址无回应,对应的ipv4地址 ping有回应。按说ping本机的地址不应该和链路的状态有关系,那会是什么原因呢?在仔细检查地址配置情况后发现该地址有个标记tentative dadfailed

1
2
3
4
5
6
7
8
17: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 0c:da:41:1d:a8:62 brd ff:ff:ff:ff:ff:ff
inet 10.10.10.17/16 scope global eth0
valid_lft forever preferred_lft forever
inet6 2000::10:18/128 scope global tentative dadfailed
valid_lft forever preferred_lft 0sec
inet6 fe80::eda:41ff:fe1d:a862/64 scope link
valid_lft forever preferred_lft forever

ip-address(8) 查到对该标记的解释如下:

1
2
tentative
(IPv6 only) only list addresses which have not yet passed duplicate address detection.

显然该地址没有通过地址重复探测(duplicate address detection,简称dad),而且这种检查机制只针对IPv6经确认,该环境的IPv6网段只有自己在用,且未手工配置过IPv6地址,但该环境曾经发生过切主

至此问题基本明确了,切主时会把老的主节点上的vip删除,再到新的主节点上把vip添加上去。如果一切正常,按照这个顺序切主没有问题,但也存在某些异常情况(比如老主上的vip没有及时删掉,而新主上已经添加好了),此时就会触发dad机制。经过验证,一旦出现dadfailed,即使地址冲突解决了,该地址依然无法访问;

解决方案

方案1:在sysctl配置中增加如下内核参数:

1
2
3
4
5
6
7
net.ipv6.conf.all.accept_dad = 0
net.ipv6.conf.default.accept_dad = 0
net.ipv6.conf.eth0.accept_dad = 0

# IPv6 Privacy Extensions (RFC 4941)
net.ipv6.conf.all.use_tempaddr = 0
net.ipv6.conf.default.use_tempaddr = 0

方案2:在ip addr add命令执行时增加nodad标识:

1
ip addr add 2000::10:18/128 dev eth0 nodad

参考资料

  1. https://blog.clanzx.net/network/ipv6-dad.html

问题背景

使用附加网络的Pod在服务器重启后启动异常,报错信息如下:

1
2
3
4
5
6
Events:
Type Reason Age From Message
Normal Scheduled 53m default-scheduler Successfully assigned xxx/xxx1-64784c458b-q67tx to node001
Warning FailedCreatePodSandBox 53m kubelet, node001 Failed to create pod sandbox: rpc er or: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxxl-64784c458b-q67tx": NetworkPlugin cni failed to set up pod "xxx1-64784c458b-q67tx_xxx" network: Multus: Err adding pod to network "net-netl-nodeOOl": Multus: error in invoke Delegate add - "macvlan": failed to create macvlan: device or resource busy
Warning FailedCreatePodSandBox 53m kubelet, node001 Failed to create pod sandbox: rpc er or: code = Unknown desc = failed to set up sandbox container "xxx" network for pod "xxxl-64784c458b-q67tx": NetworkPlugin cni failed to set up pod "xxx1-64784c458b-q67tx_xxx" network: Multus: Err adding pod to network "net-netl-nodeOOl": Multus: error in invoke Delegate add - "macvlan": failed to create macvlan: device or resource busy
...

分析过程

从日志初步看,创建Pod的sandbox异常,具体是Multus无法将Pod添加到net-netl-nodeOOl网络命名空间内,再具体点是Multus无法创建macvlan网络,原因是device or resource busy

最后的这个错误信息还是比较常见的,从字面理解,就是设备或资源忙,常见于共享存储的卸载场景。那这里也应该类似,有什么设备或资源处于被占用状态,所以执行macvlan的创建失败,既然是附加网络的问题,那优先查看了下附加网络相关的CRD资源,没什么异常;

网上根据日志搜索一番,也没有什么比较相关的问题,那就看代码吧,首先找到Multus的源码,根据上述日志找相关处理逻辑,没有找到。再一想,Multus实现macvlan网络使用的是macvlan插件,再下载插件代码,找到了相关处理逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
plugins/main/macvlan/macvlan.go:169
if err := netlink.LinkAdd(mv); err != nil {
return nil, fmt.Errorf("failed to create macvlan: %v", err)
}

// LinkAdd adds a new link device. The type and features of the device
// are taken from the parameters in the link object.
// Equivalent to: `ip link add $link`
func LinkAdd(link Link) error {
return pkgHandle.LinkAdd(link)
}

// LinkAdd adds a new link device. The type and features of the device
// are taken from the parameters in the link object.
// Equivalent to: `ip link add $link`
func (h *Handle) LinkAdd(link Link) error {
return h.linkModify(link, unix.NLM_F_CREATE|unix.NLM_F_EXCL|unix.NLM_F_ACK)
}
...

根据上述代码和注释简单的看,是在执行ip link add $link命令时报错,实际验证看看:

1
2
[root@node001 ~] ip link add link bond1 name macvlan1 type macvlan mode bridge
RTNETLINK answers: Device or resource busy

确实如此,在bond1接口上无法配置macvlan,那换一个接口试试:

1
2
3
4
5
6
[root@node001 ~] ip link add link bond0 name macvlan1 type macvlan mode bridge
[root@node001 ~] ip link show
...
110: macvlan1@bond0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ea:31:c9:7f:d9:a4 brd ff:ff:ff:ff:ff:ff
...

配置成功,说明bond1接口有什么问题,看看这俩接口有没有差异:

1
2
3
4
5
6
7
8
9
10
11
12
[root@node001 ~] ip addr show
...
2: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 0c:da:41:1d:6f:ca brd ff:ff:ff:ff:ff:ff
inet x.x.x.x/16 brd x.x.255.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::eda:41ff:fe1d:6fca/64 scope link
valid_lft forever preferred_lft forever
...
17: bond1: <BROADCAST,MULTICAST,MASTER,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 0c:da:41:1d:a8:62 brd ff:ff:ff:ff:ff:ff
...

对比两个接口可以发现两个差异点:

  1. bond0配置了IP地址,而bond1没有配置;
  2. bond0是MASTER角色,bond1既是MASTER,又是SLAVE角色;

考虑到bond0接口是用来建集群的,bond1接口是给Multus创建macvlan网络用的,所以第一个差异点属于正常现象。第二个是什么情况呢?一般来说,配置bond的目的是把几个物理接口作为SLAVE角色聚合成bond接口,这样既能增加服务器的可靠性,又增加了可用网络宽带,为用户提供不间断的网络服务。配置后,实际的物理接口应该是SLAVE角色,而聚合后的bond接口应该是MASTER角色,所以正常来说,不会同时出现两个角色才对;

查看两个bond的相关配置,没有发现什么异常,反过来讲,如果配置的有问题,那初次部署就应该报错了,而不是重启节点才发现。所以,问题的关键是重启导致的。也就是说,可能是在重启后的启动脚本里加了什么配置影响的;

搜索相关资料[1],发现在配置过程中可能有这么一个操作:

1
2
4、在/etc/rc.d/rc.local文件中加入如下语句,使系统启动自动运行
ifenslave bond0 eth0 eth1

查看问题环境上怎么配置的:

1
2
3
4
[root@node001 ~] cat /etc/rc.local
...
touch /var/lock/subsys/local
ifenslave bond0 bond1 enp661s0f0 enp661s0f1 ens1f0 ens1f1

发现有类似的配置,但不同的是,问题环境上配置了两个bond,并且配置在了一个命令里。感觉不是太对,个人理解这么配置应该会把bond1也认为是bond0的SLAVE,修改一下试试:

1
2
3
4
5
6
[root@node001 ~] cat /etc/rc.local
...
touch /var/lock/subsys/local
ifenslave bond0 enp661s0f0 enp661s0f1
ifenslave bond1 ens1f0 ens1f1
[root@node001 ~] systemctl restart network

再观察两个bond接口的角色,发现恢复正常,再看看异常Pod,也都起来了。

1
2
[root@node001 ~] kubectl get pod -A |grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE

解决方案

rc.local里的两个bond的命令拆开分别配置即可。

参考资料

  1. https://www.cnblogs.com/geaozhang/p/6763876.html

问题背景

通过kubectl delete命令删除某个业务Pod后,该Pod一直处于Terminating状态。

原因分析

根据现象看,应该是删除过程中有哪个流程异常,导致最终的删除卡在了Terminating状态。先describe看一下:

1
2
[root@node1 ~]# kubectl describe pod -n xxx cam1-78b6fc6bc8-cjsw5
// 没有发现什么异常信息,这里就不贴日志了

Event事件中未见明显异常,那就看负责删除Pod的kubelet组件日志(已过滤出关键性日志):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
I0728 16:24:57.339295    9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
I0728 16:24:57.339720 9744 kuberuntime_container.go:581] Killing container "docker://a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589" with 60 second grace period
I0728 16:25:18.259418 9744 kubelet.go:1904] SyncLoop (DELETE, "api"): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)"
2021-07-28 16:25:19.247 [INFO][394011] ipam.go 1173: Releasing all IPs with handle 'cam.cam1-78b6fc6bc8-cjsw5'
2021-07-28 16:25:19.254 [INFO][393585] k8s.go 498: Teardown processing complete.

// 可疑点1:没有获取到pod IP
W0728 16:25:19.303513 9744 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "cam1-78b6fc6bc8-cjsw5_cam": Unexpected command output Device "eth0" does not exist.
with error: exit status 1

I0728 16:25:19.341068 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"a73082a4a9a4cec174bb0d1c256cc11d804d93137551b9bfd3e6fa1522e98589"}
I0728 16:25:20.578095 9744 kubelet.go:1933] SyncLoop (PLEG): "cam1-78b6fc6bc8-cjsw5_cam(5c948341-c030-4996-b888-f032577d97b0)", event: &pleg.PodLifecycleEvent{ID:"5c948341-c030-4996-b888-f032577d97b0", Type:"ContainerDied", Data:"c3b992465cd2085300995066526a36665664558446ff6e1756135c3a5b6df2e6"}

I0728 16:25:20.711967 9744 kubelet_pods.go:1090] Killing unwanted pod "cam1-78b6fc6bc8-cjsw5"

// 可疑点2:Unmount失败
E0728 16:25:20.939400 9744 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g podName:5c948341-c030-4996-b888-f032577d97b0 nodeName:}" failed. No retries permitted until 2021-07-28 16:25:21.439325811 +0800 CST m=+199182.605079651 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"diag-log\" (UniqueName: \"kubernetes.io/glusterfs/5c948341-c030-4996-b888-f032577d97b0-cam-pv-50g\") pod \"5c948341-c030-4996-b888-f032577d97b0\" (UID: \"5c948341-c030-4996-b888-f032577d97b0\") : Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/5c948341-c030-4996-b888-f032577d97b0/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"

从删除Pod的日志看,有2个可疑点:

  1. docker_sandbox.go:384打印的获取pod IP错误;
  2. nestedpendingoperations.go:301打印的Unmount失败错误;

先看第1点,根据日志定位到代码[1]位置如下,IP没有拿到所以打印了个告警并返回空IP地址;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
pkg/kubelet/dockershim/docker_sandbox.go:348
func (ds *dockerService) getIP(podSandboxID string, sandbox *dockertypes.ContainerJSON) string {
if sandbox.NetworkSettings == nil {
return ""
}
if networkNamespaceMode(sandbox) == runtimeapi.NamespaceMode_NODE {
// For sandboxes using host network, the shim is not responsible for
// reporting the IP.
return ""
}

// Don't bother getting IP if the pod is known and networking isn't ready
ready, ok := ds.getNetworkReady(podSandboxID)
if ok && !ready {
return ""
}

ip, err := ds.getIPFromPlugin(sandbox)
if err == nil {
return ip
}

if sandbox.NetworkSettings.IPAddress != "" {
return sandbox.NetworkSettings.IPAddress
}
if sandbox.NetworkSettings.GlobalIPv6Address != "" {
return sandbox.NetworkSettings.GlobalIPv6Address
}

// 错误日志在这里
klog.Warningf("failed to read pod IP from plugin/docker: %v", err)
return ""
}

继续看getIP方法的调用处代码,这里如果没有拿到IP,也没有什么异常,直接把空值放到PodSandboxStatusResponse中并返回;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
pkg/kubelet/dockershim/docker_sandbox.go:404
func (ds *dockerService) PodSandboxStatus(ctx context.Context, req *runtimeapi.PodSandboxStatusRequest) (*runtimeapi.PodSandboxStatusResponse, error) {
podSandboxID := req.PodSandboxId

r, metadata, err := ds.getPodSandboxDetails(podSandboxID)
if err != nil {
return nil, err
}

// Parse the timestamps.
createdAt, _, _, err := getContainerTimestamps(r)
if err != nil {
return nil, fmt.Errorf("failed to parse timestamp for container %q: %v", podSandboxID, err)
}
ct := createdAt.UnixNano()

// Translate container to sandbox state.
state := runtimeapi.PodSandboxState_SANDBOX_NOTREADY
if r.State.Running {
state = runtimeapi.PodSandboxState_SANDBOX_READY
}

// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}

labels, annotations := extractLabels(r.Config.Labels)
status := &runtimeapi.PodSandboxStatus{
Id: r.ID,
State: state,
CreatedAt: ct,
Metadata: metadata,
Labels: labels,
Annotations: annotations,
Network: &runtimeapi.PodSandboxNetworkStatus{
Ip: IP,
},
Linux: &runtimeapi.LinuxPodSandboxStatus{
Namespaces: &runtimeapi.Namespace{
Options: &runtimeapi.NamespaceOption{
Network: networkNamespaceMode(r),
Pid: pidNamespaceMode(r),
Ipc: ipcNamespaceMode(r),
},
},
},
}
return &runtimeapi.PodSandboxStatusResponse{Status: status}, nil
}

到此看不出这个错误会不会中断删除流程,那就本地构造一下试试。修改上面的代码,在调用getIP方法的位置后面增加调试日志(从本地验证结果看,Pod正常删除,说明异常问题与此处无关);

1
2
3
4
5
6
7
8
9
10
11
// 调用getIP方法的位置
var IP string
if IP = ds.determinePodIPBySandboxID(podSandboxID); IP == "" {
IP = ds.getIP(podSandboxID, r)
}

// 新加调试日志,如果是指定的Pod,强制将IP置空
isTestPod := strings.Contains(metadata.GetName(), "testpod")
if isTestPod {
IP = ""
}

再看第2点,这个是ERROR级别的错误,问题出在Unmount挂载点时失败。那么卸载挂载点失败会导致卸载流程提前终止吗?网上关于Pod删除流程的源码分析文章很多,我们就直接找几篇[2,3,4]看看能不能解答上面的问题。

简单总结来说,删除一个Pod的流程如下:

  1. 调用kube-apiserverDELETE接口(默认带grace-period=30s);
  2. 第一次的删除只是更新Pod对象的元信息(DeletionTimestamp字段和DeletionGracePeriodSeconds字段),并没有在Etcd中删除记录;
  3. kubectl命令的执行会阻塞并显示正在删除Pod;
  4. kubelet组件监听到Pod对象的更新事件,执行killPod()方法;
  5. kubelet组件监听到pod的删除事件,第二次调用kube-apiserverDELETE接口(带grace-period=0
  6. kube-apiserverDELETE接口去etcd中删除Pod对象;
  7. kubectl命令的执行返回,删除Pod成功;

从前面kubelet删除异常的日志看,确实有两次DELETE操作,并且中间有个Killing container的日志,但从上面的删除流程看,两次DELETE操作之间应该是调用killPod()方法,通过查看源码,对应的日志应该是Killing unwanted pod,所以,实际上第二次的DELETE操作并没有触发。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
pkg/kubelet/kubelet_pods.go:1073
func (kl *Kubelet) podKiller() {
killing := sets.NewString()
// guard for the killing set
lock := sync.Mutex{}
for podPair := range kl.podKillingCh {
runningPod := podPair.RunningPod
apiPod := podPair.APIPod

lock.Lock()
exists := killing.Has(string(runningPod.ID))
if !exists {
killing.Insert(string(runningPod.ID))
}
lock.Unlock()

// 这里在调用killPod方法前会打印v2级别的日志
if !exists {
go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
klog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
err := kl.killPod(apiPod, runningPod, nil, nil)
if err != nil {
klog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
}
lock.Lock()
killing.Delete(string(runningPod.ID))
lock.Unlock()
}(apiPod, runningPod)
}
}
}

怎么确认第二次的DELETE操作有没有触发呢?很简单,看代码或者实际验证都可以。这里我就在测试环境删除个Pod看下相关日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@node2 ~]# kubectl delete pod -n xxx  testpodrc2-7b749f6c9c-qh68l
pod "testpodrc2-7b749f6c9c-qh68l" deleted

// 已过滤出关键性日志
[root@node2 ~]# tailf kubelet.log
I0730 13:27:31.854178 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:31.854511 24588 kuberuntime_container.go:581] Killing container "docker://e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e" with 30 second grace period
I0730 13:27:32.203167 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

I0730 13:27:32.993294 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"e2a1cd5f2165e12cf0b46e12f9cd4d656d593f75e85c0de058e0a2f376a5557e"}
I0730 13:27:32.993428 24588 kubelet.go:1933] SyncLoop (PLEG): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)", event: &pleg.PodLifecycleEvent{ID:"85ee282f-a843-4f10-a99c-79d447f83f2a", Type:"ContainerDied", Data:"c6a587614976beed0cbb6e5fabf70a2d039eec6c160154fce007fe2bb1ba3b4f"}

I0730 13:27:34.072494 24588 kubelet_pods.go:1090] Killing unwanted pod "testpodrc2-7b749f6c9c-qh68l"

I0730 13:27:40.084182 24588 kubelet.go:1904] SyncLoop (DELETE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"
I0730 13:27:40.085735 24588 kubelet.go:1898] SyncLoop (REMOVE, "api"): "testpodrc2-7b749f6c9c-qh68l_testpod(85ee282f-a843-4f10-a99c-79d447f83f2a)"

对比正常和异常场景下的日志可以看出,正常的删除操作下,Killing unwanted pod日志之后会有DELETEREMOVE的操作,这也就说明问题出在第二次DELETE操作没有触发。查看相关代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pkg/kubelet/status/status_manager.go:470
//kubelet组件有一个statusManager模块,它会for循环调用syncPod()方法
//方法内部有机会调用kube-apiserver的DELETE接口(强制删除,非平滑)
func (m *manager) syncPod(uid types.UID, status versionedPodStatus) {
...

//当pod带有DeletionTimestamp字段,并且其内容器已被删除、持久卷已被删除等的多条件下,才会进入if语句内部
if m.canBeDeleted(pod, status.status) {
deleteOptions := metav1.NewDeleteOptions(0)
deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))

//强制删除pod对象:kubectl delete pod podA --grace-period=0
err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
...

}
}

从源码可以看出,第二次DELETE操作是否触发依赖于canBeDeleted方法的校验结果,而这个方法内会检查持久卷是否已经被删除:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
pkg/kubelet/status/status_manager.go:538
func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {
if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) {
return false
}
return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)
}

pkg/kubelet/kubelet_pods.go:900
func (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {
...

// 这里会判断挂载卷是否已卸载
if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {
// We shouldnt delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes
klog.V(3).Infof("Pod %q is terminated, but some volumes have not been cleaned up", format.Pod(pod))
return false
}
if kl.kubeletConfiguration.CgroupsPerQOS {
pcm := kl.containerManager.NewPodContainerManager()
if pcm.Exists(pod) {
klog.V(3).Infof("Pod %q is terminated, but pod cgroup sandbox has not been cleaned up", format.Pod(pod))
return false
}
}
return true
}

结合出问题的日志,基本能确认是Unmount挂载点失败导致的异常。那么,挂载点为啥会Unmount失败?

1
2
// umount失败关键日志
Unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g\nOutput: umount: /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g:目标忙。\n (有些情况下通过 lsof(8) 或 fuser(1) 可以\n 找到有关使用该设备的进程的有用信息。)\n\n"

仔细看卸载失败的日志,可以看到这个挂载点的后端存储是glusterfs,而目标忙一般来说是存储设备侧在使用,所以无法卸载。那就找找看是不是哪个进程使用了这个挂载目录(以下定位由负责glusterfs的同事提供):

1
2
3
4
[root@node1 ~]# fuser -mv /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-pv-50g
用户 进程号 权限 命令
root kernel mount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~glusterfs/cam-dialog-gl.uster-pv-50g
root 94549 f.... glusterfs

除了内核的mount,还有个pid=94549glusterfs进程在占用挂载点所在目录,看看是什么进程:

1
2
[root@node1 ~]# ps -ef| grep 94549
root 94549 1 0 7月26 ? 00:01:13 /usr/sbin/glusterfs --log-level=ERR0R --log-file=/var/lib/kubelet/plugins/kubernetes.io/glusterfs/global-diaglog-pv/web-fddf96444-stxpf-glusterfs.log --fuse-mountopts=auto_unmount --process-name fuse --volfile-server=xxx --volfile-server=xxx --tfolfile-server=xxx --volfile-id=global-diaglog --fuse-mountopts=auto_unmount /var/lib/kubelet/pods/xxx/volumes/kubernetef.io-glusterfs/global-diaglog-pv

发现这个进程维护的是web-xxx的挂载信息,而web-xxxcam-xxx没有任何关联。由此推断出是glusterfs管理的挂载信息发送错乱导致,具体错乱原因就转给相关负责的同事看了。

解决方案

从分析结果看,是共享存储卷未正常卸载导致的删除Pod异常,非K8S问题。

参考资料

  1. Kubernetes v1.15.12源码
  2. kubernetes删除pod的流程的源码简析
  3. Kubernetes源码分析之Pod的删除
  4. kubernetes grace period 失效问题排查

问题背景

部署在服务器上的Web应用因为机房迁移,导致PC上无法正常访问Web页面。

原因分析

本次遇到的问题纯属网络层面问题,不用多想,先登录到服务器上,查看服务端口的监听状态:

1
2
[root@node2]# netstat -anp|grep 443
tcp6 0 0 :::443 :::* LISTEN 8450/java

在服务器所在节点、服务器之前的其他节点上curl监听端口看看是否有响应:

1
2
3
4
5
6
7
8
9
[root@node2]# curl -i -k https://192.168.10.10:443
HTTP/1.1 302 Found
Location: https://127.0.0.1:443
Content-Length: 0

[root@node2]# curl -i -k https://192.168.10.11:443
HTTP/1.1 302 Found
Location: https://192.168.10.11:443
Content-Length: 0

到此为止,说明Web服务运行正常,问题出在了PC到服务器这个通信过程。本地wireshark抓包看看,相关异常报文如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
371 70.961626   3.2.253.177     172.30.31.151   TCP     66  52541 → 443 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=4 SACK_PERM=1
373 70.962516 172.30.31.151 3.2.253.177 TCP 66 443 → 52541 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=128
375 70.962563 3.2.253.177 172.30.31.151 TCP 54 52541 → 443 [ACK] Seq=1 Ack=1 Win=65700 Len=0
377 70.963248 3.2.253.177 172.30.31.151 TLSv1.2 571 Client Hello
379 70.964323 172.30.31.151 3.2.253.177 TCP 60 443 → 52541 [ACK] Seq=1 Ack=518 Win=30336 Len=0
381 70.965327 172.30.31.151 3.2.253.177 TLSv1.2 144 Server Hello
383 70.965327 172.30.31.151 3.2.253.177 TLSv1.2 105 Change Cipher Spec, Encrypted Handshake Message
385 70.965364 3.2.253.177 172.30.31.151 TCP 54 52541 → 443 [ACK] Seq=518 Ack=142 Win=65556 Len=0
387 70.967194 3.2.253.177 172.30.31.151 TLSv1.2 61 Alert (Level: Fatal, Description: Certificate Unknown)
388 70.967233 3.2.253.177 172.30.31.151 TCP 54 52541 → 443 [FIN, ACK] Seq=525 Ack=142 Win=65556 Len=0
391 70.968320 172.30.31.151 3.2.253.177 TLSv1.2 85 Encrypted Alert
392 70.968321 172.30.31.151 3.2.253.177 TCP 60 443 → 52541 [FIN, ACK] Seq=173 Ack=526 Win=30336 Len=0
394 70.968356 3.2.253.177 172.30.31.151 TCP 54 52541 → 443 [RST, ACK] Seq=526 Ack=173 Win=0 Len=0
395 70.968370 3.2.253.177 172.30.31.151 TCP 54 52541 → 443 [RST] Seq=526 Win=0 Len=0

关键是最后两个,可以看出报文存在复位标志RST。与提供环境的人了解到PC与服务器之间使用的交换机是通过GRE隧道打通的网络,基本怀疑是交换机配置存在问题;

同时观察到PC访问集群的ftp也存在异常,说明是一个通用问题,而PC上pingssh服务器都没有问题,说明是配置导致的部分协议的连接问题;

后来提供环境的人排查交换机配置,发现GRE隧道的默认MTU1464,而集群网卡上的MTU1500,最后协商出的MSS1460(见抓包中的前两个报文):

1
2
3
4
5
6
7
8
9
10
11
12
13
[leaf11]dis interface Tunnel
Tunnel0
Current state: UP
Line protocol state: UP
Description: Tunnel0 Interface
Bandwidth: 64 kbps
Maximum transmission unit: 1464
Internet protocol processing: Disabled
Last clearing of counters: Never
Tunnel source 3.1.1.11, destination 2.1.1.222
Tunnel protocol/transport UDP_VXLAN/IP
Last 300 seconds input rate: 0 bytes/sec, 0 bits/sec, 0 packets/sec
Last 300 seconds output rate: 0 bytes/sec, 0 bits/

这种情况下,最大的报文发到交换机后,因为交换机允许的最大报文数为1464-40=1424字节,所以出现了上述现象,同时也解释了httpftp有问题(长报文),而pingssh没有问题(短报文)。

解决方案

方案1:修改隧道口和物理口的MTU值,但是取值不好定,因为不知道应用最长报文的长度。
方案2:GRE隧道口配置TCPMSS,超出后分片处理。

设置TCPMSS参考命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
【命令】
tcp mss value
undo tcp mss
【缺省情况】
未配置接口的TCP最大报文段长度。
【视图】
接口视图
【缺省用户角色】
network-admin
mdc-admin
【参数】
value:TCP最大报文段长度,取值范围为128~(接口的最大MTU值-40),单位为字节。
【使用指导】
TCP最大报文段长度(Max Segment Size,MSS)表示TCP连接的对端发往本端的最大TCP报文段的长度,目前作为TCP连接建立时的一个选项来协商:当一个TCP连接建立时,连接的双方要将MSS作为TCP报文的一个选项通告给对端,对端会记录下这个MSS值,后续在发送TCP报文时,会限制TCP报文的大小不超过该MSS值。当对端发送的TCP报文的长度小于本端的TCP最大报文段长度时,TCP报文不需要分段;否则,对端需要对TCP报文按照最大报文段长度进行分段处理后再发给本端。
该配置仅对新建的TCP连接生效,对于配置前已建立的TCP连接不生效。
该配置仅对IP报文生效,当接口上配置了MPLS功能后,不建议再配置本功能。

参考资料

  1. https://blog.csdn.net/qq_43684922/article/details/105300934

问题背景

如下所示,用户使用kubectl top命令看到其中一个节点上的Harbor占用内存约3.7G(其他业务Pod也存在类似现象),整体上来说,有点偏高。

1
2
3
4
5
6
7
8
9
10
[root@node02 ~]# kubectl get node -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP
node01 Ready master 10d v1.15.12 100.1.0.10 <none>
node02 Ready master 12d v1.15.12 100.1.0.11 <none>
node03 Ready master 10d v1.15.12 100.1.0.12 <none>

[root@node02 ~]# kubectl top pod -A |grep harbor
kube-system harbor-master1-sxg2l 15m 150Mi
kube-system harbor-master2-ncvb8 8m 3781Mi
kube-system harbor-master3-2gdsn 14m 227Mi

原因分析

我们知道,查看容器的内存占用,可以使用kubectl top命令,也可以使用docker stats命令,并且理论上来说,docker stats命令查的结果应该比kubectl top查到的更准确。查看并统计发现,实际上Harbor总内存占用约为140M左右,远没有达到3.7G:

1
2
3
4
5
6
7
8
9
10
[root@node02 ~]# docker stats |grep harbor
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
10a230bee3c7 k8s_nginx_harbor-master2-xxx 0.02% 14.15MiB / 94.26GiB 0.01%
6ba14a04fd77 k8s_harbor-portal_harbor-master2-xxx 0.01% 13.73MiB / 94.26GiB 0.01%
324413da20a9 k8s_harbor-jobservice_harbor-master2-xxx 0.11% 21.54MiB / 94.26GiB 0.02%
d880b61cf4cb k8s_harbor-core_harbor-master2-xxx 0.12% 33.2MiB / 94.26GiB 0.03%
186c064d0930 k8s_harbor-registryctl_harbor-master2-xxx 0.01% 8.34MiB / 94.26GiB 0.01%
52a50204a962 k8s_harbor-registry_harbor-master2-xxx 0.06% 29.99MiB / 94.26GiB 0.03%
86031ddd0314 k8s_harbor-redis_harbor-master2-xxx 0.14% 11.51MiB / 94.26GiB 0.01%
6366207680f2 k8s_harbor-database_harbor-master2-xxx 0.45% 8.859MiB / 94.26GiB 0.01%

这是什么情况?两个命令查到的结果差距也太大了。查看资料[1]可以知道:

  1. kubectl top命令的计算公式:memory.usage_in_bytes - inactive_file
  2. docker stats命令的计算公式:memory.usage_in_bytes - cache

可以看出,两种方式收集机制不一样,如果cache比较大,kubectl top命令看到的结果会偏高。根据上面的计算公式验证看看是否正确:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
curl -s --unix-socket /var/run/docker.sock http:/v1.24/containers/xxx/stats | jq ."memory_stats"
"memory_stats": {
"usage": 14913536,
"max_usage": 15183872,
"stats": {
"active_anon": 14835712,
"active_file": 0,
"cache": 77824,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 4096,
"inactive_file": 73728,
...
}

"memory_stats": {
"usage": 14405632,
"max_usage": 14508032,
"stats": {
"active_anon": 14397440,
"active_file": 0,
"cache": 8192,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 4096,
"inactive_file": 4096,
...
}

"memory_stats": {
"usage": 26644480,
"max_usage": 31801344,
"stats": {
"active_anon": 22810624,
"active_file": 790528,
"cache": 3833856,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 3043328,
...
}

"memory_stats": {
"usage": 40153088,
"max_usage": 90615808,
"stats": {
"active_anon": 35123200,
"active_file": 1372160,
"cache": 5029888,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 3657728,
...
}

"memory_stats": {
"usage": 10342400,
"max_usage": 12390400,
"stats": {
"active_anon": 8704000,
"active_file": 241664,
"cache": 1638400,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 1396736,
...
}

"memory_stats": {
"usage": 5845127168,
"max_usage": 22050988032,
"stats": {
"active_anon": 31576064,
"active_file": 3778052096,
"cache": 5813551104,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 2035499008,
...
}

"memory_stats": {
"usage": 13250560,
"max_usage": 34791424,
"stats": {
"active_anon": 12070912,
"active_file": 45056,
"cache": 1179648,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 1134592,
...
}

"memory_stats": {
"usage": 50724864,
"max_usage": 124682240,
"stats": {
"active_anon": 23502848,
"active_file": 13864960,
"cache": 41435136,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 6836224,
"inactive_file": 6520832,
...
}

根据上面提供的计算公式和实际获取的memory_stats数据,验证kubectl top结果和docker stats结果符合预期。那为什么Harbor缓存会占用那么高呢?

通过实际环境分析看,Harbor中占用缓存较高的组件是registry(如下所示,缓存有5.4G),考虑到registry负责docker镜像的存储,在处理镜像时会有大量的镜像层文件的读写操作,所以正常情况下这些操作确实会比较耗缓存;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
"memory_stats": {
"usage": 5845127168,
"max_usage": 22050988032,
"stats": {
"active_anon": 31576064,
"active_file": 3778052096,
"cache": 5813551104,
"dirty": 0,
"hierarchical_memory_limit": 101205622784,
"hierarchical_memsw_limit": 9223372036854772000,
"inactive_anon": 0,
"inactive_file": 2035499008,
...
}

解决方案

与用户沟通,说明kubectl top看到的结果包含了容器内使用的cache,结果会偏高,这部分缓存在内存紧张情况下会被系统回收,或者手工操作也可以释放,建议使用docker stats命令查看实际内存使用率。

参考资料

  1. https://blog.csdn.net/xyclianying/article/details/108513122

问题背景

接上次的问题,一段时间后,环境再次出现harborcalico因为健康检查不过反复重启的问题,并且使用kubectl命令进入Pod也响应非常慢甚至超时。

1
2
[root@node01 ~]# kubectl exec -it -n system node1-59c9475bc6-zkhq5 bash
^

原因分析

反复重启的原因上次已定位,这次上环境简单看还是因为健康检查超时的问题,并且现象也一样,TCP的连接卡在了第一次握手的SYN_SENT阶段。

1
2
3
[root@node01 ~]# netstat -anp|grep 23380
tcp 0 0 127.0.0.1:23380 0.0.0.0:* LISTEN 38914/kubelet
tcp 0 0 127.0.0.1:38983 127.0.0.1:23380 SYN_SENT -

也就是说,除了TCP连接队列的问题,还存在其他问题会导致该现象。先看看上次的参数还在不在:

1
2
3
[root@node01 ~]# cat /etc/sysctl.conf
net.ipv4.tcp_max_syn_backlog = 32768
net.core.somaxconn = 32768

再看下上次修改的参数是否生效:

1
2
3
[root@node01 ~]# ss -lnt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 32768 127.0.0.1:23380 *:*

参数的修改也生效了,那为什么还会卡在SYN_SENT阶段呢?从现有情况,看不出还有什么原因会导致该问题,只能摸索看看。

  1. 在问题节点和非问题节点上分别抓包,看报文交互是否存在什么异常;
  2. 根据参考资料[1],排查是否为相同问题;
  3. 根据参考资料[2],排查是否相同问题;

摸索一番,没发现什么异常。回过头来想想,既然是业务下发大量配置导致的,并且影响是全局的(除了业务Pod自身,其他组件也受到了影响),说明大概率原因还是系统层面存在的性能瓶颈。业务量大的影响除了CPU、一般还有内存、磁盘、连接数等等,与开发人员确认他们的连接还是长连接,那么连接数很大的情况下会受到什么内核参数的影响呢?其中一个就是我们熟知的文件句柄数。

1
2
3
4
5
[root@node01 ~]# lsof -p 45775 | wc -l
17974

[root@node01 ~]# lsof -p 45775|grep "sock"| wc -l
12051

嗯,打开了1w+的文件句柄数并且基本都是sock连接,而我们使用的操作系统默认情况下每个进程的文件句柄数限制为1024,查看确认一下:

1
2
[root@node01 ~]# ulimit  -n
1024

超额使用了这么多,业务Pod竟然没有too many open files错误:

1
2
3
[root@node01 ~]# kubectl logs -n system node1-59c9475bc6-zkhq5
start config
...

临时修改一下:

1
2
3
[root@node01 ~]# ulimit -n 65535
[root@node01 ~]# ulimit -n
65535

再次使用kubectl命令进入业务Pod,响应恢复正常,并且查看连接也不再有卡住的SYN_SENT阶段:

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 ~]# kubectl exec -it -n system node1-59c9475bc6-zkhq5 bash
[root@node1-59c9475bc6-zkhq5]# exit
[root@node01 ~]# kubectl exec -it -n system node1-59c9475bc6-zkhq5 bash
[root@node1-59c9475bc6-zkhq5]# exit
[root@node01 ~]# kubectl exec -it -n system node1-59c9475bc6-zkhq5 bash
[root@node1-59c9475bc6-zkhq5]# exit

[root@node01 ~]# netstat -anp|grep 23380
tcp 0 0 127.0.0.1:23380 0.0.0.0:* LISTEN 38914/kubelet
tcp 0 0 127.0.0.1:56369 127.0.0.1:23380 TIME_WAIT -
tcp 0 0 127.0.0.1:23380 127.0.0.1:57601 TIME_WAIT -
tcp 0 0 127.0.0.1:23380 127.0.0.1:57479 TIME_WAIT -

解决方案

  1. 业务根据实际情况调整文件句柄数。
  2. 针对业务量大的环境,强烈建议整体做一下操作系统层面的性能优化,否则,不定哪个系统参数就成了性能瓶颈,网上找了个调优案例[3],感兴趣的可以参考。

参考资料

  1. https://blog.csdn.net/pyxllq/article/details/80351827
  2. http://mdba.cn/2015/03/10/tcp-socket文件句柄泄漏
  3. https://www.shuzhiduo.com/A/RnJW7NLyJq/

问题背景

K8S集群内,Influxdb监控数据获取异常,最终CPU、内存和磁盘使用率都无法获取。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
监控项         使用率
CPU(核) 3%
内存(GB) 18%
磁盘空间(GB) 0%

监控项 使用率
CPU(核) 7%
内存(GB) 18%
磁盘空间(GB) 1%

监控项 使用率
CPU(核) 0%
内存(GB) 0%
磁盘空间(GB) 0%

...

Influxdb监控架构图参考[1],其中Load Balancer采用nginx实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
        ┌─────────────────┐                 
│writes & queries │
└─────────────────┘


┌───────────────┐
│ │
┌────────│ Load Balancer │─────────┐
│ │ │ │
│ └──────┬─┬──────┘ │
│ │ │ │
│ │ │ │
│ ┌──────┘ └────────┐ │
│ │ ┌─────────────┐ │ │┌──────┐
│ │ │/write or UDP│ │ ││/query│
│ ▼ └─────────────┘ ▼ │└──────┘
│ ┌──────────┐ ┌──────────┐ │
│ │ InfluxDB │ │ InfluxDB │ │
│ │ Relay │ │ Relay │ │
│ └──┬────┬──┘ └────┬──┬──┘ │
│ │ | | │ │
│ | ┌─┼──────────────┘ | │
│ │ │ └──────────────┐ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ │ │ │ │
└─▶│ InfluxDB │ │ InfluxDB │◀─┘
│ │ │ │
└──────────┘ └──────────┘

原因分析

因为获取的数据来源是influxdb数据库,所以先搞清楚异常的原因是请求路径上的问题,还是influxdb数据库自身没有数据的问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 找到influxdb-nginx的service
kubectl get svc -n kube-system -owide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
grafana-service ClusterIP 10.96.177.245 <none> 3000/TCP 21d app=grafana
heapster ClusterIP 10.96.239.225 <none> 80/TCP 21d app=heapster
influxdb-nginx-service ClusterIP 10.96.170.72 <none> 7076/TCP 21d app=influxdb-nginx
influxdb-relay-service ClusterIP 10.96.196.45 <none> 9096/TCP 21d app=influxdb-relay
influxdb-service ClusterIP 10.96.127.45 <none> 8086/TCP 21d app=influxdb

# 在集群节点上检查访问influxdb-nginx的service是否正常
curl -i 10.96.170.72:7076/query
HTTP/1.1 401 Unauthorized
Server: nginx/1.17.2

可以看出,请求发送到influxdb-nginxservice是正常的,也就是请求可以正常发送到后端的influxdb数据库。那就继续确认influxdb数据库自身没有数据的问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 找到influxdb数据库的pod
kubectl get pod -n kube-system -owide |grep influxdb
influxdb-nginx-4x8pr 1/1 Running 3 21d 177.177.52.201 node3
influxdb-nginx-tpngh 1/1 Running 6 21d 177.177.41.214 node1
influxdb-nginx-wh6kc 1/1 Running 5 21d 177.177.250.180 node2
influxdb-relay-rs-65c94bbf5f-dp7s4 1/1 Running 2 21d 177.177.250.148 node2
influxdb1-6ff9466d46-q6w5r 1/1 Running 3 21d 177.177.41.230 node1
influxdb2-d6d6697f5-zzcnk 1/1 Running 3 21d 177.177.250.161 node2
influxdb3-65ddfc7476-hxhr8 1/1 Running 4 21d 177.177.52.217 node3

# 登录任意一个influxdb容器内并进入交互式命令
kubectl exec -it -n kube-systme influxdb-rs3-65ddfc7476-hxhr8 bash
root@influxdb-rs3-65ddfc7476-hxhr8:/# influx
Connected to http://localhost:8086 version 1.7.7
InfluxDB shell version: 1.7.7
> auth
username: admin
password: xxx
> use xxx;
Using database xxx

根据业务层面的查询语句,在influxdb交互式命令下手工查询验证:

1
2
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 2m
>

结果发现确实没有查到数据,既然2min内的数据没有,那把时间线拉长一些看看呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 不限制时间范围的查询
> select sum(value) from "cpu/node_capacity"> select sum(value) from "cpu/node_capacity";
name: cpu/node_capacity
time sum
---- ---
0 5301432000

# 查询72min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72m
name: cpu/node_capacity
time sum
---- ---
1624348319900503945 72000

# sleep 1min,继续查询72min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72m
name: cpu/node_capacity
>

# 查询73min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 73m
name: cpu/node_capacity
time sum
---- ---
1624348319900503945 72000

根据查询结果看,不添加时间范围的查询是有记录的,并且通过多次验证看,数据无法获取的原因是数据在某个时间点不再写入导致的。查看influxdb的日志看看有没有什么相关日志:

1
2
3
4
5
6
kubectl logs -n kube-systme influxdb-rs3-65ddfc7476-hxhr8
ts=2021-06-22T09:56:49.658621Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx tag=pod_name
ts=2021-06-22T09:56:49.658702Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx_errors tag=pod_name
ts=2021-06-22T09:56:49.658815Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx tag=pod_name
ts=2021-06-22T09:56:49.658893Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx_errors tag=pod_name
ts=2021-06-22T09:56:49.659062Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100003 max=100000 db_instance=xxx measurement=uptime tag=pod_name

果然,有大量warn日志,提示max-values-per-tag limit may be exceeded soon,从日志可以看出,这个参数的默认值为100000。通过搜索,找到了这个参数引入的issue[2],引入原因大概意思是:

如果不小心加载了大量的cardinality数据,那么当我们删除数据的时候,InfluxDB很容易会发生OOM。

通过临时修改max-values-per-tag参数,验证问题是否解决

1
2
3
4
5
6
7
8
9
10
11
cat influxdb.conf
[meta]
dir = "/var/lib/influxdb/meta"
[data]
dir = "/var/lib/influxdb/data"
engine = "tsm1"
wal-dir = "/var/lib/influxdb/wal"
max-series-per-database = 0
max-values-per-tag = 0
[http]
auth-enabled = true
1
2
3
4
5
6
7
8
kubectl delete pod -n kube-system influxdb-rs1-6ff9466d46-q6w5r
pod "influxdb-rs1-6ff9466d46-q6w5r" deleted

kubectl delete pod -n kube-system influxdb-rs2-d6d6697f5-zzcnk
pod "influxdb-rs2-d6d6697f5-zzcnk" deleted

kubectl delete pod -n kube-system influxdb-rs3-65ddfc7476-hxhr8
pod "influxdb-rs3-65ddfc7476-hxhr8" deleted

再次观察业务层面获取的Influxdb监控数据,最终CPU、内存和磁盘使用率正常获取。

1
2
3
4
监控项         使用率
CPU(核) 19%
内存(GB) 22%
磁盘空间(GB) 2%

解决方案

根据业务情况,将influxdbmax-values-per-tag参数调整到合适值。

参考资料

  1. https://github.com/influxdata/influxdb-relay
  2. https://github.com/influxdata/influxdb/issues/7146

问题背景

K8S集群内,PodA使用服务名称访问PodB,请求出现异常。其中,PodA在node1节点上,PodB在node2节点上。

原因分析

先上tcpdump,观察请求是否有异常:

1
2
3
4
5
[root@node1 ~]# tcpdump -n -i ens192 port 50300
...
13:48:17.630335 IP 177.177.176.150.distinct -> 10.96.22.136.50300: UDP, length 214
13:48:17.630407 IP 192.168.7.21.distinct -> 10.96.22.136.50300: UDP, length 214
...

从抓包数据可以看出,请求源地址端口号为177.177.176.150:50901,目标地址端口号为10.96.22.136:50300 ,其中10.96.22.136是PodA使用server-svc这个serviceName请求得到的目的地址,也就是server-svc对应的serviceIP,那就确认一下这个地址有没有问题:

1
2
3
4
[root@node1 ~]# kubectl get pod -A -owide|grep server
ss server-xxx-xxx 1/1 Running 0 20h 177.177.176.150 node1
ss server-xxx-xxx 1/1 Running 0 20h 177.177.254.245 node2
ss server-xxx-xxx 1/1 Running 0 20h 177.177.18.152 node3
1
2
[root@node1 ~]# kubectl get svc -A -owide|grep server
ss server-svc ClusterIP 10.96.182.195 <none> 50300/UDP

可以看出,源地址没有问题,但目标地址跟预期不符,实际查到的服务名server-svc对应的地址为10.96.182.195,这是怎么回事儿呢?我们知道,K8S从v1.13版本开始默认使用CoreDNS作为服务发现,PodA使用服务名server-svc发起请求时,需要经过CoreDNS的解析,将服务名解析为serviceIP,那就登录到PodA内,验证域名解析是不是有问题:

1
2
3
4
5
6
7
8
9
10
[root@node1 ~]# kubectl exec -it -n ss server-xxx-xxx -- cat /etc/resolve.conf
nameserver 10.96.0.10
search ss.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

[root@node1 ~]# kubectl exec -it -n ss server-xxx-xxx -- nslookup server-svc
Server: 10.96.0.10

Name: ss
Address: 10.96.182.195

从查看结果看,域名解析没有问题,PodA内也可以正确解析出server-svc对应的serviceIP10.96.182.195,那最初使用tcpdump命令抓到的serviceIP10.96.22.136,难道这个地址是其他业务的服务,或者是残留的iptables规则,或者是有什么相关路由?分别查一下看看:

1
2
3
4
5
[root@node1 ~]# kubectl get svc -A -owide|grep 10.96.22.136

[root@node1 ~]# iptables-save|grep 10.96.22.136

[root@node1 ~]# ip route|grep 10.96.22.136

结果是,集群上根本不存在10.96.22.136这个地址,那PodA请求的目标地址为什么是它?既然主机上抓包时,目标地址已经是10.96.22.136,那再确认下出PodA时目标地址是什么:

1
2
3
4
5
6
7
[root@node1 ~]# ip route|grep 177.177.176.150
177.177.176.150 dev cali9afa4438787 scope link

[root@node1 ~]# tcpdump -n -i cali9afa4438787 port 50300
...
14:16:40.821511 IP 177.177.176.150.50902 -> 10.96.22.136.50300: UDP, length 214
...

原来出PodA时,目标地址已经是错误的serviceIP。而结合上面的域名解析的验证结果看,请求出PodA时的域名解析应该不存在问题。综合上面的定位情况,基本可以推测出,问题出在发送方

为了进一步区分出,是PodA内的所有发送请求都存在问题,还是只有业务自身的发送请求存在问题,我们使用nc命令在PodA内模拟发送一个UDP数据包,然后在主机上抓包验证(PodA内恰巧有nc命令,如果没有,感兴趣的同学可以使用/dev/{tcp|udp}模拟[1]):

1
2
3
4
5
6
[root@node1 ~]# kubectl exec -it -n ss server-xxx-xxx -- echo “test” | nc -u server-svc 50300 -p 9999

[root@node1 ~]# tcpdump -n -i cali9afa4438787 port 50300
...
15:46:45.871580 IP 177.177.176.150.50902 -> 10.96.182.195.50300: UDP, length 54
...

可以看出,PodA内模拟发送的请求,目标地址是可以正确解析的,也就把问题限定在了业务自身的发送请求存在问题。因为问题是服务名没有解析为正确的IP地址,所以怀疑是业务使用了什么缓存,如果猜想正确,那么重启PodA,理论上可以解决。而考虑到业务是多副本的,我们重启其中一个,其他副本上的问题环境还可以保留,跟开发沟通后重启并验证业务的请求:

1
2
3
4
5
6
7
[root@node1 ~]# docker ps |grep server-xxx-xxx | grep -v POD |awk '{print $1}' |xargs docker restart

[root@node1 ~]# tcpdump -n -i ens192 port 50300
...
15:58:17.150535 IP 177.177.176.150.distinct -> 10.96.182.195.50300: UDP, length 214
15:58:17.150607 IP 192.168.7.21.distinct -> 10.96.182.195.50300: UDP, length 214
...

验证符合预期,进一步证明了业务可能是使用了什么缓存。与开发同学了解,业务的发送使用的是java原生的API发送UDP数据,会不会是java在使用域名建立socket时默认会做缓存呢?

通过一番搜索,找了一篇相关博客[2],关键内容附上:

在通过DNS查找域名的过程中,可能会经过多台中间DNS服务器才能找到指定的域名,因此,在DNS服务器上查找域名是非常昂贵的操作。在Java中为了缓解这个问题,提供了DNS缓存。当InetAddress类第一次使用某个域名创建InetAddress对象后,JVM就会将这个域名和它从DNS上获得的信息(如IP地址)都保存在DNS缓存中。当下一次InetAddress类再使用这个域名时,就直接从DNS缓存里获得所需的信息,而无需再访问DNS服务器。

还真是,继续看怎么解决:

DNS缓存在默认时将永远保留曾经访问过的域名信息,但我们可以修改这个默认值。一般有两种方法可以修改这个默认值:

  1. 在程序中通过java.security.Security.setProperty方法设置安全属性networkaddress.cache.ttl的值(单位:秒)

  2. 设置java.security文件中的networkaddress.cache.negative.ttl属性。假设JDK的安装目录是C:/jdk1.6,那么java.security文件位于c:/jdk1.6/jre/lib/security目录中。打开这个文件,找到networkaddress.cache.ttl属性,并将这个属性值设为相应的缓存超时(单位:秒)

注:如果将networkaddress.cache.ttl属性值设为-1,那么DNS缓存数据将永远不会释放。

至此,问题定位结束。

解决方案

业务侧根据业务场景调整DNS缓存的设置。

参考资料

  1. https://blog.csdn.net/michaelwoshi/article/details/101107042
  2. https://blog.csdn.net/turkeyzhou/article/details/5510960

什么是Sealer

引用官方文档的介绍[1]:

  • sealer[ˈsiːlər]是一款分布式应用打包交付运行的解决方案,通过把分布式应用及其数据库中间件等依赖一起打包以解决复杂应用的交付问题。
  • sealer构建出来的产物我们称之为“集群镜像”, 集群镜像里内嵌了一个kubernetes,解决了分布式应用的交付一致性问题。
  • 集群镜像可以push到registry中共享给其他用户使用,也可以在官方仓库中找到非常通用的分布式软件直接使用。
  • Docker可以把一个操作系统的rootfs+应用 build成一个容器镜像,sealer把kubernetes看成操作系统,在这个更高的抽象纬度上做出来的镜像就是集群镜像。 实现整个集群的Build Share Run !!!

快速部署K8S集群

准备一个节点,先下载并安装Sealer:

1
2
3
4
[root@node1]# wget https://github.com/alibaba/sealer/releases/download/v0.1.5/sealer-v0.1.5-linux-amd64.tar.gz && tar zxvf sealer-v0.1.5-linux-amd64.tar.gz && mv sealer /usr/bin

[root@node1]# sealer version
{"gitVersion":"v0.1.5","gitCommit":"9143e60","buildDate":"2021-06-04 07:41:03","goVersion":"go1.14.15","compiler":"gc","platform":"linux/amd64"}

根据官方文档,如果要在一个已存在的机器上部署kubernetes,直接执行以下命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
[root@node1]# sealer run kubernetes:v1.19.9 --masters xx.xx.xx.xx --passwd xxxx
2021-06-19 17:22:14 [WARN] [registry_client.go:37] failed to get auth info for registry.cn-qingdao.aliyuncs.com, err: auth for registry.cn-qingdao.aliyuncs.com doesn't exist
2021-06-19 17:22:15 [INFO] [current_cluster.go:39] current cluster not found, will create a new cluster new kube build config failed: stat /root/.kube/config: no such file or directory
2021-06-19 17:22:15 [WARN] [default_image.go:89] failed to get auth info, err: auth for registry.cn-qingdao.aliyuncs.com doesn't exist
Start to Pull Image kubernetes:v1.19.9
191908a896ce: pull completed
2021-06-19 17:22:49 [INFO] [filesystem.go:88] image name is registry.cn-qingdao.aliyuncs.com/sealer-io/kubernetes:v1.19.9.alpha.1
2021-06-19 17:22:49 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /var/lib/sealer/data/my-cluster || true
copying files to 10.10.11.49: 198/198
2021-06-19 17:25:22 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : cd /var/lib/sealer/data/my-cluster/rootfs && chmod +x scripts/* && cd scripts && sh init.sh
+ storage=/var/lib/docker
+ mkdir -p /var/lib/docker
+ command_exists docker
+ command -v docker
+ systemctl daemon-reload
+ systemctl restart docker.service
++ docker info
++ grep Cg
+ cgroupDriver=' Cgroup Driver: cgroupfs'
+ driver=cgroupfs
+ echo 'driver is cgroupfs'
driver is cgroupfs
+ export criDriver=cgroupfs
+ criDriver=cgroupfs
* Applying /usr/lib/sysctl.d/00-system.conf ...
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
* Applying /usr/lib/sysctl.d/10-default-yama-scope.conf ...
kernel.yama.ptrace_scope = 0
* Applying /usr/lib/sysctl.d/50-default.conf ...
kernel.sysrq = 16
kernel.core_uses_pid = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
* Applying /usr/lib/sysctl.d/60-libvirtd.conf ...
fs.aio-max-nr = 1048576
* Applying /etc/sysctl.d/99-sysctl.conf ...
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 1
* Applying /etc/sysctl.d/k8s.conf ...
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.conf.all.rp_filter = 0
* Applying /etc/sysctl.conf ...
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.ip_forward = 1
2021-06-19 17:25:26 [INFO] [runtime.go:107] metadata version v1.19.9
2021-06-19 17:25:26 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : cd /var/lib/sealer/data/my-cluster/rootfs && echo "
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.19.9
controlPlaneEndpoint: "apiserver.cluster.local:6443"
imageRepository: sea.hub:5000/library
networking:
# dnsDomain: cluster.local
podSubnet: 100.64.0.0/10
serviceSubnet: 10.96.0.0/22
apiServer:
certSANs:
- 127.0.0.1
- apiserver.cluster.local
- 10.10.11.49
- aliyun-inc.com
- 10.0.0.2
- 127.0.0.1
- apiserver.cluster.local
- 10.103.97.2
- 10.10.11.49
- 10.103.97.2
extraArgs:
etcd-servers: https://10.10.11.49:2379
feature-gates: TTLAfterFinished=true,EphemeralContainers=true
audit-policy-file: "/etc/kubernetes/audit-policy.yml"
audit-log-path: "/var/log/kubernetes/audit.log"
audit-log-format: json
audit-log-maxbackup: '"10"'
audit-log-maxsize: '"100"'
audit-log-maxage: '"7"'
enable-aggregator-routing: '"true"'
extraVolumes:
- name: "audit"
hostPath: "/etc/kubernetes"
mountPath: "/etc/kubernetes"
pathType: DirectoryOrCreate
- name: "audit-log"
hostPath: "/var/log/kubernetes"
mountPath: "/var/log/kubernetes"
pathType: DirectoryOrCreate
- name: localtime
hostPath: /etc/localtime
mountPath: /etc/localtime
readOnly: true
pathType: File
controllerManager:
extraArgs:
feature-gates: TTLAfterFinished=true,EphemeralContainers=true
experimental-cluster-signing-duration: 876000h
extraVolumes:
- hostPath: /etc/localtime
mountPath: /etc/localtime
name: localtime
readOnly: true
pathType: File
scheduler:
extraArgs:
feature-gates: TTLAfterFinished=true,EphemeralContainers=true
extraVolumes:
- hostPath: /etc/localtime
mountPath: /etc/localtime
name: localtime
readOnly: true
pathType: File
etcd:
local:
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
" > kubeadm-config.yaml
2021-06-19 17:25:27 [INFO] [kube_certs.go:234] APIserver altNames : {map[aliyun-inc.com:aliyun-inc.com apiserver.cluster.local:apiserver.cluster.local kubernetes:kubernetes kubernetes.default:kubernetes.default kubernetes.default.svc:kubernetes.default.svc kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local localhost:localhost node1:node1] map[10.0.0.2:10.0.0.2 10.103.97.2:10.103.97.2 10.96.0.1:10.96.0.1 127.0.0.1:127.0.0.1 10.10.11.49:10.10.11.49]}
2021-06-19 17:25:27 [INFO] [kube_certs.go:254] Etcd altnames : {map[localhost:localhost node1:node1] map[127.0.0.1:127.0.0.1 10.10.11.49:10.10.11.49 ::1:::1]}, commonName : node1
2021-06-19 17:25:30 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 22/22
2021-06-19 17:25:43 [INFO] [kubeconfig.go:267] [kubeconfig] Writing "admin.conf" kubeconfig file
2021-06-19 17:25:43 [INFO] [kubeconfig.go:267] [kubeconfig] Writing "controller-manager.conf" kubeconfig file
2021-06-19 17:25:43 [INFO] [kubeconfig.go:267] [kubeconfig] Writing "scheduler.conf" kubeconfig file
2021-06-19 17:25:43 [INFO] [kubeconfig.go:267] [kubeconfig] Writing "kubelet.conf" kubeconfig file
2021-06-19 17:25:44 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes && cp -f /var/lib/sealer/data/my-cluster/rootfs/statics/audit-policy.yml /etc/kubernetes/audit-policy.yml
2021-06-19 17:25:44 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : cd /var/lib/sealer/data/my-cluster/rootfs/scripts && sh init-registry.sh 5000 /var/lib/sealer/data/my-cluster/rootfs/registry
++ dirname init-registry.sh
+ cd .
+ REGISTRY_PORT=5000
+ VOLUME=/var/lib/sealer/data/my-cluster/rootfs/registry
+ container=sealer-registry
+ mkdir -p /var/lib/sealer/data/my-cluster/rootfs/registry
+ docker load -q -i ../images/registry.tar
Loaded image: registry:2.7.1
+ docker run -d --restart=always --name sealer-registry -p 5000:5000 -v /var/lib/sealer/data/my-cluster/rootfs/registry:/var/lib/registry registry:2.7.1
e35aeefcfb415290764773f28dd843fc53dab8d1210373ca2c0f1f4773391686
2021-06-19 17:25:45 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:25:46 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:25:47 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:25:48 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:25:49 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : echo 10.10.11.49 apiserver.cluster.local >> /etc/hosts
2021-06-19 17:25:50 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : echo 10.10.11.49 sea.hub >> /etc/hosts
2021-06-19 17:25:50 [INFO] [init.go:211] start to init master0...
[ssh][10.10.11.49]failed to run command [kubeadm init --config=/var/lib/sealer/data/my-cluster/rootfs/kubeadm-config.yaml --upload-certs -v 0 --ignore-preflight-errors=SystemVerification],output is: W0619 17:25:50.649054 122163 common.go:77] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta1". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.

W0619 17:25:50.702549 122163 configset.go:348] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.9
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[WARNING FileExisting-socat]: socat not found in system path
[WARNING Hostname]: hostname "node1" could not be reached
[WARNING Hostname]: hostname "node1": lookup node1 on 10.72.66.37:53: no such host
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:

[ERROR ImagePull]: failed to pull image sea.hub:5000/library/kube-apiserver:v1.19.9: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/kube-controller-manager:v1.19.9: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/kube-scheduler:v1.19.9: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/kube-proxy:v1.19.9: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/pause:3.2: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/etcd:3.4.13-0: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[ERROR ImagePull]: failed to pull image sea.hub:5000/library/coredns:1.7.0: output: Error response from daemon: Get https://sea.hub:5000/v2/: http: server gave HTTP response to HTTPS client, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
2021-06-19 17:25:52 [EROR] [run.go:55] init master0 failed, error: [ssh][10.10.11.49]run command failed [kubeadm init --config=/var/lib/sealer/data/my-cluster/rootfs/kubeadm-config.yaml --upload-certs -v 0 --ignore-preflight-errors=SystemVerification]. Please clean and reinstall

部署报错,从错误日志看,是尝试访问Sealer自己搭建的私有registry异常。从报错信息server gave HTTP response to HTTPS client可以知道,应该是docker中没有配置insecure-registries字段导致的。查看docker的配置文件确认一下:

1
2
3
4
5
6
7
8
[root@node1]# cat /etc/docker/daemon.json 
{
"max-concurrent-downloads": 10,
"log-driver": "json-file",
"log-level": "warn",
"insecure-registries":["127.0.0.1"],
"data-root":"/var/lib/docker"
}

可以看出,insecure-registries字段配置的不对,考虑到该节点在部署之前已经安装过docker,所以不确定这个配置是之前就存在,还是Sealer配置错了,那就自己修改一下吧:

1
2
3
4
5
6
7
8
[root@node1]# cat /etc/docker/daemon.json 
{
"max-concurrent-downloads": 10,
"log-driver": "json-file",
"log-level": "warn",
"insecure-registries":["sea.hub:5000"],
"data-root":"/var/lib/docker"
}

再次执行部署命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
sealer run kubernetes:v1.19.9 --masters xx.xx.xx.xx --passwd xxxx
...
2021-06-19 17:43:56 [INFO] [kubeconfig.go:277] [kubeconfig] Using existing kubeconfig file: "/var/lib/sealer/data/my-cluster/admin.conf"
2021-06-19 17:43:57 [INFO] [kubeconfig.go:277] [kubeconfig] Using existing kubeconfig file: "/var/lib/sealer/data/my-cluster/controller-manager.conf"
2021-06-19 17:43:57 [INFO] [kubeconfig.go:277] [kubeconfig] Using existing kubeconfig file: "/var/lib/sealer/data/my-cluster/scheduler.conf"
2021-06-19 17:43:57 [INFO] [kubeconfig.go:277] [kubeconfig] Using existing kubeconfig file: "/var/lib/sealer/data/my-cluster/kubelet.conf"
2021-06-19 17:43:57 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes && cp -f /var/lib/sealer/data/my-cluster/rootfs/statics/audit-policy.yml /etc/kubernetes/audit-policy.yml
2021-06-19 17:43:57 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : cd /var/lib/sealer/data/my-cluster/rootfs/scripts && sh init-registry.sh 5000 /var/lib/sealer/data/my-cluster/rootfs/registry
++ dirname init-registry.sh
+ cd .
+ REGISTRY_PORT=5000
+ VOLUME=/var/lib/sealer/data/my-cluster/rootfs/registry
+ container=sealer-registry
+ mkdir -p /var/lib/sealer/data/my-cluster/rootfs/registry
+ docker load -q -i ../images/registry.tar
Loaded image: registry:2.7.1
+ docker run -d --restart=always --name sealer-registry -p 5000:5000 -v /var/lib/sealer/data/my-cluster/rootfs/registry:/var/lib/registry registry:2.7.1
docker: Error response from daemon: Conflict. The container name "/sealer-registry" is already in use by container "e35aeefcfb415290764773f28dd843fc53dab8d1210373ca2c0f1f4773391686". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
+ true
2021-06-19 17:43:58 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:43:59 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:44:00 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:44:01 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : mkdir -p /etc/kubernetes || true
copying files to 10.10.11.49: 1/1
2021-06-19 17:44:02 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : echo 10.10.11.49 apiserver.cluster.local >> /etc/hosts
2021-06-19 17:44:02 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : echo 10.10.11.49 sea.hub >> /etc/hosts
2021-06-19 17:44:03 [INFO] [init.go:211] start to init master0...
2021-06-19 17:46:53 [INFO] [init.go:286] [globals]join command is: apiserver.cluster.local:6443 --token comygj.c0kj18d7fh2h4xta \
--discovery-token-ca-cert-hash sha256:cd8988f9a061765914dddb24d4e578ad446d8d31b0e30dba96a89e0c4f1e7240 \
--control-plane --certificate-key b27f10340d2f89790f7e980af72cf9d54d790b53bfd4da823947d914359d6e81

2021-06-19 17:46:53 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : rm -rf .kube/config && mkdir -p /root/.kube && cp /etc/kubernetes/admin.conf /root/.kube/config
2021-06-19 17:46:53 [INFO] [init.go:230] start to install CNI
2021-06-19 17:46:53 [INFO] [init.go:250] render cni yaml success
2021-06-19 17:46:54 [INFO] [sshcmd.go:48] [ssh][10.10.11.49] : echo '
---
# Source: calico/templates/calico-config.yaml
# This ConfigMap is used to configure a self-hosted Calico installation.
kind: ConfigMap
apiVersion: v1
metadata:
name: calico-config
namespace: kube-system
data:
# Typha is disabled.
typha_service_name: "none"
# Configure the backend to use.
calico_backend: "bird"

# Configure the MTU to use
veth_mtu: "1550"

# The CNI network configuration to install on each node. The special
# values in this config will be automatically populated.
cni_network_config: |-
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "__KUBERNETES_NODE_NAME__",
"mtu": __CNI_MTU__,
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "__KUBECONFIG_FILEPATH__"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
---
# Source: calico/templates/kdd-crds.yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: felixconfigurations.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: FelixConfiguration
plural: felixconfigurations
singular: felixconfiguration
---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: ipamblocks.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: IPAMBlock
plural: ipamblocks
singular: ipamblock

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: blockaffinities.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: BlockAffinity
plural: blockaffinities
singular: blockaffinity

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: ipamhandles.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: IPAMHandle
plural: ipamhandles
singular: ipamhandle

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: ipamconfigs.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: IPAMConfig
plural: ipamconfigs
singular: ipamconfig

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: bgppeers.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: BGPPeer
plural: bgppeers
singular: bgppeer

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: bgpconfigurations.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: BGPConfiguration
plural: bgpconfigurations
singular: bgpconfiguration

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: ippools.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: IPPool
plural: ippools
singular: ippool

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: hostendpoints.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: HostEndpoint
plural: hostendpoints
singular: hostendpoint

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: clusterinformations.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: ClusterInformation
plural: clusterinformations
singular: clusterinformation

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: globalnetworkpolicies.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: GlobalNetworkPolicy
plural: globalnetworkpolicies
singular: globalnetworkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: globalnetworksets.crd.projectcalico.org
spec:
scope: Cluster
group: crd.projectcalico.org
version: v1
names:
kind: GlobalNetworkSet
plural: globalnetworksets
singular: globalnetworkset

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: networkpolicies.crd.projectcalico.org
spec:
scope: Namespaced
group: crd.projectcalico.org
version: v1
names:
kind: NetworkPolicy
plural: networkpolicies
singular: networkpolicy

---

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: networksets.crd.projectcalico.org
spec:
scope: Namespaced
group: crd.projectcalico.org
version: v1
names:
kind: NetworkSet
plural: networksets
singular: networkset
---
# Source: calico/templates/rbac.yaml

# Include a clusterrole for the kube-controllers component,
# and bind it to the calico-kube-controllers serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: calico-kube-controllers
rules:
# Nodes are watched to monitor for deletions.
- apiGroups: [""]
resources:
- nodes
verbs:
- watch
- list
- get
# Pods are queried to check for existence.
- apiGroups: [""]
resources:
- pods
verbs:
- get
# IPAM resources are manipulated when nodes are deleted.
- apiGroups: ["crd.projectcalico.org"]
resources:
- ippools
verbs:
- list
- apiGroups: ["crd.projectcalico.org"]
resources:
- blockaffinities
- ipamblocks
- ipamhandles
verbs:
- get
- list
- create
- update
- delete
# Needs access to update clusterinformations.
- apiGroups: ["crd.projectcalico.org"]
resources:
- clusterinformations
verbs:
- get
- create
- update
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: calico-kube-controllers
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: calico-kube-controllers
subjects:
- kind: ServiceAccount
name: calico-kube-controllers
namespace: kube-system
---
# Include a clusterrole for the calico-node DaemonSet,
# and bind it to the calico-node serviceaccount.
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: calico-node
rules:
# The CNI plugin needs to get pods, nodes, and namespaces.
- apiGroups: [""]
resources:
- pods
- nodes
- namespaces
verbs:
- get
- apiGroups: [""]
resources:
- endpoints
- services
verbs:
# Used to discover service IPs for advertisement.
- watch
- list
# Used to discover Typhas.
- get
- apiGroups: [""]
resources:
- nodes/status
verbs:
# Needed for clearing NodeNetworkUnavailable flag.
- patch
# Calico stores some configuration information in node annotations.
- update
# Watch for changes to Kubernetes NetworkPolicies.
- apiGroups: ["networking.k8s.io"]
resources:
- networkpolicies
verbs:
- watch
- list
# Used by Calico for policy information.
- apiGroups: [""]
resources:
- pods
- namespaces
- serviceaccounts
verbs:
- list
- watch
# The CNI plugin patches pods/status.
- apiGroups: [""]
resources:
- pods/status
verbs:
- patch
# Calico monitors various CRDs for config.
- apiGroups: ["crd.projectcalico.org"]
resources:
- globalfelixconfigs
- felixconfigurations
- bgppeers
- globalbgpconfigs
- bgpconfigurations
- ippools
- ipamblocks
- globalnetworkpolicies
- globalnetworksets
- networkpolicies
- networksets
- clusterinformations
- hostendpoints
verbs:
- get
- list
- watch
# Calico must create and update some CRDs on startup.
- apiGroups: ["crd.projectcalico.org"]
resources:
- ippools
- felixconfigurations
- clusterinformations
verbs:
- create
- update
# Calico stores some configuration information on the node.
- apiGroups: [""]
resources:
- nodes
verbs:
- get
- list
- watch
# These permissions are only required for upgrade from v2.6, and can
# be removed after upgrade or on fresh installations.
- apiGroups: ["crd.projectcalico.org"]
resources:
- bgpconfigurations
- bgppeers
verbs:
- create
- update
# These permissions are required for Calico CNI to perform IPAM allocations.
- apiGroups: ["crd.projectcalico.org"]
resources:
- blockaffinities
- ipamblocks
- ipamhandles
verbs:
- get
- list
- create
- update
- delete
- apiGroups: ["crd.projectcalico.org"]
resources:
- ipamconfigs
verbs:
- get
# Block affinities must also be watchable by confd for route aggregation.
- apiGroups: ["crd.projectcalico.org"]
resources:
- blockaffinities
verbs:
- watch
# The Calico IPAM migration needs to get daemonsets. These permissions can be
# removed if not upgrading from an installation using host-local IPAM.
- apiGroups: ["apps"]
resources:
- daemonsets
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: calico-node
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: calico-node
subjects:
- kind: ServiceAccount
name: calico-node
namespace: kube-system

---
# Source: calico/templates/calico-node.yaml
# This manifest installs the calico-node container, as well
# as the CNI plugins and network config on
# each master and worker node in a Kubernetes cluster.
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: calico-node
namespace: kube-system
labels:
k8s-app: calico-node
spec:
selector:
matchLabels:
k8s-app: calico-node
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
k8s-app: calico-node
annotations:
# This, along with the CriticalAddonsOnly toleration below,
# marks the pod as a critical add-on, ensuring it gets
# priority scheduling and that its resources are reserved
# if it ever gets evicted.
spec:
nodeSelector:
beta.kubernetes.io/os: linux
hostNetwork: true
tolerations:
# Make sure calico-node gets scheduled on all nodes.
- effect: NoSchedule
operator: Exists
# Mark the pod as a critical add-on for rescheduling.
- key: CriticalAddonsOnly
operator: Exists
- effect: NoExecute
operator: Exists
serviceAccountName: calico-node
# Minimize downtime during a rolling upgrade or deletion; tell Kubernetes to do a "force
# deletion": https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods.
terminationGracePeriodSeconds: 0
priorityClassName: system-node-critical
initContainers:
# This container performs upgrade from host-local IPAM to calico-ipam.
# It can be deleted if this is a fresh installation, or if you have already
# upgraded to use calico-ipam.
- name: upgrade-ipam
image: sea.hub:5000/calico/cni:v3.8.2
command: ["/opt/cni/bin/calico-ipam", "-upgrade"]
env:
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
volumeMounts:
- mountPath: /var/lib/cni/networks
name: host-local-net-dir
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
# This container installs the CNI binaries
# and CNI network config file on each node.
- name: install-cni
image: sea.hub:5000/calico/cni:v3.8.2
command: ["/install-cni.sh"]
env:
# Name of the CNI config file to create.
- name: CNI_CONF_NAME
value: "10-calico.conflist"
# The CNI network config to install on each node.
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
name: calico-config
key: cni_network_config
# Set the hostname based on the k8s node name.
- name: KUBERNETES_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# CNI MTU Config variable
- name: CNI_MTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
# Prevents the container from sleeping forever.
- name: SLEEP
value: "false"
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
# Adds a Flex Volume Driver that creates a per-pod Unix Domain Socket to allow Dikastes
# to communicate with Felix over the Policy Sync API.
- name: flexvol-driver
image: sea.hub:5000/calico/pod2daemon-flexvol:v3.8.2
volumeMounts:
- name: flexvol-driver-host
mountPath: /host/driver
containers:
# Runs calico-node container on each Kubernetes node. This
# container programs network policy and routes on each
# host.
- name: calico-node
image: sea.hub:5000/calico/node:v3.8.2
env:
# Use Kubernetes API as the backing datastore.
- name: DATASTORE_TYPE
value: "kubernetes"
# Wait for the datastore.
- name: WAIT_FOR_DATASTORE
value: "true"
# Set based on the k8s node name.
- name: NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# Choose the backend to use.
- name: CALICO_NETWORKING_BACKEND
valueFrom:
configMapKeyRef:
name: calico-config
key: calico_backend
# Cluster type to identify the deployment type
- name: CLUSTER_TYPE
value: "k8s,bgp"
# Auto-detect the BGP IP address.
- name: IP
value: "autodetect"
- name: IP_AUTODETECTION_METHOD
value: "interface=eth0"
# Enable IPIP
- name: CALICO_IPV4POOL_IPIP
value: "Off"
# Set MTU for tunnel device used if ipip is enabled
- name: FELIX_IPINIPMTU
valueFrom:
configMapKeyRef:
name: calico-config
key: veth_mtu
# The default IPv4 pool to create on startup if none exists. Pod IPs will be
# chosen from this range. Changing this value after installation will have
- name: CALICO_IPV4POOL_CIDR
value: "100.64.0.0/10"
- name: CALICO_DISABLE_FILE_LOGGING
value: "true"
# Set Felix endpoint to host default action to ACCEPT.
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: "ACCEPT"
# Disable IPv6 on Kubernetes.
- name: FELIX_IPV6SUPPORT
value: "false"
# Set Felix logging to "info"
- name: FELIX_LOGSEVERITYSCREEN
value: "info"
- name: FELIX_HEALTHENABLED
value: "true"
securityContext:
privileged: true
resources:
requests:
cpu: 250m
livenessProbe:
httpGet:
path: /liveness
port: 9099
host: localhost
periodSeconds: 10
initialDelaySeconds: 10
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- -bird-ready
- -felix-ready
periodSeconds: 10
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- mountPath: /run/xtables.lock
name: xtables-lock
readOnly: false
- mountPath: /var/run/calico
name: var-run-calico
readOnly: false
- mountPath: /var/lib/calico
name: var-lib-calico
readOnly: false
- name: policysync
mountPath: /var/run/nodeagent
volumes:
# Used by calico-node.
- name: lib-modules
hostPath:
path: /lib/modules
- name: var-run-calico
hostPath:
path: /var/run/calico
- name: var-lib-calico
hostPath:
path: /var/lib/calico
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
# Used to install CNI.
- name: cni-bin-dir
hostPath:
path: /opt/cni/bin
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
# Mount in the directory for host-local IPAM allocations. This is
# used when upgrading from host-local to calico-ipam, and can be removed
# if not using the upgrade-ipam init container.
- name: host-local-net-dir
hostPath:
path: /var/lib/cni/networks
# Used to create per-pod Unix Domain Sockets
- name: policysync
hostPath:
type: DirectoryOrCreate
path: /var/run/nodeagent
# Used to install Flex Volume Driver
- name: flexvol-driver-host
hostPath:
type: DirectoryOrCreate
path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
---

apiVersion: v1
kind: ServiceAccount
metadata:
name: calico-node
namespace: kube-system

---
# Source: calico/templates/calico-kube-controllers.yaml

# See https://github.com/projectcalico/kube-controllers
apiVersion: apps/v1
kind: Deployment
metadata:
name: calico-kube-controllers
namespace: kube-system
labels:
k8s-app: calico-kube-controllers
spec:
# The controllers can only have a single active instance.
replicas: 1
selector:
matchLabels:
k8s-app: calico-kube-controllers
strategy:
type: Recreate
template:
metadata:
name: calico-kube-controllers
namespace: kube-system
labels:
k8s-app: calico-kube-controllers
annotations:
spec:
nodeSelector:
beta.kubernetes.io/os: linux
tolerations:
# Mark the pod as a critical add-on for rescheduling.
- key: CriticalAddonsOnly
operator: Exists
- key: node-role.kubernetes.io/master
effect: NoSchedule
serviceAccountName: calico-kube-controllers
priorityClassName: system-cluster-critical
containers:
- name: calico-kube-controllers
image: sea.hub:5000/calico/kube-controllers:v3.8.2
env:
# Choose which controllers to run.
- name: ENABLED_CONTROLLERS
value: node
- name: DATASTORE_TYPE
value: kubernetes
readinessProbe:
exec:
command:
- /usr/bin/check-status
- -r

---

apiVersion: v1
kind: ServiceAccount
metadata:
name: calico-kube-controllers
namespace: kube-system
' | kubectl apply -f -
configmap/calico-config created
Warning: apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
serviceaccount/calico-node created
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created

至此,kubernetes集群部署完成,查看集群状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node1]# kubectl get node -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready master 2m50s v1.19.9 10.10.11.49 <none> CentOS Linux 7 (Core) 3.10.0-862.11.6.el7.x86_64 docker://19.3.0

[root@node1]# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-5565b777b6-w9mhw 1/1 Running 0 2m32s 100.76.153.65 node1
kube-system calico-node-mwkg2 1/1 Running 0 2m32s 10.10.11.49 node1
kube-system coredns-597c5579bc-dpqbx 1/1 Running 0 2m32s 100.76.153.64 node1
kube-system coredns-597c5579bc-fjnmq 1/1 Running 0 2m32s 100.76.153.66 node1
kube-system etcd-node1 1/1 Running 0 2m51s 10.10.11.49 node1
kube-system kube-apiserver-node1 1/1 Running 0 2m51s 10.10.11.49 node1
kube-system kube-controller-manager-node1 1/1 Running 0 2m51s 10.10.11.49 node1
kube-system kube-proxy-qgt9w 1/1 Running 0 2m32s 10.10.11.49 node1
kube-system kube-scheduler-node1 1/1 Running 0 2m51s 10.10.11.49 node1

参考资料

  1. https://github.com/alibaba/sealer/blob/main/docs/README_zh.md

问题背景

K8S集群环境中,有个业务在做大量配置的下发(持续几小时甚至更长时间),期间发现calico的Pod反复重启。

1
2
3
4
5
[root@node02 ~]# kubectl get pod -n kube-system -owide|grep node01
calico-kube-controllers-6f59b8cdd8-8v2qw 1/1 Running 0 4h45m 10.10.119.238 node01 <none> <none>
calico-node-b8w2b 1/1 CrashLoopBackOff 43 3d19h 10.10.119.238 node01 <none> <none>
coredns-795cc9c45c-k7qpb 1/1 Running 0 4h45m 177.177.237.42 node01 <none> <none>
...

分析过程

看到Pod出现CrashLoopBackOff状态,就想到大概率是Pod内服务自身的原因,先使用kubectl describe命令查看一下:

1
2
3
4
5
6
7
8
9
[root@node02 ~]# kubectl descroiebe pod -n kube-system calico-node-b8w2b
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 58m (x111 over 3h12m) kubelet, node01 (combined from similar events): Liveness probe failed: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Normal Pulled 43m (x36 over 3d19h) kubelet, node01 Container image "calico/node:v3.15.1" already present on machine
Warning Unhealthy 8m16s (x499 over 3h43m) kubelet, node01 Liveness probe failed: Get http://localhost:9099/liveness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning BackOff 3m31s (x437 over 3h3m) kubelet, node01 Back-off restarting failed container

从Event日志可以看出,是calico的健康检查没通过导致的重启,出错原因也比较明显:net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers),这个错误的含义是建立连接超时[1],并且手动在控制台执行健康检查命令,发现确实响应慢(正常环境是毫秒级别):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node01 ~]# time curl -i http://localhost:9099/liveness
HTTP/1.1 204 No Content
Date: Tue, 15 Jun 2021 06:24:35 GMT
real0m1.012s
user0m0.003s
sys0m0.005s
[root@node01 ~]# time curl -i http://localhost:9099/liveness
HTTP/1.1 204 No Content
Date: Tue, 15 Jun 2021 06:24:39 GMT
real0m3.014s
user0m0.002s
sys0m0.005s
[root@node01 ~]# time curl -i http://localhost:9099/liveness
real1m52.510s
user0m0.002s
sys0m0.013s
[root@node01 ~]# time curl -i http://localhost:9099/liveness
^C

先从calico相关日志查起,依次查看了calico的bird、confd和felix日志,没有发现明显错误,再看端口是否处于正常监听状态:

1
2
3
4
[root@node02 ~]# netstat -anp|grep 9099
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 1202/calico-node
tcp 0 0 127.0.0.1:9099 127.0.0.1:56728 TIME_WAIT -
tcp 0 0 127.0.0.1:56546 127.0.0.1:9099 TIME_WAIT -

考虑到错误原因是建立连接超时,并且业务量比较大,先观察一下TCP连接的状态情况:

1
2
3
4
5
[root@node01 ~]# netstat -na | awk '/^tcp/{s[$6]++}END{for(key in s) print key,s[key]}'
LISTEN 49
ESTABLISHED 284
SYN_SENT 4
TIME_WAIT 176

连接状态没有什么大的异常,再使用top命令看看CPU负载,好家伙,业务的java进程的CPU跑到了700%,持续观察一段时间发现最高飙到了2000%+,跟业务开发人员沟通,说是在做压力测试,并且线上有可能也存在这么大的并发量。好吧,那就继续看看这个状态下,CPU是不是出于高负载;

1
2
3
4
5
6
7
8
9
10
11
[root@node01 ~]# top
top - 14:28:57 up 13 days, 27 min, 2 users, load average: 9.55, 9.93, 9.91
Tasks: 1149 total, 1 running, 1146 sleeping, 0 stopped, 2 zombie
%Cpu(s): 16.0 us, 2.9 sy, 0.0 ni, 80.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 15249982+total, 21419184 free, 55542588 used, 75538048 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 94226176 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6754 root 20 0 66.8g 25.1g 290100 S 700.0 17.3 2971:49 java
25214 root 20 0 6309076 179992 37016 S 36.8 0.1 439:06.29 kubelet
20331 root 20 0 3196660 172364 24908 S 21.1 0.1 349:56.64 dockerd

查看CPU总核数,再结合上面统计出的load average和cpu的使用率,貌似负载也没有高到离谱;

1
2
3
4
[root@node01 ~]# cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
48
[root@node01 ~]# cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores: 1

这就奇怪了,凭感觉,问题大概率是高并发导致的,既然这里看不出什么,那就再回到建立连接超时这个现象上面来。说到连接超时,就会想到TCP建立连接的几个阶段(参考下图),那超时发生在哪个阶段呢?

tcp-state-transmission

Google相关资料[2],引用一下:

在TCP三次握手创建一个连接时,以下两种情况会发生超时:

  1. client发送SYN后,进入SYN_SENT状态,等待server的SYN+ACK。
  2. server收到连接创建的SYN,回应SYN+ACK后,进入SYN_RECD状态,等待client的ACK。

那么,我们的问题发生在哪个阶段?从下面的验证可以看出,问题卡在了SYN_SENT阶段,并且不止calico的健康检查会卡住,其他如kubelet、kube-controller等组件也会卡住:

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 ~]# curl http://localhost:9099/liveness
^C
[root@node01 ~]# netstat -anp|grep 9099
tcp 0 0 127.0.0.1:44360 127.0.0.1:9099 TIME_WAIT -
tcp 0 1 127.0.0.1:47496 127.0.0.1:9099 SYN_SENT 16242/curl

[root@node01 ~]# netstat -anp|grep SYN_SENT
tcp 0 1 127.0.0.1:47496 127.0.0.1:9099 SYN_SENT 16242/curl
tcp 0 1 127.0.0.1:39142 127.0.0.1:37807 SYN_SENT 25214/kubelet
tcp 0 1 127.0.0.1:38808 127.0.0.1:10251 SYN_SENT 25214/kubelet
tcp 0 1 127.0.0.1:53726 127.0.0.1:10252 SYN_SENT 25214/kubelet
...

到目前为止,我们可以得出2个结论:

  1. calico健康检查不通过的原因是TCP请求在SYN_SENT阶段卡住了;
  2. 该问题不是特定Pod的问题,应该是系统层面导致的通用问题;

综合上面2个结论,那就怀疑TCP相关内核参数是不是合适呢?特别是与SYN_SENT状态有关的参数[3];

1
2
net.ipv4.tcp_max_syn_backlog 默认为1024,表示SYN队列的长度
net.core.somaxconn 默认值是128,用于调节系统同时发起的tcp连接数,在高并发的请求中,默认值可能会导致链接超时或者重传,因此需要结合并发请求数来调节此值

查看系统上的配置,基本都是默认值,那就调整一下上面两个参数的值并设置生效:

1
2
3
4
5
6
7
8
9
[root@node01 ~]# cat /etc/sysctl.conf 
...
net.ipv4.tcp_max_syn_backlog = 32768
net.core.somaxconn = 32768

[root@node01 ~]# sysctl -p
...
net.ipv4.tcp_max_syn_backlog = 32768
net.core.somaxconn = 32768

再次执行calico的健康检查命令,请求已经不再卡住了,问题消失,查看异常的Pod也恢复正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@node01 ~]# time curl -i http://localhost:9099/liveness
HTTP/1.1 204 No Content
Date: Tue, 15 Jun 2021 14:48:38 GMT
real 0m0.011s
user 0m0.004s
sys 0m0.004s
[root@node01 ~]# time curl -i http://localhost:9099/liveness
HTTP/1.1 204 No Content
Date: Tue, 15 Jun 2021 14:48:39 GMT
real 0m0.010s
user 0m0.001s
sys 0m0.005s
[root@node01 ~]# time curl -i http://localhost:9099/liveness
HTTP/1.1 204 No Content
Date: Tue, 15 Jun 2021 14:48:40 GMT
real 0m0.011s
user 0m0.002s

其实,最终这个问题的解决也是半猜半验证得到的,如果是正向推演,发现TCP请求在SYN_SENT阶段卡住之后,其实应该要确认相关内核参数是不是确实太小。

解决方案

在高并发场景下,做服务器内核参数的调优。

参考资料

  1. https://romatic.net/post/go_net_errors/
  2. http://blog.qiusuo.im/blog/2014/03/19/tcp-timeout/
  3. http://www.51testing.com/html/13/235813-3710663.html