0%

K8S问题排查-BC-Linux-for-Eular操作系统适配问题

问题背景

bclinux-for-eular操作系统上部署K8S集群后,安装上层的业务组件时,突然从某个组件开始,Pod无法正常启动。

原因分析

查看description日志,出现了明显的too many open files错误:

1
2
3
Normal   Scheduled               41m                  default-scheduler  Successfully assigned xx/xxx-v3falue7-6f59dd5766-npd2x to node1
Warning FailedCreatePodSandBox 26m (x301 over 41m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "xxx-v3falue7-6f59dd5766-npd2x": Error response from daemon: start failed: : pipe2: too many open files: unknown
Normal SandboxChanged 66s (x808 over 41m) kubelet Pod sandbox changed, it will be killed and re-created.

因为使用的是docker作为CRI,所以先查看docker日志:

1
2
3
4
5
6
7
time="2023-11-13T14:56:05.734166795+08:00" level=info msg="/etc/resolv.conf does not exist"
time="2023-11-13T14:56:05.734193544+08:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
time="2023-11-13T14:56:05.734202079+08:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
time="2023-11-13T14:56:05.740830618+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2023-11-13T14:56:05.740850537+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2023-11-13T14:56:05.751993232+08:00" level=error msg="1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27 cleanup: failed to delete container from containerd: no such container"
time="2023-11-13T14:56:05.752024358+08:00" level=error msg="Handler for POST /v1.41/containers/1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27/start returned error: start failed: : fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown"

docker日志看,错误原因是:fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown,基本确认是**containerd的文件句柄打开数量过多**。

如下所示,查询到containerd运行时的文件句柄数限制为默认的1024,默认配置偏低。当节点容器过多时,就会出现容器无法启动的现象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@node1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2023-11-01 11:02:14 CST; 1 weeks 10 days ago
Docs: https://containerd.io
Main PID: 1999 (containerd)
Tasks: 1622
Memory: 3.5G
CGroup: /system.slice/containerd.service
├─ 999 /usr/bin/containerd

cat /proc/999/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 319973 319973 processes
Max open files 1024 524288 files

查看containerd.service文件,没有对文件句柄数做显示配置(对比其他正常环境,操作系统安装完成后,containerd.service文件中存在LimitNOFILE配置):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@node1 ~]# cat /usr/lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes

[Install]
WantedBy=multi-user.target

解决方案

修改containerd.service文件,显示配置文件句柄数,具体大小值根据实际需要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@node1 ~]# cat /usr/lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity

[Install]
WantedBy=multi-user.target

参考资料

https://blog.csdn.net/weixin_42072280/article/details/126513751