0%

由于工作需要,定位问题时可能需要访问redhat的知识库,参考资料[1],执行以下几步即可搞定:

操作步骤

第一步:登录 https://access.redhat.com/ 创建一个账号;

第二步:访问 https://developers.redhat.com/products/rhel/download 激活订阅(收到邮件并激活);

第三步:访问 https://access.redhat.com/management 确认一下我们的账号是否有 developer subscription

1
2
14904535	Red Hat Developer Subscription for Individuals
14904536 Red Hat Beta Access

第四步:用注册的用户名密码,激活一个rhel系统:

1
subscription-manager register --auto-attach --username ******** --password ********

第五步:访问https://access.redhat.com/solutions/6178422测试知识库是否能访问;

关于激活一个rhel系统的操作,为了快速方便,这里使用vagrant软件快速部署一个redhat8的操作系统。这个流程也仅需要以下几步:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# 初始化Vagrantfile文件
$ vagrant init generic/rhel8
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.

# 启动redhat8系统
$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'generic/rhel8' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
==> default: Loading metadata for box 'generic/rhel8'
default: URL: https://vagrantcloud.com/generic/rhel8
==> default: Adding box 'generic/rhel8' (v4.3.12) for provider: virtualbox
default: Downloading: https://vagrantcloud.com/generic/boxes/rhel8/versions/4.3.12/providers/virtualbox/amd64/vagrant.box
default:
default: Calculating and comparing box checksum...
==> default: Successfully added box 'generic/rhel8' (v4.3.12) for 'virtualbox'!
==> default: Importing base box 'generic/rhel8'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'generic/rhel8' version '4.3.12' is up to date...
==> default: Setting the name of the VM: Redhat8_default_1715673627487_1933
==> default: Vagrant has detected a configuration issue which exposes a
==> default: vulnerability with the installed version of VirtualBox. The
==> default: current guest is configured to use an E1000 NIC type for a
==> default: network adapter which is vulnerable in this version of VirtualBox.
==> default: Ensure the guest is trusted to use this configuration or update
==> default: the NIC type using one of the methods below:
==> default:
==> default: https://www.vagrantup.com/docs/virtualbox/configuration.html#default-nic-type
==> default: https://www.vagrantup.com/docs/virtualbox/networking.html#virtualbox-nic-type
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
default: Adapter 1: nat
==> default: Forwarding ports...
default: 22 (guest) => 2222 (host) (adapter 1)
==> default: Running 'pre-boot' VM customizations...
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
default: SSH address: 127.0.0.1:2222
default: SSH username: vagrant
default: SSH auth method: private key
The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'starting, running'. The machine is in the
'paused' state. Please verify everything is configured
properly and try again.

If the provider you're using has a GUI that comes with it,
it is often helpful to open that and watch the machine, since the
GUI often has more helpful error messages than Vagrant can retrieve.
For example, if you're using VirtualBox, run `vagrant up` while the
VirtualBox GUI is open.

The primary issue for this error is that the provider you're using
is not properly configured. This is very rarely a Vagrant issue.

# 上面的命令执行后系统处于paused状态,再启动一下
$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Checking if box 'generic/rhel8' version '4.3.12' is up to date...
==> default: Unpausing the VM...

# ssh到新安装的redhat系统
$ vagrant ssh
Register this system with Red Hat Insights: insights-client --register
Create an account or view all your systems at https://red.ht/insights-dashboard

# 在redhat系统中执行注册
[root@rhel8 ~]# subscription-manager register --auto-attach --username ******** --password ********
Registering to: subscription.rhsm.redhat.com:443/subscription
The system has been registered with ID: xxxx-xxxx-xxxx-xxxx-xxxx
The registered system name is: rhel8.localdomain

# 系统使用完,关机即可
$ vagrant halt
==> default: Attempting graceful shutdown of VM...
default:
default: Vagrant insecure key detected. Vagrant will automatically replace
default: this with a newly generated keypair for better security.
default:
default: Inserting generated public key within guest...
default: Removing insecure key from the guest if it's present...
default: Key inserted! Disconnecting and reconnecting using new SSH key...

参考资料

https://wangzheng422.github.io/docker_env/notes/2022/2022.04.no-cost.rhel.sub.html

问题背景

K8S环境中,某个业务由于误操作重启了系统的dbus服务,导致所有的Pod启动失败,相关日志如下:

1
unable to ensure pod container exists: failed to create container for [kubepods besteffort ...] : dbus: connection closed by user

原因分析

根据错误信息,查到相关issue[1],原因如下:

kubelet服务在创建Pod时会调用/var/run/dbus/system_bus_socket,如果dbus服务由于某些异常发生重启,/var/run/dbus/system_bus_socket这个文件就会被重新创建。此时,kubelet继续向旧的socket发送数据,就会出现上述的报错信息。

解决方案

临时方案:重启kubelet服务

永久方案:升级K8S版本到v1.25+

后续问题

重启过dbuskubelet服务后,出现非root用户ssh远程慢的现象。查看secure日志,发现如下错误:

1
2
pam_systemd(crond:session): Failed to create session: Activation of org.freedesktop.login1 timed out
pam_systemd(crond:session): Failed to create session: Connection timed out

查看资料[2],原因是ssh依赖systemd-logind服务,而该服务又依赖dbus服务,通过重启systemd-logind服务解决:

1
[root@core log]# systemctl restart systemd-logind 

参考资料

1.https://github.com/kubernetes/kubernetes/issues/100328

2.https://www.jianshu.com/p/bb66d7f8c859

问题现象

K8S集群所有节点之间网络异常,无法执行正常的SSH操作。

原因分析

基于该现象,首先怀疑是使用的密码错误,先排查使用的密码和实际密码和是否一致,经确认业务存储的密码跟实际密码是一致的,排除密码不一致的问题;

再排查是不是有异常的ip使用错误密码连接:

这里使用的是ipv6地址,需要注意,默认的netstat命令看到的ipv6地址是不全的,无法方便看出完整的ip地址,需要添加-W命令完整显示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node1 ~]# netstat -anp -v|grep -w 22 
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 156183/sshd: /usr/s
tcp6 0 0 :::22 :::* LISTEN 156183/sshd: /usr/s
tcp6 0 0 2000:8080:5a0a:2f:59732 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:44072 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:35666 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:42998 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:59834 2000:8080:5a0a:2f40::22 ESTABLISHED 170769/java
tcp6 0 0 2000:8080:5a0a:2f:59652 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:39430 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:35648 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:36852 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:43162 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:35002 2000:8080:5a0a:2f40::22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f:36052 2000:8080:5a0a:2f40::22 ESTABLISHED 170769/java

完整ipv6地址的ssh连接如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node1 ~]# netstat -anp -W|grep -w 22 |grep -v ::4
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 156183/sshd: /usr/s
tcp6 0 0 :::22 :::* LISTEN 156183/sshd: /usr/s
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:48950 2000:8080:5a0a:2f40:8002::5:22 ESTABLISHED 170769/java
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52506 2000:8080:5a0a:2f40:8002::5:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:56798 2000:8080:5a0a:2f40:8002::6:22 ESTABLISHED 170769/java
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52624 2000:8080:5a0a:2f40:8002::5:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:56860 2000:8080:5a0a:2f40:8002::6:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52396 2000:8080:5a0a:2f40:8002::5:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52398 2000:8080:5a0a:2f40:8002::5:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:22 2000:8080:5a0a:2f40:8002::5:45532 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52202 2000:8080:5a0a:2f40:8002::5:22 TIME_WAIT -
tcp6 0 0 2000:8080:5a0a:2f40:8002::5:52348 2000:8080:5a0a:2f40:8002::5:22 ESTABLISHED 170769/java

从上面的记录看,至少当前没有异常ipssh连接,再确认一下是不是之前出现过错误密码导致密码被锁的情况;

查看/var/log/secure日志(日志已发生过轮转,无法确认出问题的初始时间点),查看系统最近没有发生过重启,继续看journal --boot里的登录失败日志,找到了出问题的时间点,并且可以看到源ip地址2000:8080:5a0a:2f47::2一直使用错误密码登录:

1
2
3
4
5
6
7
8
cat boot.log |grep "Failed password"|less
3月 26 10:42:19 node1 sshd[114043]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 34968 ssh2
3月 26 10:42:23 node1 sshd[114043]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 34968 ssh2
3月 26 10:42:25 node1 sshd[114043]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 34968 ssh2
3月 26 10:42:28 node1 sshd[114043]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 34968 ssh2
3月 26 10:42:31 node1 sshd[116187]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 35194 ssh2
3月 26 10:42:34 node1 sshd[116187]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 35194 ssh2
3月 26 10:42:36 node1 sshd[116187]: Failed password for admin from 2000:8080:5a0a:2f47::2 port 35194 ssh2

正常来说,使用错误密码登录失败后,密码被锁到指定时间后会自动解锁。但问题环境当前没有错误密码连接的情况下,使用正确密码依然无法连接。

临时注释/etc/pam.d/security-auth/etc/pam.d/password-authauth相关的配置,验证ssh异常是否是密码锁配置导致:

1
# auth required pam_tally2.so onerr=fail deny=5 unlock_time=900 even_deny_root

修改完观察一段时间,ssh恢复正常,还原回去后,ssh又出现异常,基本确认是配置问题。从系统相关同事了解到,这里使用的密码锁定模块tally是个老模块,因为存在缺陷已经被废弃,其中一个问题就是:在使用错误密码被锁后,即使密码正确了,也无法解除锁定。建议使用faillock模块替代,配置方法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node1 ~]# vim /etc/pam.d/system-auth 或者 vi /etc/pam.d/login
# 在文件开头增加如下内容:
auth [success=1 default=bad] pam_unix.so
auth [default=die] pam_faillock.so authfail deny=5 even_deny_root unlock_time=900 root_unlock_time=10
auth sufficient pam_faillock.so authsucc deny=5 even_deny_root unlock_time=900 root_unlock_time=10
auth required pam_deny.so

[root@node1 ~]# vim /etc/pam.d/password-auth 或者 vi /etc/pam.d/sshd
在文件第二行(第一行为 #%PAM-1.0 )增加如下内容:
auth [success=1 default=bad] pam_unix.so
auth [default=die] pam_faillock.so authfail deny=5 even_deny_root unlock_time=900 root_unlock_time=10
auth sufficient pam_faillock.so authsucc deny=5 even_deny_root unlock_time=900 root_unlock_time=10
auth required pam_deny.so

说明:faillock模块远程登录、本地登录过程中,用户锁定均不会有任何提示,只会出现锁定期间即使密码输入正确也无法登录系统的现象,解锁后可正常登录。

至于为什么出现这个问题,最后了解到是客户那边的漏扫平台使用弱密码故意扫的,正常只会扫一次,不清楚为什么触发扫了多次。

解决方案

锁密码的安全加固使用faillock模块替代老版本的tally模块。

1.找到harborconfigmap配置文件,以harbor-cm.yaml为例,修改配置文件,vim harbor-cm.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
http {
...
server {
listen 80;
server_tokens off;
client_max_body_size 0;

location / {
proxy_pass http://localhost:80/; --删除该配置项
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

proxy_buffering off;
proxy_request_buffering off;
return 403; --新增该配置项
}
...

2.执行kubectl apply -f harbor-cm.yaml命令更新配置
3.找到harborpod所在yaml,以harbor1.yaml为例,vim harbor1.yaml,修改harbor nginx的探针,把默认探测 / 修改为探测 /api/systeminfo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- image: goharbor/nginx-photon:v1.6.4
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /api/systeminfo --修改点
port: 80
initialDelaySeconds: 1
periodSeconds: 10
name: nginx
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /api/systeminfo --修改点
port: 80
initialDelaySeconds: 1
periodSeconds: 10
  1. 执行kubectl apply -f harbor1.yaml重启harbor服务
  2. 重新登录harbor页面,预期返回403 Forbidden,无法登录

问题背景

在使用国产海光CPU Hygon C86 7265的服务器上部署K8S集群时,出现calico-node启动失败,相关日志如下:

1
2
3
4
5
6
7
8
9
10
[root@node1 ~]# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system calico-kube-controllers-7c7986989c-bwvw4 0/1 Pending 0 5m11s <none> <none>
kube-system calico-node-v64fv 0/1 CrashLoopBackOff 5 5m11s 10.10.26.120 node1
kube-system coredns-6db7677797-jkhpd 0/1 Pending 0 5m11s <none> <none>
kube-system coredns-6db7677797-r58c5 0/1 Pending 0 5m11s <none> <none>
kube-system kube-apiserver-node1 1/1 Running 6 5m23s 10.10.26.120 node1
kube-system kube-controller-manager-node1 1/1 Running 8 5m28s 10.10.26.120 node1
kube-system kube-proxy-ncw4g 1/1 Running 0 5m11s 10.10.26.120 node1
kube-system kube-scheduler-node1 1/1 Running 6 5m29s 10.10.26.120 node1

原因分析

查看具体错误日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
[root@node1 ~]# kubectl logs -n kube-system calico-node-v64fv
2024-04-03 14:29:25.424 [INFO][9] startup/startup.go 427: Early log level set to info
2024-04-03 14:29:25.425 [INFO][9] startup/utils.go 131: Using HOSTNAME environment (lowercase) for node name node1
2024-04-03 14:29:25.425 [INFO][9] startup/utils.go 139: Determined node name: node1
2024-04-03 14:29:25.428 [INFO][9] startup/startup.go 106: Skipping datastore connection test
CRNGT failed.
SIGABRT: abort
PC=0x7efbf7409a9f m=13 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7efbf7409a9f
stack: frame={sp:0x7efbaa7fb780, fp:0x0} stack=[0x7efba9ffc250,0x7efbaa7fbe50)fffff 0x00007efbf7de02cc
0x00007efbaa7fb6c0: 0x00007efbf73c2340 0x00007efbf7ffbed0
0x00007efbaa7fb6d0: 0x00007efbf73c8cd0 0x00007efbf7ffbed0
0x00007efbaa7fb6e0: 0x0000000000000001 0x00007efbf7de06be
0x00007efbaa7fb6f0: 0x000000000000015f 0x00007efbf7783360
0x00007efbaa7fb700: 0x00007efbf7ffb9e0 0x0000000004904060
0x00007efbaa7fb710: 0x00007efbaa7fbdf0 0x0000000000000000
0x00007efbaa7fb720: 0x0000000000000020 0x00007efb94000dd0
0x00007efbaa7fb730: 0x00007efb94000dd0 0x00007efbf7de5574
0x00007efbaa7fb740: 0x0000000000000005 0x0000000000000000
0x00007efbaa7fb750: 0x0000000000000005 0x00007efbf73c2340
0x00007efbaa7fb760: 0x00007efbaa7fb9b0 0x00007efbf7dd2ae7
0x00007efbaa7fb770: 0x0000000000000001 0x00007efbf74db5df
0x00007efbaa7fb780: <0x0000000000000000 0x00007efbf777c850
0x00007efbaa7fb790: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7a0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7b0: 0x000000000000037f 0x0000000000000000
0x00007efbaa7fb7c0: 0x0000000000000000 0x0002ffff00001fa0
0x00007efbaa7fb7d0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7e0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7f0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb800: 0xfffffffe7fffffff 0xffffffffffffffff
0x00007efbaa7fb810: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb820: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb830: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb840: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb850: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb860: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb870: 0xffffffffffffffff 0xffffffffffffffff
runtime: unknown pc 0x7efbf7409a9f
stack: frame={sp:0x7efbaa7fb780, fp:0x0} stack=[0x7efba9ffc250,0x7efbaa7fbe50)
0x00007efbaa7fb680: 0x00007efbaa7fb6c0 0x00007efbf8000558
0x00007efbaa7fb690: 0x0000000000000000 0x00007efbf8000558
0x00007efbaa7fb6a0: 0x0000000000000001 0x0000000000000000
0x00007efbaa7fb6b0: 0x00000000ffffffff 0x00007efbf7de02cc
0x00007efbaa7fb6c0: 0x00007efbf73c2340 0x00007efbf7ffbed0
0x00007efbaa7fb6d0: 0x00007efbf73c8cd0 0x00007efbf7ffbed0
0x00007efbaa7fb6e0: 0x0000000000000001 0x00007efbf7de06be
0x00007efbaa7fb6f0: 0x000000000000015f 0x00007efbf7783360
0x00007efbaa7fb700: 0x00007efbf7ffb9e0 0x0000000004904060
0x00007efbaa7fb710: 0x00007efbaa7fbdf0 0x0000000000000000
0x00007efbaa7fb720: 0x0000000000000020 0x00007efb94000dd0
0x00007efbaa7fb730: 0x00007efb94000dd0 0x00007efbf7de5574
0x00007efbaa7fb740: 0x0000000000000005 0x0000000000000000
0x00007efbaa7fb750: 0x0000000000000005 0x00007efbf73c2340
0x00007efbaa7fb760: 0x00007efbaa7fb9b0 0x00007efbf7dd2ae7
0x00007efbaa7fb770: 0x0000000000000001 0x00007efbf74db5df
0x00007efbaa7fb780: <0x0000000000000000 0x00007efbf777c850
0x00007efbaa7fb790: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7a0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7b0: 0x000000000000037f 0x0000000000000000
0x00007efbaa7fb7c0: 0x0000000000000000 0x0002ffff00001fa0
0x00007efbaa7fb7d0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7e0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7f0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb800: 0xfffffffe7fffffff 0xffffffffffffffff
0x00007efbaa7fb810: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb820: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb830: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb840: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb850: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb860: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb870: 0xffffffffffffffff 0xffffffffffffffff

goroutine 131 [syscall]:
runtime.cgocall(0x2629971, 0xc00067da20)
/usr/local/go-cgo/src/runtime/cgocall.go:156 +0x5c fp=0xc00067d9f8 sp=0xc00067d9c0 pc=0x41081c
crypto/internal/boring._Cfunc__goboringcrypto_RAND_bytes(0xc0006ba680, 0x20)
_cgo_gotypes.go:1140 +0x4c fp=0xc00067da20 sp=0xc00067d9f8 pc=0x66a0ac
crypto/internal/boring.randReader.Read(0x0, {0xc0006ba680, 0x20, 0x20})
/usr/local/go-cgo/src/crypto/internal/boring/rand.go:21 +0x31 fp=0xc00067da48 sp=0xc00067da20 pc=0x66e691
crypto/internal/boring.(*randReader).Read(0x3333408, {0xc0006ba680, 0xc00067dab0, 0x45acb2})
<autogenerated>:1 +0x34 fp=0xc00067da78 sp=0xc00067da48 pc=0x6754f4
io.ReadAtLeast({0x336aba0, 0x3333408}, {0xc0006ba680, 0x20, 0x20}, 0x20)
/usr/local/go-cgo/src/io/io.go:328 +0x9a fp=0xc00067dac0 sp=0xc00067da78 pc=0x4b6ffa
io.ReadFull(...)
/usr/local/go-cgo/src/io/io.go:347
crypto/tls.(*Conn).makeClientHello(0xc000410700)
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:107 +0x6a5 fp=0xc00067dbe8 sp=0xc00067dac0 pc=0x728f25
crypto/tls.(*Conn).clientHandshake(0xc000410700, {0x33dd910, 0xc000a91880})
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:157 +0x96 fp=0xc00067de78 sp=0xc00067dbe8 pc=0x7295f6
crypto/tls.(*Conn).clientHandshake-fm({0x33dd910, 0xc000a91880})
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:148 +0x39 fp=0xc00067dea0 sp=0xc00067de78 pc=0x759899
crypto/tls.(*Conn).handshakeContext(0xc000410700, {0x33dd980, 0xc00057a240})
/usr/local/go-cgo/src/crypto/tls/conn.go:1445 +0x3d1 fp=0xc00067df70 sp=0xc00067dea0 pc=0x727bf1
crypto/tls.(*Conn).HandshakeContext(...)
/usr/local/go-cgo/src/crypto/tls/conn.go:1395
net/http.(*persistConn).addTLS.func2()
/usr/local/go-cgo/src/net/http/transport.go:1534 +0x71 fp=0xc00067dfe0 sp=0xc00067df70 pc=0x7d4dd1
runtime.goexit()
/usr/local/go-cgo/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00067dfe8 sp=0xc00067dfe0 pc=0x4759c1
created by net/http.(*persistConn).addTLS
/usr/local/go-cgo/src/net/http/transport.go:1530 +0x345

goroutine 1 [select]:
net/http.(*Transport).getConn(0xc00033e140, 0xc0006b6f80, {{}, 0x0, {0xc000c6a0a0, 0x5}, {0xc00046c680, 0xd}, 0x0})
/usr/local/go-cgo/src/net/http/transport.go:1372 +0x5d2
net/http.(*Transport).roundTrip(0xc00033e140, 0xc000650e00)
/usr/local/go-cgo/src/net/http/transport.go:581 +0x774
net/http.(*Transport).RoundTrip(0x2cc6880, 0xc000a6b1d0)
/usr/local/go-cgo/src/net/http/roundtrip.go:18 +0x19
k8s.io/client-go/transport.(*bearerAuthRoundTripper).RoundTrip(0xc000a98360, 0xc000650a00)
/go/pkg/mod/k8s.io/client-go@v0.23.3/transport/round_trippers.go:317 +0x242
net/http.send(0xc000650900, {0x336a040, 0xc000a98360}, {0x2e2a640, 0x4d0701, 0x4a61720})
/usr/local/go-cgo/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc000a983f0, 0xc000650900, {0x2ec14f9, 0xe, 0x4a61720})
/usr/local/go-cgo/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc000a983f0, 0xc000650900)
/usr/local/go-cgo/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
/usr/local/go-cgo/src/net/http/client.go:593
k8s.io/client-go/rest.(*Request).request(0xc000650700, {0x33dd980, 0xc00057a240}, 0x4a95f98)
/go/pkg/mod/k8s.io/client-go@v0.23.3/rest/request.go:980 +0x439
k8s.io/client-go/rest.(*Request).Do(0x20, {0x33dd948, 0xc000056060})
/go/pkg/mod/k8s.io/client-go@v0.23.3/rest/request.go:1038 +0xcc
k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).Get(0xc00004fa40, {0x33dd948, 0xc000056060}, {0x2ec14f9, 0xe}, {{{0x0, 0x0}, {0x0, 0x0}}, {0x0, ...}})
/go/pkg/mod/k8s.io/client-go@v0.23.3/kubernetes/typed/core/v1/configmap.go:78 +0x15a
github.com/projectcalico/calico/node/pkg/lifecycle/startup.Run()
/go/src/github.com/projectcalico/calico/node/pkg/lifecycle/startup/startup.go:148 +0x422
main.main()
/go/src/github.com/projectcalico/calico/node/cmd/calico-node/main.go:142 +0x732

goroutine 8 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0xc000202900)
/go/pkg/mod/k8s.io/klog/v2@v2.40.1/klog.go:1283 +0x6a
created by k8s.io/klog/v2.init.0
/go/pkg/mod/k8s.io/klog/v2@v2.40.1/klog.go:420 +0xfb

goroutine 29 [select]:
net/http.setRequestCancel.func4()
/usr/local/go-cgo/src/net/http/client.go:398 +0x94
created by net/http.setRequestCancel
/usr/local/go-cgo/src/net/http/client.go:397 +0x43e

goroutine 30 [chan receive]:
net/http.(*persistConn).addTLS(0xc0005617a0, {0x33dd980, 0xc00057a240}, {0xc00046c680, 0x9}, 0x0)
/usr/local/go-cgo/src/net/http/transport.go:1540 +0x365
net/http.(*Transport).dialConn(0xc00033e140, {0x33dd980, 0xc00057a240}, {{}, 0x0, {0xc000c6a0a0, 0x5}, {0xc00046c680, 0xd}, 0x0})
/usr/local/go-cgo/src/net/http/transport.go:1614 +0xab7
net/http.(*Transport).dialConnFor(0x0, 0xc00077d6b0)
/usr/local/go-cgo/src/net/http/transport.go:1446 +0xb0
created by net/http.(*Transport).queueForDial
/usr/local/go-cgo/src/net/http/transport.go:1415 +0x3d7

goroutine 183 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc0001ac0a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/balancer_conn_wrappers.go:71 +0xa5
created by google.golang.org/grpc.newCCBalancerWrapper
/go/pkg/mod/google.golang.org/grpc@v1.40.0/balancer_conn_wrappers.go:62 +0x246

goroutine 184 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc000660000)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/clientconn.go:1219 +0x48f
created by google.golang.org/grpc.(*addrConn).connect
/go/pkg/mod/google.golang.org/grpc@v1.40.0/clientconn.go:849 +0x147

goroutine 194 [select]:
google.golang.org/grpc/internal/transport.(*http2Client).keepalive(0xc00000c5a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:1569 +0x169
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:350 +0x18a5

goroutine 195 [IO wait]:
internal/poll.runtime_pollWait(0x7efbf7e43728, 0x72)
/usr/local/go-cgo/src/runtime/netpoll.go:303 +0x85
internal/poll.(*pollDesc).wait(0xc000a92980, 0xc0005b4000, 0x0)
/usr/local/go-cgo/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go-cgo/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000a92980, {0xc0005b4000, 0x8000, 0x8000})
/usr/local/go-cgo/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000a92980, {0xc0005b4000, 0x1060100000000, 0x8})
/usr/local/go-cgo/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00060e078, {0xc0005b4000, 0x9c8430, 0xc00033c500})
/usr/local/go-cgo/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000197380, {0xc0003c04a0, 0x9, 0x18})
/usr/local/go-cgo/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x3365ca0, 0xc000197380}, {0xc0003c04a0, 0x9, 0x9}, 0x9)
/usr/local/go-cgo/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/go-cgo/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0003c04a0, 0x9, 0x3f69d15}, {0x3365ca0, 0xc000197380})
/go/pkg/mod/golang.org/x/net@v0.0.0-20220520000938-2e3eb7b945c2/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003c0460)
/go/pkg/mod/golang.org/x/net@v0.0.0-20220520000938-2e3eb7b945c2/http2/frame.go:498 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc00000c5a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:1495 +0x41f
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:355 +0x18ef

goroutine 196 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc000204230, 0x1)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/controlbuf.go:406 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc000197440)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/controlbuf.go:533 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:405 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:403 +0x1f45

goroutine 132 [select]:
crypto/tls.(*Conn).handshakeContext.func2()
/usr/local/go-cgo/src/crypto/tls/conn.go:1421 +0x9e
created by crypto/tls.(*Conn).handshakeContext
/usr/local/go-cgo/src/crypto/tls/conn.go:1420 +0x1bd

rax 0x0
rbx 0x6
rcx 0xffffffffffffffff
rdx 0x0
rdi 0x2
rsi 0x7efbaa7fb780
rbp 0x7efbaa7fbdf0
rsp 0x7efbaa7fb780
r8 0x0
r9 0x7efbaa7fb780
r10 0x8
r11 0x246
r12 0x0
r13 0x20
r14 0x7efb94000dd0
r15 0x7efb94000dd0
rip 0x7efbf7409a9f
rflags 0x246
cs 0x33
fs 0x0
gs 0x0
Calico node failed to start

经沟通确认,该环境之前安装的版本没有问题,最新的版本才出现。考虑到最新版本升级过calico,先怀疑是不是calico版本的问题。通过升级calico到最新版本,发现问题依然存在。

根据错误日志CRNGT failed查找相关资料[1],发现有人遇到相同的错误,原因是使用的AMD Ryzen 9 3000 系列CPU存在bug,导致RNRAND 在某些 BIOS 版本中无法正确生成随机数,解决方法是升级BIOS版本。

按照资料[1]提供的检测方法,创建一个main.go文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package main

import (
"fmt"
"crypto/rand"
)

func main() {
a := make([]byte, 10)
_, err := rand.Read(a)
if err != nil {
panic(err)
}
fmt.Println(string(a))
}

执行如下命令,在问题服务器上未复现出问题:

1
$ GOEXPERIMENT=boringcrypto go run main.go

继续查资料[2],同样也是AMD Ryzen 9 3000 系列CPU,并且给出了另一种检查方法:

1
2
3
4
you@ubuntu-live:~$ wget https://cdn.arstechnica.net/wp-content/uploads/2019/10/rdrand-test.zip
you@ubuntu-live:~$ unzip rdrand-test.zip
you@ubuntu-live:~$ cd rdrand-test
you@ubuntu-live:~$ ./amd-rdrand.bug

不过,这个链接已经失效了,继续搜索相关资料,又找了一个测试工具[3],并提供了二进制文件,在异常服务器上测试效果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64 --第一次随机数不同
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 17705883718297935842 (0xf5b7ef6e97855fe2)
try: 2 success: 1 random number: 6443855104021096318 (0x596d2c137b43e77e)
try: 3 success: 1 random number: 10126471306861746785 (0x8c88740051ae1a61)
try: 4 success: 1 random number: 13463061200056996464 (0xbad666d4c2bdd270)
try: 5 success: 1 random number: 7695825692332247646 (0x6acd10b164c9ca5e)
try: 6 success: 1 random number: 1263849930341660097 (0x118a18d0c36ab5c1)
try: 7 success: 1 random number: 2580393233033016710 (0x23cf65f953c13586)
try: 8 success: 1 random number: 1842118076754864861 (0x199084a17f4caadd)
try: 9 success: 1 random number: 2896900625228522073 (0x2833dbc52c5a6259)
try: 10 success: 1 random number: 3899901262805814503 (0x361f3b8934a34ce7)
try: 11 success: 1 random number: 3597359862242937122 (0x31ec63bc2e3d0922)
try: 12 success: 1 random number: 12246743104637488545 (0xa9f52bf7b761cda1)
try: 13 success: 1 random number: 16491679937497687446 (0xe4de3786c7fc6596)
try: 14 success: 1 random number: 7270227793600200162 (0x64e509a8b1b63de2)
try: 15 success: 1 random number: 15697857806096052438 (0xd9d9fe80faf2b0d6)
try: 16 success: 1 random number: 2546933488048450266 (0x235886835dacaada)
try: 17 success: 1 random number: 6670897529050922874 (0x5c93c9f5701c7f7a)
try: 18 success: 1 random number: 14670415794664541721 (0xcb97c97024428e19)
try: 19 success: 1 random number: 2452728878003037248 (0x2209d7eb5fb6a440)
try: 20 success: 1 random number: 16252906931536406850 (0xe18decbe1db62942)

The RDRAND instruction of this CPU appears to be working.
The numbers generated should be different and random.
If the numbers generated appears to be similar, the RDRAND instruction is
broken.

[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64 --之后的随机数完全相同
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 2 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 3 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 4 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 5 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 6 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 7 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 8 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 9 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 10 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 11 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 12 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 13 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 14 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 15 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 16 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 17 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 18 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 19 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 20 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)

The RDRAND instruction of this CPU appears to be broken!
The numbers generated are NOT random but the CPU returns the success flag.

在正常的服务器上执行效果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64  --多次测试,随机数没有出现相同的情况
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 17914541561690204462 (0xf89d3ca29284292e)
try: 2 success: 1 random number: 14332812162628513309 (0xc6e860b931deee1d)
try: 3 success: 1 random number: 11906898495071391800 (0xa53dcd18875d1038)
try: 4 success: 1 random number: 5465211412374691004 (0x4bd854d2d9011cbc)
try: 5 success: 1 random number: 13927489571584093018 (0xc14861f96f3a2b5a)
try: 6 success: 1 random number: 70328156090550554 (0x00f9db15d97c491a)
try: 7 success: 1 random number: 9065062530023621999 (0x7dcd9257a0c3056f)
try: 8 success: 1 random number: 283806862943046502 (0x03f048d69289cb66)
try: 9 success: 1 random number: 7602503365830811759 (0x698184880c0ea06f)
try: 10 success: 1 random number: 3090051278467342602 (0x2ae2114416c9a10a)
try: 11 success: 1 random number: 2685951337108651825 (0x25466a82a458bf31)
try: 12 success: 1 random number: 15486706753868706299 (0xd6ebd5bd94fcb1fb)
try: 13 success: 1 random number: 11789666617122680772 (0xa39d4f52ede0efc4)
try: 14 success: 1 random number: 1388997005975229823 (0x1346b56aef3c157f)
try: 15 success: 1 random number: 11566015841037137779 (0xa082be20c78f3773)
try: 16 success: 1 random number: 14397918040333260716 (0xc7cfae2c9b4097ac)
try: 17 success: 1 random number: 10383120616855762267 (0x901841305bb8f55b)
try: 18 success: 1 random number: 6694856356368217838 (0x5ce8e8629f97f6ee)
try: 19 success: 1 random number: 2307408338273596455 (0x20058fa892927427)
try: 20 success: 1 random number: 6317182892917504808 (0x57ab245f0985bb28)

The RDRAND instruction of this CPU appears to be working.
The numbers generated should be different and random.
If the numbers generated appears to be similar, the RDRAND instruction is
broken.

对比两个服务器的BIOS差异:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
异常服务器:
[root@node1 ~]# dmidecode -t bios
dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.1 present.

Handle 0x0068, DMI type 0, 26 bytes
BIOS Information
Vendor: Byosoft
Version: 3.07.09P01
Release Date: 12/16/2020
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 0 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 17.0

Handle 0x0070, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 2
en|US|iso8859-1
zh|CN|unicode
Currently Installed Language: zh|CN|unicode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
正常服务器:
[root@node1 ~]# dmidecode -t bios
dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.1 present.

Handle 0x0069, DMI type 0, 26 bytes
BIOS Information
Vendor: Byosoft
Version: 5.19
Release Date: 03/04/2022
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16 MB
Characteristics:
ISA is supported
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
USB legacy is supported
ATAPI Zip drive boot is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
System is a virtual machine
BIOS Revision: 5.19

Handle 0x0070, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 2
en|US|iso8859-1
zh|CN|unicode
Currently Installed Language: zh|CN|unicode

对比发现,异常服务器的BIOS版本是Version: 3.07.09P01,而正常服务器的BIOS版本是Version: 5.19,基本确认是BIOS版本差异导致。最后升级BIOS版本后再次测试,随机数生成正常,calico-node也可以正常启动。

解决方案

升级BIOS的版本。

参考资料

1.https://github.com/projectcalico/calico/issues/7001
2.https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/
3.https://github.com/cjee21/RDRAND-Tester

问题背景

java项目中引用了sshj依赖包远程执行ssh命令,执行ssh的命令在环境上可以正常运行,但通过单元测试验证ssh命令时提示如下错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2024-03-28 17:25:22 WARN  DefaultConfig:206 - Disabling high-strength ciphers: cipher strengths apparently limited by JCE policy
2024-03-28 17:25:22 INFO TransportImpl:214 - Client identity string: SSH-2.0-SSHJ_0.27.0
2024-03-28 17:25:22 INFO TransportImpl:178 - Server identity string: SSH-2.0-OpenSSH_7.4
2024-03-28 17:25:23 ERROR TransportImpl:593 - Dying because - Invalid signature file digest for Manifest main attributes
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:317)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:259)
at java.util.jar.JarVerifier.processEntry(JarVerifier.java:323)
at java.util.jar.JarVerifier.update(JarVerifier.java:234)
at java.util.jar.JarFile.initializeVerifier(JarFile.java:394)
at java.util.jar.JarFile.ensureInitialization(JarFile.java:632)
at java.util.jar.JavaUtilJarAccessImpl.ensureInitialization(JavaUtilJarAccessImpl.java:69)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:993)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:456)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at net.schmizz.sshj.common.KeyType$3.isMyType(KeyType.java:124)
at net.schmizz.sshj.common.KeyType.fromKey(KeyType.java:288)
at net.schmizz.sshj.transport.kex.AbstractDHG.next(AbstractDHG.java:82)
at net.schmizz.sshj.transport.KeyExchanger.handle(KeyExchanger.java:364)
at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:503)
at net.schmizz.sshj.transport.Decoder.decodeMte(Decoder.java:159)
at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:79)
at net.schmizz.sshj.transport.Decoder.received(Decoder.java:231)
at net.schmizz.sshj.transport.Reader.run(Reader.java:59)
2024-03-28 17:25:23 INFO TransportImpl:192 - Disconnected - UNKNOWN
2024-03-28 17:25:23 ERROR Promise:174 - <<kex done>> woke to: net.schmizz.sshj.transport.TransportException: Invalid signature file digest for Manifest main attributes
2024-03-28 17:25:23 ERROR matrix:573 - failed exec command ls /root/ on node 10.10.2.8

根据报错信息Invalid signature file digest for Manifest main attributes,查找相关资料,尝试以下几种解决方法都没有效果:

  1. 自定义providerSecurity.addProvider(new sun.security.ec.SunEC());
  2. 禁用JCE加密限制:Security.setProperty("crypto.policy", "unlimited");
  3. 基于sshjSecurityUtils设置provider
1
2
3
4
5
将BC提供者设置为SSHJ的安全提供者
SecurityUtils.setSecurityProvider(String.valueOf(Security.getProvider("BC")));

将JCE提供者设置为SSHJ的安全提供者
SecurityUtils.setSecurityProvider(String.valueOf(Security.getProvider("SunJCE")));

sshj相关issue[1],发现一个类似的问题,原因是bcprov的签名无法被验证。查看bcprov的签名情况:

1
2
3
4
5
6
7
8
9
10
有问题的版本:
[root@node1 1.0.0]# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-1.el7_9.x86_64/bin/jarsigner -verify bcprov-jdk15on-1.60.jar
jarsigner: java.lang.SecurityException: Invalid signature file digest for Manifest main attributes

高版本:
[root@node1 1.0.0]# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-1.el7_9.x86_64/bin/jarsigner -verify bcprov-jdk15on-1.69.jar
jar 已验证。
警告:
此 jar 包含其证书链无效的条目。原因: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
The DSA signing key has a keysize of 1024 which is considered a security risk. This key size will be disabled in a future update.

看起来是版本问题,更新项目中的bcprov版本到1.69,重新测试后报错消失,显示正常的命令执行结果:

1
2
3
4
5
2024-03-29 09:00:08 INFO  BouncyCastleRandom:48 - Generating random seed from SecureRandom.
2024-03-29 09:00:08 INFO TransportImpl:214 - Client identity string: SSH-2.0-SSHJ_0.27.0
2024-03-29 09:00:08 INFO TransportImpl:178 - Server identity string: SSH-2.0-OpenSSH_7.4
2024-03-29 09:00:08 INFO TransportImpl:192 - Disconnected - BY_APPLICATION
anaconda-ks.cfg

解决方案

升级依赖包bcprov的版本到1.69

参考资料

  1. https://github.com/hierynomus/sshj/issues/701

问题背景

参考资料[1],问题初始现象是:VMWare虚拟化环境下PodVXLAN通信异常。

通过资料[2]可知,临时方案有2个:

方案1:调整网卡的配置:

1
ethtool -K vxlan.calico tx-checksum-ip-generic off

方案2:在创建虚拟机时,把网络适配器的类型改为 E1000E1000e

解决方案

最近在跟踪相关issue[3,4]时发现,还有2个方案可以解决该问题(未实际测试验证)。

方案1:如果是flanne网络插件,可以使用如下命令临时解决:

1
iptables -A OUTPUT -p udp -m udp --dport 8472 -j MARK --set-xmark 0x0

方案2:如果是calico网络插件,可以使用如下特性门永久解决:

1
featureDetectOverride: "ChecksumOffloadBroken=true"

参考资料

1.https://lyyao09.github.io/2022/06/05/k8s/K8S%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5-VMWare%E8%99%9A%E6%8B%9F%E5%8C%96%E7%8E%AF%E5%A2%83%E4%B8%8BPod%E8%B7%A8VXLAN%E9%80%9A%E4%BF%A1%E5%BC%82%E5%B8%B8/

2.https://lyyao09.github.io/2023/04/05/k8s/K8S%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5-VMWare%E8%99%9A%E6%8B%9F%E5%8C%96%E7%8E%AF%E5%A2%83%E4%B8%8BPod%E8%B7%A8VXLAN%E9%80%9A%E4%BF%A1%E5%BC%82%E5%B8%B8%EF%BC%88%E7%BB%AD%EF%BC%89/

3.https://github.com/kubernetes-sigs/kubespray/issues/8992

4.https://github.com/flannel-io/flannel/issues/1279

问题1:openj9环境使用profiler命令输出内存火焰图为空

问题现象

基于openjdk,使用-d参数控制持续60s后自动停止,使用浏览器打开html报告,内容显示正常:

1
2
3
4
5
[root@test arthas]# /usr/bin/java -jar arthas-boot.jar 
[arthas@44794]$ profiler start -e alloc -d 60
Profiling started
profiler will silent stop after 60 seconds.
profiler output file will be: /arthas-output/20240326-171613.html

基于openjdk-openj9,同样使用-d参数控制持续60s后自动停止,使用浏览器打开html报告,内容显示空白:

1
2
3
4
5
[root@test arthas]# /usr/lib/openj9/bin/java -jar arthas-boot.jar 
[arthas@7857]$ profiler start -e alloc -d 60
Profiling started
profiler will silent stop after 60 seconds.
profiler output file will be: /arthas-output/20240326-163013.html

解决方法

profileropenjdk-openj9的支持还不够全面,可以通过不加-d参数,主动stop临时规避:

1
2
3
4
5
6
7
[root@test arthas]# /usr/lib/openj9/bin/java -jar arthas-boot.jar 
[arthas@7857]$ profiler start -e alloc
Profiling started

[arthas@7857]$ profiler stop
OK
profiler output file : /arthas-output/20240326-163522.html

使用这种方法,可以正常输出基于内存的火焰图。

根据kubesphere官方资料[1],记录搭建离线部署环境出现的几个问题。

问题1:联网主机制作离线安装包失败

制作过程中出现部分镜像拉取超时,可能是网络问题,多次重试即可。

1
[root@node kubesphere]# ./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz

问题2:安装harbor阶段失败

安装harbor阶段出现unable to sign certificate: must specify a CommonName错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@node1 kubesphere]# ./kk init registry -f config-sample.yaml -a kubesphere.tar.gz
19:37:46 CST [GreetingsModule] Greetings
19:37:47 CST message: [master]
Greetings, KubeKey!
19:37:47 CST success: [master]
19:37:47 CST [UnArchiveArtifactModule] Check the KubeKey artifact md5 value
19:37:47 CST success: [LocalHost]
...
19:48:16 CST success: [master]
19:48:16 CST [ConfigureOSModule] configure the ntp server for each node
19:48:17 CST skipped: [master]
19:48:17 CST [InitRegistryModule] Fetch registry certs
19:48:18 CST success: [master]
19:48:18 CST [InitRegistryModule] Generate registry Certs
[certs] Using existing ca certificate authority
19:48:18 CST message: [LocalHost]
unable to sign certificate: must specify a CommonName
19:48:18 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[InitRegistryModule] exec failed:
failed: [LocalHost] [GenerateRegistryCerts] exec failed after 1 retries: unable to sign certificate: must specify a CommonName

参考资料[2],修改配置registry相关配置:

1
2
3
4
5
6
7
8
9
10
11
registry:
type: harbor
auths:
"dockerhub.kubekey.local":
username: admin
password: Harbor12345
certsPath: "/etc/docker/certs.d/dockerhub.kubekey.local"
privateRegistry: "dockerhub.kubekey.local"
namespaceOverride: "kubesphereio"
registryMirrors: []
insecureRegistries: []

问题3:创建集群阶段

提示下载kubernetes的二进制文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz
23:29:32 CST [NodeBinariesModule] Download installation binaries
23:29:32 CST message: [localhost]
downloading amd64 kubeadm v1.22.12 ...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: storage.googleapis.com; 未知的错误
23:29:32 CST [WARN] Having a problem with accessing https://storage.googleapis.com? You can try again after setting environment 'export KKZONE=cn'
23:29:32 CST message: [LocalHost]
Failed to download kubeadm binary: curl -L -o /home/k8s/kubesphere/kubekey/kube/v1.22.12/amd64/kubeadm https://storage.googleapis.com/kubernetes-release/release/v1.22.12/bin/linux/amd64/kubeadm error: exit status 6
23:29:32 CST failed: [LocalHost]
error: Pipeline[CreateClusterPipeline] execute failed: Module[NodeBinariesModule] exec failed:
failed: [LocalHost] [DownloadBinaries] exec failed after 1 retries: Failed to download kubeadm binary: curl -L -o /home/k8s/kubesphere/kubekey/kube/v1.22.12/amd64/kubeadm https://storage.googleapis.com/kubernetes-release/release/v1.22.12/bin/linux/amd64/kubeadm error: exit status 6

这个错误是因为config-sample.yaml不是通过命令生成的,所以kubernetes的版本不对,查看命令的帮助信息,发现kubesphere的默认版本是v3.4.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz -h
Create a Kubernetes or KubeSphere cluster

Usage:
kk create cluster [flags]

Flags:
-a, --artifact string Path to a KubeKey artifact
--container-manager string Container runtime: docker, crio, containerd and isula. (default "docker")
--debug Print detailed information
--download-cmd string The user defined command to download the necessary binary files. The first param '%s' is output path, the second param '%s', is the URL (default "curl -L -o %s %s")
-f, --filename string Path to a configuration file
-h, --help help for cluster
--ignore-err Ignore the error message, remove the host which reported error and force to continue
--namespace string KubeKey namespace to use (default "kubekey-system")
--skip-pull-images Skip pre pull images
--skip-push-images Skip pre push images
--with-kubernetes string Specify a supported version of kubernetes
--with-kubesphere Deploy a specific version of kubesphere (default v3.4.1)
--with-local-storage Deploy a local PV provisioner
--with-packages install operation system packages by artifact
--with-security-enhancement Security enhancement
-y, --yes

修改命令,指定kubesphere版本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-kubesphere 3.4.0
W1205 00:36:57.266052 1453 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.96.0.10]; the provided value is: [169.254.25.10]
[init] Using Kubernetes version: v1.23.15
[preflight] Running pre-flight checks
[WARNING FileExisting-socat]: socat not found in system path
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 24.0.6. Latest validated version: 20.10
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileExisting-conntrack]: conntrack not found in system path
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
00:36:58 CST stdout: [master]
[preflight] Running pre-flight checks
W1205 00:36:58.323079 1534 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
W1205 00:36:58.327376 1534 cleanupnode.go:109] [reset] Failed to evaluate the "/var/lib/kubelet" directory. Skipping its unmount and cleanup: lstat /var/lib/kubelet: no such file or directory
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
00:36:58 CST message: [master]
init kubernetes cluster failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=FileExisting-crictl,ImagePull"

问题4:部分Pod卡在启动阶段

1
2
3
4
kubesphere-system              ks-apiserver-86757d49bb-m9pp4          ContainerCreating
kubesphere-system ks-console-cbdb4558c-7z6lg Running
kubesphere-system ks-controller-manager-64b5dcb7d-9mrsw ContainerCreating
kubesphere-system ks-installer-ff66855c9-d8x4k Running

通过资料[3]可知,可以使用命令kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath='{.items[0].metadata.name}') -f查询进度,上面的异常Pod需要等所有组件安装后才起来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#####################################################
### Welcome to KubeSphere! ###
#####################################################

Console: http://10.10.10.30:30880
Account: admin
Password: P@88w0rd
NOTES:
1. After you log into the console, please check the
monitoring status of service components in
"Cluster Management". If any service is not
ready, please wait patiently until all components
are up and running.
2. Please change the default password after login.

#####################################################
https://kubesphere.io 2023-12-05 01:24:00
#####################################################
01:24:04 CST success: [master]
01:24:04 CST Pipeline[CreateClusterPipeline] execute successfully
Installation is complete.

Please check the result using the command:

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l 'app in (ks-install, ks-installer)' -o jsonpath='{.items[0].metadata.name}') -f

问题5:metrics-server启动失败

通过查看相关日志可知,如果把harbor仓库安装在master节点,端口会冲突:

1
2
3
4
5
6
7
8
9
10
11
12
[root@master ~]# kubectl logs -f -n kube-system metrics-server-6d987cb45c-4swvd
panic: failed to create listener: failed to listen on 0.0.0.0:4443: listen tcp 0.0.0.0:4443: bind: address already in use

goroutine 1 [running]:
main.main()
/go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:39 +0xfc
[root@master ~]# netstat -anp|grep 4443
tcp 0 0 0.0.0.0:4443 0.0.0.0:* LISTEN 22372/docker-proxy
tcp6 0 0 :::4443 :::* LISTEN 22378/docker-proxy

[root@master ~]# docker ps |grep harbor|grep 4443
1733e9580af5 goharbor/nginx-photon:v2.5.3 "nginx -g 'daemon of…" 4 hours ago Up 4 hours (healthy) 0.0.0.0:4443->4443/tcp, :::4443->4443/tcp, 0.0.0.0:80->8080/tcp, :::80->8080/tcp, 0.0.0.0:443->8443/tcp, :::443->8443/tcp nginx

修改端口后恢复。

问题6:部分Pod拉取镜像失败

1
2
3
4
kubesphere-logging-system      opensearch-cluster-data-0            init:ImagePullBackOff
kubesphere-logging-system opensearch-cluster-master-0 init:ImagePullBackOff
istio-system istio-cni-node-vlzt7 ImagePullBackOff
kubesphere-controls-system kubesphere-router-test-55b5fcc887-xlzsh ImagePullBackOff

查看发现,init失败,是因为用了busybox镜像,但离线包没有提前下载:

1
2
3
4
5
6
7
8
initContainers:
- args:
- chown -R 1000:1000 /usr/share/opensearch/data
command:
- sh
- -c
image: busybox:latest
imagePullPolicy: Always

后面两个镜像拉取失败问题,同样是因为离线包没有提前下载:

1
Normal   BackOff    21s (x51 over 15m)  kubelet            Back-off pulling image "dockerhub.kubekey.local/kubesphereio/install-cni:1.14.6"

手动下载导入到离线环境后,异常Pod恢复。

参考资料

1.https://kubesphere.io/zh/docs/v3.3/installing-on-linux/introduction/air-gapped-installation/

2.https://github.com/kubesphere/kubekey/issues/1762#issuecomment-1681625989

3.https://github.com/kubesphere/ks-installer/issues/907

前提条件

  1. 确保 Docker 版本不低于 19.03,同时还要通过设置环境变量 DOCKER_CLI_EXPERIMENTAL 来启用。可以通过下面的命令来为当前终端启用 buildx 插件,并验证是否开启[1]:
1
2
3
4
[root@node1 root]# export DOCKER_CLI_EXPERIMENTAL=enabled

[root@node1 root]# docker buildx version
github.com/docker/buildx v0.3.1-tp-docker 6db68d029599c6710a32aa7adcba8e5a344795a7
  1. 确保Linux内核版本升级到 4.8 以上,否则会出现如下异常[2]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@node1 root]# docker run --privileged --rm tonistiigi/binfmt --install all
Unable to find image 'tonistiigi/binfmt:latest' locally
latest: Pulling from tonistiigi/binfmt
2a625f6055a5: Pull complete
71d6c64c6702: Pull complete
Digest: sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
Status: Downloaded newer image for tonistiigi/binfmt:latest
installing: arm64 cannot register "/usr/bin/qemu-aarch64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: s390x cannot register "/usr/bin/qemu-s390x" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: riscv64 cannot register "/usr/bin/qemu-riscv64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: mips64le cannot register "/usr/bin/qemu-mips64el" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: mips64 cannot register "/usr/bin/qemu-mips64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: arm cannot register "/usr/bin/qemu-arm" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: ppc64le cannot register "/usr/bin/qemu-ppc64le" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
{
"supported": [
"linux/amd64",
"linux/386"
],
"emulators": null
}

环境准备

  1. 升级内核版本,以升级到4.9为例,相关rpm包见链接[3]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@node1 4.9]# ll
total 13400
-rw-r--r-- 1 root root 1114112 Dec 12 20:22 kernel-4.9.241-37.el7.x86_64.rpm
-rw-r--r-- 1 root root 11686072 Dec 12 20:22 kernel-devel-4.9.241-37.el7.x86_64.rpm

[root@node1 4.9]# rpm -ivh kernel-*
warning: kernel-4.9.241-37.el7.x86_64.rpm: Header V4 RSA/SHA1 Signature, key ID 61e8806c: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:kernel-devel-4.9.241-37.el7 ################################# [ 50%]
2:kernel-4.9.241-37.el7 ################################# [100%]

[root@node1 4.9]# reboot

[root@node1 4.9]# uname -a
Linux node1 4.9.241-37.el7.x86_64 #1 SMP Mon Nov 2 13:55:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  1. 启用 binfmt_misc,并检查启用结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
[root@node1 ~]# docker run --privileged --rm tonistiigi/binfmt --install all

installing: arm OK
installing: s390x OK
installing: ppc64le OK
installing: arm64 OK
installing: riscv64 OK
installing: mips64le OK
installing: mips64 OK
{
"supported": [
"linux/amd64",
"linux/arm64",
"linux/riscv64",
"linux/ppc64le",
"linux/s390x",
"linux/386",
"linux/mips64le",
"linux/mips64",
"linux/arm/v7",
"linux/arm/v6"
],
"emulators": [
"qemu-aarch64",
"qemu-arm",
"qemu-mips64",
"qemu-mips64el",
"qemu-ppc64le",
"qemu-riscv64",
"qemu-s390x"
]
}

[root@node1 ~]# ls -al /proc/sys/fs/binfmt_misc/
total 0
drwxr-xr-x 2 root root 0 Dec 13 16:29 .
dr-xr-xr-x 1 root root 0 Dec 13 16:27 ..
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-aarch64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-arm
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-mips64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-mips64el
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-ppc64le
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-riscv64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-s390x
--w------- 1 root root 0 Dec 13 16:29 register
-rw-r--r-- 1 root root 0 Dec 13 16:29 status

构建验证

创建一个新的构建器,并启动构建器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node1 ~]# docker buildx create --use --name mybuilder
mybuilder

[root@node1 ~]# docker buildx inspect mybuilder --bootstrap
[+] Building 105.8s (1/1) FINISHED
=> [internal] booting buildkit 105.8s
=> => pulling image moby/buildkit:buildx-stable-1 105.3s
=> => creating container buildx_buildkit_mybuilder0 0.6s
Name: mybuilder
Driver: docker-container
Last Activity: 2023-12-13 08:35:03 +0000 UTC

Nodes:
Name: mybuilder0
Endpoint: unix:///var/run/docker.sock
Status: running
Buildkit: v0.9.3
Platforms: linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6

以构建xxx镜像为例,并将构建好的镜像保存在本地,将 type 指定为 docker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node1 images]# docker buildx build -t xxx/xxx --platform=linux/arm64 -o type=docker .
[+] Building 5.5s (6/6) FINISHED docker-container:mybuilder
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 219B 0.0s
=> [internal] load .dockerignore 0.1s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.9s
=> => transferring context: 68.42MB 0.8s
=> [1/1] COPY ./xxx /bin/xxx 0.1s
=> exporting to oci image format 4.3s
=> => exporting layers 3.0s
=> => exporting manifest sha256:33877987488ccd8fb6803f06f6b90b5ff667dd172db23b339e96acee31af354f 0.0s
=> => exporting config sha256:f16ad6c6fc37b1cad030e7880c094f75f2cb6959ebbc3712808f25e04b96a395 0.0s
=> => sending tarball 1.3s
=> importing to docker

查看镜像:

1
2
[root@node1 images]# docker images|grep xxx
xxx/xxx latest f16ad6c6fc37 2 minutes ago 68.4MB

参考资料

https://cloud.tencent.com/developer/article/1543689

https://www.cnblogs.com/frankming/p/16870285.html

http://ftp.usf.edu/pub/centos/7/virt/x86_64/xen-414/Packages/k/