0%

K8S问题排查-基于海光CPU的服务器环境部署K8S集群失败问题

问题背景

在使用国产海光CPU Hygon C86 7265的服务器上部署K8S集群时,出现calico-node启动失败,相关日志如下:

1
2
3
4
5
6
7
8
9
10
[root@node1 ~]# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system calico-kube-controllers-7c7986989c-bwvw4 0/1 Pending 0 5m11s <none> <none>
kube-system calico-node-v64fv 0/1 CrashLoopBackOff 5 5m11s 10.10.26.120 node1
kube-system coredns-6db7677797-jkhpd 0/1 Pending 0 5m11s <none> <none>
kube-system coredns-6db7677797-r58c5 0/1 Pending 0 5m11s <none> <none>
kube-system kube-apiserver-node1 1/1 Running 6 5m23s 10.10.26.120 node1
kube-system kube-controller-manager-node1 1/1 Running 8 5m28s 10.10.26.120 node1
kube-system kube-proxy-ncw4g 1/1 Running 0 5m11s 10.10.26.120 node1
kube-system kube-scheduler-node1 1/1 Running 6 5m29s 10.10.26.120 node1

原因分析

查看具体错误日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
[root@node1 ~]# kubectl logs -n kube-system calico-node-v64fv
2024-04-03 14:29:25.424 [INFO][9] startup/startup.go 427: Early log level set to info
2024-04-03 14:29:25.425 [INFO][9] startup/utils.go 131: Using HOSTNAME environment (lowercase) for node name node1
2024-04-03 14:29:25.425 [INFO][9] startup/utils.go 139: Determined node name: node1
2024-04-03 14:29:25.428 [INFO][9] startup/startup.go 106: Skipping datastore connection test
CRNGT failed.
SIGABRT: abort
PC=0x7efbf7409a9f m=13 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7efbf7409a9f
stack: frame={sp:0x7efbaa7fb780, fp:0x0} stack=[0x7efba9ffc250,0x7efbaa7fbe50)fffff 0x00007efbf7de02cc
0x00007efbaa7fb6c0: 0x00007efbf73c2340 0x00007efbf7ffbed0
0x00007efbaa7fb6d0: 0x00007efbf73c8cd0 0x00007efbf7ffbed0
0x00007efbaa7fb6e0: 0x0000000000000001 0x00007efbf7de06be
0x00007efbaa7fb6f0: 0x000000000000015f 0x00007efbf7783360
0x00007efbaa7fb700: 0x00007efbf7ffb9e0 0x0000000004904060
0x00007efbaa7fb710: 0x00007efbaa7fbdf0 0x0000000000000000
0x00007efbaa7fb720: 0x0000000000000020 0x00007efb94000dd0
0x00007efbaa7fb730: 0x00007efb94000dd0 0x00007efbf7de5574
0x00007efbaa7fb740: 0x0000000000000005 0x0000000000000000
0x00007efbaa7fb750: 0x0000000000000005 0x00007efbf73c2340
0x00007efbaa7fb760: 0x00007efbaa7fb9b0 0x00007efbf7dd2ae7
0x00007efbaa7fb770: 0x0000000000000001 0x00007efbf74db5df
0x00007efbaa7fb780: <0x0000000000000000 0x00007efbf777c850
0x00007efbaa7fb790: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7a0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7b0: 0x000000000000037f 0x0000000000000000
0x00007efbaa7fb7c0: 0x0000000000000000 0x0002ffff00001fa0
0x00007efbaa7fb7d0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7e0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7f0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb800: 0xfffffffe7fffffff 0xffffffffffffffff
0x00007efbaa7fb810: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb820: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb830: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb840: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb850: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb860: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb870: 0xffffffffffffffff 0xffffffffffffffff
runtime: unknown pc 0x7efbf7409a9f
stack: frame={sp:0x7efbaa7fb780, fp:0x0} stack=[0x7efba9ffc250,0x7efbaa7fbe50)
0x00007efbaa7fb680: 0x00007efbaa7fb6c0 0x00007efbf8000558
0x00007efbaa7fb690: 0x0000000000000000 0x00007efbf8000558
0x00007efbaa7fb6a0: 0x0000000000000001 0x0000000000000000
0x00007efbaa7fb6b0: 0x00000000ffffffff 0x00007efbf7de02cc
0x00007efbaa7fb6c0: 0x00007efbf73c2340 0x00007efbf7ffbed0
0x00007efbaa7fb6d0: 0x00007efbf73c8cd0 0x00007efbf7ffbed0
0x00007efbaa7fb6e0: 0x0000000000000001 0x00007efbf7de06be
0x00007efbaa7fb6f0: 0x000000000000015f 0x00007efbf7783360
0x00007efbaa7fb700: 0x00007efbf7ffb9e0 0x0000000004904060
0x00007efbaa7fb710: 0x00007efbaa7fbdf0 0x0000000000000000
0x00007efbaa7fb720: 0x0000000000000020 0x00007efb94000dd0
0x00007efbaa7fb730: 0x00007efb94000dd0 0x00007efbf7de5574
0x00007efbaa7fb740: 0x0000000000000005 0x0000000000000000
0x00007efbaa7fb750: 0x0000000000000005 0x00007efbf73c2340
0x00007efbaa7fb760: 0x00007efbaa7fb9b0 0x00007efbf7dd2ae7
0x00007efbaa7fb770: 0x0000000000000001 0x00007efbf74db5df
0x00007efbaa7fb780: <0x0000000000000000 0x00007efbf777c850
0x00007efbaa7fb790: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7a0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7b0: 0x000000000000037f 0x0000000000000000
0x00007efbaa7fb7c0: 0x0000000000000000 0x0002ffff00001fa0
0x00007efbaa7fb7d0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7e0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb7f0: 0x0000000000000000 0x0000000000000000
0x00007efbaa7fb800: 0xfffffffe7fffffff 0xffffffffffffffff
0x00007efbaa7fb810: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb820: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb830: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb840: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb850: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb860: 0xffffffffffffffff 0xffffffffffffffff
0x00007efbaa7fb870: 0xffffffffffffffff 0xffffffffffffffff

goroutine 131 [syscall]:
runtime.cgocall(0x2629971, 0xc00067da20)
/usr/local/go-cgo/src/runtime/cgocall.go:156 +0x5c fp=0xc00067d9f8 sp=0xc00067d9c0 pc=0x41081c
crypto/internal/boring._Cfunc__goboringcrypto_RAND_bytes(0xc0006ba680, 0x20)
_cgo_gotypes.go:1140 +0x4c fp=0xc00067da20 sp=0xc00067d9f8 pc=0x66a0ac
crypto/internal/boring.randReader.Read(0x0, {0xc0006ba680, 0x20, 0x20})
/usr/local/go-cgo/src/crypto/internal/boring/rand.go:21 +0x31 fp=0xc00067da48 sp=0xc00067da20 pc=0x66e691
crypto/internal/boring.(*randReader).Read(0x3333408, {0xc0006ba680, 0xc00067dab0, 0x45acb2})
<autogenerated>:1 +0x34 fp=0xc00067da78 sp=0xc00067da48 pc=0x6754f4
io.ReadAtLeast({0x336aba0, 0x3333408}, {0xc0006ba680, 0x20, 0x20}, 0x20)
/usr/local/go-cgo/src/io/io.go:328 +0x9a fp=0xc00067dac0 sp=0xc00067da78 pc=0x4b6ffa
io.ReadFull(...)
/usr/local/go-cgo/src/io/io.go:347
crypto/tls.(*Conn).makeClientHello(0xc000410700)
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:107 +0x6a5 fp=0xc00067dbe8 sp=0xc00067dac0 pc=0x728f25
crypto/tls.(*Conn).clientHandshake(0xc000410700, {0x33dd910, 0xc000a91880})
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:157 +0x96 fp=0xc00067de78 sp=0xc00067dbe8 pc=0x7295f6
crypto/tls.(*Conn).clientHandshake-fm({0x33dd910, 0xc000a91880})
/usr/local/go-cgo/src/crypto/tls/handshake_client.go:148 +0x39 fp=0xc00067dea0 sp=0xc00067de78 pc=0x759899
crypto/tls.(*Conn).handshakeContext(0xc000410700, {0x33dd980, 0xc00057a240})
/usr/local/go-cgo/src/crypto/tls/conn.go:1445 +0x3d1 fp=0xc00067df70 sp=0xc00067dea0 pc=0x727bf1
crypto/tls.(*Conn).HandshakeContext(...)
/usr/local/go-cgo/src/crypto/tls/conn.go:1395
net/http.(*persistConn).addTLS.func2()
/usr/local/go-cgo/src/net/http/transport.go:1534 +0x71 fp=0xc00067dfe0 sp=0xc00067df70 pc=0x7d4dd1
runtime.goexit()
/usr/local/go-cgo/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00067dfe8 sp=0xc00067dfe0 pc=0x4759c1
created by net/http.(*persistConn).addTLS
/usr/local/go-cgo/src/net/http/transport.go:1530 +0x345

goroutine 1 [select]:
net/http.(*Transport).getConn(0xc00033e140, 0xc0006b6f80, {{}, 0x0, {0xc000c6a0a0, 0x5}, {0xc00046c680, 0xd}, 0x0})
/usr/local/go-cgo/src/net/http/transport.go:1372 +0x5d2
net/http.(*Transport).roundTrip(0xc00033e140, 0xc000650e00)
/usr/local/go-cgo/src/net/http/transport.go:581 +0x774
net/http.(*Transport).RoundTrip(0x2cc6880, 0xc000a6b1d0)
/usr/local/go-cgo/src/net/http/roundtrip.go:18 +0x19
k8s.io/client-go/transport.(*bearerAuthRoundTripper).RoundTrip(0xc000a98360, 0xc000650a00)
/go/pkg/mod/k8s.io/client-go@v0.23.3/transport/round_trippers.go:317 +0x242
net/http.send(0xc000650900, {0x336a040, 0xc000a98360}, {0x2e2a640, 0x4d0701, 0x4a61720})
/usr/local/go-cgo/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc000a983f0, 0xc000650900, {0x2ec14f9, 0xe, 0x4a61720})
/usr/local/go-cgo/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc000a983f0, 0xc000650900)
/usr/local/go-cgo/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
/usr/local/go-cgo/src/net/http/client.go:593
k8s.io/client-go/rest.(*Request).request(0xc000650700, {0x33dd980, 0xc00057a240}, 0x4a95f98)
/go/pkg/mod/k8s.io/client-go@v0.23.3/rest/request.go:980 +0x439
k8s.io/client-go/rest.(*Request).Do(0x20, {0x33dd948, 0xc000056060})
/go/pkg/mod/k8s.io/client-go@v0.23.3/rest/request.go:1038 +0xcc
k8s.io/client-go/kubernetes/typed/core/v1.(*configMaps).Get(0xc00004fa40, {0x33dd948, 0xc000056060}, {0x2ec14f9, 0xe}, {{{0x0, 0x0}, {0x0, 0x0}}, {0x0, ...}})
/go/pkg/mod/k8s.io/client-go@v0.23.3/kubernetes/typed/core/v1/configmap.go:78 +0x15a
github.com/projectcalico/calico/node/pkg/lifecycle/startup.Run()
/go/src/github.com/projectcalico/calico/node/pkg/lifecycle/startup/startup.go:148 +0x422
main.main()
/go/src/github.com/projectcalico/calico/node/cmd/calico-node/main.go:142 +0x732

goroutine 8 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0xc000202900)
/go/pkg/mod/k8s.io/klog/v2@v2.40.1/klog.go:1283 +0x6a
created by k8s.io/klog/v2.init.0
/go/pkg/mod/k8s.io/klog/v2@v2.40.1/klog.go:420 +0xfb

goroutine 29 [select]:
net/http.setRequestCancel.func4()
/usr/local/go-cgo/src/net/http/client.go:398 +0x94
created by net/http.setRequestCancel
/usr/local/go-cgo/src/net/http/client.go:397 +0x43e

goroutine 30 [chan receive]:
net/http.(*persistConn).addTLS(0xc0005617a0, {0x33dd980, 0xc00057a240}, {0xc00046c680, 0x9}, 0x0)
/usr/local/go-cgo/src/net/http/transport.go:1540 +0x365
net/http.(*Transport).dialConn(0xc00033e140, {0x33dd980, 0xc00057a240}, {{}, 0x0, {0xc000c6a0a0, 0x5}, {0xc00046c680, 0xd}, 0x0})
/usr/local/go-cgo/src/net/http/transport.go:1614 +0xab7
net/http.(*Transport).dialConnFor(0x0, 0xc00077d6b0)
/usr/local/go-cgo/src/net/http/transport.go:1446 +0xb0
created by net/http.(*Transport).queueForDial
/usr/local/go-cgo/src/net/http/transport.go:1415 +0x3d7

goroutine 183 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc0001ac0a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/balancer_conn_wrappers.go:71 +0xa5
created by google.golang.org/grpc.newCCBalancerWrapper
/go/pkg/mod/google.golang.org/grpc@v1.40.0/balancer_conn_wrappers.go:62 +0x246

goroutine 184 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc000660000)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/clientconn.go:1219 +0x48f
created by google.golang.org/grpc.(*addrConn).connect
/go/pkg/mod/google.golang.org/grpc@v1.40.0/clientconn.go:849 +0x147

goroutine 194 [select]:
google.golang.org/grpc/internal/transport.(*http2Client).keepalive(0xc00000c5a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:1569 +0x169
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:350 +0x18a5

goroutine 195 [IO wait]:
internal/poll.runtime_pollWait(0x7efbf7e43728, 0x72)
/usr/local/go-cgo/src/runtime/netpoll.go:303 +0x85
internal/poll.(*pollDesc).wait(0xc000a92980, 0xc0005b4000, 0x0)
/usr/local/go-cgo/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go-cgo/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000a92980, {0xc0005b4000, 0x8000, 0x8000})
/usr/local/go-cgo/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000a92980, {0xc0005b4000, 0x1060100000000, 0x8})
/usr/local/go-cgo/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00060e078, {0xc0005b4000, 0x9c8430, 0xc00033c500})
/usr/local/go-cgo/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc000197380, {0xc0003c04a0, 0x9, 0x18})
/usr/local/go-cgo/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x3365ca0, 0xc000197380}, {0xc0003c04a0, 0x9, 0x9}, 0x9)
/usr/local/go-cgo/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/go-cgo/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0003c04a0, 0x9, 0x3f69d15}, {0x3365ca0, 0xc000197380})
/go/pkg/mod/golang.org/x/net@v0.0.0-20220520000938-2e3eb7b945c2/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003c0460)
/go/pkg/mod/golang.org/x/net@v0.0.0-20220520000938-2e3eb7b945c2/http2/frame.go:498 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc00000c5a0)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:1495 +0x41f
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:355 +0x18ef

goroutine 196 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc000204230, 0x1)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/controlbuf.go:406 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc000197440)
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/controlbuf.go:533 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:405 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/pkg/mod/google.golang.org/grpc@v1.40.0/internal/transport/http2_client.go:403 +0x1f45

goroutine 132 [select]:
crypto/tls.(*Conn).handshakeContext.func2()
/usr/local/go-cgo/src/crypto/tls/conn.go:1421 +0x9e
created by crypto/tls.(*Conn).handshakeContext
/usr/local/go-cgo/src/crypto/tls/conn.go:1420 +0x1bd

rax 0x0
rbx 0x6
rcx 0xffffffffffffffff
rdx 0x0
rdi 0x2
rsi 0x7efbaa7fb780
rbp 0x7efbaa7fbdf0
rsp 0x7efbaa7fb780
r8 0x0
r9 0x7efbaa7fb780
r10 0x8
r11 0x246
r12 0x0
r13 0x20
r14 0x7efb94000dd0
r15 0x7efb94000dd0
rip 0x7efbf7409a9f
rflags 0x246
cs 0x33
fs 0x0
gs 0x0
Calico node failed to start

经沟通确认,该环境之前安装的版本没有问题,最新的版本才出现。考虑到最新版本升级过calico,先怀疑是不是calico版本的问题。通过升级calico到最新版本,发现问题依然存在。

根据错误日志CRNGT failed查找相关资料[1],发现有人遇到相同的错误,原因是使用的AMD Ryzen 9 3000 系列CPU存在bug,导致RNRAND 在某些 BIOS 版本中无法正确生成随机数,解决方法是升级BIOS版本。

按照资料[1]提供的检测方法,创建一个main.go文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package main

import (
"fmt"
"crypto/rand"
)

func main() {
a := make([]byte, 10)
_, err := rand.Read(a)
if err != nil {
panic(err)
}
fmt.Println(string(a))
}

执行如下命令,在问题服务器上未复现出问题:

1
$ GOEXPERIMENT=boringcrypto go run main.go

继续查资料[2],同样也是AMD Ryzen 9 3000 系列CPU,并且给出了另一种检查方法:

1
2
3
4
you@ubuntu-live:~$ wget https://cdn.arstechnica.net/wp-content/uploads/2019/10/rdrand-test.zip
you@ubuntu-live:~$ unzip rdrand-test.zip
you@ubuntu-live:~$ cd rdrand-test
you@ubuntu-live:~$ ./amd-rdrand.bug

不过,这个链接已经失效了,继续搜索相关资料,又找了一个测试工具[3],并提供了二进制文件,在异常服务器上测试效果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64 --第一次随机数不同
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 17705883718297935842 (0xf5b7ef6e97855fe2)
try: 2 success: 1 random number: 6443855104021096318 (0x596d2c137b43e77e)
try: 3 success: 1 random number: 10126471306861746785 (0x8c88740051ae1a61)
try: 4 success: 1 random number: 13463061200056996464 (0xbad666d4c2bdd270)
try: 5 success: 1 random number: 7695825692332247646 (0x6acd10b164c9ca5e)
try: 6 success: 1 random number: 1263849930341660097 (0x118a18d0c36ab5c1)
try: 7 success: 1 random number: 2580393233033016710 (0x23cf65f953c13586)
try: 8 success: 1 random number: 1842118076754864861 (0x199084a17f4caadd)
try: 9 success: 1 random number: 2896900625228522073 (0x2833dbc52c5a6259)
try: 10 success: 1 random number: 3899901262805814503 (0x361f3b8934a34ce7)
try: 11 success: 1 random number: 3597359862242937122 (0x31ec63bc2e3d0922)
try: 12 success: 1 random number: 12246743104637488545 (0xa9f52bf7b761cda1)
try: 13 success: 1 random number: 16491679937497687446 (0xe4de3786c7fc6596)
try: 14 success: 1 random number: 7270227793600200162 (0x64e509a8b1b63de2)
try: 15 success: 1 random number: 15697857806096052438 (0xd9d9fe80faf2b0d6)
try: 16 success: 1 random number: 2546933488048450266 (0x235886835dacaada)
try: 17 success: 1 random number: 6670897529050922874 (0x5c93c9f5701c7f7a)
try: 18 success: 1 random number: 14670415794664541721 (0xcb97c97024428e19)
try: 19 success: 1 random number: 2452728878003037248 (0x2209d7eb5fb6a440)
try: 20 success: 1 random number: 16252906931536406850 (0xe18decbe1db62942)

The RDRAND instruction of this CPU appears to be working.
The numbers generated should be different and random.
If the numbers generated appears to be similar, the RDRAND instruction is
broken.

[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64 --之后的随机数完全相同
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 2 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 3 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 4 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 5 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 6 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 7 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 8 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 9 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 10 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 11 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 12 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 13 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 14 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 15 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 16 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 17 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 18 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 19 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)
try: 20 success: 1 random number: 18446744073709551615 (0xffffffffffffffff)

The RDRAND instruction of this CPU appears to be broken!
The numbers generated are NOT random but the CPU returns the success flag.

在正常的服务器上执行效果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[root@node1 ~]# ./RDRAND_Tester_Linux_x86_64  --多次测试,随机数没有出现相同的情况
RDRAND Tester v20210328 x86_64
Compiled on Apr 16 2021
Compiled with GNU Compiler Collection (GCC) 10.3.0
Running on Hygon C86 7265 24-core Processor
This CPU supports the following instructions:
RDRAND: Supported
RDSEED: Supported

Testing RDRAND...
try: 1 success: 1 random number: 17914541561690204462 (0xf89d3ca29284292e)
try: 2 success: 1 random number: 14332812162628513309 (0xc6e860b931deee1d)
try: 3 success: 1 random number: 11906898495071391800 (0xa53dcd18875d1038)
try: 4 success: 1 random number: 5465211412374691004 (0x4bd854d2d9011cbc)
try: 5 success: 1 random number: 13927489571584093018 (0xc14861f96f3a2b5a)
try: 6 success: 1 random number: 70328156090550554 (0x00f9db15d97c491a)
try: 7 success: 1 random number: 9065062530023621999 (0x7dcd9257a0c3056f)
try: 8 success: 1 random number: 283806862943046502 (0x03f048d69289cb66)
try: 9 success: 1 random number: 7602503365830811759 (0x698184880c0ea06f)
try: 10 success: 1 random number: 3090051278467342602 (0x2ae2114416c9a10a)
try: 11 success: 1 random number: 2685951337108651825 (0x25466a82a458bf31)
try: 12 success: 1 random number: 15486706753868706299 (0xd6ebd5bd94fcb1fb)
try: 13 success: 1 random number: 11789666617122680772 (0xa39d4f52ede0efc4)
try: 14 success: 1 random number: 1388997005975229823 (0x1346b56aef3c157f)
try: 15 success: 1 random number: 11566015841037137779 (0xa082be20c78f3773)
try: 16 success: 1 random number: 14397918040333260716 (0xc7cfae2c9b4097ac)
try: 17 success: 1 random number: 10383120616855762267 (0x901841305bb8f55b)
try: 18 success: 1 random number: 6694856356368217838 (0x5ce8e8629f97f6ee)
try: 19 success: 1 random number: 2307408338273596455 (0x20058fa892927427)
try: 20 success: 1 random number: 6317182892917504808 (0x57ab245f0985bb28)

The RDRAND instruction of this CPU appears to be working.
The numbers generated should be different and random.
If the numbers generated appears to be similar, the RDRAND instruction is
broken.

对比两个服务器的BIOS差异:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
异常服务器:
[root@node1 ~]# dmidecode -t bios
dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.1 present.

Handle 0x0068, DMI type 0, 26 bytes
BIOS Information
Vendor: Byosoft
Version: 3.07.09P01
Release Date: 12/16/2020
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 0 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 17.0

Handle 0x0070, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 2
en|US|iso8859-1
zh|CN|unicode
Currently Installed Language: zh|CN|unicode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
正常服务器:
[root@node1 ~]# dmidecode -t bios
dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.1 present.

Handle 0x0069, DMI type 0, 26 bytes
BIOS Information
Vendor: Byosoft
Version: 5.19
Release Date: 03/04/2022
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16 MB
Characteristics:
ISA is supported
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
USB legacy is supported
ATAPI Zip drive boot is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
System is a virtual machine
BIOS Revision: 5.19

Handle 0x0070, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 2
en|US|iso8859-1
zh|CN|unicode
Currently Installed Language: zh|CN|unicode

对比发现,异常服务器的BIOS版本是Version: 3.07.09P01,而正常服务器的BIOS版本是Version: 5.19,基本确认是BIOS版本差异导致。最后升级BIOS版本后再次测试,随机数生成正常,calico-node也可以正常启动。

解决方案

升级BIOS的版本。

参考资料

1.https://github.com/projectcalico/calico/issues/7001
2.https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/
3.https://github.com/cjee21/RDRAND-Tester