0%

Docker问题排查-异常断电导致Docker启动卡住

问题描述

Docker守护进程在异常断电后卡在activating状态,并且内存占用在无限增加。

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 docker]# service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─restart.conf
Active: activating (start) since Tue 2021-12-07 15:44:32 CST; 1min 18s ago
Docs: https:/docs.docker.com
Main PID: 274797 (dockerd)
Tasks: 481
Memory: 15.4G -- 占用不正常
CGroup: /system
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node01 docker]# service docker status
Redirecting to /bin/systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─restart.conf
Active: deactivating (stop-sigterm)
Docs: https:/docs.docker.com
Main PID: 274797 (dockerd)
Tasks: 481
Memory: 247.3G -- 一直在疯狂增加
CGroup: /system.slice/docker.service
└─274

定位过程

首先,查看docker版本和docker info信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node01files]# docker version
Client: Docker Engine - Community
Version: 20.10.7
API version: 1.41
Go version: go1.13.15
Git commit: f0df350
Built: Wed Jun 2 11:58:10 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[root@node01 files]# docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
scan: Docker Scan (Docker Inc., v0.8.0)

Server:
Server Version: 20.10.7
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.3.el7.x86_64
Operating System: CentOS 7.6
OSType: linux
Architecture: x86_64
Total Memory: 256GiB
Name: node01
Docker Root Dir: /var/lib/docker
Debug Mode: false
File Descriptors: 366
Goroutines: 290
EventsListeners: 0
Registry: https:/index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true

版本还算比较新,查看docker日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.831370861+08:00" level=info msg="Starting up"
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950367668+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950430356+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950486773+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Dec 07 14:57:57 node01 dockerd[239420]: time="2021-12-07T14:57:57.950524941+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004622322+08:00" level=info msg="parsed scheme: \"unix\"" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004663918+08:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004697382+08:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix://run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}" module=grpc
Dec 07 14:57:58 node01 dockerd[239420]: time="2021-12-07T14:57:58.004726533+08:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
Dec 07 14:58:14 node01 dockerd[239420]: time="2021-12-07T14:58:14.832862519+08:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
Dec 07 14:59:00 node01 dockerd[239420]: time="2021-12-07T14:59:00.039554832+08:00" level=info msg="Loading containers: start."
Dec 07 15:02:27 node01 dockerd[239420]: fatal error: runtime: out of memory
Dec 07 15:02:27 node01 dockerd[239420]: runtime stack:
Dec 07 15:02:27 node01 dockerd[239420]: runtime.throw(0x5604d0deee9d, 0x16)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/panic.go:774 +0x74
Dec 07 15:02:28 node01 dockerd[239420]: runtime.sysMap(0xfb6c000000, 0x5c000000, 0x5604d35dab58)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mem_linux.go:169 +0xc7
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).sysAlloc(0x5604d35be240, 0x5beee000, 0x7f6fd1fbab88, 0x5604cf22bf6b)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:701 +0x1cf
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).grow(0x5604d35be240, 0x2df77, 0xffffffff)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1255 +0xa5
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).allocSpanLocked(0x5604d35be240, 0x2df77, 0x5604d35dab68, 0x5604cf212927)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1170 +0x268
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc_m(0x5604d35be240, 0x2df77, 0x7f6c01180100, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1022 +0xc6
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc.func1()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1093 +0x4e
Dec 07 15:02:28 node01 dockerd[239420]: runtime.(*mheap).alloc(0x5604d35be240, 0x2df77, 0x7f6fd1010100, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/mheap.go:1092 +0x8c
Dec 07 15:02:28 node01 dockerd[239420]: runtime.largeAlloc(0x5beee000, 0x5604cf250001, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1138 +0x99
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc.func1()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1033 +0x48
Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack(0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:370 +0x63
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mstart()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/proc.go:1146
Dec 07 15:02:28 node01 dockerd[239420]: goroutine 1 [running]:
Dec 07 15:02:28 node01 dockerd[239420]: runtime.systemstack_switch()
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0008bff70 sp=0xc0008bff68 pc=0x5604cf2528e0
Dec 07 15:02:28 node01 dockerd[239420]: runtime.mallocgc(0x5beee000, 0x5604d1e8fde0, 0x1, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/malloc.go:1032 +0x8a7 fp=0xc0008c0010 sp=0xc0008bff70 pc=0x5604cf200ef7
Dec 07 15:02:28 node01 dockerd[239420]: runtime.growslice(0x5604d1e8fde0, 0xef119e4000, 0x3107eaa, 0x3107eaa, 0x3107eab, 0x5604cf20146b, 0xfb6bbbfb80, 0x7f6c01184350)
Dec 07 15:02:28 node01 dockerd[239420]: /usr/local/go/src/runtime/slice.go:181 +0x1e4 fp=0xc0008c0078 sp=0xc0008c0010 pc=0x5604cf23b0b4
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).first(0xc0008c0288)
Dec 07 15:02:28 node01 dockerd[239420]: docker/vendor/go.etcd.io/bbolt/cursor.go:182 +0x138 fp=0xc0008c00d8 sp=0xc0008c0078 pc=0x5604d012d348
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).next(0xc0008c0288, 0x0, 0x9f, 0x9f, 0x7f7150506429, 0x2b8, 0xc0008c0188, 0x5604cf23ee10)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:234 +0x84 fp=0xc0008c0128 sp=0xc0008c00d8 pc=0x5604d012d684
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*Cursor).Next(0xc0008c0288, 0x7f715050638a, 0x9f, 0x9f, 0xfb6bbcf720, 0x9f, 0x2b8)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/cursor.go:75 +0x43 fp=0xc0008c0198 sp=0xc0008c0128 pc=0x5604d012ca43
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List.func1(0xc0034350a0, 0xc0008c0200, 0xc0034350a0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:288 +0x19a fp=0xc0008c02b8 sp=0xc0008c0198 pc=0x5604d0142a4a
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt.(*DB).View(0xc0001dc200, 0xc0008c03f8, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/go.etcd.io/bbolt/db.go:725 +0xaa fp=0xc0008c0340 sp=0xc0008c02b8 pc=0x5604d0130a3a
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb.(*BoltDB).List(0xc00482f2c0, 0xc001a7af00, 0x5e, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libkv/store/boltdb/boltdb.go:279 +0x1b3 fp=0xc0008c0430 sp=0xc0008c0340 pc=0x5604d0141583
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).kmap(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x5e, 0xc001a7ae40)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:43 +0x286 fp=0xc0008c0560 sp=0xc0008c0430 pc=0x5604cfcc81a6
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*cache).list(0xc003bf5900, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/cache.go:164 +0x7e fp=0xc0008c0678 sp=0xc0008c0560 pc=0x5604cfcc988e
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore.(*datastore).List(0xc0079ff680, 0xc001a7ae40, 0x5e, 0x5604d220c1a0, 0xc001e7c160, 0x0, 0x0, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/datastore/datastore.go:517 +0x205 fp=0xc0008c0730 sp=0xc0008c0678 pc=0x5604cfccc745
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*network).getEndpointsFromStore(0xc00289a380, 0xc004a99258, 0x7, 0x5604d2206920, 0xc001a515f0, 0xc002f3ff48)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/store.go:190 +0x343 fp=0xc0008c09f8 sp=0xc0008c0730 pc=0x5604d01acaf3
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.(*controller).reservePools(0xc0004b7400)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:977 +0x4c1 fp=0xc0008c0c28 sp=0xc0008c09f8 pc=0x5604d0159701
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork.New(0xc001055d00, 0x9, 0x10, 0xc0007ac870, 0xc001387fb0, 0xc001055d00, 0x9)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/vendor/libnetwork/controller.go:245 +0x615 fp=0xc0008c0dd8 sp=0xc0008c0c28 pc=0x5604d0154815
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).initNetworkController(0xc00000c1e0, 0xc000207080, 0xc001387fb0, 0xc000236f50, 0xc00000c1e0, 0xc0007170c8, 0xc001387fb0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon_unix.go:855 +0xa9 fp=0xc0008c0e90 sp=0xc0008c0dd8 pc=0x5604d0a49429
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.(*Daemon).restore(0xc00000c1e0, 0xc00004e5c0, 0xc000220000)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:490 +0x50b fp=0xc0008c1088 sp=0xc0008c0e90 pc=0x5604d0a3898b
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon.NewDaemon(0x5604d21de4c0, 0xc00004e5c0, 0xc000207080, 0xc0007ac870, 0x0, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/daemon/daemon.go:1147 +0x2be8 fp=0xc0008c19c8 sp=0xc0008c1088 pc=0x5604d0a3c938
Dec 07 15:02:28 node01 dockerd[239420]: main.(*DaemonCli).start(0xc0007ab410, 0xc000218600, 0x0, 0x0)
Dec 07 15:02:28 node01 dockerd[239420]: /docker/cmd/dockerd/daemon.go:195 +0x743 fp=0xc0008c1d00 sp=0xc0008c19c8 pc=0x5604d0dbb3c3
Dec 07 15:02:28 node01 dockerd[239420]: main.runDaemon(...)

日志中存在明显的异常堆栈打印,分析可知,启动过程中走到Loading containers: start.之后卡住,然后打印fatal error: runtime: out of memory,也就是内存爆了。根据堆栈信息,可以看出异常的调用路径如下:

1
NewDaemon —> restore() —> initNetworkController —> libnetwork.New() —> reservePools() —>  getEndpointsFromStore() —> List —> cache.list(kvObject) —> cache.kmap(kvObject) —> List(keyPrefix) —> Next —> next —> first

拉取对应版本的代码,根据上述调用过程,找到指定的代码位置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// List returns the range of keys starting with the passed in prefix
func (b *BoltDB) List(keyPrefix string) ([]*store.KVPair, error) {
var (
db *bolt.DB
err error
)
b.Lock()
defer b.Unlock()

kv := []*store.KVPair{}

if db, err = b.getDBhandle(); err != nil {
return nil, err
}
defer b.releaseDBhandle()

err = db.View(func(tx *bolt.Tx) error {
bucket := tx.Bucket(b.boltBucket)
if bucket == nil {
return store.ErrKeyNotFound
}

cursor := bucket.Cursor()
prefix := []byte(keyPrefix)

for key, v := cursor.Seek(prefix); bytes.HasPrefix(key, prefix); key, v = cursor.Next() {

dbIndex := binary.LittleEndian.Uint64(v[:libkvmetadatalen])
v = v[libkvmetadatalen:]
val := make([]byte, len(v))
copy(val, v)

kv = append(kv, &store.KVPair{
Key: string(key),
Value: val,
LastIndex: dbIndex,
})
}
return nil
})
if len(kv) == 0 {
return nil, store.ErrKeyNotFound
}
return kv, err
}

可以看出,这段是读取boltdb数据库(用于缓存网络配置),从之前了解看,这个数据库在异常断电时很容易损坏,所以怀疑是数据库损坏了,导致此处的遍历读取超出了预期的循环次数,而每次循环都会创建变量,分配内存,最终被内核OOM

docker社区查找相关issue[1] [2],发现确实存在boltdb数据库损坏的现象,不过最终报错的现象不太一样。最后,在docker社区也提了个issue[3],社区反馈也是怀疑boltdb数据库损坏,并建议可以把local-kv.db文件删除再重启来恢复。

个人觉得,上面报错的地方可以优化一下,对db文件做一次检查,如果检查到异常,提前抛异常,而不是不停地吃内存(由于异常的环境被破坏了,这个想法需要等复现后再考虑优化)。

解决方案

  1. 临时解决:删除local-kv.db文件再重启docker服务。
  2. 永久解决:优化异常代码。

参考资料

  1. https://github.com/moby/moby/issues/37238
  2. https://github.com/moby/moby/issues/42099
  3. https://github.com/moby/moby/issues/43072