0%

工具分享-使用Kubeasz一键部署K8S集群

问题背景

为了验证最新版本的k8s是否已修复某个bug,需要快速搭建一个k8s环境,本文选取资料[1]中的kubeasz工具,并记录部署过程及相关问题。

部署过程

先下载工具脚本、kubeasz代码、二进制、默认容器镜像。

使用如下命令开始安装:

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 k8s]# ./ezdown -S
2023-03-22 13:39:40 INFO Action begin: start_kubeasz_docker
2023-03-22 13:39:41 INFO try to run kubeasz in a container
2023-03-22 13:39:41 DEBUG get host IP: 10.10.11.49
2023-03-22 13:39:41 DEBUG generate ssh key pair
# 10.10.11.49 SSH-2.0-OpenSSH_6.6.1
f1b442b7fdaf757c7787536b17d12d76208a2dd7884d56fbd1d35817dc2e94ca
2023-03-22 13:39:41 INFO Action successed: start_kubeasz_docker

[root@node01 k8s]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f1b442b7fdaf easzlab/kubeasz:3.5.0 "sleep 36000" 15 seconds ago Up 14 seconds kubeasz

执行后看不出是成功,还是失败。根据文档说明,进入容器内手动执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node01 ~]# docker exec -it kubeasz ezctl start-aio
2023-03-22 06:15:05 INFO get local host ipadd: 10.10.11.49
2023-03-22 06:15:05 DEBUG generate custom cluster files in /etc/kubeasz/clusters/default
2023-03-22 06:15:05 DEBUG set versions
2023-03-22 06:15:05 DEBUG disable registry mirrors
2023-03-22 06:15:05 DEBUG cluster default: files successfully created.
2023-03-22 06:15:05 INFO next steps 1: to config '/etc/kubeasz/clusters/default/hosts'
2023-03-22 06:15:05 INFO next steps 2: to config '/etc/kubeasz/clusters/default/config.yml'
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 06:15:05 INFO cluster:default setup step:all begins in 5s, press any key to abort:

PLAY [kube_master,kube_node,etcd,ex_lb,chrony] **********************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************
fatal: [10.10.11.49]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: root@10.10.11.49: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).", "unreachable": true}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0

从日志看,提示权限有问题。实际测试可以正常的ssh免密登录:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bash-5.1# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)?
bash-5.1# ssh-copy-id root@10.10.11.49
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
expr: warning: '^ERROR: ': using '^' as the first character
of a basic regular expression is not portable; it is ignored
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@10.10.11.49's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'root@10.10.11.49'"
and check to make sure that only the key(s) you wanted were added.

bash-5.1# ssh root@10.10.11.49
root@10.10.11.49's password:

查看相关配置文件,权限正常:

1
2
3
4
5
6
[root@node01 kubeasz]# ll ~/.ssh
total 16
-rw------- 1 root root 1752 Mar 22 14:25 authorized_keys
-rw------- 1 root root 2602 Mar 22 14:25 id_rsa
-rw-r--r-- 1 root root 567 Mar 22 14:25 id_rsa.pub
-rw-r--r-- 1 root root 1295 Mar 22 13:39 known_hosts

不清楚具体哪里有问题,参考资料[2],尝试改为用用户名密码执行。

在容器内配置用户密码,检查通过:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bash-5.1# vi /etc/ansible/hosts
[webservers]
10.10.11.49

[webservers:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

bash-5.1# ansible webservers -m ping
10.10.11.49 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}

修改安装集群依赖的clusters/default/hosts文件,同样增加用户密码配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[etcd]
10.10.11.49

[etcd:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

# master node(s)
[kube_master]
10.10.11.49

[kube_master:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'


# work node(s)
[kube_node]
10.10.11.49

[kube_node:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

执行命令,提示缺少sshpass工具:

1
2
3
4
5
6
7
8
9
10
11
[root@node01 kubeasz]# docker exec -it kubeasz ezctl setup default all
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 07:35:46 INFO cluster:default setup step:all begins in 5s, press any key to abort:

PLAY [kube_master,kube_node,etcd,ex_lb,chrony] **********************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************
fatal: [10.10.11.49]: FAILED! => {"msg": "to use the 'ssh' connection type with passwords, you must install the sshpass program"}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0

安装sshpass依赖包:

1
2
3
4
5
6
bash-5.1# apk add sshpass
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz
(1/1) Installing sshpass (1.09-r0)
Executing busybox-1.35.0-r17.trigger
OK: 21 MiB in 47 packages

重复执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@node01 kubeasz]# docker exec -it kubeasz ezctl setup default all
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 07:36:37 INFO cluster:default setup step:all begins in 5s, press any key to abort:

...

TASK [kube-node : 轮询等待kube-proxy启动] *********************************************************************************************************************************************************************
changed: [10.10.11.49]
FAILED - RETRYING: 轮询等待kubelet启动 (4 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (3 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (2 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (1 retries left).

TASK [kube-node : 轮询等待kubelet启动] ************************************************************************************************************************************************************************
fatal: [10.10.11.49]: FAILED! => {"attempts": 4, "changed": true, "cmd": "systemctl is-active kubelet.service", "delta": "0:00:00.014621", "end": "2023-03-22 15:42:07.230186", "msg": "non-zero return code", "rc": 3, "start": "2023-03-22 15:42:07.215565", "stderr": "", "stderr_lines": [], "stdout": "activating", "stdout_lines": ["activating"]}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=85 changed=78 unreachable=0 failed=1 skipped=123 rescued=0 ignored=0
localhost : ok=33 changed=30 unreachable=0 failed=0 skipped=11 rescued=0 ignored=0

kubelet阶段失败,查看kubelet服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@node01 log]# service kubelet status -l
Redirecting to /bin/systemctl status -l kubelet.service
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2023-03-22 15:56:31 CST; 1s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 147581 ExecStart=/opt/kube/bin/kubelet --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --hostname-override=10.10.11.49 --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --root-dir=/var/lib/kubelet --v=2 (code=exited, status=1/FAILURE)
Main PID: 147581 (code=exited, status=1/FAILURE)

Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.719832 147581 manager.go:228] Version: {KernelVersion:3.10.0-862.11.6.el7.x86_64 ContainerOsVersion:CentOS Linux 7 (Core) DockerVersion: DockerAPIVersion: CadvisorVersion: CadvisorRevision:}
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.720896 147581 server.go:659] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.721939 147581 container_manager_linux.go:267] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722392 147581 container_manager_linux.go:272] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName:
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722503 147581 topology_manager.go:134] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722609 147581 container_manager_linux.go:308] "Creating device plugin manager"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722689 147581 manager.go:125] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722763 147581 server.go:66] "Creating device plugin registration server" version="v1beta1" socket="/var/lib/kubelet/device-plugins/kubelet.sock"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722905 147581 state_mem.go:36] "Initialized new in-memory state store"
Mar 22 15:56:31 node01 kubelet[147581]: E0322 15:56:31.726502 147581 run.go:74] "command failed" err="failed to run Kubelet: validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"

根据日志报错,参考资料[3],删除 /etc/containerd/config.toml 文件并重启 containerd 即可:

1
2
mv /etc/containerd/config.toml /root/config.toml.bak
systemctl restart containerd

重复执行命令,后台查看发现calico-node启动失败,查看日志如下:

1
2
3
4
5
6
7
8
9
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 41s default-scheduler Successfully assigned kube-system/calico-node-rqpjm to 10.10.11.49
Normal Pulling 20s (x2 over 31s) kubelet Pulling image "easzlab.io.local:5000/calico/cni:v3.23.5"
Warning Failed 19s (x2 over 31s) kubelet Failed to pull image "easzlab.io.local:5000/calico/cni:v3.23.5": rpc error: code = Unknown desc = failed to pull and unpack image "easzlab.io.local:5000/calico/cni:v3.23.5": failed to resolve reference "easzlab.io.local:5000/calico/cni:v3.23.5": failed to do request: Head "https://easzlab.io.local:5000/v2/calico/cni/manifests/v3.23.5": http: server gave HTTP response to HTTPS client
Warning Failed 19s (x2 over 31s) kubelet Error: ErrImagePull
Normal BackOff 5s (x2 over 30s) kubelet Back-off pulling image "easzlab.io.local:5000/calico/cni:v3.23.5"
Warning Failed 5s (x2 over 30s) kubelet Error: ImagePullBackOff

查看docker层面配置,并测试拉起镜像正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node01 ~]# cat /etc/docker/daemon.json
{
"max-concurrent-downloads": 10,
"insecure-registries": ["easzlab.io.local:5000"],
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"data-root":"/var/lib/docker"
}

[root@node01 log]# docker pull easzlab.io.local:5000/calico/cni:v3.23.5
v3.23.5: Pulling from calico/cni
Digest: sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29
Status: Image is up to date for easzlab.io.local:5000/calico/cni:v3.23.5
easzlab.io.local:5000/calico/cni:v3.23.5

查看containerd层面,并测试拉起镜像也正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node01 log]# ctr image pull --plain-http=true easzlab.io.local:5000/calico/cni:v3.23.5
easzlab.io.local:5000/calico/cni:v3.23.5: resolved |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:cc0e45adf05a30a90384ba7024dbabdad9ae0bcd7b5a535c28dede741298fea3: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:47c5dbbec31222325790ebad8c07d270a63689bd10dc8f54115c65db7c30ad1f: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:8efc3d73e2741a93be09f68c859da466f525b9d0bddb1cd2b2b633f14f232941: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:1c979d623de9aef043cb4ff489da5636d61c39e30676224af0055240e1816382: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4c98a4f67c5a7b1058111d463051c98b23e46b75fc943fc2535899a73fc0c9f1: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:51729c6e2acda05a05e203289f5956954814d878f67feb1a03f9941ec5b4008b: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:050b055d5078c5c6ad085d106c232561b0c705aa2173edafd5e7a94a1e908fc5: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:7430548aa23e56c14da929bbe5e9a2af0f9fd0beca3bd95e8925244058b83748: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 3.1 s total: 103.0 (33.2 MiB/s)
unpacking linux/amd64 sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29...
done: 6.82968396s

根据资料[4],查看containerd配置,并新增私有仓库的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@node01 ~]# containerd config default > /etc/containerd/config.toml

[root@node01 ~]# vim /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = ""

[plugins."io.containerd.grpc.v1.cri".registry.auths]

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.headers]

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
endpoint = ["http://easzlab.io.local:5000"]

[root@node01 ~]# service containerd restart

查看pod状态,又卡在了ContainerCreating状态:

1
2
3
4
5
6
7
8
9
[root@node01 ~]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-89b744d6c-klzwh 1/1 Running 0 5m35s
kube-system calico-node-wmvff 1/1 Running 0 5m35s
kube-system coredns-6665999d97-mp7xc 0/1 ContainerCreating 0 5m35s
kube-system dashboard-metrics-scraper-57566685b4-8q5fm 0/1 ContainerCreating 0 5m35s
kube-system kubernetes-dashboard-57db9bfd5b-h6jp4 0/1 ContainerCreating 0 5m35s
kube-system metrics-server-6bd9f986fc-njpnj 0/1 ContainerCreating 0 5m35s
kube-system node-local-dns-wz9bg 1/1 Running 0 5m31s

选择一个describe查看:

1
2
3
4
5
6
7
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6m7s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal Scheduled 5m47s default-scheduler Successfully assigned kube-system/coredns-6665999d97-mp7xc to 10.10.11.49
Warning FailedCreatePodSandBox 5m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "072c164d79f4874a8d851d36115ea04b75a2155dae3cecdc764e923c9f38f86b": plugin type="calico" failed (add): failed to find plugin "calico" in path [/opt/cni/bin]
Normal SandboxChanged 33s (x25 over 5m46s) kubelet Pod sandbox changed, it will be killed and re-created.

从日志看,是cni插件不存在的问题,手动拷贝之后,查看pod状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node01 bin]# cd /opt/cni/bin/
[root@node01 bin]# chmod +x *
[root@node01 bin]# ll -h
total 186M
-rwxr-xr-x 1 root root 3.7M Mar 22 17:46 bandwidth
-rwxr-xr-x 1 root root 56M Mar 22 17:46 calico
-rwxr-xr-x 1 root root 56M Mar 22 17:46 calico-ipam
-rwxr-xr-x 1 root root 2.4M Mar 22 17:46 flannel
-rwxr-xr-x 1 root root 3.1M Mar 22 17:46 host-local
-rwxr-xr-x 1 root root 56M Mar 22 17:46 install
-rwxr-xr-x 1 root root 3.2M Mar 22 17:46 loopback
-rwxr-xr-x 1 root root 3.6M Mar 22 17:46 portmap
-rwxr-xr-x 1 root root 3.3M Mar 22 17:46 tuning

[root@node01 bin]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-89b744d6c-mpfgq 1/1 Running 0 37m
kube-system calico-node-h9sm2 1/1 Running 0 37m
kube-system coredns-6665999d97-8pdbd 1/1 Running 0 37m
kube-system dashboard-metrics-scraper-57566685b4-c2l8w 1/1 Running 0 37m
kube-system kubernetes-dashboard-57db9bfd5b-74lmb 1/1 Running 0 37m
kube-system metrics-server-6bd9f986fc-d9crl 1/1 Running 0 37m
kube-system node-local-dns-kvgv6 1/1 Running 0 37m

部署完成。

参考资料

https://github.com/easzlab/kubeasz/blob/master/docs/setup/quickStart.md

https://www.jianshu.com/p/c48b4a24c7d4

https://www.cnblogs.com/immaxfang/p/16721407.html

https://github.com/containerd/containerd/issues/4938