0%

根据kubesphere官方资料[1],记录搭建离线部署环境出现的几个问题。

问题1:联网主机制作离线安装包失败

制作过程中出现部分镜像拉取超时,可能是网络问题,多次重试即可。

1
[root@node kubesphere]# ./kk artifact export -m manifest-sample.yaml -o kubesphere.tar.gz

问题2:安装harbor阶段失败

安装harbor阶段出现unable to sign certificate: must specify a CommonName错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@node1 kubesphere]# ./kk init registry -f config-sample.yaml -a kubesphere.tar.gz
19:37:46 CST [GreetingsModule] Greetings
19:37:47 CST message: [master]
Greetings, KubeKey!
19:37:47 CST success: [master]
19:37:47 CST [UnArchiveArtifactModule] Check the KubeKey artifact md5 value
19:37:47 CST success: [LocalHost]
...
19:48:16 CST success: [master]
19:48:16 CST [ConfigureOSModule] configure the ntp server for each node
19:48:17 CST skipped: [master]
19:48:17 CST [InitRegistryModule] Fetch registry certs
19:48:18 CST success: [master]
19:48:18 CST [InitRegistryModule] Generate registry Certs
[certs] Using existing ca certificate authority
19:48:18 CST message: [LocalHost]
unable to sign certificate: must specify a CommonName
19:48:18 CST failed: [LocalHost]
error: Pipeline[InitRegistryPipeline] execute failed: Module[InitRegistryModule] exec failed:
failed: [LocalHost] [GenerateRegistryCerts] exec failed after 1 retries: unable to sign certificate: must specify a CommonName

参考资料[2],修改配置registry相关配置:

1
2
3
4
5
6
7
8
9
10
11
registry:
type: harbor
auths:
"dockerhub.kubekey.local":
username: admin
password: Harbor12345
certsPath: "/etc/docker/certs.d/dockerhub.kubekey.local"
privateRegistry: "dockerhub.kubekey.local"
namespaceOverride: "kubesphereio"
registryMirrors: []
insecureRegistries: []

问题3:创建集群阶段

提示下载kubernetes的二进制文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz
23:29:32 CST [NodeBinariesModule] Download installation binaries
23:29:32 CST message: [localhost]
downloading amd64 kubeadm v1.22.12 ...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: storage.googleapis.com; 未知的错误
23:29:32 CST [WARN] Having a problem with accessing https://storage.googleapis.com? You can try again after setting environment 'export KKZONE=cn'
23:29:32 CST message: [LocalHost]
Failed to download kubeadm binary: curl -L -o /home/k8s/kubesphere/kubekey/kube/v1.22.12/amd64/kubeadm https://storage.googleapis.com/kubernetes-release/release/v1.22.12/bin/linux/amd64/kubeadm error: exit status 6
23:29:32 CST failed: [LocalHost]
error: Pipeline[CreateClusterPipeline] execute failed: Module[NodeBinariesModule] exec failed:
failed: [LocalHost] [DownloadBinaries] exec failed after 1 retries: Failed to download kubeadm binary: curl -L -o /home/k8s/kubesphere/kubekey/kube/v1.22.12/amd64/kubeadm https://storage.googleapis.com/kubernetes-release/release/v1.22.12/bin/linux/amd64/kubeadm error: exit status 6

这个错误是因为config-sample.yaml不是通过命令生成的,所以kubernetes的版本不对,查看命令的帮助信息,发现kubesphere的默认版本是v3.4.1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz -h
Create a Kubernetes or KubeSphere cluster

Usage:
kk create cluster [flags]

Flags:
-a, --artifact string Path to a KubeKey artifact
--container-manager string Container runtime: docker, crio, containerd and isula. (default "docker")
--debug Print detailed information
--download-cmd string The user defined command to download the necessary binary files. The first param '%s' is output path, the second param '%s', is the URL (default "curl -L -o %s %s")
-f, --filename string Path to a configuration file
-h, --help help for cluster
--ignore-err Ignore the error message, remove the host which reported error and force to continue
--namespace string KubeKey namespace to use (default "kubekey-system")
--skip-pull-images Skip pre pull images
--skip-push-images Skip pre push images
--with-kubernetes string Specify a supported version of kubernetes
--with-kubesphere Deploy a specific version of kubesphere (default v3.4.1)
--with-local-storage Deploy a local PV provisioner
--with-packages install operation system packages by artifact
--with-security-enhancement Security enhancement
-y, --yes

修改命令,指定kubesphere版本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-kubesphere 3.4.0
W1205 00:36:57.266052 1453 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.96.0.10]; the provided value is: [169.254.25.10]
[init] Using Kubernetes version: v1.23.15
[preflight] Running pre-flight checks
[WARNING FileExisting-socat]: socat not found in system path
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 24.0.6. Latest validated version: 20.10
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileExisting-conntrack]: conntrack not found in system path
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
00:36:58 CST stdout: [master]
[preflight] Running pre-flight checks
W1205 00:36:58.323079 1534 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
W1205 00:36:58.327376 1534 cleanupnode.go:109] [reset] Failed to evaluate the "/var/lib/kubelet" directory. Skipping its unmount and cleanup: lstat /var/lib/kubelet: no such file or directory
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
00:36:58 CST message: [master]
init kubernetes cluster failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=FileExisting-crictl,ImagePull"

问题4:部分Pod卡在启动阶段

1
2
3
4
kubesphere-system              ks-apiserver-86757d49bb-m9pp4          ContainerCreating
kubesphere-system ks-console-cbdb4558c-7z6lg Running
kubesphere-system ks-controller-manager-64b5dcb7d-9mrsw ContainerCreating
kubesphere-system ks-installer-ff66855c9-d8x4k Running

通过资料[3]可知,可以使用命令kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l app=ks-install -o jsonpath='{.items[0].metadata.name}') -f查询进度,上面的异常Pod需要等所有组件安装后才起来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#####################################################
### Welcome to KubeSphere! ###
#####################################################

Console: http://10.10.10.30:30880
Account: admin
Password: P@88w0rd
NOTES:
1. After you log into the console, please check the
monitoring status of service components in
"Cluster Management". If any service is not
ready, please wait patiently until all components
are up and running.
2. Please change the default password after login.

#####################################################
https://kubesphere.io 2023-12-05 01:24:00
#####################################################
01:24:04 CST success: [master]
01:24:04 CST Pipeline[CreateClusterPipeline] execute successfully
Installation is complete.

Please check the result using the command:

kubectl logs -n kubesphere-system $(kubectl get pod -n kubesphere-system -l 'app in (ks-install, ks-installer)' -o jsonpath='{.items[0].metadata.name}') -f

问题5:metrics-server启动失败

通过查看相关日志可知,如果把harbor仓库安装在master节点,端口会冲突:

1
2
3
4
5
6
7
8
9
10
11
12
[root@master ~]# kubectl logs -f -n kube-system metrics-server-6d987cb45c-4swvd
panic: failed to create listener: failed to listen on 0.0.0.0:4443: listen tcp 0.0.0.0:4443: bind: address already in use

goroutine 1 [running]:
main.main()
/go/src/sigs.k8s.io/metrics-server/cmd/metrics-server/metrics-server.go:39 +0xfc
[root@master ~]# netstat -anp|grep 4443
tcp 0 0 0.0.0.0:4443 0.0.0.0:* LISTEN 22372/docker-proxy
tcp6 0 0 :::4443 :::* LISTEN 22378/docker-proxy

[root@master ~]# docker ps |grep harbor|grep 4443
1733e9580af5 goharbor/nginx-photon:v2.5.3 "nginx -g 'daemon of…" 4 hours ago Up 4 hours (healthy) 0.0.0.0:4443->4443/tcp, :::4443->4443/tcp, 0.0.0.0:80->8080/tcp, :::80->8080/tcp, 0.0.0.0:443->8443/tcp, :::443->8443/tcp nginx

修改端口后恢复。

问题6:部分Pod拉取镜像失败

1
2
3
4
kubesphere-logging-system      opensearch-cluster-data-0            init:ImagePullBackOff
kubesphere-logging-system opensearch-cluster-master-0 init:ImagePullBackOff
istio-system istio-cni-node-vlzt7 ImagePullBackOff
kubesphere-controls-system kubesphere-router-test-55b5fcc887-xlzsh ImagePullBackOff

查看发现,init失败,是因为用了busybox镜像,但离线包没有提前下载:

1
2
3
4
5
6
7
8
initContainers:
- args:
- chown -R 1000:1000 /usr/share/opensearch/data
command:
- sh
- -c
image: busybox:latest
imagePullPolicy: Always

后面两个镜像拉取失败问题,同样是因为离线包没有提前下载:

1
Normal   BackOff    21s (x51 over 15m)  kubelet            Back-off pulling image "dockerhub.kubekey.local/kubesphereio/install-cni:1.14.6"

手动下载导入到离线环境后,异常Pod恢复。

参考资料

1.https://kubesphere.io/zh/docs/v3.3/installing-on-linux/introduction/air-gapped-installation/

2.https://github.com/kubesphere/kubekey/issues/1762#issuecomment-1681625989

3.https://github.com/kubesphere/ks-installer/issues/907

前提条件

  1. 确保 Docker 版本不低于 19.03,同时还要通过设置环境变量 DOCKER_CLI_EXPERIMENTAL 来启用。可以通过下面的命令来为当前终端启用 buildx 插件,并验证是否开启[1]:
1
2
3
4
[root@node1 root]# export DOCKER_CLI_EXPERIMENTAL=enabled

[root@node1 root]# docker buildx version
github.com/docker/buildx v0.3.1-tp-docker 6db68d029599c6710a32aa7adcba8e5a344795a7
  1. 确保Linux内核版本升级到 4.8 以上,否则会出现如下异常[2]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@node1 root]# docker run --privileged --rm tonistiigi/binfmt --install all
Unable to find image 'tonistiigi/binfmt:latest' locally
latest: Pulling from tonistiigi/binfmt
2a625f6055a5: Pull complete
71d6c64c6702: Pull complete
Digest: sha256:8de6f2decb92e9001d094534bf8a92880c175bd5dfb4a9d8579f26f09821cfa2
Status: Downloaded newer image for tonistiigi/binfmt:latest
installing: arm64 cannot register "/usr/bin/qemu-aarch64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: s390x cannot register "/usr/bin/qemu-s390x" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: riscv64 cannot register "/usr/bin/qemu-riscv64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: mips64le cannot register "/usr/bin/qemu-mips64el" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: mips64 cannot register "/usr/bin/qemu-mips64" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: arm cannot register "/usr/bin/qemu-arm" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
installing: ppc64le cannot register "/usr/bin/qemu-ppc64le" to /proc/sys/fs/binfmt_misc/register: write /proc/sys/fs/binfmt_misc/register: invalid argument
{
"supported": [
"linux/amd64",
"linux/386"
],
"emulators": null
}

环境准备

  1. 升级内核版本,以升级到4.9为例,相关rpm包见链接[3]:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@node1 4.9]# ll
total 13400
-rw-r--r-- 1 root root 1114112 Dec 12 20:22 kernel-4.9.241-37.el7.x86_64.rpm
-rw-r--r-- 1 root root 11686072 Dec 12 20:22 kernel-devel-4.9.241-37.el7.x86_64.rpm

[root@node1 4.9]# rpm -ivh kernel-*
warning: kernel-4.9.241-37.el7.x86_64.rpm: Header V4 RSA/SHA1 Signature, key ID 61e8806c: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:kernel-devel-4.9.241-37.el7 ################################# [ 50%]
2:kernel-4.9.241-37.el7 ################################# [100%]

[root@node1 4.9]# reboot

[root@node1 4.9]# uname -a
Linux node1 4.9.241-37.el7.x86_64 #1 SMP Mon Nov 2 13:55:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  1. 启用 binfmt_misc,并检查启用结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
[root@node1 ~]# docker run --privileged --rm tonistiigi/binfmt --install all

installing: arm OK
installing: s390x OK
installing: ppc64le OK
installing: arm64 OK
installing: riscv64 OK
installing: mips64le OK
installing: mips64 OK
{
"supported": [
"linux/amd64",
"linux/arm64",
"linux/riscv64",
"linux/ppc64le",
"linux/s390x",
"linux/386",
"linux/mips64le",
"linux/mips64",
"linux/arm/v7",
"linux/arm/v6"
],
"emulators": [
"qemu-aarch64",
"qemu-arm",
"qemu-mips64",
"qemu-mips64el",
"qemu-ppc64le",
"qemu-riscv64",
"qemu-s390x"
]
}

[root@node1 ~]# ls -al /proc/sys/fs/binfmt_misc/
total 0
drwxr-xr-x 2 root root 0 Dec 13 16:29 .
dr-xr-xr-x 1 root root 0 Dec 13 16:27 ..
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-aarch64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-arm
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-mips64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-mips64el
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-ppc64le
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-riscv64
-rw-r--r-- 1 root root 0 Dec 13 16:29 qemu-s390x
--w------- 1 root root 0 Dec 13 16:29 register
-rw-r--r-- 1 root root 0 Dec 13 16:29 status

构建验证

创建一个新的构建器,并启动构建器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node1 ~]# docker buildx create --use --name mybuilder
mybuilder

[root@node1 ~]# docker buildx inspect mybuilder --bootstrap
[+] Building 105.8s (1/1) FINISHED
=> [internal] booting buildkit 105.8s
=> => pulling image moby/buildkit:buildx-stable-1 105.3s
=> => creating container buildx_buildkit_mybuilder0 0.6s
Name: mybuilder
Driver: docker-container
Last Activity: 2023-12-13 08:35:03 +0000 UTC

Nodes:
Name: mybuilder0
Endpoint: unix:///var/run/docker.sock
Status: running
Buildkit: v0.9.3
Platforms: linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6

以构建xxx镜像为例,并将构建好的镜像保存在本地,将 type 指定为 docker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node1 images]# docker buildx build -t xxx/xxx --platform=linux/arm64 -o type=docker .
[+] Building 5.5s (6/6) FINISHED docker-container:mybuilder
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 219B 0.0s
=> [internal] load .dockerignore 0.1s
=> => transferring context: 2B 0.0s
=> [internal] load build context 0.9s
=> => transferring context: 68.42MB 0.8s
=> [1/1] COPY ./xxx /bin/xxx 0.1s
=> exporting to oci image format 4.3s
=> => exporting layers 3.0s
=> => exporting manifest sha256:33877987488ccd8fb6803f06f6b90b5ff667dd172db23b339e96acee31af354f 0.0s
=> => exporting config sha256:f16ad6c6fc37b1cad030e7880c094f75f2cb6959ebbc3712808f25e04b96a395 0.0s
=> => sending tarball 1.3s
=> importing to docker

查看镜像:

1
2
[root@node1 images]# docker images|grep xxx
xxx/xxx latest f16ad6c6fc37 2 minutes ago 68.4MB

参考资料

https://cloud.tencent.com/developer/article/1543689

https://www.cnblogs.com/frankming/p/16870285.html

http://ftp.usf.edu/pub/centos/7/virt/x86_64/xen-414/Packages/k/

问题背景

bclinux-for-eular操作系统上部署K8S集群后,安装上层的业务组件时,突然从某个组件开始,Pod无法正常启动。

原因分析

查看description日志,出现了明显的too many open files错误:

1
2
3
Normal   Scheduled               41m                  default-scheduler  Successfully assigned xx/xxx-v3falue7-6f59dd5766-npd2x to node1
Warning FailedCreatePodSandBox 26m (x301 over 41m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "xxx-v3falue7-6f59dd5766-npd2x": Error response from daemon: start failed: : pipe2: too many open files: unknown
Normal SandboxChanged 66s (x808 over 41m) kubelet Pod sandbox changed, it will be killed and re-created.

因为使用的是docker作为CRI,所以先查看docker日志:

1
2
3
4
5
6
7
time="2023-11-13T14:56:05.734166795+08:00" level=info msg="/etc/resolv.conf does not exist"
time="2023-11-13T14:56:05.734193544+08:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
time="2023-11-13T14:56:05.734202079+08:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
time="2023-11-13T14:56:05.740830618+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2023-11-13T14:56:05.740850537+08:00" level=error msg="stream copy error: reading from a closed fifo"
time="2023-11-13T14:56:05.751993232+08:00" level=error msg="1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27 cleanup: failed to delete container from containerd: no such container"
time="2023-11-13T14:56:05.752024358+08:00" level=error msg="Handler for POST /v1.41/containers/1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27/start returned error: start failed: : fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown"

docker日志看,错误原因是:fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown,基本确认是**containerd的文件句柄打开数量过多**。

如下所示,查询到containerd运行时的文件句柄数限制为默认的1024,默认配置偏低。当节点容器过多时,就会出现容器无法启动的现象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@node1 ~]# systemctl status containerd.service
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2023-11-01 11:02:14 CST; 1 weeks 10 days ago
Docs: https://containerd.io
Main PID: 1999 (containerd)
Tasks: 1622
Memory: 3.5G
CGroup: /system.slice/containerd.service
├─ 999 /usr/bin/containerd

cat /proc/999/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 319973 319973 processes
Max open files 1024 524288 files

查看containerd.service文件,没有对文件句柄数做显示配置(对比其他正常环境,操作系统安装完成后,containerd.service文件中存在LimitNOFILE配置):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@node1 ~]# cat /usr/lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes

[Install]
WantedBy=multi-user.target

解决方案

修改containerd.service文件,显示配置文件句柄数,具体大小值根据实际需要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@node1 ~]# cat /usr/lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity

[Install]
WantedBy=multi-user.target

参考资料

https://blog.csdn.net/weixin_42072280/article/details/126513751

问题背景

K8S集群环境中有个容器化的组件A需要在节点上创建iptables规则实现请求转换,在centos 7的操作系统上功能正常,但在redhat 8的操作系统上发现该功能失效。具体来说,就是容器内iptables命令执行后没有生效。

分析过程

业务组件A的功能是基于iptables命令的,查看当前容器内使用的iptables版本:

1
2
~ # iptables -V
iptables v1.6.2

查看使用的centos 7宿主机上的iptables版本:

1
2
[root@node1 ~]# iptables -V
iptables v1.4.21

对比redhat8.4的环境,发现它提供的iptables版本较高,且**模式从默认的legacy变成了nf_tables**:

1
2
[root@node1 ~]# iptables -V
iptables v1.8.4 (nf_tables)

redhat官方资料看[1],该模式无法修改,因为在制作rpm包时删掉了legacy模式依赖的二进制文件:

1
2
3
4
5
6
iptables changelog:
Raw
* Wed Jul 11 2018 Phil Sutter - 1.8.0-1
- New upstream version 1.8.0
- Drop compat sub-package
- Use nft tool versions, drop legacy ones

而其他操作系统,如debianalpine安装的高版本iptables均支持legacynf_tables两种模式,且支持切换:

1
2
3
4
5
update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

update-alternatives --set iptables /usr/sbin/iptables-nft
update-alternatives --set ip6tables /usr/sbin/ip6tables-nft

至此,基本确认是iptables模式不匹配导致的规则下发失效。实测在redhat8.4的环境,基于alpine的基础镜像制作的组件A,通过上述命令修改iptablesnf_tables模式后,规则下发正常。

解决方案

为了同时支持centos 7redhat 8,组件A需要在容器启动时按需选择iptables的模式,保证老环境使用原来的legacy模式,新环境调整为nf_tables模式。

参考资料

  1. https://access.redhat.com/solutions/4377321
  2. https://www.cnblogs.com/redcat8850/p/16135814.html

问题背景

K8S集群环境稳定运行一年后,pod重启卡在ContainerCreating状态:

1
2
3
4
5
6
7
8
[root@node1 ~]# kubectl get pod -n kube-system -owide
NAME READY STATUS RESTARTS AGE IP NODE
calico-kube-controllers-cd96b6c89-bpjp6 1/1 Running 0 40h 10.10.0.1 node3
calico-node-ffsz8 1/1 Running 0 14s 10.10.0.1 node3
calico-node-nsmwl 1/1 Running 0 14s 10.10.0.2 node2
calico-node-w4ngt 1/1 Running 0 14s 10.10.0.1 node1
coredns-55c8f5fd88-hw76t 1/1 Running 1 260d 192.168.135.55 node3
xxx-55c8f5fd88-vqwbz 1/1 ContainerCreating 1 319d 192.168.104.22 node2

分析过程

describe查看

1
2
3
4
5
6
7
[root@node1 ~]# kubectl describe pod -n xxx xxx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 52m default-scheduler Successfully assigned xxx/xxx to node1
Warning FailedCreatePodSandBox 52m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: connection is unauthorized: Unauthorized, failed to clean up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to teardown pod "xxx" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
Normal SandboxChanged 50m (x10 over 52m) kubelet Pod sandbox changed, it will be killed and re-created.

事件里显示的Unauthorized,也就是因为无权限从kube-apiserver中获取相关信息,查看对应pod使用的token,发现确实存在过期时间相关的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
alg: "RS256",
kid: "nuXGyK2zjFNBRnO1ayeOxJDm_luMf4eqQFnqJbsVl7I"
}.
{
aud: [
"https://kubernetes.default.svc.cluster.local"
],
exp: 1703086264, // 时间过期的定义,一年后该token过期
iat: 1671550264,
nbf: 1671550264,
iss: "https://kubernetes.default.svc.cluster.local",
kubernetes.io: {
namespace: "kube-system",
pod: {
name: "xxx",
uid: "c7300d73-c716-4bbc-ad2b-80353d99073b"
},
serviceaccount: {
name: "multus",
uid: "1600e098-6a86-4296-8410-2051d45651ce"
},
warnafter: 1671553871
},
sub: "system:serviceaccount:kube-system:xxx"
}.

查看相关issue[1,2,3],基本确认是**k8s版本迭代引起的**,为了提供更安全的token机制,从v1.21版本开始,BoundServiceAccountTokenVolume特性进入beta版本,并默认启用。

解决方案

  1. 如果不想使用该特性,可以按照下面提供的方法[4],对kube-apiserverkube-controller-manager组件添加feature gate禁用即可。
1
2
3
4
5
6
1. How can this feature be enabled / disabled in a live cluster?
Feature gate name: BoundServiceAccountTokenVolume
Components depending on the feature gate: kube-apiserver and kube-controller-manager
Will enabling / disabling the feature require downtime of the control plane? yes, need to restart kube-apiserver and kube-controller-manager.
Will enabling / disabling the feature require downtime or reprovisioning of a node? no.
2. Does enabling the feature change any default behavior? yes, pods' service account tokens will expire after 1 year by default and are not stored as Secrets any more.
  1. 如果需要使用该特性,则要求使用token的一方适配修改,做到不缓存或者token失效后支持自动刷新新的token到内存即可,已知新版本的client-gofabric8客户端均已支持。

参考资料

  1. https://github.com/k8snetworkplumbingwg/multus-cni/issues/852
  2. https://github.com/projectcalico/calico/issues/5712
  3. https://www.cnblogs.com/bystander/p/rancher-jian-kong-bu-xian-shi-jian-kong-shu-ju-wen.html
  4. https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md

编译过程

参考官方资料[1],执行编译命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node01 projectcalico]# make -C node image
"Build dependency versions"
BIRD_VERSION = v0.3.3-151-g767b5389
"Test dependency versions"
CNI_VER = master
"Calico git version"
GIT_VERSION =
make: Entering directory `/home/go/gopath/src/github.com/projectcalico/node'
mkdir -p .go-pkg-cache bin /home/go/gopath/pkg/mod && docker run --rm --net=host --init -e GOPRIVATE='github.com/tigera/*' -e GO111MODULE=on -v /home/go/gopath/pkg/mod:/go/pkg/mod:rw -e LOCAL_USER_ID=0 -e GOCACHE=/go-cache -e GOARCH=amd64 -e GOPATH=/go -e OS=linux -e GOOS=linux -e GOFLAGS= -v /home/go/gopath/src/github.com/projectcalico/node:/go/src/github.com/projectcalico/node:rw -v /home/go/gopath/src/github.com/projectcalico/node/.go-pkg-cache:/go-cache:rw -w /go/src/github.com/projectcalico/node calico/go-build:v0.40 sh -c ' go mod download'
...
Starting with UID : 0
useradd: UID 0 is not unique
su-exec: getpwnam(user): Success
make: *** [remote-deps] Error 1
make: Leaving directory `/home/go/gopath/src/github.com/projectcalico/node'

从日志看,构建在remote-deps阶段失败,错误是useradd: UID 0 is not unique。从日志中的docker启动容器的命令看,是有个LOCAL_USER_ID=0的参数,说明是想以root用户起容器,但这个过程执行了useradd命令添加用户(理论上是不应该执行到这里的)。

分析calico-nodeentrypoint.sh,如果是以root用户启动,代码走到第10行就结束了,而判断是否为root用户的依据是RUN_AS_ROOT参数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash

# Add local user
# Either use the LOCAL_USER_ID if passed in at runtime or
# fallback

USER_ID=${LOCAL_USER_ID:-9001}

if [ "${RUN_AS_ROOT}" = "true" ]; then
exec "$@"
fi

echo "Starting with UID : $USER_ID" 1>&2
# Do not create mail box.
/bin/sed -i 's/^CREATE_MAIL_SPOOL=yes/CREATE_MAIL_SPOOL=no/' /etc/default/useradd
# Don't pass "-m" to useradd if the home directory already exists (which can occur if it was volume mounted in) otherwise it will fail.
if [[ ! -d "/home/user" ]]; then
/usr/sbin/useradd -m -U -s /bin/bash -u $USER_ID user
else
/usr/sbin/useradd -U -s /bin/bash -u $USER_ID user
fi

...

exec /sbin/su-exec user "$@"

make的执行结果看,没有发现RUN_AS_ROOT变量,再查看calico-nodeMakefile文件,也没有定义,猜测是缺少了RUN_AS_ROOT变量定义导致的

1
[root@node01 projectcalico]# grep -r "RUN_AS_ROOT" ./node/

参考官网资料[2],发现go-buildMakefile里有针对root用户的处理:

1
2
3
4
ifeq ("$(LOCAL_USER_ID)", "0")
# The build needs to run as root.
EXTRA_DOCKER_ARGS+=-e RUN_AS_ROOT='true'
endif

同步修改到calico-nodeMakefile文件:

1
2
[root@node01 projectcalico]# grep -r "RUN_AS_ROOT" ./node/
./node/Makefile.common: EXTRA_DOCKER_ARGS+=-e RUN_AS_ROOT='true'

再次执行make命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@node01 projectcalico]# make -C node image
"Build dependency versions"
BIRD_VERSION = v0.3.3-151-g767b5389
"Test dependency versions"
CNI_VER = master
"Calico git version"
GIT_VERSION =
make: Entering directory `/home/go/gopath/src/github.com/projectcalico/node'
mkdir -p .go-pkg-cache bin /home/go/gopath/pkg/mod && docker run --rm --net=host --init -e GOPRIVATE='github.com/tigera/*' -e RUN_AS_ROOT='true' -e GO111MODULE=on -v /home/go/gopath/pkg/mod:/go/pkg/mod:rw -e GOCACHE=/go-cache -e GOARCH=amd64 -e GOPATH=/go -e OS=linux -e GOOS=linux -e GOFLAGS= -e LOCAL_USER_ID=0 -v /home/go/gopath/src/github.com/projectcalico/node:/go/src/github.com/projectcalico/node:rw -v /home/go/gopath/src/github.com/projectcalico/node/.go-pkg-cache:/go-cache:rw -w /go/src/github.com/projectcalico/node -e CGO_ENABLED=1 calico/go-build:v0.40 sh -c ' go build -v -o dist/bin//calico-node-amd64 -ldflags " -X github.com/projectcalico/node/pkg/startup.VERSION= -X github.com/projectcalico/node/buildinfo.GitVersion=<unknown> -X github.com/projectcalico/node/buildinfo.BuildDate=2023-05-09T06:06:42+0000 -X github.com/projectcalico/node/buildinfo.GitRevision=<unknown>" ./cmd/calico-node/main.go'
github.com/kelseyhightower/confd/pkg/backends
github.com/projectcalico/libcalico-go/lib/apis/v1/unversioned
github.com/projectcalico/libcalico-go/lib/backend/encap
...
Starting with UID : 9001
calico-node-amd64 -v

docker build --pull -t calico/node:latest-amd64 . --build-arg BIRD_IMAGE=calico/bird:v0.3.3-151-g767b5389-amd64 --build-arg QEMU_IMAGE=calico/go-build:v0.40 --build-arg GIT_VERSION= -f ./Dockerfile.amd64
Sending build context to Docker daemon 66.3MB
Step 1/40 : ARG ARCH=x86_64
Step 2/40 : ARG GIT_VERSION=unknown
Step 3/40 : ARG IPTABLES_VER=1.8.2-16
Step 4/40 : ARG RUNIT_VER=2.1.2
Step 5/40 : ARG BIRD_IMAGE=calico/bird:latest
Step 6/40 : FROM calico/bpftool:v5.3-amd64 as bpftool
...
Step 16/40 : RUN dnf install -y 'dnf-command(config-manager)' && dnf config-manager --set-enabled PowerTools && yum install -y rpm-build yum-utils make && yum install -y wget glibc-static gcc && yum -y update-minimal --security --sec-severity=Important --sec-severity=Critical
---> Running in eca2b4c5f0b4
CentOS Linux 8 - AppStream 51 B/s | 38 B 00:00
Error: Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: No URLs in mirrorlist

从编译日志看,问题是yum安装依赖包出错了,原因是使用了默认源vault.centos.org,更改Dockerfile.amd64,替换成国内的阿里源[3]:

1
2
3
4
5
6
7
8
9
-ARG CENTOS_MIRROR_BASE_URL=http://vault.centos.org/8.1.1911
+ARG CENTOS_MIRROR_BASE_URL=https://mirrors.aliyun.com/centos-vault/8.1.1911

+RUN mv /etc/yum.repos.d /etc/yum.repo.d-bk && \
+ mkdir -p /etc/yum.repos.d && mv /centos.repo /etc/yum.repos.d && \
+ yum clean all && yum makecache && \
dnf install -y 'dnf-command(config-manager)' && \
# Enable PowerTools repo for '-devel' packages
- dnf config-manager --set-enabled PowerTools && \

更改centos.repo文件,跳过gpgcheck校验,增加PowerTool源:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[centos-8-base-os]
name = CentOS - BaseOS
baseurl = https://mirrors.aliyun.com/centos-vault/8.1.1911/BaseOS/x86_64/os
enabled = 1
gpgkey = https://mirrors.aliyun.com/keys/RPM-GPG-KEY-CentOS-Official
gpgcheck = 0

[centos-8-appstream]
name = CentOS - AppStream
baseurl = https://mirrors.aliyun.com/centos-vault/8.1.1911/AppStream/x86_64/os
enabled = 1
gpgkey = https://mirrors.aliyun.com/keys/RPM-GPG-KEY-CentOS-Official
gpgcheck = 0

[Centos8-PowerTool-local1]
name=Centos8-PowerTool-local1
baseurl=https://mirrors.aliyun.com/centos-vault/8.1.1911/PowerTools/x86_64/os
enabled=1
gpgcheck=0

继续编译:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
...
docker build --pull -t calico/node:latest-amd64 . --build-arg BIRD_IMAGE=calico/bird:v0.3.3-151-g767b5389-amd64 --build-arg QEMU_IMAGE=calico/go-build:v0.40 --build-arg GIT_VERSION= -f ./Dockerfile.amd64
Sending build context to Docker daemon 66.3MB
Step 1/41 : ARG ARCH=x86_64
Step 2/41 : ARG GIT_VERSION=unknown
Step 3/41 : ARG IPTABLES_VER=1.8.2-16
Step 4/41 : ARG RUNIT_VER=2.1.2
Step 5/41 : ARG BIRD_IMAGE=calico/bird:latest
Step 6/41 : FROM calico/bpftool:v5.3-amd64 as bpftool
...
Step 12/41 : ARG CENTOS_MIRROR_BASE_URL=https://mirrors.aliyun.com/centos-vault/8.1.1911
---> Using cache
---> a96f716928d7
...
Step 17/41 : RUN mv /etc/yum.repos.d /etc/yum.repo.d-bk && mkdir -p /etc/yum.repos.d && mv /centos.repo /etc/yum.repos.d && yum clean all && yum makecache && dnf install -y 'dnf-command(config-manager)' && yum install -y rpm-build yum-utils make && yum install -y wget glibc-static gcc && yum -y update-minimal --security --sec-severity=Important --sec-severity=Critical
---> Using cache
---> a9ffd418a7a4
...
Step 24/41 : FROM registry.access.redhat.com/ubi8/ubi-minimal:8.1-407
8.1-407: Pulling from ubi8/ubi-minimal
Digest: sha256:01b8fb7b3ad16a575651a4e007e8f4d95b68f727b3a41fc57996be9a790dc4fa
Status: Image is up to date for registry.access.redhat.com/ubi8/ubi-minimal:8.1-407
---> 6ce38bb5210c
...
Step 39/41 : COPY dist/bin/calico-node-amd64 /bin/calico-node
---> Using cache
---> 916fbf133fb0
Step 40/41 : COPY --from=bpftool /bpftool /bin
---> Using cache
---> f797db5c4eb4
Step 41/41 : CMD ["start_runit"]
---> Using cache
---> fe6496ded4a6
[Warning] One or more build-args [QEMU_IMAGE] were not consumed
Successfully built fe6496ded4a6
Successfully tagged calico/node:latest-amd64
touch .calico_node.created-amd64
make: Leaving directory `/home/go/gopath/src/github.com/projectcalico/node'

查看编译的镜像:

1
2
3
4
[root@node01 github.com]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
calico/node latest-amd64 77f4ca933207 7 hours ago 264MB
<none> <none> 420e5252b060 7 hours ago 633MB

参考文献

  1. https://github.com/projectcalico/calico/blob/master/DEVELOPER_GUIDE.md
  2. https://github.com/projectcalico/go-build/blob/7a75e06f7e9b39df8697ca96f6d5f42369155902/Makefile.common
  3. https://mirrors.aliyun.com/centos-vault/8.1.1911/

问题背景

客户的防火墙抓到了没有EndpointService请求,从K8S角度来说,正常情况下不应该存在这种现象的,因为没有EndpointService请求会被iptables规则reject掉才对。

分析过程

先本地环境复现,创建一个没有后端的服务,例如grafana-service111

1
2
3
4
5
6
7
8
9
10
11
 [root@node01 ~]# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d
kube-system grafana-service ClusterIP 10.96.78.163 <none> 3000/TCP 2d
kube-system grafana-service111 ClusterIP 10.96.52.101 <none> 3000/TCP 13s

[root@node01 ~]# kubectl get ep -A
NAMESPACE NAME ENDPOINTS AGE
default kubernetes 10.10.72.15:6443 2d
kube-system grafana-service 10.78.104.6:3000,10.78.135.5:3000 2d
kube-system grafana-service111 <none> 18s

进入一个业务Pod,并请求grafana-service111,结果请求卡住并超时终止:

1
2
3
4
5
6
7
[root@node01 ~]# kubectl exec -it -n kube-system   influxdb-rs1-5bdc67f4cb-lnfgt bash
root@influxdb-rs1-5bdc67f4cb-lnfgt:/# time curl http://10.96.52.101:3000
curl: (7) Failed to connect to 10.96.52.101 port 3000: Connection timed out

real 2m7.307s
user 0m0.006s
sys 0m0.008s

查看grafana-service111iptables规则,发现有reject规则,但从上面的实测现象看,应该是没有生效:

1
2
[root@node01 ~]# iptables-save |grep 10.96.52.101
-A KUBE-SERVICES -d 10.96.52.101/32 -p tcp -m comment --comment "kube-system/grafana-service111: has no endpoints" -m tcp --dport 3000 -j REJECT --reject-with icmp-port-unreachable

在业务Pod容器网卡上抓包,没有发现响应报文(不符合预期):

1
2
3
4
[root@node01 ~]# tcpdump -n -i calie2568ca85e4 host 10.96.52.101
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on calie2568ca85e4, link-type EN10MB (Ethernet), capture size 262144 bytes
20:31:34.647286 IP 10.78.166.136.39230 > 10.96.52.101.hbci: Flags [S], seq 1890821953, win 29200, options [mss 1460,sackOK,TS val 792301056 ecr 0,nop,wscale 7], length 0

在节点网卡上抓包,存在服务请求包(不符合预期):

1
2
3
4
5
6
[root@node01 ~]# tcpdump -n -i eth0 host 10.96.52.101
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:33:36.994881 IP 10.10.72.10.41234 > 10.96.52.101.hbci: Flags [S], seq 3530065013, win 29200, options [mss 1460,sackOK,TS val 792423403 ecr 0,nop,wscale 7], length 0
20:33:37.995298 IP 10.10.72.10.41234 > 10.96.52.101.hbci: Flags [S], seq 3530065013, win 29200, options [mss 1460,sackOK,TS val 792424404 ecr 0,nop,wscale 7], length 0
20:33:39.999285 IP 10.10.72.10.41234 > 10.96.52.101.hbci: Flags [S], seq 3530065013, win 29200, options [mss 1460,sackOK,TS val 792426408 ecr 0,nop,wscale 7], length 0

既然reject规则存在,初步怀疑可能影响该规则的组件有两个:

  1. kube-proxy
  2. calico-node

基于上一篇《使用Kubeasz一键部署K8S集群》,在最新的K8S集群上做相同的测试,发现不存在该问题,说明该问题在新版本已经修复了。分别在K8SCalicoissue上查询相关问题,最后发现是Calicobug,相关issue见参考资料[1, 2],修复记录见参考资料[3]。

下面是新老版本的Calico处理cali-FORWARD链的差异点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
有问题的环境:
[root@node4 ~]# iptables -t filter -S cali-FORWARD
-N cali-FORWARD
-A cali-FORWARD -m comment --comment "cali:vjrMJCRpqwy5oRoX" -j MARK --set-xmark 0x0/0xe0000
-A cali-FORWARD -m comment --comment "cali:A_sPAO0mcxbT9mOV" -m mark --mark 0x0/0x10000 -j cali-from-hep-forward
-A cali-FORWARD -i cali+ -m comment --comment "cali:8ZoYfO5HKXWbB3pk" -j cali-from-wl-dispatch
-A cali-FORWARD -o cali+ -m comment --comment "cali:jdEuaPBe14V2hutn" -j cali-to-wl-dispatch
-A cali-FORWARD -m comment --comment "cali:12bc6HljsMKsmfr-" -j cali-to-hep-forward
-A cali-FORWARD -m comment --comment "cali:MH9kMp5aNICL-Olv" -m comment --comment "Policy explicitly accepted packet." -m mark --mark 0x10000/0x10000 -j ACCEPT
//问题在这最后这一条规则,新版本的calico把这条规则移到了FORWARD链

正常的环境:
[root@node01 ~]# iptables -t filter -S cali-FORWARD
-N cali-FORWARD
-A cali-FORWARD -m comment --comment "cali:vjrMJCRpqwy5oRoX" -j MARK --set-xmark 0x0/0xe0000
-A cali-FORWARD -m comment --comment "cali:A_sPAO0mcxbT9mOV" -m mark --mark 0x0/0x10000 -j cali-from-hep-forward
-A cali-FORWARD -i cali+ -m comment --comment "cali:8ZoYfO5HKXWbB3pk" -j cali-from-wl-dispatch
-A cali-FORWARD -o cali+ -m comment --comment "cali:jdEuaPBe14V2hutn" -j cali-to-wl-dispatch
-A cali-FORWARD -m comment --comment "cali:12bc6HljsMKsmfr-" -j cali-to-hep-forward
-A cali-FORWARD -m comment --comment "cali:NOSxoaGx8OIstr1z" -j cali-cidr-block

下面是在最新的K8S集群上做相同的测试记录,可以跟异常环境做对比。

模拟一个业务请求pod

1
2
3
4
5
[root@node01 home]# kubectl run busybox --image=busybox-curl:v1.0 --image-pull-policy=IfNotPresent -- sleep 300000
pod/busybox created

[root@node01 home]# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default busybox 1/1 Running 0 14h 10.78.153.73 10.10.11.49

模拟一个业务响应服务metrics-server111,并且该服务无后端endpoint

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@node01 home]# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.68.0.1 <none> 443/TCP 18h
kube-system dashboard-metrics-scraper ClusterIP 10.68.174.38 <none> 8000/TCP 17h
kube-system kube-dns ClusterIP 10.68.0.2 <none> 53/UDP,53/TCP,9153/TCP 17h
kube-system kube-dns-upstream ClusterIP 10.68.41.41 <none> 53/UDP,53/TCP 17h
kube-system kubernetes-dashboard NodePort 10.68.160.45 <none> 443:30861/TCP 17h
kube-system metrics-server ClusterIP 10.68.65.249 <none> 443/TCP 17h
kube-system metrics-server111 ClusterIP 10.68.224.53 <none> 443/TCP 14h
kube-system node-local-dns ClusterIP None <none> 9253/TCP 17h

[root@node01 ~]# kubectl get ep -A
NAMESPACE NAME ENDPOINTS AGE
default kubernetes 172.28.11.49:6443 18h
kube-system dashboard-metrics-scraper 10.78.153.68:8000 18h
kube-system kube-dns 10.78.153.67:53,10.78.153.67:53,10.78.153.67:9153 18h
kube-system kube-dns-upstream 10.78.153.67:53,10.78.153.67:53 18h
kube-system kubernetes-dashboard 10.78.153.66:8443 18h
kube-system metrics-server 10.78.153.65:4443 18h
kube-system metrics-server111 <none> 15h
kube-system node-local-dns 172.28.11.49:9253 18h

进入业务请求pod,做curl测试,请求立刻被拒绝(符合预期):

1
2
3
[root@node01 02-k8s]# kubectl exec -it busybox bash
/ # curl -i -k https://10.68.224.53:443
curl: (7) Failed to connect to 10.68.224.53 port 443 after 2 ms: Connection refused

tcpdump抓取容器网卡报文,出现tcp port https unreachable符合预期):

1
2
3
tcpdump -n -i cali12d4a061371
21:54:42.697437 IP 10.78.153.73.41606 > 10.68.224.53.https: Flags [S], seq 3510100476, win 29200, options [mss 1460,sackOK,TS val 2134372616 ecr 0,nop,wscale 7], length 0
21:54:42.698804 IP 10.10.11.49> 10.78.153.73: ICMP 10.68.224.53 tcp port https unreachable, length 68

tcpdump抓取节点网卡报文,无请求从测试容器内发出集群(符合预期);

1
2
3
4
5
6
7
[root@node01 bin]# tcpdump -n -i eth0 host 10.68.224.53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
2 packets received by filter
0 packets dropped by kernel

解决方案

升级Calico,要求版本>=v3.16.0

参考资料

https://github.com/projectcalico/calico/issues/1055
https://github.com/projectcalico/calico/issues/3901
https://github.com/projectcalico/felix/pull/2424

问题背景

这个问题[1]定位到VMWare虚拟化层面后就一直搁置下来了,但考虑到测试环境存在大量虚拟机部署的情况,并且支持虚拟化部署也是早晚的事,所以后续也在一直关注VMWareRedhat的相关资料。

原因分析

幸运的是,从Redhat的官网里还真的找到了相关问题[2],通过官网说明了解到,该问题出现的环境是Redhat 8.38.4,使用vmxnet3的适配器,并且使用UDP隧道协议,比如vxlanGRE

官方给出的解决方案是:

  1. 升级Redhat8.5 (kernel-4.18.0-348.el8)及以后版本。
  2. 升级VMware ESXi6.7P077.0U3 (7.0.3) 及以后版本。

上述更新包括对vmxnet3 NIC不支持的隧道禁用 tx-checksum-ip-generic 的逻辑,因此最终结果与下面的解决方法相同。

1
ethtool -K DEVNAME tx-checksum-ip-generic off

但是,实测结果显示,升级RedhatVMware均无法解决,而临时禁用的命令是可以的。考虑到临时命令还涉及到持久化的问题,还是需要另找方法。

既然使用vmxnet3网卡不行,是不是能换网卡类型?果然,根据这个思路,继续查找资料[3],发现使用E1000类型的网卡可以解决该问题,实测结果也符合预期。

解决方案

临时方案:在创建虚拟机时,把网络适配器的类型改为 E1000E1000e

永久方案:依然需要等待VMWareRedhat的官方修复。

参考资料

  1. https://lyyao09.github.io/2022/06/05/k8s/K8S%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5-VMWare%E8%99%9A%E6%8B%9F%E5%8C%96%E7%8E%AF%E5%A2%83%E4%B8%8BPod%E8%B7%A8VXLAN%E9%80%9A%E4%BF%A1%E5%BC%82%E5%B8%B8/
  2. https://access.redhat.com/solutions/5881451
  3. https://zhangguanzhang.github.io/2022/07/28/redhat84-vxlan-esxi

问题背景

为了验证最新版本的k8s是否已修复某个bug,需要快速搭建一个k8s环境,本文选取资料[1]中的kubeasz工具,并记录部署过程及相关问题。

部署过程

先下载工具脚本、kubeasz代码、二进制、默认容器镜像。

使用如下命令开始安装:

1
2
3
4
5
6
7
8
9
10
11
12
[root@node01 k8s]# ./ezdown -S
2023-03-22 13:39:40 INFO Action begin: start_kubeasz_docker
2023-03-22 13:39:41 INFO try to run kubeasz in a container
2023-03-22 13:39:41 DEBUG get host IP: 10.10.11.49
2023-03-22 13:39:41 DEBUG generate ssh key pair
# 10.10.11.49 SSH-2.0-OpenSSH_6.6.1
f1b442b7fdaf757c7787536b17d12d76208a2dd7884d56fbd1d35817dc2e94ca
2023-03-22 13:39:41 INFO Action successed: start_kubeasz_docker

[root@node01 k8s]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f1b442b7fdaf easzlab/kubeasz:3.5.0 "sleep 36000" 15 seconds ago Up 14 seconds kubeasz

执行后看不出是成功,还是失败。根据文档说明,进入容器内手动执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node01 ~]# docker exec -it kubeasz ezctl start-aio
2023-03-22 06:15:05 INFO get local host ipadd: 10.10.11.49
2023-03-22 06:15:05 DEBUG generate custom cluster files in /etc/kubeasz/clusters/default
2023-03-22 06:15:05 DEBUG set versions
2023-03-22 06:15:05 DEBUG disable registry mirrors
2023-03-22 06:15:05 DEBUG cluster default: files successfully created.
2023-03-22 06:15:05 INFO next steps 1: to config '/etc/kubeasz/clusters/default/hosts'
2023-03-22 06:15:05 INFO next steps 2: to config '/etc/kubeasz/clusters/default/config.yml'
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 06:15:05 INFO cluster:default setup step:all begins in 5s, press any key to abort:

PLAY [kube_master,kube_node,etcd,ex_lb,chrony] **********************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************
fatal: [10.10.11.49]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: root@10.10.11.49: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).", "unreachable": true}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0

从日志看,提示权限有问题。实际测试可以正常的ssh免密登录:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bash-5.1# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)?
bash-5.1# ssh-copy-id root@10.10.11.49
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
expr: warning: '^ERROR: ': using '^' as the first character
of a basic regular expression is not portable; it is ignored
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@10.10.11.49's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'root@10.10.11.49'"
and check to make sure that only the key(s) you wanted were added.

bash-5.1# ssh root@10.10.11.49
root@10.10.11.49's password:

查看相关配置文件,权限正常:

1
2
3
4
5
6
[root@node01 kubeasz]# ll ~/.ssh
total 16
-rw------- 1 root root 1752 Mar 22 14:25 authorized_keys
-rw------- 1 root root 2602 Mar 22 14:25 id_rsa
-rw-r--r-- 1 root root 567 Mar 22 14:25 id_rsa.pub
-rw-r--r-- 1 root root 1295 Mar 22 13:39 known_hosts

不清楚具体哪里有问题,参考资料[2],尝试改为用用户名密码执行。

在容器内配置用户密码,检查通过:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bash-5.1# vi /etc/ansible/hosts
[webservers]
10.10.11.49

[webservers:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

bash-5.1# ansible webservers -m ping
10.10.11.49 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}

修改安装集群依赖的clusters/default/hosts文件,同样增加用户密码配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[etcd]
10.10.11.49

[etcd:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

# master node(s)
[kube_master]
10.10.11.49

[kube_master:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'


# work node(s)
[kube_node]
10.10.11.49

[kube_node:vars]
ansible_ssh_pass='******'
ansible_ssh_user='root'

执行命令,提示缺少sshpass工具:

1
2
3
4
5
6
7
8
9
10
11
[root@node01 kubeasz]# docker exec -it kubeasz ezctl setup default all
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 07:35:46 INFO cluster:default setup step:all begins in 5s, press any key to abort:

PLAY [kube_master,kube_node,etcd,ex_lb,chrony] **********************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************
fatal: [10.10.11.49]: FAILED! => {"msg": "to use the 'ssh' connection type with passwords, you must install the sshpass program"}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0

安装sshpass依赖包:

1
2
3
4
5
6
bash-5.1# apk add sshpass
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz
(1/1) Installing sshpass (1.09-r0)
Executing busybox-1.35.0-r17.trigger
OK: 21 MiB in 47 packages

重复执行命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@node01 kubeasz]# docker exec -it kubeasz ezctl setup default all
ansible-playbook -i clusters/default/hosts -e @clusters/default/config.yml playbooks/90.setup.yml
2023-03-22 07:36:37 INFO cluster:default setup step:all begins in 5s, press any key to abort:

...

TASK [kube-node : 轮询等待kube-proxy启动] *********************************************************************************************************************************************************************
changed: [10.10.11.49]
FAILED - RETRYING: 轮询等待kubelet启动 (4 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (3 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (2 retries left).
FAILED - RETRYING: 轮询等待kubelet启动 (1 retries left).

TASK [kube-node : 轮询等待kubelet启动] ************************************************************************************************************************************************************************
fatal: [10.10.11.49]: FAILED! => {"attempts": 4, "changed": true, "cmd": "systemctl is-active kubelet.service", "delta": "0:00:00.014621", "end": "2023-03-22 15:42:07.230186", "msg": "non-zero return code", "rc": 3, "start": "2023-03-22 15:42:07.215565", "stderr": "", "stderr_lines": [], "stdout": "activating", "stdout_lines": ["activating"]}

PLAY RECAP **********************************************************************************************************************************************************************************************
10.10.11.49 : ok=85 changed=78 unreachable=0 failed=1 skipped=123 rescued=0 ignored=0
localhost : ok=33 changed=30 unreachable=0 failed=0 skipped=11 rescued=0 ignored=0

kubelet阶段失败,查看kubelet服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[root@node01 log]# service kubelet status -l
Redirecting to /bin/systemctl status -l kubelet.service
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2023-03-22 15:56:31 CST; 1s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 147581 ExecStart=/opt/kube/bin/kubelet --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --hostname-override=10.10.11.49 --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --root-dir=/var/lib/kubelet --v=2 (code=exited, status=1/FAILURE)
Main PID: 147581 (code=exited, status=1/FAILURE)

Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.719832 147581 manager.go:228] Version: {KernelVersion:3.10.0-862.11.6.el7.x86_64 ContainerOsVersion:CentOS Linux 7 (Core) DockerVersion: DockerAPIVersion: CadvisorVersion: CadvisorRevision:}
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.720896 147581 server.go:659] "--cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.721939 147581 container_manager_linux.go:267] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722392 147581 container_manager_linux.go:272] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName:
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722503 147581 topology_manager.go:134] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722609 147581 container_manager_linux.go:308] "Creating device plugin manager"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722689 147581 manager.go:125] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722763 147581 server.go:66] "Creating device plugin registration server" version="v1beta1" socket="/var/lib/kubelet/device-plugins/kubelet.sock"
Mar 22 15:56:31 node01 kubelet[147581]: I0322 15:56:31.722905 147581 state_mem.go:36] "Initialized new in-memory state store"
Mar 22 15:56:31 node01 kubelet[147581]: E0322 15:56:31.726502 147581 run.go:74] "command failed" err="failed to run Kubelet: validate service connection: CRI v1 runtime API is not implemented for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService"

根据日志报错,参考资料[3],删除 /etc/containerd/config.toml 文件并重启 containerd 即可:

1
2
mv /etc/containerd/config.toml /root/config.toml.bak
systemctl restart containerd

重复执行命令,后台查看发现calico-node启动失败,查看日志如下:

1
2
3
4
5
6
7
8
9
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 41s default-scheduler Successfully assigned kube-system/calico-node-rqpjm to 10.10.11.49
Normal Pulling 20s (x2 over 31s) kubelet Pulling image "easzlab.io.local:5000/calico/cni:v3.23.5"
Warning Failed 19s (x2 over 31s) kubelet Failed to pull image "easzlab.io.local:5000/calico/cni:v3.23.5": rpc error: code = Unknown desc = failed to pull and unpack image "easzlab.io.local:5000/calico/cni:v3.23.5": failed to resolve reference "easzlab.io.local:5000/calico/cni:v3.23.5": failed to do request: Head "https://easzlab.io.local:5000/v2/calico/cni/manifests/v3.23.5": http: server gave HTTP response to HTTPS client
Warning Failed 19s (x2 over 31s) kubelet Error: ErrImagePull
Normal BackOff 5s (x2 over 30s) kubelet Back-off pulling image "easzlab.io.local:5000/calico/cni:v3.23.5"
Warning Failed 5s (x2 over 30s) kubelet Error: ImagePullBackOff

查看docker层面配置,并测试拉起镜像正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@node01 ~]# cat /etc/docker/daemon.json
{
"max-concurrent-downloads": 10,
"insecure-registries": ["easzlab.io.local:5000"],
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"data-root":"/var/lib/docker"
}

[root@node01 log]# docker pull easzlab.io.local:5000/calico/cni:v3.23.5
v3.23.5: Pulling from calico/cni
Digest: sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29
Status: Image is up to date for easzlab.io.local:5000/calico/cni:v3.23.5
easzlab.io.local:5000/calico/cni:v3.23.5

查看containerd层面,并测试拉起镜像也正常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@node01 log]# ctr image pull --plain-http=true easzlab.io.local:5000/calico/cni:v3.23.5
easzlab.io.local:5000/calico/cni:v3.23.5: resolved |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:cc0e45adf05a30a90384ba7024dbabdad9ae0bcd7b5a535c28dede741298fea3: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:47c5dbbec31222325790ebad8c07d270a63689bd10dc8f54115c65db7c30ad1f: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:8efc3d73e2741a93be09f68c859da466f525b9d0bddb1cd2b2b633f14f232941: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:1c979d623de9aef043cb4ff489da5636d61c39e30676224af0055240e1816382: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4c98a4f67c5a7b1058111d463051c98b23e46b75fc943fc2535899a73fc0c9f1: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:51729c6e2acda05a05e203289f5956954814d878f67feb1a03f9941ec5b4008b: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:050b055d5078c5c6ad085d106c232561b0c705aa2173edafd5e7a94a1e908fc5: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:7430548aa23e56c14da929bbe5e9a2af0f9fd0beca3bd95e8925244058b83748: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 3.1 s total: 103.0 (33.2 MiB/s)
unpacking linux/amd64 sha256:9c5055a2b5bc0237ab160aee058135ca9f2a8f3c3eee313747a02edcec482f29...
done: 6.82968396s

根据资料[4],查看containerd配置,并新增私有仓库的配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@node01 ~]# containerd config default > /etc/containerd/config.toml

[root@node01 ~]# vim /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = ""

[plugins."io.containerd.grpc.v1.cri".registry.auths]

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.headers]

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
endpoint = ["http://easzlab.io.local:5000"]

[root@node01 ~]# service containerd restart

查看pod状态,又卡在了ContainerCreating状态:

1
2
3
4
5
6
7
8
9
[root@node01 ~]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-89b744d6c-klzwh 1/1 Running 0 5m35s
kube-system calico-node-wmvff 1/1 Running 0 5m35s
kube-system coredns-6665999d97-mp7xc 0/1 ContainerCreating 0 5m35s
kube-system dashboard-metrics-scraper-57566685b4-8q5fm 0/1 ContainerCreating 0 5m35s
kube-system kubernetes-dashboard-57db9bfd5b-h6jp4 0/1 ContainerCreating 0 5m35s
kube-system metrics-server-6bd9f986fc-njpnj 0/1 ContainerCreating 0 5m35s
kube-system node-local-dns-wz9bg 1/1 Running 0 5m31s

选择一个describe查看:

1
2
3
4
5
6
7
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6m7s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal Scheduled 5m47s default-scheduler Successfully assigned kube-system/coredns-6665999d97-mp7xc to 10.10.11.49
Warning FailedCreatePodSandBox 5m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "072c164d79f4874a8d851d36115ea04b75a2155dae3cecdc764e923c9f38f86b": plugin type="calico" failed (add): failed to find plugin "calico" in path [/opt/cni/bin]
Normal SandboxChanged 33s (x25 over 5m46s) kubelet Pod sandbox changed, it will be killed and re-created.

从日志看,是cni插件不存在的问题,手动拷贝之后,查看pod状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@node01 bin]# cd /opt/cni/bin/
[root@node01 bin]# chmod +x *
[root@node01 bin]# ll -h
total 186M
-rwxr-xr-x 1 root root 3.7M Mar 22 17:46 bandwidth
-rwxr-xr-x 1 root root 56M Mar 22 17:46 calico
-rwxr-xr-x 1 root root 56M Mar 22 17:46 calico-ipam
-rwxr-xr-x 1 root root 2.4M Mar 22 17:46 flannel
-rwxr-xr-x 1 root root 3.1M Mar 22 17:46 host-local
-rwxr-xr-x 1 root root 56M Mar 22 17:46 install
-rwxr-xr-x 1 root root 3.2M Mar 22 17:46 loopback
-rwxr-xr-x 1 root root 3.6M Mar 22 17:46 portmap
-rwxr-xr-x 1 root root 3.3M Mar 22 17:46 tuning

[root@node01 bin]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-89b744d6c-mpfgq 1/1 Running 0 37m
kube-system calico-node-h9sm2 1/1 Running 0 37m
kube-system coredns-6665999d97-8pdbd 1/1 Running 0 37m
kube-system dashboard-metrics-scraper-57566685b4-c2l8w 1/1 Running 0 37m
kube-system kubernetes-dashboard-57db9bfd5b-74lmb 1/1 Running 0 37m
kube-system metrics-server-6bd9f986fc-d9crl 1/1 Running 0 37m
kube-system node-local-dns-kvgv6 1/1 Running 0 37m

部署完成。

参考资料

https://github.com/easzlab/kubeasz/blob/master/docs/setup/quickStart.md

https://www.jianshu.com/p/c48b4a24c7d4

https://www.cnblogs.com/immaxfang/p/16721407.html

https://github.com/containerd/containerd/issues/4938

背景

说起开发工具Goland,做Go语言开发的同学应该都不陌生,但由于大部分同学的电脑资源有限,尤其是公司里配备的电脑,在本地使用Goland多多少少有些不够顺畅。

如果公司内服务器资源充足,再加上容器化技术的加持,把Goland以容器的形式部署在服务器上运行是一个不错的解决方法。带着这个想法搜索资料[1]发现,Goland 官方还真的开发了容器版,并且提供网页和客户端两种登录方式。下面给出内网环境下的配置步骤及采坑记录:

注:

  1. 以下操作同样适用于Jetbrains旗下的其他开发工具,比如IDEA
  2. Goland的注册方法不在本次教程讨论范围之内,请自行解决。

镜像获取

Goland网页版功能是jetbrains官方[2]提供的Docker镜像,所以内网配置的前提是先从外网拉取到需要的镜像,然后导出镜像包并拷贝到内网中:

1
2
docker pull registry.jetbrains.team/p/prj/containers/projector-goland
docker save -o projector-goland.tar registry.jetbrains.team/p/prj/containers/projector-goland

注:如果无法拉取官方的镜像,可以在公众号后台回复 docker goland 即可获取goland 网页版镜像。

服务启动

拿到镜像后,找一个安装了docker的服务器或虚机,使用docker run命令启动:

1
2
3
4
5
6
7
8
9
10
11
docker run -itd \
-u root \
-p 8887:8887 \
--net=host \
--privileged \
-v /home/admin/goland-dir:/root \
-v /etc/localtime:/etc/localtime \
-v /home/admin/goland-dir/sources.list:/etc/apt/sources.list \
--name goland \
--restart always \
registry.jetbrains.team/p/prj/containers/projector-goland

重要)部分参数说明:

  1. 指定用户:可选,默认不指定用户,容器启动时会使用一个非root用户projector-user,这里使用root用户启动是为了避免后续操作的权限问题;
  2. 指定主机网络:可选,方便使用代理拉取代码,没有代理的话先从外网下载也可以;
  3. 指定特权模式:可选,方便调试,没有开启的话直接使用GoLand调试会提示权限问题;
  4. 挂载点1:必选,默认用户下,将/home/projector-user挂载到本地,root用户下直接将root目录挂载到本地;
  5. 挂载点2:可选,保持容器时间与主机时间一致;
  6. 挂载点3:可选,配置内网依赖源,方便下载gcc等编译所需的依赖;

浏览器访问

容器正常启动后,在浏览器中通过http://x.x.x.x:8887地址登录网页版GoLand

客户端访问

如果不习惯使用浏览器,官方还提供了原生客户端,我们通过地址[3]下载,打开后输入地址即可。

导入项目示例

以导入K8S源码为例,登录到容器内,使用git命令拉取kubernetes源码:

1
2
3
projector-user@storage:~/go/src/github.com$ git clone https://github.com/kubernetes/kubernetes.git
Cloning into 'kubernetes'...
fatal: unable to access 'https://github.com/kubernetes/kubernetes.git/': server certificate verification failed. CAfile: none CRLfile: none

拉取失败,提示CA证书问题,通过以下命令解决:

1
git config --global http.sslVerify false

又拉取失败:

1
2
3
4
5
projector-user@storage:~/go/src/github.com$ git clone https://github.com/kubernetes/kubernetes.git
Cloning into 'kubernetes'...
fatal: unable to update url base from redirection:
asked for: https://github.com/kubernetes/kubernetes.git/info/refs?service=git-upload-pack
redirect: http://x.x.x.x/proxy.html?template=default&tabs=pwd&vlanid=0&url=https://github.com%2Fkubernetes%2Fkubernetes.git%2Finfo%2Frefs%3Fservice%3Dgit-upload-pack

因为未配置代理,通过以下命令解决:

1
2
设置:git config --global http.proxy http://user:password@http://x.x.x.xx:8080
查看:git config --get --global http.proxy

注:密码中如果存在特殊字符,请先转义。

再次尝试拉取,拉取成功:

1
2
3
4
5
6
7
8
9
projector-user@storage:~/go/src/github.com$ git clone https://github.com/kubernetes/kubernetes.git
Cloning into 'kubernetes'...
remote: Enumerating objects: 1258541, done.
remote: Counting objects: 100% (316/316), done.
remote: Compressing objects: 100% (201/201), done.
remote: Total 1258541 (delta 131), reused 150 (delta 111), pack-reused 1258225
Receiving objects: 100% (1258541/1258541), 773.55 MiB | 805.00 KiB/s, done.
Resolving deltas: 100% (906256/906256), done.
Checking out files: 100% (23196/23196), done.

常见问题

1. 复制粘贴问题

根据参考资料[4],可以通过设置环境变量ORG_JETBRAINS_PROJECTOR_SERVER_SSL_PROPERTIES_PATH解决:

1
docker run -e ORG_JETBRAINS_PROJECTOR_SERVER_SSL_PROPERTIES_PATH=/root/ssl/ssl.properties ...

ssl的配置文件举例:

1
2
3
4
STORE_TYPE=JKS
FILE_PATH=/root/ssl/keystore
STORE_PASSWORD=xxx
KEY_PASSWORD=xxx

通过查看启动日志确认ssl是否配置成功,如下日志所示,WebSocket SSL is enabled: /root/ssl/ssl.properties表示配置成功,此时在浏览器用https://xxx:8887/?wss访问即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
Found IDE: goland
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
[DEBUG] :: IdeState :: Starting attempts to Init ProjectorClassLoader
[DEBUG] :: IdeState :: Starting attempts to attach IJ injector agent
[DEBUG] :: IdeState :: Starting attempts to initialize IDEA: fix AA and disable smooth scrolling (at start)
[DEBUG] :: IdeState :: Starting attempts to Getting IDE colors
[DEBUG] :: ProjectorServer :: Daemon thread starts
[DEBUG] :: IdeState :: Starting attempts to search for editors
[INFO] :: ProjectorServer :: ProjectorServer is starting on host 0.0.0.0/0.0.0.0 and port 8887
[INFO] :: HttpWsServerBuilder :: WebSocket SSL is enabled: /root/ssl/ssl.properties
[INFO] :: HttpWsServer :: Server started on host 0.0.0.0/0.0.0.0 and port 8887
[DEBUG] :: IdeState :: "Init ProjectorClassLoader" is done
[DEBUG] :: IdeState :: "search for editors" is done

登录后再次尝试,又可以快乐的Ctrl+CCtrl+V了。

2. 自定义Keymap被重置问题

根据参考资料[4],可以通过设置环境变量ORG_JETBRAINS_PROJECTOR_SERVER_AUTO_KEYMAP=false解决:

1
docker run -e ORG_JETBRAINS_PROJECTOR_SERVER_AUTO_KEYMAP=false ...

登录后再观察,发现自定义的keymap不会神奇的恢复了。

3. 原生客户端全局搜索结果模糊问题

模糊部分刚好是搜索的字符串,使用起来问题也不大,如果忍不了,也可以暂时使用浏览器开心玩耍(浏览器下没有该问题)。

更新:v1.0.2版本已修复该问题

参考资料

  1. https://fuckcloudnative.io/posts/run-jetbrains-ide-in-docker/
  2. https://github.com/JetBrains/projector-docker
  3. https://github.com/JetBrains/projector-client/releases
  4. https://jetbrains.github.io/projector-client/mkdocs/latest/ij_user_guide/server_customization/