2024-03-28 17:25:22 WARN DefaultConfig:206 - Disabling high-strength ciphers: cipher strengths apparently limited by JCE policy 2024-03-28 17:25:22 INFO TransportImpl:214 - Client identity string: SSH-2.0-SSHJ_0.27.0 2024-03-28 17:25:22 INFO TransportImpl:178 - Server identity string: SSH-2.0-OpenSSH_7.4 2024-03-28 17:25:23 ERROR TransportImpl:593 - Dying because - Invalid signature file digest for Manifest main attributes java.lang.SecurityException: Invalid signature file digest for Manifest main attributes at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:317) at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:259) at java.util.jar.JarVerifier.processEntry(JarVerifier.java:323) at java.util.jar.JarVerifier.update(JarVerifier.java:234) at java.util.jar.JarFile.initializeVerifier(JarFile.java:394) at java.util.jar.JarFile.ensureInitialization(JarFile.java:632) at java.util.jar.JavaUtilJarAccessImpl.ensureInitialization(JavaUtilJarAccessImpl.java:69) at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:993) at java.net.URLClassLoader.defineClass(URLClassLoader.java:456) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at net.schmizz.sshj.common.KeyType$3.isMyType(KeyType.java:124) at net.schmizz.sshj.common.KeyType.fromKey(KeyType.java:288) at net.schmizz.sshj.transport.kex.AbstractDHG.next(AbstractDHG.java:82) at net.schmizz.sshj.transport.KeyExchanger.handle(KeyExchanger.java:364) at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:503) at net.schmizz.sshj.transport.Decoder.decodeMte(Decoder.java:159) at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:79) at net.schmizz.sshj.transport.Decoder.received(Decoder.java:231) at net.schmizz.sshj.transport.Reader.run(Reader.java:59) 2024-03-28 17:25:23 INFO TransportImpl:192 - Disconnected - UNKNOWN 2024-03-28 17:25:23 ERROR Promise:174 - <<kex done>> woke to: net.schmizz.sshj.transport.TransportException: Invalid signature file digest for Manifest main attributes 2024-03-28 17:25:23 ERROR matrix:573 - failed exec command ls /root/ on node 10.10.2.8
根据报错信息Invalid signature file digest for Manifest main attributes,查找相关资料,尝试以下几种解决方法都没有效果:
有问题的版本: [root@node1 1.0.0]# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-1.el7_9.x86_64/bin/jarsigner -verify bcprov-jdk15on-1.60.jar jarsigner: java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
高版本: [root@node1 1.0.0]# /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-1.el7_9.x86_64/bin/jarsigner -verify bcprov-jdk15on-1.69.jar jar 已验证。 警告: 此 jar 包含其证书链无效的条目。原因: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target The DSA signing key has a keysize of 1024 which is considered a security risk. This key size will be disabled in a future update.
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz -h Create a Kubernetes or KubeSphere cluster
Usage: kk create cluster [flags]
Flags: -a, --artifact string Path to a KubeKey artifact --container-manager string Container runtime: docker, crio, containerd and isula. (default "docker") --debug Print detailed information --download-cmd string The user defined command to download the necessary binary files. The first param '%s' is output path, the second param '%s', is the URL (default "curl -L -o %s %s") -f, --filename string Path to a configuration file -h, --helphelpfor cluster --ignore-err Ignore the error message, remove the host which reported error and force to continue --namespace string KubeKey namespace to use (default "kubekey-system") --skip-pull-images Skip pre pull images --skip-push-images Skip pre push images --with-kubernetes string Specify a supported version of kubernetes --with-kubesphere Deploy a specific version of kubesphere (default v3.4.1) --with-local-storage Deploy a local PV provisioner --with-packages install operation system packages by artifact --with-security-enhancement Security enhancement -y, --yes
[root@node1 kubesphere]# ./kk create cluster -f config-sample.yaml -a kubesphere.tar.gz --with-kubesphere 3.4.0 W1205 00:36:57.266052 1453 utils.go:69] The recommended value for"clusterDNS"in"KubeletConfiguration" is: [10.96.0.10]; the provided value is: [169.254.25.10] [init] Using Kubernetes version: v1.23.15 [preflight] Running pre-flight checks [WARNING FileExisting-socat]: socat not found in system path [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 24.0.6. Latest validated version: 20.10 error execution phase preflight: [preflight] Some fatal errors occurred: [ERROR FileExisting-conntrack]: conntrack not found in system path [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...` To see the stack trace of this error execute with --v=5 or higher 00:36:58 CST stdout: [master] [preflight] Running pre-flight checks W1205 00:36:58.323079 1534 removeetcdmember.go:80] [reset] No kubeadm config, using etcd pod spec to get data directory [reset] No etcd config found. Assuming external etcd [reset] Please, manually reset etcd to prevent further issues [reset] Stopping the kubelet service [reset] Unmounting mounted directories in"/var/lib/kubelet" W1205 00:36:58.327376 1534 cleanupnode.go:109] [reset] Failed to evaluate the "/var/lib/kubelet" directory. Skipping its unmount and cleanup: lstat /var/lib/kubelet: no such file or directory [reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki] [reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf] [reset] Deleting contents of stateful directories: [/var/lib/dockershim /var/run/kubernetes /var/lib/cni]
The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables. If you wish to reset iptables, you must do so manually by using the "iptables"command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar) to reset your system's IPVS tables. The reset process does not clean your kubeconfig files and you must remove them manually. Please, check the contents of the $HOME/.kube/config file. 00:36:58 CST message: [master] init kubernetes cluster failed: Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=FileExisting-crictl,ImagePull"
##################################################### ### Welcome to KubeSphere! ### #####################################################
Console: http://10.10.10.30:30880 Account: admin Password: P@88w0rd NOTES: 1. After you log into the console, please check the monitoring status of service components in "Cluster Management". If any service is not ready, please wait patiently until all components are up and running. 2. Please change the default password after login.
Normal Scheduled 41m default-scheduler Successfully assigned xx/xxx-v3falue7-6f59dd5766-npd2x to node1 Warning FailedCreatePodSandBox 26m (x301 over 41m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "xxx-v3falue7-6f59dd5766-npd2x": Error response from daemon: start failed: : pipe2: too many open files: unknown Normal SandboxChanged 66s (x808 over 41m) kubelet Pod sandbox changed, it will be killed and re-created.
因为使用的是docker作为CRI,所以先查看docker日志:
1 2 3 4 5 6 7
time="2023-11-13T14:56:05.734166795+08:00" level=info msg="/etc/resolv.conf does not exist" time="2023-11-13T14:56:05.734193544+08:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]" time="2023-11-13T14:56:05.734202079+08:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]" time="2023-11-13T14:56:05.740830618+08:00" level=error msg="stream copy error: reading from a closed fifo" time="2023-11-13T14:56:05.740850537+08:00" level=error msg="stream copy error: reading from a closed fifo" time="2023-11-13T14:56:05.751993232+08:00" level=error msg="1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27 cleanup: failed to delete container from containerd: no such container" time="2023-11-13T14:56:05.752024358+08:00" level=error msg="Handler for POST /v1.41/containers/1622cfb1c90d926b867db7bcb0a86498ccad59db81223e861ac515ec75ed7c27/start returned error: start failed: : fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown"
从docker日志看,错误原因是:fork/exec /usr/bin/containerd-shim-runc-v2: too many open files: unknown,基本确认是**containerd的文件句柄打开数量过多**。
[root@node1 ~]# systemctl status containerd.service ● containerd.service - containerd container runtime Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled) Active: active (running) since Sat 2023-11-01 11:02:14 CST; 1 weeks 10 days ago Docs: https://containerd.io Main PID: 1999 (containerd) Tasks: 1622 Memory: 3.5G CGroup: /system.slice/containerd.service ├─ 999 /usr/bin/containerd cat /proc/999/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 319973 319973 processes Max open files 1024 524288 files
[Service] ExecStartPre=-/sbin/modprobe overlay ExecStart=/usr/bin/containerd KillMode=process Delegate=yes LimitNOFILE=1048576 # Having non-zero Limit*s causes performance problems due to accounting overhead # in the kernel. We recommend using cgroups to do container-local accounting. LimitNPROC=infinity LimitCORE=infinity TasksMax=infinity
iptables changelog: Raw * Wed Jul 11 2018 Phil Sutter - 1.8.0-1 - New upstream version 1.8.0 - Drop compat sub-package - Use nft tool versions, drop legacy ones
[root@node1 ~]# kubectl get pod -n kube-system -owide NAME READY STATUS RESTARTS AGE IP NODE calico-kube-controllers-cd96b6c89-bpjp6 1/1 Running 0 40h 10.10.0.1 node3 calico-node-ffsz8 1/1 Running 0 14s 10.10.0.1 node3 calico-node-nsmwl 1/1 Running 0 14s 10.10.0.2 node2 calico-node-w4ngt 1/1 Running 0 14s 10.10.0.1 node1 coredns-55c8f5fd88-hw76t 1/1 Running 1 260d 192.168.135.55 node3 xxx-55c8f5fd88-vqwbz 1/1 ContainerCreating 1 319d 192.168.104.22 node2
分析过程
describe查看
1 2 3 4 5 6 7
[root@node1 ~]# kubectl describe pod -n xxx xxx Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 52m default-scheduler Successfully assigned xxx/xxx to node1 Warning FailedCreatePodSandBox 52m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to set up pod "xxx" network: connection is unauthorized: Unauthorized, failed to clean up sandbox container "xxx" network for pod "xxx": networkPlugin cni failed to teardown pod "xxx" network: error getting ClusterInformation: connection is unauthorized: Unauthorized] Normal SandboxChanged 50m (x10 over 52m) kubelet Pod sandbox changed, it will be killed and re-created.
1. How can this feature be enabled / disabled in a live cluster? Feature gate name: BoundServiceAccountTokenVolume Components depending on the feature gate: kube-apiserver and kube-controller-manager Will enabling / disabling the feature require downtime of the control plane? yes, need to restart kube-apiserver and kube-controller-manager. Will enabling / disabling the feature require downtime or reprovisioning of a node? no. 2. Does enabling the feature change any default behavior? yes, pods' service account tokens will expire after 1 year by default and are not stored as Secrets any more.
#!/bin/bash # Add local user # Either use the LOCAL_USER_ID if passed in at runtime or # fallback
USER_ID=${LOCAL_USER_ID:-9001}
if [ "${RUN_AS_ROOT}" = "true" ]; then exec "$@" fi
echo "Starting with UID : $USER_ID" 1>&2 # Do not create mail box. /bin/sed -i 's/^CREATE_MAIL_SPOOL=yes/CREATE_MAIL_SPOOL=no/' /etc/default/useradd # Don't pass "-m" to useradd if the home directory already exists (which can occur if it was volume mounted in) otherwise it will fail. if [[ ! -d "/home/user" ]]; then /usr/sbin/useradd -m -U -s /bin/bash -u $USER_ID user else /usr/sbin/useradd -U -s /bin/bash -u $USER_ID user fi
... docker build --pull -t calico/node:latest-amd64 . --build-arg BIRD_IMAGE=calico/bird:v0.3.3-151-g767b5389-amd64 --build-arg QEMU_IMAGE=calico/go-build:v0.40 --build-arg GIT_VERSION= -f ./Dockerfile.amd64 Sending build context to Docker daemon 66.3MB Step 1/41 : ARG ARCH=x86_64 Step 2/41 : ARG GIT_VERSION=unknown Step 3/41 : ARG IPTABLES_VER=1.8.2-16 Step 4/41 : ARG RUNIT_VER=2.1.2 Step 5/41 : ARG BIRD_IMAGE=calico/bird:latest Step 6/41 : FROM calico/bpftool:v5.3-amd64 as bpftool ... Step 12/41 : ARG CENTOS_MIRROR_BASE_URL=https://mirrors.aliyun.com/centos-vault/8.1.1911 ---> Using cache ---> a96f716928d7 ... Step 17/41 : RUN mv /etc/yum.repos.d /etc/yum.repo.d-bk && mkdir -p /etc/yum.repos.d && mv /centos.repo /etc/yum.repos.d && yum clean all && yum makecache && dnf install -y 'dnf-command(config-manager)' && yum install -y rpm-build yum-utils make && yum install -y wget glibc-static gcc && yum -y update-minimal --security --sec-severity=Important --sec-severity=Critical ---> Using cache ---> a9ffd418a7a4 ... Step 24/41 : FROM registry.access.redhat.com/ubi8/ubi-minimal:8.1-407 8.1-407: Pulling from ubi8/ubi-minimal Digest: sha256:01b8fb7b3ad16a575651a4e007e8f4d95b68f727b3a41fc57996be9a790dc4fa Status: Image is up to date for registry.access.redhat.com/ubi8/ubi-minimal:8.1-407 ---> 6ce38bb5210c ... Step 39/41 : COPY dist/bin/calico-node-amd64 /bin/calico-node ---> Using cache ---> 916fbf133fb0 Step 40/41 : COPY --from=bpftool /bpftool /bin ---> Using cache ---> f797db5c4eb4 Step 41/41 : CMD ["start_runit"] ---> Using cache ---> fe6496ded4a6 [Warning] One or more build-args [QEMU_IMAGE] were not consumed Successfully built fe6496ded4a6 Successfully tagged calico/node:latest-amd64 touch .calico_node.created-amd64 make: Leaving directory `/home/go/gopath/src/github.com/projectcalico/node'
查看编译的镜像:
1 2 3 4
[root@node01 github.com]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE calico/node latest-amd64 77f4ca933207 7 hours ago 264MB <none> <none> 420e5252b060 7 hours ago 633MB
[root@node01 ~]# kubectl get svc -A NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d kube-system grafana-service ClusterIP 10.96.78.163 <none> 3000/TCP 2d kube-system grafana-service111 ClusterIP 10.96.52.101 <none> 3000/TCP 13s
[root@node01 ~]# kubectl get ep -A NAMESPACE NAME ENDPOINTS AGE default kubernetes 10.10.72.15:6443 2d kube-system grafana-service 10.78.104.6:3000,10.78.135.5:3000 2d kube-system grafana-service111 <none> 18s
进入一个业务Pod,并请求grafana-service111,结果请求卡住并超时终止:
1 2 3 4 5 6 7
[root@node01 ~]# kubectl exec -it -n kube-system influxdb-rs1-5bdc67f4cb-lnfgt bash root@influxdb-rs1-5bdc67f4cb-lnfgt:/# time curl http://10.96.52.101:3000 curl: (7) Failed to connect to 10.96.52.101 port 3000: Connection timed out
有问题的环境: [root@node4 ~]# iptables -t filter -S cali-FORWARD -N cali-FORWARD -A cali-FORWARD -m comment --comment "cali:vjrMJCRpqwy5oRoX" -j MARK --set-xmark 0x0/0xe0000 -A cali-FORWARD -m comment --comment "cali:A_sPAO0mcxbT9mOV" -m mark --mark 0x0/0x10000 -j cali-from-hep-forward -A cali-FORWARD -i cali+ -m comment --comment "cali:8ZoYfO5HKXWbB3pk" -j cali-from-wl-dispatch -A cali-FORWARD -o cali+ -m comment --comment "cali:jdEuaPBe14V2hutn" -j cali-to-wl-dispatch -A cali-FORWARD -m comment --comment "cali:12bc6HljsMKsmfr-" -j cali-to-hep-forward -A cali-FORWARD -m comment --comment "cali:MH9kMp5aNICL-Olv" -m comment --comment "Policy explicitly accepted packet." -m mark --mark 0x10000/0x10000 -j ACCEPT //问题在这最后这一条规则,新版本的calico把这条规则移到了FORWARD链
正常的环境: [root@node01 ~]# iptables -t filter -S cali-FORWARD -N cali-FORWARD -A cali-FORWARD -m comment --comment "cali:vjrMJCRpqwy5oRoX" -j MARK --set-xmark 0x0/0xe0000 -A cali-FORWARD -m comment --comment "cali:A_sPAO0mcxbT9mOV" -m mark --mark 0x0/0x10000 -j cali-from-hep-forward -A cali-FORWARD -i cali+ -m comment --comment "cali:8ZoYfO5HKXWbB3pk" -j cali-from-wl-dispatch -A cali-FORWARD -o cali+ -m comment --comment "cali:jdEuaPBe14V2hutn" -j cali-to-wl-dispatch -A cali-FORWARD -m comment --comment "cali:12bc6HljsMKsmfr-" -j cali-to-hep-forward -A cali-FORWARD -m comment --comment "cali:NOSxoaGx8OIstr1z" -j cali-cidr-block
下面是在最新的K8S集群上做相同的测试记录,可以跟异常环境做对比。
模拟一个业务请求pod:
1 2 3 4 5
[root@node01 home]# kubectl run busybox --image=busybox-curl:v1.0 --image-pull-policy=IfNotPresent -- sleep 300000 pod/busybox created
[root@node01 home]# kubectl get pod -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default busybox 1/1 Running 0 14h 10.78.153.73 10.10.11.49