问题背景 K8S集群内,Influxdb
监控数据获取异常,最终CPU、内存和磁盘使用率都无法获取。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 监控项 使用率 CPU(核) 3% 内存(GB) 18% 磁盘空间(GB) 0% 监控项 使用率 CPU(核) 7% 内存(GB) 18% 磁盘空间(GB) 1% 监控项 使用率 CPU(核) 0% 内存(GB) 0% 磁盘空间(GB) 0% ...
Influxdb
监控架构图参考[1],其中Load Balancer
采用nginx
实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ┌─────────────────┐ │writes & queries │ └─────────────────┘ │ ▼ ┌───────────────┐ │ │ ┌────────│ Load Balancer │─────────┐ │ │ │ │ │ └──────┬─┬──────┘ │ │ │ │ │ │ │ │ │ │ ┌──────┘ └────────┐ │ │ │ ┌─────────────┐ │ │┌──────┐ │ │ │/write or UDP│ │ ││/query│ │ ▼ └─────────────┘ ▼ │└──────┘ │ ┌──────────┐ ┌──────────┐ │ │ │ InfluxDB │ │ InfluxDB │ │ │ │ Relay │ │ Relay │ │ │ └──┬────┬──┘ └────┬──┬──┘ │ │ │ | | │ │ │ | ┌─┼──────────────┘ | │ │ │ │ └──────────────┐ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ │ │ │ │ │ │ │ └─▶│ InfluxDB │ │ InfluxDB │◀─┘ │ │ │ │ └──────────┘ └──────────┘
原因分析 因为获取的数据来源是influxdb
数据库,所以先搞清楚异常的原因是请求路径上的问题,还是influxdb
数据库自身没有数据的问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 # 找到influxdb-nginx的service kubectl get svc -n kube-system -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR grafana-service ClusterIP 10.96.177.245 <none> 3000/TCP 21d app=grafana heapster ClusterIP 10.96.239.225 <none> 80/TCP 21d app=heapster influxdb-nginx-service ClusterIP 10.96.170.72 <none> 7076/TCP 21d app=influxdb-nginx influxdb-relay-service ClusterIP 10.96.196.45 <none> 9096/TCP 21d app=influxdb-relay influxdb-service ClusterIP 10.96.127.45 <none> 8086/TCP 21d app=influxdb # 在集群节点上检查访问influxdb-nginx的service是否正常 curl -i 10.96.170.72:7076/query HTTP/1.1 401 Unauthorized Server: nginx/1.17.2
可以看出,请求发送到influxdb-nginx
的service
是正常的,也就是请求可以正常发送到后端的influxdb
数据库。那就继续确认influxdb
数据库自身没有数据的问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # 找到influxdb数据库的pod kubectl get pod -n kube-system -owide |grep influxdb influxdb-nginx-4x8pr 1/1 Running 3 21d 177.177.52.201 node3 influxdb-nginx-tpngh 1/1 Running 6 21d 177.177.41.214 node1 influxdb-nginx-wh6kc 1/1 Running 5 21d 177.177.250.180 node2 influxdb-relay-rs-65c94bbf5f-dp7s4 1/1 Running 2 21d 177.177.250.148 node2 influxdb1-6ff9466d46-q6w5r 1/1 Running 3 21d 177.177.41.230 node1 influxdb2-d6d6697f5-zzcnk 1/1 Running 3 21d 177.177.250.161 node2 influxdb3-65ddfc7476-hxhr8 1/1 Running 4 21d 177.177.52.217 node3 # 登录任意一个influxdb容器内并进入交互式命令 kubectl exec -it -n kube-systme influxdb-rs3-65ddfc7476-hxhr8 bash root@influxdb-rs3-65ddfc7476-hxhr8:/# influx Connected to http://localhost:8086 version 1.7.7 InfluxDB shell version: 1.7.7 > auth username: admin password: xxx > use xxx; Using database xxx
根据业务层面的查询语句,在influxdb
交互式命令下手工查询验证:
1 2 > select sum (value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 2m>
结果发现确实没有查到数据,既然2min
内的数据没有,那把时间线拉长一些看看呢?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # 不限制时间范围的查询 > select sum (value) from "cpu/node_capacity" > select sum (value) from "cpu/node_capacity" ;name: cpu/node_capacity time sum ---- --- 0 5301432000 # 查询72min内的数据 > select sum (value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72mname: cpu/node_capacity time sum ---- --- 1624348319900503945 72000 # sleep 1min,继续查询72min内的数据> select sum (value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72mname: cpu/node_capacity > # 查询73min内的数据 > select sum (value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 73mname: cpu/node_capacity time sum ---- --- 1624348319900503945 72000
根据查询结果看,不添加时间范围的查询是有记录的,并且通过多次验证看,数据无法获取的原因是数据在某个时间点不再写入导致的 。查看influxdb
的日志看看有没有什么相关日志:
1 2 3 4 5 6 kubectl logs -n kube-systme influxdb-rs3-65ddfc7476-hxhr8 ts=2021-06-22T09:56:49.658621Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx tag=pod_name ts=2021-06-22T09:56:49.658702Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx_errors tag=pod_name ts=2021-06-22T09:56:49.658815Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx tag=pod_name ts=2021-06-22T09:56:49.658893Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx_errors tag=pod_name ts=2021-06-22T09:56:49.659062Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100003 max=100000 db_instance=xxx measurement=uptime tag=pod_name
果然,有大量warn
日志,提示max-values-per-tag limit may be exceeded soon
,从日志可以看出,这个参数的默认值为100000
。通过搜索,找到了这个参数引入的issue[2],引入原因大概意思是:
如果不小心加载了大量的cardinality数据,那么当我们删除数据的时候,InfluxDB很容易会发生OOM。
通过临时修改max-values-per-tag
参数,验证问题是否解决
1 2 3 4 5 6 7 8 9 10 11 cat influxdb.conf [meta] dir = "/var/lib/influxdb/meta" [data] dir = "/var/lib/influxdb/data" engine = "tsm1" wal-dir = "/var/lib/influxdb/wal" max-series-per-database = 0 max-values-per-tag = 0 [http] auth-enabled = true
1 2 3 4 5 6 7 8 kubectl delete pod -n kube-system influxdb-rs1-6ff9466d46-q6w5r pod "influxdb-rs1-6ff9466d46-q6w5r" deleted kubectl delete pod -n kube-system influxdb-rs2-d6d6697f5-zzcnk pod "influxdb-rs2-d6d6697f5-zzcnk" deleted kubectl delete pod -n kube-system influxdb-rs3-65ddfc7476-hxhr8 pod "influxdb-rs3-65ddfc7476-hxhr8" deleted
再次观察业务层面获取的Influxdb
监控数据,最终CPU、内存和磁盘使用率正常获取。
1 2 3 4 监控项 使用率 CPU(核) 19% 内存(GB) 22% 磁盘空间(GB) 2%
解决方案 根据业务情况,将influxdb
的max-values-per-tag
参数调整到合适值。
参考资料
https://github.com/influxdata/influxdb-relay
https://github.com/influxdata/influxdb/issues/7146