0%

K8S问题排查-Influxdb监控数据获取异常

问题背景

K8S集群内,Influxdb监控数据获取异常,最终CPU、内存和磁盘使用率都无法获取。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
监控项         使用率
CPU(核) 3%
内存(GB) 18%
磁盘空间(GB) 0%

监控项 使用率
CPU(核) 7%
内存(GB) 18%
磁盘空间(GB) 1%

监控项 使用率
CPU(核) 0%
内存(GB) 0%
磁盘空间(GB) 0%

...

Influxdb监控架构图参考[1],其中Load Balancer采用nginx实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
        ┌─────────────────┐                 
│writes & queries │
└─────────────────┘


┌───────────────┐
│ │
┌────────│ Load Balancer │─────────┐
│ │ │ │
│ └──────┬─┬──────┘ │
│ │ │ │
│ │ │ │
│ ┌──────┘ └────────┐ │
│ │ ┌─────────────┐ │ │┌──────┐
│ │ │/write or UDP│ │ ││/query│
│ ▼ └─────────────┘ ▼ │└──────┘
│ ┌──────────┐ ┌──────────┐ │
│ │ InfluxDB │ │ InfluxDB │ │
│ │ Relay │ │ Relay │ │
│ └──┬────┬──┘ └────┬──┬──┘ │
│ │ | | │ │
│ | ┌─┼──────────────┘ | │
│ │ │ └──────────────┐ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ │ │ │ │
└─▶│ InfluxDB │ │ InfluxDB │◀─┘
│ │ │ │
└──────────┘ └──────────┘

原因分析

因为获取的数据来源是influxdb数据库,所以先搞清楚异常的原因是请求路径上的问题,还是influxdb数据库自身没有数据的问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 找到influxdb-nginx的service
kubectl get svc -n kube-system -owide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
grafana-service ClusterIP 10.96.177.245 <none> 3000/TCP 21d app=grafana
heapster ClusterIP 10.96.239.225 <none> 80/TCP 21d app=heapster
influxdb-nginx-service ClusterIP 10.96.170.72 <none> 7076/TCP 21d app=influxdb-nginx
influxdb-relay-service ClusterIP 10.96.196.45 <none> 9096/TCP 21d app=influxdb-relay
influxdb-service ClusterIP 10.96.127.45 <none> 8086/TCP 21d app=influxdb

# 在集群节点上检查访问influxdb-nginx的service是否正常
curl -i 10.96.170.72:7076/query
HTTP/1.1 401 Unauthorized
Server: nginx/1.17.2

可以看出,请求发送到influxdb-nginxservice是正常的,也就是请求可以正常发送到后端的influxdb数据库。那就继续确认influxdb数据库自身没有数据的问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 找到influxdb数据库的pod
kubectl get pod -n kube-system -owide |grep influxdb
influxdb-nginx-4x8pr 1/1 Running 3 21d 177.177.52.201 node3
influxdb-nginx-tpngh 1/1 Running 6 21d 177.177.41.214 node1
influxdb-nginx-wh6kc 1/1 Running 5 21d 177.177.250.180 node2
influxdb-relay-rs-65c94bbf5f-dp7s4 1/1 Running 2 21d 177.177.250.148 node2
influxdb1-6ff9466d46-q6w5r 1/1 Running 3 21d 177.177.41.230 node1
influxdb2-d6d6697f5-zzcnk 1/1 Running 3 21d 177.177.250.161 node2
influxdb3-65ddfc7476-hxhr8 1/1 Running 4 21d 177.177.52.217 node3

# 登录任意一个influxdb容器内并进入交互式命令
kubectl exec -it -n kube-systme influxdb-rs3-65ddfc7476-hxhr8 bash
root@influxdb-rs3-65ddfc7476-hxhr8:/# influx
Connected to http://localhost:8086 version 1.7.7
InfluxDB shell version: 1.7.7
> auth
username: admin
password: xxx
> use xxx;
Using database xxx

根据业务层面的查询语句,在influxdb交互式命令下手工查询验证:

1
2
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 2m
>

结果发现确实没有查到数据,既然2min内的数据没有,那把时间线拉长一些看看呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 不限制时间范围的查询
> select sum(value) from "cpu/node_capacity"> select sum(value) from "cpu/node_capacity";
name: cpu/node_capacity
time sum
---- ---
0 5301432000

# 查询72min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72m
name: cpu/node_capacity
time sum
---- ---
1624348319900503945 72000

# sleep 1min,继续查询72min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 72m
name: cpu/node_capacity
>

# 查询73min内的数据
> select sum(value) from "cpu/node_capacity" where "type" = 'node' and "nodename" = 'node1' and time > now() - 73m
name: cpu/node_capacity
time sum
---- ---
1624348319900503945 72000

根据查询结果看,不添加时间范围的查询是有记录的,并且通过多次验证看,数据无法获取的原因是数据在某个时间点不再写入导致的。查看influxdb的日志看看有没有什么相关日志:

1
2
3
4
5
6
kubectl logs -n kube-systme influxdb-rs3-65ddfc7476-hxhr8
ts=2021-06-22T09:56:49.658621Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx tag=pod_name
ts=2021-06-22T09:56:49.658702Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/rx_errors tag=pod_name
ts=2021-06-22T09:56:49.658815Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx tag=pod_name
ts=2021-06-22T09:56:49.658893Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100000 max=100000 db_instance=xxx measurement=network/tx_errors tag=pod_name
ts=2021-06-22T09:56:49.659062Z lvl=warn msg="max-values-per-tag limit may be exceeded soon" log_id=0UYIcREl000 service=store perc=100% n=100003 max=100000 db_instance=xxx measurement=uptime tag=pod_name

果然,有大量warn日志,提示max-values-per-tag limit may be exceeded soon,从日志可以看出,这个参数的默认值为100000。通过搜索,找到了这个参数引入的issue[2],引入原因大概意思是:

如果不小心加载了大量的cardinality数据,那么当我们删除数据的时候,InfluxDB很容易会发生OOM。

通过临时修改max-values-per-tag参数,验证问题是否解决

1
2
3
4
5
6
7
8
9
10
11
cat influxdb.conf
[meta]
dir = "/var/lib/influxdb/meta"
[data]
dir = "/var/lib/influxdb/data"
engine = "tsm1"
wal-dir = "/var/lib/influxdb/wal"
max-series-per-database = 0
max-values-per-tag = 0
[http]
auth-enabled = true
1
2
3
4
5
6
7
8
kubectl delete pod -n kube-system influxdb-rs1-6ff9466d46-q6w5r
pod "influxdb-rs1-6ff9466d46-q6w5r" deleted

kubectl delete pod -n kube-system influxdb-rs2-d6d6697f5-zzcnk
pod "influxdb-rs2-d6d6697f5-zzcnk" deleted

kubectl delete pod -n kube-system influxdb-rs3-65ddfc7476-hxhr8
pod "influxdb-rs3-65ddfc7476-hxhr8" deleted

再次观察业务层面获取的Influxdb监控数据,最终CPU、内存和磁盘使用率正常获取。

1
2
3
4
监控项         使用率
CPU(核) 19%
内存(GB) 22%
磁盘空间(GB) 2%

解决方案

根据业务情况,将influxdbmax-values-per-tag参数调整到合适值。

参考资料

  1. https://github.com/influxdata/influxdb-relay
  2. https://github.com/influxdata/influxdb/issues/7146