Prometheus
大约 4 分钟
Prometheus
作 者: 陶健
日 期: 2022-08-30
背景
作为服务监测工具,Prometheus 可用于监控和记录服务器的状态信息。 同时与邮件、企业微信快速集成,实现告警通知.
Node Exporter
需要监控的机器安装Node Exporter
$ curl -OL https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
$ tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz
将Node Exporter配置成服务
$ cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/node_exporter-1.3.1.linux-amd64/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
$ systemctl start node_exporter
$ systemctl enable node_exporter
node_exporter目录位置根据实际情况修改
Prometheus
安装Prometheus
$ docker run \
-d \
-p 9090:9090 \
-v /YOUR_PATH/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /YOUR_PATH/prometheus/rules:/etc/prometheus/rules \
--name prometheus \
--restart=always \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.enable-lifecycle \
--web.enable-admin-api \
映射磁盘目录位置根据实际情况修改
配置Prometheus prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.rules
scrape_configs:
- job_name: 'prometheus'
metrics_path: /metrics
static_configs:
- targets:
- localhost:9090
- job_name: 'node'
metrics_path: /metrics
static_configs:
- targets:
- node_1_ip:9100
- node_2_ip:9100
targets 请根据实际情况修改
告警规则设置 xxx.rule
groups:
- name: office-2101
rules:
- alert: "实例丢失"
expr: up{job="office-2101"} == 0
for: 2m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} lost"
description: "{{ $labels.instance }}'s job {{ $labels.job }} has stopped more tham 1m"
- alert: "CPU 使用率大于 80%"
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{job="office-2101",mode="idle"}[30s])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "内存容量小于 20%"
expr: node_memory_MemAvailable_bytes{job="office-2101"} / node_memory_MemTotal_bytes{job="office-2101"} * 100 < 20
for: 2m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{ $labels.instance }})
description: "Node memory is filling up (< 20% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "磁盘容量小于 10%"
expr: (node_filesystem_avail_bytes{job="office-2101"} * 100) / node_filesystem_size_bytes{job="office-2101"} < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{job="office-2101"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance {{ $labels.instance }})
description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "磁盘读 I/O 超过 50MB/s"
expr: sum by (instance) (rate(node_disk_read_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 50
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual disk read rate (instance {{ $labels.instance }})
description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "磁盘写 I/O 超过 50MB/s"
expr: sum by (instance) (rate(node_disk_written_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 50
for: 2m
labels:
severity: warning
annotations:
summary: Host unusual disk write rate (instance {{ $labels.instance }})
description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "网卡流入速率大于 10MB/s"
expr: sum by (instance) (rate(node_network_receive_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 10
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput in (instance {{ $labels.instance }})
description: "Host network interfaces are probably receiving too much data (> 10 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: "网卡流出速率大于 10MB/s"
expr: sum by (instance) (rate(node_network_transmit_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 10
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput out (instance {{ $labels.instance }})
description: "Host network interfaces are probably sending too much data (> 10 MB/s)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
规则文件放入
prometheus.yml中定义的rule_files路径下
{job="office-2101"}请根据实际情况修改, 如果需要对所有的job生效, 则不用配置
告警规则测试
# 节点是否在线
$ systemctl stop node_exporter
# CPU 4核 满载10分钟
$ stress-ng --cpu 4 --timeout 10m
# 内存 占用80% 持续10分钟
$ stress-ng --vm 8 --vm-bytes 80% -t 10m
# 磁盘 占用80%空间 持续10分钟
$ stress-ng --iomix 2 --iomix-bytes 80% -t 10m
配置规则热加载
$ curl -X POST http://127.0.0.1:9090/-/reload
Grafana
安装Grafana
$ docker run \
-d \
-p 3000:3000 \
--name grafana \
--restart=always \
grafana/grafana
AlertManager
安装AlertManager
$ docker run \
-d \
-p 9093:9093 \
-v /YOUR_PATH/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
-v /YOUR_PATH/alertmanager/template:/etc/alertmanager/template \
--name=alertmanager \
--restart=always \
prom/alertmanager
映射磁盘目录位置根据实际情况修改
配置邮件模版 email.tmpl
参考
https://raw.githubusercontent.com/prometheus/alertmanager/main/template/email.tmpl
配置微信模版 wechat.tmpl
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===== 监控报警 =====
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警状态: {{ .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }} {{ $alert.Annotations.description}};
触发阀值: {{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
===== ==END== =====
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===== 异常恢复 =====
告警类型: {{ $alert.Labels.alertname }}
告警状态: {{ .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }} {{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
===== ==END== =====
{{- end }}
{{- end }}
{{- end }}
{{- end }}
配置AlertManager alertmanager.yml
global:
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'YOUR_EMAIL_ADDRESS'
smtp_auth_username: 'YOUR_EMAIL_AUTH_USERNAME'
smtp_auth_password: 'YOUR_EMAIL_AUTH_PASSWORD'
smtp_require_tls: false
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'YOUR_WECHAT_CORP_ID'
wechat_api_secret: 'YOUR_WECHAT_API_SECRET'
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
receiver: 'wechat'
group_by:
- 'env'
- 'instance'
- 'type'
- 'group'
- 'job'
- 'alertname'
group_wait: 10s # 初次发送告警延时
group_interval: 10s # 距离第一次发送告警,等待多久再次发送告警
repeat_interval: 5m # 告警重发时间
receivers:
- name: 'email'
email_configs:
- send_resolved: true
to: 'taojian@woodare.com, lu_f@woodare.com'
- name: 'wechat'
wechat_configs:
- send_resolved: true
corp_id: 'YOUR_WECHAT_CORP_ID'
to_party: '3'
message: '{{ template "wechat.default.message" . }}'
agent_id: 'YOUR_AGENT_ID'
api_secret: 'YOUR_WECHAT_API_SECRET'
smtp根据实际情况修改
- 配置Prometheus
prometheus.yml
alerting:
alertmanagers:
static_configs:
- targets:
- localhost:9093
targets根据实际情况修改
