Prometheus

大约 4 分钟

Prometheus

作 者: 陶健

日 期: 2022-08-30

背景


作为服务监测工具,Prometheusopen in new window 可用于监控和记录服务器的状态信息。 同时与邮件、企业微信快速集成,实现告警通知.

Node Exporter


需要监控的机器安装Node Exporter

$ curl -OL https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
$ tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz

将Node Exporter配置成服务

$ cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/node_exporter-1.3.1.linux-amd64/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

$ systemctl start node_exporter
$ systemctl enable node_exporter

node_exporter目录位置根据实际情况修改

Prometheus


安装Prometheus

$ docker run \
-d \
-p 9090:9090 \
-v /YOUR_PATH/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /YOUR_PATH/prometheus/rules:/etc/prometheus/rules \
--name prometheus \
--restart=always \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.enable-lifecycle \
--web.enable-admin-api \

映射磁盘目录位置根据实际情况修改

配置Prometheus prometheus.yml

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s

rule_files:
- /etc/prometheus/rules/*.rules

scrape_configs:
- job_name: 'prometheus'
  metrics_path: /metrics
  static_configs:
  - targets: 
	  - localhost:9090
- job_name: 'node'
  metrics_path: /metrics
  static_configs:
  - targets: 
  	- node_1_ip:9100
    - node_2_ip:9100

targets 请根据实际情况修改

告警规则设置 xxx.rule

groups:
- name: office-2101
  rules:
  - alert: "实例丢失"
    expr: up{job="office-2101"} == 0
    for: 2m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} lost"
      description: "{{ $labels.instance }}'s job {{ $labels.job }} has stopped more tham 1m"

  - alert: "CPU 使用率大于 80%"
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{job="office-2101",mode="idle"}[30s])) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: "内存容量小于 20%"
    expr: node_memory_MemAvailable_bytes{job="office-2101"} / node_memory_MemTotal_bytes{job="office-2101"} * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 20% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: "磁盘容量小于 10%"
    expr: (node_filesystem_avail_bytes{job="office-2101"} * 100) / node_filesystem_size_bytes{job="office-2101"} < 10 and ON (instance, device, mountpoint) node_filesystem_readonly{job="office-2101"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 
  - alert: "磁盘读 I/O 超过 50MB/s"
    expr: sum by (instance) (rate(node_disk_read_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 
  - alert: "磁盘写 I/O 超过 50MB/s"
    expr: sum by (instance) (rate(node_disk_written_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 50
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 
  - alert: "网卡流入速率大于 10MB/s"
    expr: sum by (instance) (rate(node_network_receive_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 10 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: "网卡流出速率大于 10MB/s"
    expr: sum by (instance) (rate(node_network_transmit_bytes_total{job="office-2101"}[2m])) / 1024 / 1024 > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host network interfaces are probably sending too much data (> 10 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

规则文件放入 prometheus.yml 中定义的 rule_files 路径下

{job="office-2101"} 请根据实际情况修改, 如果需要对所有的job生效, 则不用配置

告警规则测试

# 节点是否在线
$ systemctl stop node_exporter

# CPU 4核 满载10分钟
$ stress-ng --cpu 4 --timeout 10m

# 内存 占用80% 持续10分钟
$ stress-ng --vm 8 --vm-bytes 80% -t 10m

# 磁盘 占用80%空间 持续10分钟
$ stress-ng --iomix 2 --iomix-bytes 80% -t 10m

配置规则热加载

$ curl -X POST http://127.0.0.1:9090/-/reload

Grafana


安装Grafana

$ docker run \
-d \
-p 3000:3000 \
--name grafana \
--restart=always \
grafana/grafana

AlertManager


安装AlertManager

$ docker run \
-d \
-p 9093:9093 \
-v /YOUR_PATH/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
-v /YOUR_PATH/alertmanager/template:/etc/alertmanager/template \
--name=alertmanager \
--restart=always \
prom/alertmanager

映射磁盘目录位置根据实际情况修改

配置邮件模版 email.tmpl

参考 https://raw.githubusercontent.com/prometheus/alertmanager/main/template/email.tmpl

配置微信模版 wechat.tmpl

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===== 监控报警 =====
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警状态: {{ .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }} {{ $alert.Annotations.description}};
触发阀值: {{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
===== ==END==  =====
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
===== 异常恢复 =====
告警类型: {{ $alert.Labels.alertname }}
告警状态: {{ .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }} {{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
===== ==END==  =====
{{- end }}
{{- end }}
{{- end }}
{{- end }}

配置AlertManager alertmanager.yml

global:
  smtp_smarthost: 'smtp.exmail.qq.com:465'
  smtp_from: 'YOUR_EMAIL_ADDRESS'
  smtp_auth_username: 'YOUR_EMAIL_AUTH_USERNAME'
  smtp_auth_password: 'YOUR_EMAIL_AUTH_PASSWORD'
  smtp_require_tls: false
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: 'YOUR_WECHAT_CORP_ID'
  wechat_api_secret: 'YOUR_WECHAT_API_SECRET'

templates:
- '/etc/alertmanager/template/*.tmpl'

route:
  receiver: 'wechat'
  group_by: 
  - 'env'
  - 'instance'
  - 'type'
  - 'group'
  - 'job'
  - 'alertname'
  group_wait: 10s        # 初次发送告警延时
  group_interval: 10s    # 距离第一次发送告警,等待多久再次发送告警
  repeat_interval: 5m    # 告警重发时间

receivers:
- name: 'email'
  email_configs: 
  - send_resolved: true
    to: 'taojian@woodare.com, lu_f@woodare.com'
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    corp_id: 'YOUR_WECHAT_CORP_ID'
    to_party: '3'
    message: '{{ template "wechat.default.message" . }}'
    agent_id: 'YOUR_AGENT_ID'
    api_secret: 'YOUR_WECHAT_API_SECRET'

smtp根据实际情况修改

  1. 配置Prometheus prometheus.yml
alerting:
  alertmanagers:
    static_configs:
    - targets: 
      - localhost:9093

targets根据实际情况修改

上次编辑于:
贡献者: lu_feng,luke