當(dāng)前位置：首頁(yè) > 编程资源 > 综合教程 >内容正文

综合教程

Prometheus监控系统资源故障报警

發(fā)布時(shí)間：2023/12/13 综合教程 23 生活家

生活随笔收集整理的這篇文章主要介紹了 Prometheus监控系统资源故障报警小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

準(zhǔn)備環(huán)境：

主機(jī)	環(huán)境	部署內(nèi)容
192.168.220.130	centos7.6	node_exporter-1.2.0.linux-amd64
192.168.220.131	centos7.6	node_exporter-1.2.0.linux-amd64
192.168.220.129	centos7.6	prometheus-2.28.1.linux-amd64 alertmanager-0.21.0 grafana-8.0.6-1.x86_64.rpm node_exporter-1.2.0.linux-amd64

安裝node_exporter

下載解壓下載地址：
https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

tar -xzvf node_exporter-1.2.0.linux-amd64.tar.gz -C /opt
ln -s /opt/node_exporter-1.2.0.linux-amd64 /opt/node_exporter
vim  /usr/lib/systemd/system/node_exporter.service 
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

chown prometheus:prometheus /usr/lib/systemd/system/node_exporter.service
chown prometheus:prometheus /opt/node_exporte

設(shè)置開(kāi)機(jī)啟動(dòng)

systemctl daemon-reload
systemctl enable node_exporter.service
systemctl start node_exporter.service
systemctl status node_exporter.service

瀏覽器訪問(wèn)http://192.168.220.129:9100/metrics，會(huì)跳轉(zhuǎn)到metrics頁(yè)面,通過(guò)輪詢的方式更新數(shù)據(jù)

修改prometheus.yml

將 node_exporter 加入 prometheus.yml配置中

vim /opt/prometheus/prometheus.yml
- job_name: 'Linux'
 file_sd_configs:
 - files: ['/opt/prometheus/rules/test_cluster.yml']
  refresh_interval: 5s

vim /opt/prometheus/rules/test_cluster.yml
- targets: ['192.168.220.129:9100']
  labels:
    name: Linux-test1

- targets: ['192.168.220.130:9100']
  labels:
    name: Linux-test2

- targets: ['192.168.220.131:9100']
  labels:
    name: Linux-test3

重啟prometheus服務(wù)

systemctl restart prometheus.service
或者熱加載
curl -X POST  http://localhost:9090/-/reload

Grafana模板導(dǎo)入

下載模板

https://grafana.com/grafana/dashboards/11074

在grafana中導(dǎo)入dashboard

安裝alertmanager

下載安裝包并配置

下載地址：
https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

tar -xzvf alertmanager-0.21.0.linux-amd64.tar.gz -C /opt
ln -s /opt/alertmanager-0.21.0.linux-amd64 /opt/alertmanager
vim /opt/alertmanager/conf/alertmanager.yml
global:
  smtp_smarthost: smtp.exmail.xxx.com:465 # 發(fā)件人郵箱smtp地址
  smtp_auth_username: xxxx@xxx.com # 發(fā)件人郵箱賬號(hào)
  smtp_from: xxx@xxx.com # 發(fā)件人郵箱賬號(hào)
  smtp_auth_password: xxxxxx # 發(fā)件人郵箱密碼（郵箱授權(quán)碼）
  resolve_timeout: 5m
  smtp_require_tls: false

route:
  # group_by: ['alertname'] # 報(bào)警分組依據(jù)
  group_wait: 10s # 最初即第一次等待多久時(shí)間發(fā)送一組警報(bào)的通知
  group_interval: 10s # 在發(fā)送新警報(bào)前的等待時(shí)間
  repeat_interval: 1m # 發(fā)送重復(fù)警報(bào)的周期 對(duì)于email配置中多頻繁
  receiver: 'email'

receivers:
- name: email
  email_configs:
  - send_resolved: true
    to: xxx@xxx.com # 收件人郵箱賬號(hào)

設(shè)置alertmanager系統(tǒng)服務(wù),并配置開(kāi)機(jī)啟動(dòng)

vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/conf/alertmanager.yml --storage.path=/opt/alertmanager/data
Restart=on-failure
[Install]
WantedBy=multi-user.target

設(shè)置開(kāi)機(jī)啟動(dòng)

systemctl daemon-reload
systemctl enable prometheus.service
systemctl start prometheus.service
systemctl status alertmanager.service

prometheus配置

在prometheus目錄下編輯報(bào)警模版system_rules.yml，添加一些自定義報(bào)警項(xiàng)。

groups:
- name: Host
  rules:
  - alert: 主機(jī)狀態(tài)報(bào)警
    expr: up == 0
    for: 1m
    labels:
      serverity: high
    annotations:
      summary: "{{$labels.instance}}:服務(wù)器宕機(jī)"
      description: "{{$labels.instance}}:服務(wù)器延時(shí)超過(guò)5分鐘"
          
  - alert: CPU報(bào)警
    expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by(instance)) > 90
    for: 1m
    labels:
      serverity: middle
    annotations:
      summary: "{{$labels.instance}}: High CPU Usage Detected"
      description: "{{$labels.instance}}: CPU usage is {{$value}}, above 90%"
 
  - alert: 內(nèi)存報(bào)警
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
    for: 1m
    labels:
      serverity: high
    annotations:
      summary: "{{$labels.instance}}: High Memory Usage Detected"
      description: "{{$labels.instance}}: Memory Usage i{{ $value }}, above 85%"
 
  - alert: 磁盤(pán)報(bào)警
    expr: 100 * (node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 90
    for: 1m
    labels:
      serverity: middle
    annotations:
      summary: "{{$labels.instance}}: High Disk Usage Detected"
      description: "{{$labels.instance}}, mountpoint {{$labels.mountpoint}}: Disk Usage is {{ $value }}, above 90%"

  - alert: IO報(bào)警
    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
    for: 1m
    labels:
      serverity: high
    annotations:
      summary: "{{$labels.mountpoint}} 流入磁盤(pán)IO使用率過(guò)高！"
      description: "{{$labels.mountpoint }} 流入磁盤(pán)IO大于60%(目前使用:{{$value}})"

  - alert: 網(wǎng)絡(luò)報(bào)警
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
    for: 1m
    labels:
      serverity: high
    annotations:
      summary: "{{$labels.mountpoint}} 流入網(wǎng)絡(luò)帶寬過(guò)高！"
      description: "{{$labels.mountpoint }}流入網(wǎng)絡(luò)帶寬持續(xù)2分鐘高于100M. RX帶寬使用率{{$value}}"
   
  - alert: TCP會(huì)話報(bào)警
    expr: node_netstat_Tcp_CurrEstab > 1000
    for: 1m
    labels:
      serverity: high
    annotations:
      summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過(guò)高！"
      description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"

在prometheus目錄下編輯prometheus的配置文件，將監(jiān)控的配置信息添加到prometheus.yml。如下圖所示：

重啟Prometheus加載配置
systemctl restart prometheus.service

訪問(wèn)驗(yàn)證：http://192.168.220.129:9090/alerts

驗(yàn)證郵件報(bào)警

登陸prometheus的web頁(yè)面，查看報(bào)警信息。

瀏覽器輸入Prometheus_IP:9090 ，可以看到各個(gè)報(bào)警項(xiàng)的狀態(tài)。

郵箱驗(yàn)證報(bào)警郵件

總結(jié)

以上是生活随笔為你收集整理的Prometheus监控系统资源故障报警的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： PNG格式的图像文件，创建的图像的MIM
下一篇： Python生成pyd文件