prometheus实战:
一:安裝部分:
https://prometheus.io/download/ ###下載源碼解壓即可
https://grafana.com/grafana/dashboards ###搜索數(shù)據(jù)源為prometheus的
這里下載了:prometheus、node_exporter、alertmanager、pushgateway
同時機(jī)器需要安裝docker
yum install docker -ysystemctl start docker.service安裝gragana:
wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.3.4-1.x86_64.rpmyum localinstall grafana-5.3.4-1.x86_64.rpm?systemctl start grafana-server二 :配置:
1、prometheus配置:
global:scrape_interval: ? ? 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.scrape_timeout: 10s? ? ? alerting:alertmanagers:- static_configs:- targets:- 127.0.0.1:9093rule_files:- "./rules/rule_*.yml"# - "second_rules.yml"scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: 'prometheus'? ###這個必須配置,這個地址抓取的所有數(shù)據(jù)會自動加上`job=prometheus`的標(biāo)簽# metrics_path defaults to '/metrics' #抓取監(jiān)控目標(biāo)的路徑,默認(rèn)是/metrics 可以根據(jù)自己業(yè)務(wù)的需要進(jìn)行修改# scheme defaults to 'http'.static_configs: ? #這是通過靜態(tài)文件的配置方法:這種方法直接指定要抓去目標(biāo)的ip和端口- targets: ['localhost:9090']- job_name: gatewaystatic_configs:?- targets: ['127.0.0.1:9091']labels: ? ## 打上標(biāo)簽,instance會被指定為‘gataway’instance: gataway- job_name: node_exportfile_sd_configs:?#refresh_interval: 1m #刷新發(fā)現(xiàn)文件的時間間隔- files:- /data/prometheus-2.12.0.linux-amd64/node_discovery.json- job_name: mysql_discoveryfile_sd_configs:?#refresh_interval: 1m #刷新發(fā)現(xiàn)文件的時間間隔- files:- /data/prometheus-2.12.0.linux-amd64/mysql_discovery.json- job_name: redis_discoveryfile_sd_configs:?#refresh_interval: 1m #刷新發(fā)現(xiàn)文件的時間間隔- files:- /data/prometheus-2.12.0.linux-amd64/redis_discovery.json各種discovery的模版為:
[{"targets": ["127.0.0.1:9100"],"labels": {"instance": "test","idc": "beijing"}},{"targets": ["127.0.0.1:9101"],"labels": {"instance": "test2","idc": "beijing"}}]2、alertmanager配置:
global:resolve_timeout: 5m#templates:#? - 'demo.tmpl'route:receiver: webhookgroup_wait: 30sgroup_interval: 5mrepeat_interval: 10mgroup_by: ['alertname']routes:- receiver: webhookgroup_wait: 10smatch:job_name: mysql|kubernetes- receiver: 'webhook-kafka'group_by: [instance, alertname]match_re:instance: ^kafka-(.*)receivers:- name: webhookwebhook_configs:- url: http://localhost:8060/dingtalk/ops_dingding/send?send_resolved: true- name: webhook-kafkawebhook_configs:- url: http://localhost:8062/dingtalk/ops_59/sendsend_resolved: true####備注:盡量使用動態(tài)發(fā)現(xiàn)的配置,以免配置文件過長
- job_name: mysql_discoveryfile_sd_configs: #refresh_interval: 1m #刷新發(fā)現(xiàn)文件的時間間隔- files:- /data/prometheus-2.12.0.linux-amd64/redis_discovery.json三:啟動服務(wù):
1、promethues的server啟動:
nohup sh start.sh 2>&1 > prometheus.log &--start.sh內(nèi)容為
./prometheus --storage.tsdb.path=./data --storage.tsdb.retention.time=168h --web.enable-lifecycle --storage.tsdb.no-lockfil2、alertmanager的啟動:
nohup sh start.sh 2>&1 > alertmanager.log &--start.sh內(nèi)容為
./alertmanager --config.file="alertmanager.yml"?其余服務(wù)的啟動
nohup ./node_exporter &nohup ./pushgateway &--------------------------啟動云數(shù)據(jù)庫的采集服務(wù):docker run -d \-p 9104:9104 \-e DATA_SOURCE_NAME="user:password@(url)/" \prom/mysqld-exporterdocker run -d \-p 9121:9121 \-e REDIS_ADDR="redis://url:port" \-e REDIS_PASSWORD="password" \oliver006/redis_exporter如果有多組 只要修改外部暴漏端口和連接信息等等就可以了-----報警規(guī)則配置:
rule_node.yml
groups:- name: 主機(jī)狀態(tài)-監(jiān)控告警rules:- alert: 主機(jī)狀態(tài)expr: up == 0for: 1mlabels:status: 非常嚴(yán)重annotations:summary: "{{$labels.instance}}:服務(wù)器宕機(jī)"description: "{{$labels.instance}}:服務(wù)器延時超過5分鐘"- alert: CPU使用情況expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60for: 1mlabels:status: 一般告警annotations:summary: "{{$labels.mountpoint}} CPU使用率過高!"description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"- alert: 內(nèi)存使用expr: 100 -(node_memory_MemTotal_bytes -node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes ) / node_memory_MemTotal_bytes * 100> 80for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} 內(nèi)存使用率過高!"description: "{{$labels.mountpoint }} 內(nèi)存使用大于80%(目前使用:{{$value}}%)"- alert: IO性能expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"description: "{{$labels.mountpoint }} 流入磁盤IO大于60%(目前使用:{{$value}})"- alert: 網(wǎng)絡(luò)expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} 流入網(wǎng)絡(luò)帶寬過高!"description: "{{$labels.mountpoint }}流入網(wǎng)絡(luò)帶寬持續(xù)2分鐘高于100M. RX帶寬使用率{{$value}}"- alert: 網(wǎng)絡(luò)expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} 流出網(wǎng)絡(luò)帶寬過高!"description: "{{$labels.mountpoint }}流出網(wǎng)絡(luò)帶寬持續(xù)2分鐘高于100M. RX帶寬使用率{{$value}}"- alert: TCP會話expr: node_netstat_Tcp_CurrEstab > 1000for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"- alert: 磁盤容量expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80for: 1mlabels:status: 嚴(yán)重告警annotations:summary: "{{$labels.mountpoint}} 磁盤分區(qū)使用率過高!"description: "{{$labels.mountpoint }} 磁盤分區(qū)使用大于80%(目前使用:{{$value}}%)"rule_mysql.yml
groups:- name: MySQLStatsAlertrules:- alert: MySQL is downexpr: up{job="mysql-discorvery"} == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} MySQL is down"description: "MySQL database is down. This requires immediate action!"- alert: Read buffer size is bigger than max. allowed packet sizeexpr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet?for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."- alert: Sort buffer possibly missconfiguredexpr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024?for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."- alert: Thread stack size is too smallexpr: mysql_global_variables_thread_stack <196608for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} Thread stack size is too small"description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."- alert: Used more than 90% of max connections limited?expr: mysql_global_status_threads_connected > mysql_global_variables_max_connections * 0.8for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"description: "Used more than 80% of max connections limited"- alert: InnoDB Force Recovery is enabledexpr: mysql_global_variables_innodb_force_recovery != 0?for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."- alert: InnoDB Log File size is too smallexpr: mysql_global_variables_innodb_log_file_size < 16777216?for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."- alert: Table definition cache too smallexpr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cachefor: 1mlabels:severity:pageannotations:summary: "Instance {{ $labels.instance }} Table definition cache too small"description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"- alert: Thread stack size is possibly too smallexpr: mysql_global_variables_thread_stack < 262144for: 1mlabels:severity:pageannotations:summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."- alert: InnoDB Plugin is enabledexpr: mysql_global_variables_ignore_builtin_innodb == 1for: 1mlabels:severity:pageannotations:summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"description: "InnoDB Plugin is enabled"- alert: Binary Log is disabledexpr: mysql_global_variables_log_bin != 1for: 1mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} Binary Log is disabled"description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."- alert: IO thread stoppedexpr: mysql_slave_status_slave_io_running != 1for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} IO thread stopped"description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."- alert: SQL thread stopped?expr: mysql_slave_status_slave_sql_running == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} SQL thread stopped"description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."- alert: SQL thread stoppedexpr: mysql_slave_status_slave_sql_running != 1for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."- alert: Slave lagging behind Masterexpr: rate(mysql_slave_status_seconds_behind_master[1m]) >30?for: 1mlabels:severity: warning?annotations:summary: "Instance {{ $labels.instance }} Slave lagging behind Master"description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"- alert:Instance has slow logsexpr: irate(mysql_global_status_slow_queries[5m]) > 10for: 1mlabels:severity: worningannotations:summary: "Instance {{ $labels.instance }} has slow log"description: "slow log"rule_redis.yml:
groups:- name: MySQLStatsAlertrules:- alert: redis is downexpr: up{job="redis-discorvery"} == 0for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} MySQL is down"description: "MySQL database is down. This requires immediate action!"- alert: redis memory alertexpr: 100 * (redis_memory_used_bytes{instance !~ "pro-sas|pro-redis-dun"}? / redis_config_maxmemory{instance !~ "pro-sas|pro-redis-dun"} ) > 90for: 1mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} memory userd over 90"description: "redis memory short"寫在最后:
prometheus監(jiān)控k8s:
http://note.youdao.com/noteshare?id=dbbc868d32835cab5eac5a455df243ed
prometheus釘釘報警:
https://github.com/timonwong/prometheus-webhook-dingtalk
總結(jié)
以上是生活随笔為你收集整理的prometheus实战:的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 06丨MongoDB基本操作
- 下一篇: 2.4.4 案例理解4种事务的隔离级别