Prometheus入门使用(三)

Title
Prometheus入门使用(三)
Date
May 14, 2023

Prometheus告警简介:

Prometheus通过PromQL表达式定义触发告警条件,满足触发条件之后在web页面显示告警,关联Alertmanager之后就可以通过Alertmanager推送警告信息到不同的平台。

Prometheus告警架构图:

notion image
 

Prometheus告警设置:

Prometheus的告警规则通过PromQL表达式定义触发警告条件,满足条件时就会触发告警通知,
1.编辑prometheus.yml文件,设置rules文件路径: rule_files: # - "first_rules.yml" # - "second_rules.yml" - /usr/local/prometheus/*.yml #设置prmetheus下的所有rules文件,默认每分钟根据这些规则进行计算,可以通过evaluation_interval来覆盖默认的计算周期 2.编辑rules文件设置告警规则: groups: #规则组下面可以设置多条规则 - name: hostStatsAlert #规则组名称 rules: - alert: hostCpuUsageAlert #警告名称 expr: (sum(increase(node_cpu_seconds_total[1m]))by(instance)) > 59 #告警PromQL表达式,满足条件触发告警 for: 1m #评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending labels: #自定义标签,允许用户指定要附加到告警上的一组附加标签 severity: page annotations: #附加信息 summary: "Instance {{ $labels.instance }} CPU usgae high" #汇总警告报告信息 description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})" #详细描述警告信息 通过$labels.<labelname>变量可以访问当前告警实例中指定标签的值。$value则可以获取当前PromQL表达式计算的样本值 3.重启promtheus server 4.手动拉高cpu利用率: root@dokcer:~# cat /dev/zero>/dev/null
重启Prometheus server之后就可以看到设置的告警规则和当前的告警状态:
notion image
由于设置的等待时间为一分钟,所以一分钟之后警告状态才由PENDING转为FIRING状态:
notion image
 
 

部署AlertManager与Promtheus进行关联:

Alertmanager的配置:
配置
作用
全局配置(global)
用于定义一些全局的公共参数,如全局的SMTP配置,Slack配置等内容
模板(templates)
用于定义告警通知时的模板,如HTML模板,邮件模板等
告警路由(route)
根据标签匹配,确定当前告警应该如何处理
接收人(receivers)
接收人是一个抽象的概念,它可以是一个邮箱也可以是微信,Slack或者Webhook等,接收人一般配合告警路由使用
抑制规则(inhibit_rules)
合理设置抑制规则可以减少垃圾告警的产生
1.下载AlertManger: root@dokcer:~# wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz 2.解压AlertManger执行文件: root@dokcer:~# tar -xzvf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/ 3.创建链接文件: root@dokcer:~# ln -sv /usr/local/alertmanager-0.24.0.linux-amd64/alertmanager /usr/local/bin/alertmanager '/usr/local/bin/alertmanager' -> '/usr/local/alertmanager-0.24.0.linux-amd64/alertmanager' 4.编辑AlertManager.yml文件: route: #路由 group_by: ['severity'] #划分的组 group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['severity', 'dev', 'instance'] #当label为severity时,只发生一条报警信息 5.启动AlertManager root@dokcer:~# nohup alertmanager --config-file='/usr/local/alertmanager-0.24.0.linux-amd64/alertmanager.yml' &
访问http://IP:9093就可以在web界面看到告警的内容:
notion image
 

联动Prometheus和AlertManager:

1.编辑Prometheus.yml文件中的alerting部分 alerting: alertmanagers: - static_configs: - targets: ["192.168.0.50:9093"] # - alertmanager:9093 2.重启Prometheus
在这之后告警信息就会从Prometheus转发到AlertManager,再通过Alertmanager中的配置推送到不同平台(包括邮件,移动端,webhook等方式)

利用webhook发送报警信息:

route: #路由 group_by: ['severity'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'web.hook' #接收器 receivers: - name: 'web.hook' webhook_configs: #接收器为webhook方式 - url: 'http://127.0.0.1:5001/' #推送的地址 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['severity', 'dev', 'instance']
当触发报警信息时就会通过POST的方式向url地址发送json请求:
json格式:
{ "version": "4", "groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate) "truncatedAlerts": <int>, // how many alerts have been truncated due to "max_alerts" "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backlink to the Alertmanager. "alerts": [ { "status": "<resolved|firing>", "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": <string>, // identifies the entity that caused the alert "fingerprint": <string> // fingerprint to identify the alert }, ... ] }

验证webhook效果:

利用python写个简单的web server,url填好地址之后,就可以接收到alertmanager发送的post请求:
web_server:
import socket def server_start(port): server = socket.socket() server.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,True) server.bind(("192.168.0.76",port)) server.listen(128) while True: client, ip_port = server.accept() print(f"客户端{ip_port[0]}连接成功") request_data = client.recv(1024).decode() print(request_data) #打印接收到的信息 if len(request_data) == 0: client.close() else: request_path = request_data.split(" ")[1] if request_path == "/": request_path = "index.html" else: request_path = request_path.replace("/","") print(request_path) try: with open(request_path, 'rb') as file: file_content = file.read() except Exception as e: response_line = "HTTP/1.1 404 NOT FOUND\r\n" response_head = "Server: Python Server2.0\r\n" with open("../miniweb/error.html", "rb") as e: error_data = e.read() response_data = (response_line + response_head + "\r\n").encode() + error_data client.send(response_data) else: response_line = "HTTP/1.1 200 Ok\r\n" response_head = "Server: Python Server2.0\r\n" response_data = (response_line + response_head + "\r\n").encode() + file_content client.send(response_data) finally: client.close() if __name__ == '__main__': server_start(7777)
接收到的警告信息:
notion image
 
Built with Potion.so