当前位置: 首页 > news >正文

实现prometheus+grafana的监控部署

直接贴部署用的文件信息了

kubectl label node xxx monitoring=true


创建命名空间
 

kubectl create ns monitoring

部署operator

kubectl apply -f operator-rbac.yml
kubectl apply -f operator-dp.yml
kubectl apply -f operator-crd.yml


 # 定义node-export

kubectl apply -f ./node-exporter/node-exporter-sa.yml
kubectl apply -f ./node-exporter/node-exporter-rbac.yml
kubectl apply -f ./node-exporter/node-exporter-svc.yml
kubectl apply -f ./node-exporter/node-exporter-ds.yml

 # 自定义配置文件,定义显示方式

kubectl apply -f ./grafana/pv-pvc-hostpath.yml
kubectl apply -f ./grafana/grafana-sa.yml
kubectl apply -f ./grafana/grafana-source.yml
kubectl apply -f ./grafana/grafana-datasources.yml
kubectl apply -f ./grafana/grafana-admin-secret.yml
kubectl apply -f ./grafana/grafana-svc.yml


 # 创建配置conifgmap
 

kubectl create configmap grafana-config --from-file=./grafana/grafana.ini --namespace=monitoring
kubectl create configmap all-grafana-dashboards --from-file=./grafana/dashboard --namespace=monitoringkubectl apply -f ./grafana/grafana-dp.yml
kubectl apply -f ./service-discovery/kube-controller-manager-svc.yml
kubectl apply -f ./service-discovery/kube-scheduler-svc.yml

 # 自定义配置文件,定义收集和报警规则
 

kubectl apply -f ./prometheus/prometheus-secret.yml
kubectl apply -f ./prometheus/prometheus-rules.yml
kubectl apply -f ./prometheus/prometheus-rbac.yml
kubectl apply -f ./prometheus/prometheus-svc.yml

# prometheus-operator  部署成功后才能创建成功
 

kubectl apply -f ./prometheus/pv-pvc-hostpath.yaml
kubectl apply -f ./prometheus/prometheus-main.yml

 # 监控目标,lable必须是k8s-app  因为prometheus是按这个查找的。不然prometheus采集不了该

kubectl apply -f ./servicemonitor/alertmanager-sm.yml
kubectl apply -f ./servicemonitor/coredns-sm.yml
kubectl apply -f ./servicemonitor/kube-apiserver-sm.yml
kubectl apply -f ./servicemonitor/kube-controller-manager-sm.yml
kubectl apply -f ./servicemonitor/kube-scheduler-sm.yml
kubectl apply -f ./servicemonitor/kubelet-sm.yml
kubectl apply -f ./servicemonitor/kubestate-metrics-sm.yml
kubectl apply -f ./servicemonitor/node-exporter-sm.yml
kubectl apply -f ./servicemonitor/prometheus-operator-sm.yml
kubectl apply -f ./servicemonitor/prometheus-sm.yml
kubectl apply -f ./servicemonitor/pushgateway-sm.yml

# prometheus-adapter  部署

kubectl apply -f ./prometheus_adapter/metric_rule.yaml
kubectl apply -f ./prometheus_adapter/prometheus_adapter.yaml

受限于篇幅就不张贴部署脚本内容,详情请参见

GitHub - chenrui2200/prometheus_k8s_install

创建出进监控pod

node_exporterprometheus_k8s 是在 Kubernetes 环境中监控和收集指标的两个重要组件。它们之间有着密切的关系,以下是它们的详细讲解及相互关系。

Node Exporter

  • 定义: node_exporter 是 Prometheus 官方提供的一个工具,用于收集和暴露操作系统及硬件的性能指标。这些指标包括 CPU、内存、磁盘、网络等系统级别的性能数据。

  • 功能:

    • 它运行在每个节点上,监控该节点的系统资源使用情况。
    • 提供的指标格式符合 Prometheus 的要求,因此可以直接被 Prometheus 抓取。
  • 使用场景:

    • 适用于监控物理机、虚拟机或 Kubernetes 节点的基础设施健康状态。

Prometheus K8s

  • 定义: prometheus_k8s 是一个 Prometheus 实例,专门用于在 Kubernetes 集群中监控 Kubernetes 资源及其运行的应用。

  • 功能:

    • 通过 Kubernetes 的 API,自动发现集群中的服务和容器,并抓取它们暴露的指标。
    • 可以监控 Kubernetes 组件(如 kube-apiserver、kube-scheduler、kube-controller-manager)及各个应用程序的性能。
  • 使用场景:

    • 适用于监控整个 Kubernetes 集群的健康状况和性能。

二者的关系

  1. 数据来源:

    • node_exporter 收集每个节点的系统性能指标,然后 Prometheus 可以抓取这些指标。
    • prometheus_k8s 则从 Kubernetes 中的其他组件和应用程序收集指标。
  2. 监控层次:

    • node_exporter 主要关注底层硬件和操作系统级别的监控。
    • prometheus_k8s 关注 Kubernetes 资源和应用的监控,包括 Pods、服务和其它 Kubernetes 对象。
  3. 集成:

    • 在 Kubernetes 环境中,通常会在每个节点上运行 node_exporter,并配置 prometheus_k8s 来定期抓取 node_exporter 的指标,这样就能实现对节点性能的全面监控。

node_exporterprometheus_k8s 是相辅相成的。node_exporter 提供了基础设施级别的监控,而 prometheus_k8s 则补充了 Kubernetes 资源和应用的监控。结合使用这两个组件,能够实现对整个系统的全面监控,帮助运维人员及时发现和解决问题。

查看前端展示

python实现prometheus客户端

import json, datetime, time
import requests
import pysnooperclass Prometheus():def __init__(self, host=''):#  '/api/v1/query_range'    查看范围数据#  '/api/v1/query'    瞬时数据查询self.host = hostself.query_path = 'http://%s/api/v1/query' % self.hostself.query_range_path = 'http://%s/api/v1/query_range' % self.host# @pysnooper.snoop()def get_istio_service_metric(self, namespace):service_metric = {"qps": {},"gpu": {},"memory": {},"cpu": {}}# qps请求mem_expr = 'sum by (destination_workload,response_code) (irate(istio_requests_total{destination_service_namespace="%s"}[1m]))' % (namespace,)# print(mem_expr)params = {'query': mem_expr,'start': int(time.time())-300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for service in metrics:service_name = service['metric']['destination_workload']if service_name not in service_metric['qps']:service_metric['qps'][service_name] = {}service_metric["qps"][service_name] = service['values']except Exception as e:print(e)# 内存mem_expr = 'sum by (pod) (container_memory_working_set_bytes{job="kubelet", image!="",container_name!="POD",namespace="%s"})' % (namespace,)# print(mem_expr)params = {'query': mem_expr,'start': int(time.time()) - 300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:pod_name = pod['metric']['pod']if pod_name not in service_metric['memory']:service_metric[pod_name] = {}service_metric['memory'][pod_name] = pod['values']except Exception as e:print(e)# cpu获取cpu_expr = "sum by (pod) (rate(container_cpu_usage_seconds_total{namespace='%s',container!='POD'}[1m]))" % (namespace)params = {'query': cpu_expr,'start': int(time.time()) - 300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:pod_name = pod['metric']['pod']if pod_name not in service_metric['cpu']:service_metric[pod_name] = {}service_metric['cpu'][pod_name] = pod['values']except Exception as e:print(e)gpu_expr = "avg by (pod) (DCGM_FI_DEV_GPU_UTIL{namespace='%s'})" % (namespace)params = {'query': gpu_expr,'start': (datetime.datetime.now() - datetime.timedelta(days=1) - datetime.timedelta(hours=8)).strftime('%Y-%m-%dT%H:%M:%S.000Z'),'end': datetime.datetime.now().strftime('%Y-%m-%dT%H:%M:%S.000Z'),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:# print(metrics)for pod in metrics:pod_name = pod['metric']['pod']if pod_name not in service_metric['gpu']:service_metric['gpu'][pod_name] = {}service_metric['gpu'][pod_name] = pod['values']except Exception as e:print(e)return service_metric# 获取当前pod利用率# @pysnooper.snoop()def get_resource_metric(self):max_cpu = 0max_mem = 0ave_gpu = 0pod_metric = {}# 这个pod  30分钟内的最大值mem_expr = "sum by (pod) (container_memory_working_set_bytes{container!='POD', container!=''})"# print(mem_expr)params = {'query': mem_expr,'start': int(time.time()) - 300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:if pod['metric']:pod_name = pod['metric']['pod']values = max([float(x[1]) for x in pod['values']])if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['memory'] = round(values / 1024 / 1024 / 1024, 2)except Exception as e:print(e)cpu_expr = "sum by (pod) (rate(container_cpu_usage_seconds_total{container!='POD'}[1m]))"params = {'query': cpu_expr,'start': int(time.time()) - 300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:if pod['metric']:pod_name = pod['metric']['pod']values = [float(x[1]) for x in pod['values']]# values = round(sum(values) / len(values), 2)values = round(max(values), 2)if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['cpu'] = valuesexcept Exception as e:print(e)gpu_expr = "avg by (pod) (DCGM_FI_DEV_GPU_UTIL)"params = {'query': gpu_expr,'start': int(time.time()) - 300,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:# print(metrics)for pod in metrics:if pod['metric']:pod_name = pod['metric']['pod']values = [float(x[1]) for x in pod['values']]# values = round(sum(values)/len(values),2)values = round(max(values), 2)if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['gpu'] = values / 100except Exception as e:print(e)return pod_metric# @pysnooper.snoop()def get_namespace_resource_metric(self, namespace):max_cpu = 0max_mem = 0ave_gpu = 0pod_metric = {}# 这个pod  30分钟内的最大值mem_expr = "sum by (pod) (container_memory_working_set_bytes{namespace='%s',container!='POD', container!=''})" % (namespace,)# print(mem_expr)params = {'query': mem_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:pod_name = pod['metric']['pod']values = max([float(x[1]) for x in pod['values']])if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['memory'] = round(values / 1024 / 1024 / 1024, 2)except Exception as e:print(e)cpu_expr = "sum by (pod) (rate(container_cpu_usage_seconds_total{namespace='%s',container!='POD'}[1m]))" % (namespace)params = {'query': cpu_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for pod in metrics:pod_name = pod['metric']['pod']values = [float(x[1]) for x in pod['values']]# values = round(sum(values) / len(values), 2)values = round(max(values), 2)if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['cpu'] = valuesexcept Exception as e:print(e)gpu_expr = "avg by (pod) (DCGM_FI_DEV_GPU_UTIL{namespace='%s'})" % (namespace)params = {'query': gpu_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:# print(metrics)for pod in metrics:pod_name = pod['metric']['pod']values = [float(x[1]) for x in pod['values']]# values = round(sum(values)/len(values),2)values = round(max(values), 2)if pod_name not in pod_metric:pod_metric[pod_name] = {}pod_metric[pod_name]['gpu'] = values / 100except Exception as e:print(e)return pod_metric# @pysnooper.snoop()def get_pod_resource_metric(self, pod_name, namespace):max_cpu = 0max_mem = 0ave_gpu = 0# 这个pod  30分钟内的最大值mem_expr = "sum by (pod) (container_memory_working_set_bytes{namespace='%s', pod=~'%s.*',container!='POD', container!=''})"%(namespace,pod_name)# print(mem_expr)params = {'query': mem_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:metrics = metrics[0]['values']for metric in metrics:if int(metric[1]) > max_mem:max_mem = int(metric[1]) / 1024 / 1024 / 1024except Exception as e:print(e)cpu_expr = "sum by (pod) (rate(container_cpu_usage_seconds_total{namespace='%s',pod=~'%s.*',container!='POD'}[1m]))" % (namespace, pod_name)params = {'query': cpu_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:metrics = metrics[0]['values']for metric in metrics:if float(metric[1]) > max_cpu:max_cpu = float(metric[1])except Exception as e:print(e)gpu_expr = "avg by (pod) (DCGM_FI_DEV_GPU_UTIL{namespace='%s',pod=~'%s.*'})" % (namespace, pod_name)params = {'query': gpu_expr,'start': int(time.time()) - 60*60*24,'end': int(time.time()),'step': "1m",  # 运行小于1分钟的,将不会被采集到# 'timeout':"30s"}print(params)try:res = requests.get(url=self.query_range_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:metrics = metrics[0]['values']all_util = [float(metric[1]) for metric in metrics]ave_gpu = sum(all_util) / len(all_util) / 100except Exception as e:print(e)return {"cpu": round(max_cpu, 2), "memory": round(max_mem, 2), 'gpu': round(ave_gpu, 2)}# todo 获取机器的负载补充完整# @pysnooper.snoop()def get_machine_metric(self):# 这个pod  30分钟内的最大值metrics = {"pod_num": "sum(kubelet_running_pod_count)by (node)","request_memory": "","request_cpu": "","request_gpu": "","used_memory": "","used_cpu": "","used_gpu": "",}back = {}for metric_name in metrics:# print(mem_expr)params = {'query': metrics[metric_name],'timeout': "30s"}print(params)back[metric_name] = {}try:res = requests.get(url=self.query_path, params=params)metrics = json.loads(res.content.decode('utf8', 'ignore'))if metrics['status'] == 'success':metrics = metrics['data']['result']if metrics:for metric in metrics:node = metric['metric']['node']if ':' in node:node = node[:node.index(':')]value = metric['value'][1]back[metric_name][node] = int(value)except Exception as e:print(e)return back

Prometheus 查询语法使用一种类似于函数式编程的表达方式,允许用户从时间序列数据库中提取和聚合数据。你提供的查询语句 cpu_expr 是一个典型的 Prometheus 查询,下面是对其各个部分的详细讲解:

  1. 函数 sum:是一个聚合函数,用于对一组时间序列进行求和。它可以用于计算所有匹配指标的总和。

  2. by (pod)这是一个分组操作,表示将结果按 pod 标签进行分组。这样,查询结果将显示每个 pod 的 CPU 使用量总和。

  3. rate(container_cpu_usage_seconds_total{...}[1m]):

    rate 是一个用于计算时间序列速率的函数。在这里,它计算 container_cpu_usage_seconds_total 指标在过去 1 分钟内的变化率。container_cpu_usage_seconds_total 是一个计数器类型的指标,表示容器 CPU 使用时间的总和。{namespace='%s',container!='POD'} 是一个标签选择器,用于过滤指标。它选择特定命名空间下的所有容器,但排除了名为 POD 的容器。
  • 整体而言,这个查询的目的是计算指定命名空间下,所有非 POD 容器的 CPU 使用率,并按每个 pod 进行分组。通过使用 rate 函数,它可以提供 CPU 使用的实时速率,而不是总使用时间,这样更能反映容器的当前状态。

    def echart(self, filters=None):prometheus = Prometheus(conf.get('PROMETHEUS', 'prometheus-k8s.monitoring:9090'))# 获取 prometheus-k8s 地址pod_resource_metric = prometheus.get_resource_metric()print('pod_resource_metric', pod_resource_metric)all_resource = {"mem_all": sum([int(global_cluster_load[cluster_name]['mem_all']) for cluster_name in global_cluster_load]),"cpu_all": sum([int(global_cluster_load[cluster_name]['cpu_all']) for cluster_name in global_cluster_load]),"gpu_all": sum([int(global_cluster_load[cluster_name]['gpu_all']) for cluster_name in global_cluster_load]),}all_resource_req = {"mem_req": sum([int(global_cluster_load[cluster_name]['mem_req']) for cluster_name in global_cluster_load]),"cpu_req": sum([int(global_cluster_load[cluster_name]['cpu_req']) for cluster_name in global_cluster_load]),"gpu_req": sum([int(global_cluster_load[cluster_name]['gpu_req']) for cluster_name in global_cluster_load]),}all_resource_used = {"mem_used": sum([pod_resource_metric[x].get('memory', 0) for x in pod_resource_metric]),"cpu_used": sum([pod_resource_metric[x].get('cpu', 0) for x in pod_resource_metric]),"gpu_used": sum([pod_resource_metric[x].get('gpu', 0) for x in pod_resource_metric]),}option = {"title": [{"subtext": __('集群信息'),"MEM_NAME": __('内存请求占有率'),"MEM_VALUE": int(100 * all_resource_req['mem_req'] / (0.001 + all_resource['mem_all'])),"CPU_NAME": __('CPU占用率'),"CPU_VALUE": int(all_resource['cpu_all']),'GPU_NAME': __('GPU总量(卡)'),'GPU_VALUE': int(all_resource['gpu_all']),'MEM_MAX': int(all_resource['mem_all'] * 2),'CPU_MAX': int(all_resource['cpu_all'] * 2),'GPU_MAX': int(all_resource['gpu_all'] * 2)},{"subtext": __('资源占用率'),"MEM_NAME": __('内存占用率'),"MEM_VALUE": int(100 * all_resource_req['mem_req'] / (0.001 + all_resource['mem_all'])),"CPU_NAME": __('CPU占用率'),"CPU_VALUE": int(100 * all_resource_req['cpu_req'] / (0.001 + all_resource['cpu_all'])),"GPU_NAME": __('GPU占用率'),"GPU_VALUE": int(100 * all_resource_req['gpu_req'] / (0.001 + all_resource['gpu_all']))},{"subtext": __('资源利用率'),'MEM_NAME': __('内存利用率'),'MEM_VALUE': str(min(100,int(100 *all_resource_used['mem_used'] / (0.001 +all_resource['mem_all'])))),'CPU_NAME': __('CPU利用率'),'CPU_VALUE': str(min(100, int(100 *all_resource_used['cpu_used'] / (0.001 +all_resource['cpu_all'])))),'GPU_NAME': __('GPU利用率'),'GPU_VALUE': str(min(100, int(100 *all_resource_used['gpu_used'] / (0.001 +all_resource['gpu_all']))))}]}return option

附一个用上面代码实现的运维表盘界面

http://www.lryc.cn/news/465750.html

相关文章:

  • 确保Spring Boot定时任务只执行一次方案
  • 【Python数据可视化】利用Matplotlib绘制美丽图表!
  • 【最新通知】2024年Cisco思科认证CCNA详解
  • 监控内容、监控指标、监控工具大科普
  • 生成文件夹 - python 实现
  • 快速了解学会python基础语言及IDLE 提供的常用快捷键
  • 【python】OpenCV—Sort the Point Set from Top Left to Bottom Right
  • LeetCode 1493.删掉一个元素以后全为1的最长子数组
  • php常用设计模式之工厂模式
  • 通用软件版本标识
  • (计算机毕设)基于SpringBoot的就业平台开题报告
  • STM32G4系列MCU的ADC模块标定方法和采样时间
  • NVIDIA Jetson支持的神经网络加速的量化平台
  • MySQL 免密登录的几种配置方式
  • html全局属性、框架标签
  • ARL 灯塔 | CentOS7 — ARL 灯塔搭建流程(Docker)
  • 抖音列表页采集-前言
  • Linux 端口占用 kill被占用的端口 杀掉端口
  • 爬虫之数据解析
  • 本地缓存少更改、小数据、低一致表的思考
  • redis 使用
  • 使用 Pake 一键打包网页为桌面应用 / 客户端
  • vue.js【常用UI组件库】
  • 基于vue框架的的地铁站智慧管理系统的设计n09jb(程序+源码+数据库+调试部署+开发环境)系统界面在最后面。
  • 《南京师大学报(自然科学版)》
  • 考研读研生存指南,注意事项
  • 爬虫结合项目实战
  • 【Next.js 项目实战系列】07-分配 Issue 给用户
  • Web,RESTful API 在微服务中的作用是什么?
  • Ajax:跨域、防抖和节流、HTTP协议