k8s 1.28.2 集群部署 Thanos 对接 MinIO 实现 Prometheus 数据长期存储


什么是 Thanos

Thanos 是一个强大的 Prometheus 扩展解决方案,能够解决 Prometheus 在大规模环境下的存储、扩展性和高可用性问题。

它非常适合大规模集群监控需求,尤其是需要长期存储监控数据和全局查询。

Thanos 的主要功能

  • 全局查询(Global Query View)
    • 通过其 Querier 组件提供从多个 Prometheus 实例查询的能力,并能对跨多个数据源进行全局去重查询
    • 即使在大规模集群中运行多个 Prometheus 实例,用户也可以从一个接口统一查询所有的监控数据
  • 长期存储(Unlimited Retention)
    • Prometheus 默认只适用于短期数据存储,而 Thanos 提供了将监控数据推送到长期存储(如 Amazon S3、Google Cloud Storage、MinIO 等对象存储)的能力
  • Prometheus 集成(Prometheus Compatible)
    • Grafana 和其他支持 Prometheus 查询 API 的工具都可以通过 Thanos 查询 Prometheus 数据
  • 数据压缩与去重(Downsampling & Compaction)
    • Thanos 的 Compactor 组件会定期对存储在对象存储中的数据进行压缩、去重和优化,以减少存储开销并提高查询性能

Thanos 的架构组件

遵循 KISS 和 Unix 理念,Thanos 由一组组件组成,每个组件都扮演一个特定的角色

  • Sidecar
    • 与每个 Prometheus 实例一起部署,负责将数据推送到对象存储,并暴露出 Prometheus 的数据给 Querier
  • Store Gateway
    • 简称为 Store,专门用于从对象存储(如 AWS S3、Google Cloud Storage、MinIO 等)中检索历史监控数据的组件
  • Compactor
    • 负责对存储在对象存储中的数据进行压缩、去重和优化,提升查询性能并减少存储开销
  • Receiver
    • 专门用于接收和存储 Prometheus 实例通过 Remote Write 发送数据的组件(强烈建议使用 Prometheus v2.13.0+,因为它的远程读取功能得到了改进。)
  • Ruler/Rule
    • 类似 Prometheus 的 Alertmanager,它允许用户基于存储的数据执行告警和规则评估
  • Querier/Query
    • 一个用于全局查询的组件,能够从多个 Prometheus 实例和对象存储中提取数据,并提供统一的查询接口
  • Query Frontend
    • Query 的前端页面,通过查询分片缓存请求队列等机制,加速复杂查询,并提升查询在高负载环境下的响应速度

Thanos 部署架构

Sidecar

Sidecar 使用 Prometheus 的 reload 接口。确保 Prometheus 启用 --web.enable-lifecycle 参数

  • 优点
    • 轻量级:Sidecar 是一个轻量的代理,只需要运行在 Prometheus 实例旁边即可,无需对 Prometheus 进行大的改动。
    • 实时数据访问:Sidecar 允许 Thanos 直接访问 Prometheus 的实时监控数据,保证了最新监控信息的可查询性。
    • 长期存储集成:可以将 Prometheus 的数据定期上传到对象存储,解决了 Prometheus 原生不具备长期存储的缺陷。
  • 缺点
    • 依赖 Prometheus:Sidecar 必须依赖于运行的 Prometheus 实例,如果 Prometheus 实例宕机,Sidecar 也无法提供数据查询功能。
    • 水平扩展有限:Sidecar 并不设计用于大规模数据接收,它主要是作为 Prometheus 的配套组件,无法像 Receiver 那样水平扩展来处理大量的数据。

Receive

  • 优点

    • 大规模数据接收:Receiver 能够高效接收大量来自 Prometheus 实例的数据,适用于大规模部署。
    • 多租户支持:可以处理和隔离多个租户的数据,在需要监控多个独立环境时非常有用。
    • 水平扩展:通过数据分片和扩展 Receiver 实例,能够处理越来越多的数据接收任务。
    • 去重和高可用性:Receiver 能够通过去重机制,确保多实例高可用性,并避免重复数据存储。
  • 缺点

    • 无直接查询功能:Receiver 本身不具备查询功能,接收到的数据需要依赖其他 Thanos 组件(如 Querier 和 Store)进行查询和分析。

      实时性较低:相比直接从 Prometheus 实例查询数据,Receiver 可能在数据处理和查询时存在一定的延迟。

Sidecar 与 Receiver 的区别对比(抄自 ChatGPT)

特性 Thanos Sidecar Thanos Receiver
主要功能 集成 Prometheus 实例,提供实时数据访问和长期存储 接收 Prometheus 实例的远程写入数据,并存储
数据源 直接从 Prometheus 获取数据 Prometheus 的 Remote Write 数据
数据存储方式 定期上传 Prometheus 数据块到对象存储 将接收到的数据存储在本地或对象存储中
水平扩展性 无法扩展,只与单个 Prometheus 实例集成 可以通过增加实例水平扩展
实时数据查询 支持 Prometheus 实时数据查询 无法直接查询数据
多租户支持 不支持 支持,适用于多租户环境
高可用性 依赖 Prometheus 实例 支持高可用部署和去重机制
适用场景 与现有 Prometheus 实例集成,长期存储数据 大规模、多租户环境的数据接收和存储

架构选择

  • 多集群 thanos 监控告警实践
  • 打造云原生大型分布式监控系统 (三): Thanos 部署与实践
  • 以下的建议取自这两个博客,具体的架构选择,也只能大家根据自己的实际情况验证和判断
  • Sidecar 与 Receiver 的最主要的区分就是最新数据的查询方式不同
    • Sidecar 最新数据直接读取 Promethues 数据目录
    • Receiver 的所有数据都在存储服务里面(S3 等存储服务)
  • Prometheus 集群不大,采集的服务不多的情况下,即使 Sidecar 和 全局查询的 Query 不在一个机房,只要都是国内的,查询延迟一般不会太高
  • Prometheus 集群很大,要采集的数据也非常多的情况下,尽可能还是选择 Sidecar 架构,因为数据一旦激增,Receiver 的压力会非常非常大,需要很大的资源,也需要很强大的存储性能
  • 除非主要目的是针对指标历史做分析使用,或者 Prometheus 有某些特殊场景无法持久化数据,这些以外的场景,建议使用 Sidecar

开始部署

采用 sidecar 模式部署

部署架构

考虑用 Prometheus 自带的 rule 做告警,这边没打算部署 Thanos-rule

k8s 集群 A k8s 集群 B
Prometheus:v2.54.1 Prometheus:v2.54.1
node-exporter:v1.8.2 node-exporter:v1.8.2
kube-state-metrics:v2.11.0 kube-state-metrics:v2.11.0
Thanos-sidecar:v0.36.1 Thanos-sidecar:v0.36.1
Thanos-query:v0.36.1 Thanos-query:v0.36.1
Thanos-store-gateway:v0.36.1 Thanos-store-gateway:v0.36.1
Thanos-compact:v0.36.1
Thanos-query-globle:v0.36.1
Thanos-query-frontend:v0.36.1
Grafana

MinIO 部署可以看我之前的博客:k8s 1.28.2 集群部署 MinIO 分布式集群,先提前准备好 MinIO 集群

创建 namespace

以下所有的 k 命令都代表 kubectl 命令,部署这块只展示一个环境的,我这边是两套 k8s 集群,需要部署两套 Prometheus

k create ns monitor

node-exporter 部署

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: node-exporter
  name: node-exporter-svc
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 9100
    protocol: TCP
  selector:
    app.kubernetes.io/name: node-exporter
  type: ClusterIP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/name: node-exporter
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      containers:
      - args:
        - --path.rootfs=/rootfs
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        image: docker.m.daocloud.io/prom/node-exporter:v1.8.2
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: http
        volumeMounts:
        - mountPath: /rootfs
          name: root
          readOnly: true
      hostIPC: true
      hostNetwork: true
      hostPID: true
      volumes:
      - hostPath:
          path: /
        name: root

kube-state-metrics 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - serviceaccounts
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs:
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - daemonsets
  - deployments
  - replicasets
  verbs:
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - jobs
  verbs:
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - certificatesigningrequests
  verbs:
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - list
  - watch
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - volumeattachments
  verbs:
  - list
  - watch
- apiGroups:
  - admissionregistration.k8s.io
  resources:
  - mutatingwebhookconfigurations
  - validatingwebhookconfigurations
  verbs:
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  - ingressclasses
  - ingresses
  verbs:
  - list
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - list
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - clusterrolebindings
  - clusterroles
  - rolebindings
  - roles
  verbs:
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
  name: kube-state-metrics
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
    spec:
      automountServiceAccountToken: true
      containers:
      - image: docker.m.daocloud.io/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.11.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          httpGet:
            path: /livez
            port: http-metrics
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /readyz
            port: telemetry
          initialDelaySeconds: 5
          timeoutSeconds: 5
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics-sa

Prometheus + Thanos-Sidecar 部署

固定节点创建 label

k label node 192.168.22.125 prometheus=true

生成 secret

MinIO 配置

因为包含 MinIO 的 access_key 和 secret_key,尽量别用 configmap 去明文读取,用 secret 读取,一会输出的内容,合并成一行后,需要放到下面的 secret 里面去替换掉

cat <<EOF | base64 -
type: S3
config:
  bucket: "prom-thanos-sidecar"
  endpoint: "minio.api.devops.icu"
  access_key: "gsl2dzAHviNzabSn0ikw"
  secret_key: "82zQ0UMDlOo3LxCQM9TqSygEYrMuxSSRYQdO1KXF"
  insecure: true
EOF
etcd 证书

我是 kubeadm 部署的 k8s 集群,我的证书路径是 /etc/kubernetes/pki/etcd,我直接把本地文件生成 secret

certs_dir=/etc/kubernetes/pki/etcd; \
k create secret generic etcd-pki -n monitoring \
--from-file=ca=${certs_dir}/ca.crt \
--from-file=cert=${certs_dir}/server.crt \
--from-file=key=${certs_dir}/server.key

启动 Prometheus + Thanos-Sidecar

Prometheus 的数据存储用的是本地 hostpath 的方式,由于 Thanos 需要读取 Prometheus 的数据,所以要保持用户一致,不然会因为权限问题,Thanos 没法读取数据,也没法将数据上传到 MinIO,具体的报错参考:ts=2024-10-21T06:09:16.284378709Z caller=sidecar.go:410 level=warn err="upload 01JAP2JAZ0AQT8BEYFY30A4VVD: hard link block: hard link file chunks/000001: link /etc/prometheus/data/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001 /etc/prometheus/data/thanos/upload/01JAP2JAZ0AQT8BEYFY30A4VVD/chunks/000001: operation not permitted" uploaded=0

  • Prometheus 参数简介
    • --storage.tsdb.min-block-duration=2h:最小2小时生成一次新的数据块
    • --storage.tsdb.max-block-duration=2h:最大2小时生成一次新的数据块
    • --storage.tsdb.retention.time=6h:Prometheus 本地数据保留时长,默认是15天,这个可以自己根据实际磁盘情况调整
    • --storage.tsdb.wal-compression:启用 WAL 日志压缩,减少 WAL 文件的大小,降低存储空间的需求
    • --storage.tsdb.no-lockfile:禁用锁文件,避免影响 Thanos 上传数据块到 MinIO
    • --web.enable-lifecycle:支持热更新 localhost:9090/-/reload 热加载配置文件
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: prometheus-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: prometheus-svc
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 9090
    targetPort: 9090
  - name: grpc
    port: 10901
    targetPort: 10901
  selector:
    app: prometheus
  type: ClusterIP
---
apiVersion: v1
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      evaluation_interval: 30s
      scrape_timeout: 10s
      external_labels:
        cluster: devops
        replica: $(POD_NAME)
    rule_files:
    - /etc/prometheus/rules/*.yml
    scrape_configs:
    - job_name: prometheus
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app]
        regex: prometheus
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9090
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kube-apiserver
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kubelet
      metrics_path: /metrics/cadvisor
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [instance]
        action: replace
        target_label: node
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: etcd
      kubernetes_sd_configs:
      - role: pod
      scheme: https
      tls_config:
        ca_file: /etc/prometheus/etcd-ssl/ca
        cert_file: /etc/prometheus/etcd-ssl/cert
        key_file: /etc/prometheus/etcd-ssl/key
        insecure_skip_verify: false
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_component]
        regex: etcd
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:2379
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: coredns
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_k8s_app]
        regex: kube-dns
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9153
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: node-exporter
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - source_labels: [__meta_kubernetes_node_address_InternalIP]
        action: replace
        target_label: ip
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: kube-state-metrics
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;kube-state-metrics
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:8080
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
kind: ConfigMap
metadata:
  name: prometheus-cm
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: prometheus
  name: thanos-config
  namespace: monitoring
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: prometheus
                operator: In
                values:
                - "true"
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - prometheus
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --config.file=/etc/prometheus/config/prometheus.yml
        - --storage.tsdb.path=/etc/prometheus/data
        - --storage.tsdb.min-block-duration=2h
        - --storage.tsdb.max-block-duration=2h
        - --storage.tsdb.retention.time=6h
        - --storage.tsdb.wal-compression
        - --storage.tsdb.no-lockfile
        - --web.enable-lifecycle
        command:
        - /bin/prometheus
        env:
        - name: TZ
          value: Asia/Shanghai
        image: quay.io/prometheus/prometheus:v2.54.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 60
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: http
          timeoutSeconds: 1
        name: prometheus
        ports:
        - containerPort: 9090
          name: http
        readinessProbe:
          failureThreshold: 60
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: http
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 500m
            memory: 1024Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
        - mountPath: /etc/prometheus/config
          name: prometheus-config
        - mountPath: /etc/prometheus/etcd-ssl
          name: etcd-ssl
      - args:
        - sidecar
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --tsdb.path=/etc/prometheus/data
        - --prometheus.url=http://localhost:9090
        - --objstore.config-file=/etc/thanos/config/thanos-sidecar.yml
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        name: thanos-sidecar
        ports:
        - containerPort: 10901
          name: grpc
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
        - mountPath: /etc/thanos/config/thanos-sidecar.yml
          name: thanos-config
          readOnly: true
          subPath: config
      imagePullSecrets:
      - name: harbor-secret
      initContainers:
      - command:
        - sh
        - -c
        - '[ -d /etc/prometheus/data/thanos ] || chown -R 65534:65534 /etc/prometheus/data'
        image: quay.io/prometheus/prometheus:v2.54.1
        imagePullPolicy: IfNotPresent
        name: init-dir
        securityContext:
          runAsUser: 0
        volumeMounts:
        - mountPath: /etc/prometheus/data
          name: prometheus-home
      securityContext:
        runAsUser: 65534
      serviceAccount: prometheus-sa
      terminationGracePeriodSeconds: 0
      volumes:
      - hostPath:
          path: /approot/k8s_data/prometheus
          type: DirectoryOrCreate
        name: prometheus-home
      - configMap:
          name: prometheus-cm
        name: prometheus-config
      - name: thanos-config
        secret:
          secretName: thanos-config
      - name: etcd-ssl
        secret:
          secretName: etcd-pki

Thanos-store-gateway 部署

secret 里面涉及的内容,和 sidecar 里面的是一样的,记得替换成自己的

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-sa
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-objstore-config
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-store-gateway
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-store-gateway
  serviceName: thanos-store-gateway-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-store-gateway
    spec:
      containers:
      - args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --no-cache-index-header
        - --objstore.config-file=/etc/thanos/objstore.yaml
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-store-gateway
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/store
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-store-gateway-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data

Thanos-compact 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-sa
  namespace: monitoring
---
apiVersion: v1
data:
  config: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogInByb20tdGhhbm9zLXNpZGVjYXIiCiAgZW5kcG9pbnQ6ICJtaW5pby5hcGkuZGV2b3BzLmljdSIKICBhY2Nlc3Nfa2V5OiAiZ3NsMmR6QUh2aU56YWJTbjBpa3ciCiAgc2VjcmV0X2tleTogIjgyelEwVU1EbE9vM0x4Q1FNOVRxU3lnRVlyTXV4U1NSWVFkTzFLWEYiCiAgaW5zZWN1cmU6IHRydWUK
kind: Secret
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-objstore-config
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-store-gateway
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-store-gateway
  name: thanos-store-gateway
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-store-gateway
  serviceName: thanos-store-gateway-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-store-gateway
    spec:
      containers:
      - args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --no-cache-index-header
        - --objstore.config-file=/etc/thanos/objstore.yaml
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-store-gateway
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/store
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-store-gateway-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data
root@dream:/approot/chen2ha/kubetpl 13:58:08 # cat output/thanos-compact.yaml
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact-headless
  namespace: monitoring
spec:
  clusterIP: None
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-compact
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: thanos-compact
  name: thanos-compact
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-compact
  serviceName: thanos-compact-headless
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-compact
    spec:
      containers:
      - args:
        - compact
        - --wait
        - --log.level=info
        - --log.format=logfmt
        - --data-dir=/var/thanos/compact
        - --http-address=0.0.0.0:10902
        - --objstore.config-file=/etc/thanos/objstore.yaml
        - --compact.enable-vertical-compaction
        - --deduplication.replica-label=replica
        - --deduplication.func=penalty
        - --delete-delay=1d
        - --retention.resolution-raw=7d
        - --retention.resolution-5m=15d
        - --retention.resolution-1h=30d
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-compact
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
        - mountPath: /etc/thanos/objstore.yaml
          name: objstore-config
          readOnly: true
          subPath: config
        - mountPath: /var/thanos/compact
          name: data
          readOnly: false
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-compact-sa
      volumes:
      - name: objstore-config
        secret:
          secretName: thanos-objstore-config
      - emptyDir:
          sizeLimit: 100Mi
        name: data

Thanos-query 部署

  • --query.replica-label 参数指定依据哪个标签做数据的去重,在 Prometheus 的 external_labels 里面配置的
  • 给 Thanos-query 的 gRPC 端口配一个独立的 svc ,通过 nodeport 的方式暴露端口,再由一个全局的 Thanos-query 来注册各个集群的 Thanos-query,最终通过 Thanos-query-frontend 来查询
    • 当然,如果资源足够,也完全可以每个集群再多部署一个 Thanos-query 来当作全局查询,内外查询做一个分流
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query-np-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    nodePort: 31901
    port: 10901
    targetPort: grpc
  selector:
    app.kubernetes.io/name: thanos-query
  type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query
  name: thanos-query
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query
    spec:
      containers:
      - args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-store-gateway-headless.monitoring.svc.cluster.local
        - --endpoint=dnssrv+_grpc._tcp.prometheus-svc.monitoring.svc.cluster.local
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-sa

Thanos-query-globle 部署

--endpoint 我是两个集群各挑了两个节点

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle-svc
  namespace: monitoring
spec:
  ports:
  - name: grpc
    port: 10901
    targetPort: grpc
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query-globle
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-globle
  name: thanos-query-globle
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query-globle
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query-globle
    spec:
      containers:
      - args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --endpoint=192.168.22.112:31901
        - --endpoint=192.168.22.113:31901
        - --endpoint=192.168.22.122:31901
        - --endpoint=192.168.22.123:31901
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query-globle
        ports:
        - containerPort: 10901
          name: grpc
          protocol: TCP
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-globle-sa

Thanos-query-frontend 部署

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend-sa
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend-svc
  namespace: monitoring
spec:
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/name: thanos-query-frontend
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: thanos-query-frontend
  name: thanos-query-frontend
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos-query-frontend
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos-query-frontend
    spec:
      containers:
      - args:
        - query-frontend
        - --log.level=info
        - --log.format=logfmt
        - --http-address=0.0.0.0:10902
        - --query-frontend.downstream-url=http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
        env:
        - name: HOST_IP_ADDRESS
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        image: quay.io/thanos/thanos:v0.36.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: thanos-query-frontend
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 0
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65534
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 65534
        runAsGroup: 65532
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: thanos-query-frontend-sa
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-query-frontend
  namespace: monitoring
spec:
  ingressClassName: nginx
  rules:
  - host: thanos.devops.icu
    http:
      paths:
      - backend:
          service:
            name: thanos-query-frontend-svc
            port:
              number: 10902
        path: /
        pathType: Prefix

Grafana 部署

这边采用了 nfs 针对 dashboard 的 json 文件做了持久化,有修改或者增加就比较方便,直接上传到 nfs 就可以了

---
apiVersion: v1
data:
  grafana.ini: |
    provisioning = /etc/grafana/provisioning
kind: ConfigMap
metadata:
  name: grafana-cm
  namespace: monitoring
---
apiVersion: v1
data:
  prometheus.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://thanos-query-globle-svc.monitoring.svc.cluster.local:10902
kind: ConfigMap
metadata:
  name: grafana-datasource
  namespace: monitoring
---
apiVersion: v1
data:
  dashboards.yaml: |
    apiVersion: 1
    providers:
    - name: 'a unique provider name'
      orgId: 1
      folder: ''
      folderUid: ''
      type: file
      disableDeletion: false
      editable: true
      updateIntervalSeconds: 10
      allowUiUpdates: true
      options:
        # <string, required> path to dashboard files on disk. Required
        path: /etc/grafana/provisioning/dashboards/views
kind: ConfigMap
metadata:
  name: grafana-dashboard
  namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: grafana
  name: grafana-svc
  namespace: monitoring
spec:
  ports:
  - port: 3000
    protocol: TCP
    targetPort: http-grafana
  selector:
    app.kubernetes.io/name: grafana
  type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: grafana
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: grafana
  template:
    metadata:
      labels:
        app.kubernetes.io/name: grafana
    spec:
      containers:
      - env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: docker.m.daocloud.io/grafana/grafana:11.3.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 3000
          timeoutSeconds: 1
        name: grafana
        ports:
        - containerPort: 3000
          name: http-grafana
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /robots.txt
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 2
        resources:
          limits:
            cpu: 1000m
            memory: 1024Mi
          requests:
            cpu: 250m
            memory: 750Mi
        volumeMounts:
        - mountPath: /etc/grafana/grafana.ini
          name: grafana-config
          subPath: grafana.ini
        - mountPath: /etc/grafana/provisioning/datasources/prometheus.yaml
          name: grafana-datasource
          subPath: prometheus.yaml
        - mountPath: /etc/grafana/provisioning/dashboards/grafana-dashboard.yaml
          name: grafana-dashboard
          subPath: dashboards.yaml
        - mountPath: /etc/grafana/provisioning/dashboards/views
          name: grafana
          subPathExpr: $(POD_NAME)
      securityContext:
        fsGroup: 472
        supplementalGroups:
        - 0
      volumes:
      - configMap:
          name: grafana-cm
        name: grafana-config
      - configMap:
          name: grafana-datasource
        name: grafana-datasource
      - configMap:
          name: grafana-dashboard
        name: grafana-dashboard
  volumeClaimTemplates:
  - metadata:
      name: grafana
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: nfs-client
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
spec:
  ingressClassName: nginx
  rules:
  - host: grafana.devops.icu
    http:
      paths:
      - backend:
          service:
            name: grafana-svc
            port:
              number: 3000
        path: /
        pathType: Prefix

增加 Thanos 和 MinIO 监控

Prometheus 采集 MinIO 指标需要鉴权,需要通过 mc 命令配置 JWT 认证,可以查看官方文档:mc admin prometheus generate

或者 MinIO 配置 MINIO_PROMETHEUS_AUTH_TYPE=public 参数,需要重启 MinIO 生效,使 Prometheus 可以直接访问 metrics api

    - job_name: minio
      metrics_path: /minio/v2/metrics/cluster
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: storage;minio-svc
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9000
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-query
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-query-svc
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-store-gateway
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-store-gateway-headless
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    - job_name: thanos-compact
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        regex: monitoring;thanos-compact-headless
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:10902
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

Grafana dashboard

记录几个我这边配置的 dashboard id,因为我这边是双 k8s 集群,所以要加上 cluster 这个变量,大部分都需要自己再细调一下

coredns

14981

etcd

用的官方给的模板:grafana.json

Thanos

12937

node-exporter

12633 或者 21902

16098

最后

yaml 和 dashboard 的 json 文件可以从 gitee 自取:https://gitee.com/chen2ha/yaml_for_kubernetes/tree/master/thanos

posted @ 2024-10-29 14:25  月巴左耳东  阅读(139)  评论(0编辑  收藏  举报