Prometheus + Grafana(十三)系统监控之Cassandra
前言
利用jmx_exporter的方式对cassandra进行监控。
配置JavaAgent
cassandra 集群下的所有節點都要進行如下配置
-
上传
下载并上传jmx_prometheus_javaagent-0.12.0.jar安装包到cassandra集群$CASSANDRA_HOME/lib/目录下
下载地址:https://github.com/prometheus/jmx_exporter/blob/master/README.md
-
配置
1、增加配置文件cassandra-jmx.yml到cassandra集群 conf/ 目錄下
lowercaseOutputName: true lowercaseOutputLabelNames: true whitelistObjectNames: [ "org.apache.cassandra.metrics:type=ColumnFamily,name=RangeLatency,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=LiveSSTableCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SSTablesPerReadHistogram,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SpeculativeRetries,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOnHeapSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableSwitchCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableLiveDataSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableColumnsCount,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOffHeapSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalsePositives,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalseRatio,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterDiskSpaceUsed,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterOffHeapMemoryUsed,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize,*", "org.apache.cassandra.metrics:type=ColumnFamily,name=TotalDiskSpaceUsed,*", "org.apache.cassandra.metrics:type=CQL,name=RegularStatementsExecuted,*", "org.apache.cassandra.metrics:type=CQL,name=PreparedStatementsExecuted,*", "org.apache.cassandra.metrics:type=Compaction,name=PendingTasks,*", "org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks,*", "org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted,*", "org.apache.cassandra.metrics:type=Compaction,name=TotalCompactionsCompleted,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Latency,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Unavailables,*", "org.apache.cassandra.metrics:type=ClientRequest,name=Timeouts,*", "org.apache.cassandra.metrics:type=Storage,name=Exceptions,*", "org.apache.cassandra.metrics:type=Storage,name=TotalHints,*", "org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress,*", "org.apache.cassandra.metrics:type=Storage,name=Load,*", "org.apache.cassandra.metrics:type=Connection,name=TotalTimeouts,*", "org.apache.cassandra.metrics:type=ThreadPools,name=CompletedTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=PendingTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=ActiveTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=TotalBlockedTasks,*", "org.apache.cassandra.metrics:type=ThreadPools,name=CurrentlyBlockedTasks,*", "org.apache.cassandra.metrics:type=DroppedMessage,name=Dropped,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=HitRate,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Hits,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Requests,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Entries,*", "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size,*", #"org.apache.cassandra.metrics:type=Streaming,name=TotalIncomingBytes,*", #"org.apache.cassandra.metrics:type=Streaming,name=TotalOutgoingBytes,*", "org.apache.cassandra.metrics:type=Client,name=connectedNativeClients,*", "org.apache.cassandra.metrics:type=Client,name=connectedThriftClients,*", "org.apache.cassandra.metrics:type=Table,name=WriteLatency,*", "org.apache.cassandra.metrics:type=Table,name=ReadLatency,*", "org.apache.cassandra.net:type=FailureDetector,*", ] #blacklistObjectNames: ["org.apache.cassandra.metrics:type=ColumnFamily,*"] rules: - pattern: org.apache.cassandra.metrics<type=(Connection|Streaming), scope=(\S*), name=(\S*)><>(Count|Value) name: cassandra_$1_$3 labels: address: "$2" - pattern: org.apache.cassandra.metrics<type=(ColumnFamily), name=(RangeLatency)><>(Mean) name: cassandra_$1_$2_$3 - pattern: org.apache.cassandra.net<type=(FailureDetector)><>(DownEndpointCount) name: cassandra_$1_$2 - pattern: org.apache.cassandra.metrics<type=(Keyspace), keyspace=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$3_$4 labels: "$1": "$2" - pattern: org.apache.cassandra.metrics<type=(Table), keyspace=(\S*), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$4_$5 labels: "keyspace": "$2" "table": "$3" - pattern: org.apache.cassandra.metrics<type=(ClientRequest), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile) name: cassandra_$1_$3_$4 labels: "type": "$2" - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(\S*)><>(Count|Value) name: cassandra_$1_$5 labels: "$1": "$4" "$2": "$3"
2、修改cassandra配置文件 conf/cassandra-env.sh,
增加javaagent :
JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jamm-0.3.0.jar -javaagent:$CASSANDRA_HOME/lib/jmx_prometheus_javaagent-0.12.0.jar=7070:${CASSANDRA_HOME}/conf/cassandra-jmx.yml"
注:7070端口就是给promephues收集信息的端口
-
启动
重啟cassandra 服務,启动成功后,可以访问 http://10.x.xx.100:7070/metrics/ ,(IP和端口要改成相应环境的)
看抓取的信息如下:
Prometheus配置
-
配置
修改prometheus组件的prometheus.yml加入cassandra监控:
vi /usr/local/prometheus-2.15.1/prometheus.yml
-
启动验证
先kill掉Prometheus进程,用以下命令重启它,然后查看targets:
cd /usr/local/prometheus-2.15.1 nohup ./prometheus --config.file=prometheus.yml &
注:State=UP,说明成功
Grafana配置
-
导入仪表盘模板
导入 https://grafana.com/dashboards/5408 仪表盘,再结合自身业务修改过的最终仪表盘:
这里需要注意下,grafana的cassandra metric dashboard的json(https://grafana.com/grafana/dashboards/5408)有一些不正确的地方,需要人为修改下。
-
预警指标
序号 |
预警名称 |
预警规则 |
描述 |
1 |
内存预警 |
当内存使用达到阈值【>80%】时进行预警 |
|
2 |
Gc耗时预警 |
当Gc耗时达到阈值【>0.3s】时进行预警 |
|
3 |
Gc次数预警 |
当每秒Gc次数达到阈值【>5】时进行预警 |
|