Hadoop记录-Hadoop监控指标汇总

系统参数监控metrics

load_one            每分钟的系统平均负载

load_fifteen        每15分钟的系统平均负载

load_five           每5分钟的系统平均负载

boottime            系统启动时间,精确到秒

bytes_in            网络接收速度,单位bytes/sec

bytes_out           网络发送速度,单位bytes/sec

cpu_aidle            启动的空闲CPU百分比

cpu_idle            空闲CPU百分比  

cpu_nice            用户进程空间内改变过优先级的进程占用CPU百分比

cpu_num             CPU线程总数

cpu_report          CPU使用情况汇总报告

cpu_speed           CPU速度(MHz)

cpu_system          内核空间占用CPU百分比

cpu_user            用户空间占用CPU百分比

cpu_wio                CPU空闲时的最大I/O请求

proc_total          进程总数

swap_free            空闲交换分区空闲大小

swap_total            空闲交换分区大小(KBs显示)

disk_free            剩余磁盘空间

disk_total            磁盘总大小

ip_address          ip地址列表

last_reported        最后一次报告时间

load_report            系统负载汇总报告

location            定位信息(经纬度)

machine_type        系统版本(X86或64)

mem_buffers            内核缓存的内存总量

mem_cached            缓存内存大小

mem_free            空闲内存大小

mem_report            内存汇总报告

mem_shared            共享内存大小

mem_total            物理内存总量(KBs显示)

os_name                操作系统名称

os_release            操作系统版本

pkts_in                每秒进来的包数

pkts_out            每秒出去的包数

proc_run            运行的进程总数

packet_report        包汇总报告

network_report        网络汇总报告

namenode监控metrics

dfs.namenode.SafeModeTime                        safemode时间

dfs.namenode.AddBlockOps                        写入block次数

dfs.namenode.BlockReportAvgTime                    block report的平均时间次数

dfs.namenode.BlockReportNumOps                    block report的次数

dfs.namenode.CreateFileOps                        创建文件次数

dfs.namenode.DeleteFileOps                        删除文件次数

dfs.namenode.FileInfoOps                        查看文件info次数

dfs.namenode.FilesCreated                        已创建的文件个数

dfs.namenode.FilesDeleted                        已删除的文件个数

dfs.namenode.FilesInGetListingOps                getlist操作次数

dfs.namenode.FilesRenamed                        重命名文件个数

dfs.namenode.FsImageLoadTime                    fsimage加载时间

dfs.namenode.GetAdditionalDatanodeOps            GetAdditionalDatanode操作次数

dfs.namenode.GetBlockLocations                    获取block位置操作次数

dfs.namenode.GetListingOps                        getListing操作次数

dfs.namenode.SyncsAvgTime                        将操作同步为editlog的平均时间

dfs.namenode.SyncsNumOps                        将操作同步为editlog的次数

dfs.namenode.TransactionsAvgTime                transcation的平均时间

dfs.namenode.TransactionsBatchedInSync            transcation在flush时发现已经被sync的情况的次数

dfs.namenode.TransactionsNumOps                    transcation的个数

datanode参数监控metrics

dfs.datanode.BlockReportsAvgTime                            向namenode汇报block的平均时间

dfs.datanode.BlockReportsNumOps                                向namenode汇报block的次数

dfs.datanode.BlocksRead                                        从硬盘读块的次数

dfs.datanode.BlocksRemoved                                    删除块的个数

dfs.datanode.BlocksReplicated                                备份块操作的个数

dfs.datanode.BlocksVerified                                    验证块的次数

dfs.datanode.BlocksWritten                                    写入块的个数

dfs.datanode.BytesRead                                        读出总字节

dfs.datanode.BytesWritten                                    写入总字节

dfs.datanode.CopyBlockOpAvgTime                                复制块的平均时间

dfs.datanode.CopyBlockOpNumOps                                复制块的次数                               

dfs.datanode.HeartbeatsAvgTime                                向namenode汇报的平均时间

dfs.datanode.HeartbeatsNumOps                                向namenode汇报的次数

dfs.datanode.ReadBlockOpAvgTime                                读数据块的平均时间

dfs.datanode.ReadBlockOpNumOps                                读数据块的次数

dfs.datanode.ReadsFromLocalClient                            本地读取的次数

dfs.datanode.ReadsFromRemoteClient                            远程读取的次数

dfs.datanode.WriteBlockOpAvgTime                            写数据块的平均时间

dfs.datanode.WriteBlockOpNumOps                                写数据块的次数

dfs.datanode.WritesFromLocalClient                            写本地的次数

dfs.datanode.WritesFromRemoteClient                            写远程的次数

dfs.datanode.PacketAckRoundTripTimeNanosAvgTime                包确认平均时间

dfs.datanode.PacketAckRoundTripTimeNanosNumOps              包确认次数

dfs.datanode.FlushNanosAvgTime                                文件系统flush平均时间

dfs.datanode.FlushNanosNumOps                               文件系统flush次数

dfs.datanode.ReplaceBlockOpAvgTime                            块替换平均时间

dfs.datanode.ReplaceBlockOpNumOps                            块替换次数    

dfs.datanode.SendDataPacketBlockedOnNetworkNanosAvgTime     网络上发送块平均时间

dfs.datanode.SendDataPacketBlockedOnNetworkNanosNumOps      网络上发生块次数

dfs.datanode.SendDataPacketTransferNanosAvgTime             网络上发送包平均时间

dfs.datanode.SendDataPacketTransferNanosNumOps                网络上发送包个数

HDFS文件系统metric

dfs.FSNamesystem.BlockCapacity                         block的总容量

dfs.FSNamesystem.BlocksTotal                        block的当前容量

dfs.FSNamesystem.CapacityRemainingGB                HDFS文件系统剩余的容量

dfs.FSNamesystem.CapacityTotalGB                    HDFS文件系统总体容量

dfs.FSNamesystem.CapacityUsedGB                        HDFS文件系统已使用的容量

dfs.FSNamesystem.CorruptBlocks                        已损坏的block数量

dfs.FSNamesystem.ExcessBlocks                        多余的block                        

dfs.FSNamesystem.ExpiredHeartbeats                    超时的心跳

dfs.FSNamesystem.FilesTotal                            文件总数

dfs.FSNamesystem.LastCheckpointTime                    最近一次做checkpoint的时间

dfs.FSNamesystem.LastWrittenTransactionId            最近一次写入的transactionid

dfs.FSNamesystem.MillisSinceLastLoadedEdits            距离上一次加载edit的时间

dfs.FSNamesystem.MissingBlocks                        丢失的block数量

dfs.FSNamesystem.TotalFiles                            文件总个数

dfs.FSNamesystem.UnderReplicatedBlocks                副本个数不够的block

dfs.FSNamesystem.PendingDataNodeMessageCount        datanode的请求被queue在standby namenode的个数

dfs.FSNamesystem.PendingDeletionBlocks                未被验证的block个数

dfs.FSNamesystem.PendingReplicationBlocks            等待被备份的block个数

dfs.FSNamesystem.PostponedMisreplicatedBlocks        被推迟处理的错误备份的block个数

dfs.FSNamesystem.ScheduledReplicationBlocks            排定要备份的block个数

dfs.FSNamesystem.TotalLoad                            namenode的Xceiver个数

dfs.FSNamesystem.TransactionsSinceLastCheckpoint    从上次checkpoint起到现在新的transcation的个数

dfs.FSNamesystem.TransactionsSinceLastLogRoll        从上次roll editlog起到现在新的transcation的个数

hbase.master metrics

hbase.master.cluster_requests                        当前机器整体request的个数

hbase.master.splitSize_avg_time                        splitlog的大小

hbase.master.splitSize_num_ops                        splitlog次数

hbase.master.splitTime_avg_time                        splitlog的时间

hbase.master.splitTime_num_ops                        splitlog的次数

hbase参数监控metrics

hbase.regionserver.blockCacheCount                            RegionServer中缓存到blockcache中block的个数。

hbase.regionserver.blockCacheEvictedCount                    BlockCache中被换出的Block的个数。

hbase.regionserver.blockCacheFree                            返回block cache中空闲的内存大小。

hbase.regionserver.blockCacheHitCachingRatio                HitCache表示因为读取不到而cacheblock的行为,blockCacheHitCachingRatio表示发生该行为的比率

hbase.regionserver.blockCacheHitCount                        blockCache命中次数

hbase.regionserver.blockCacheHitRatio                        blockCache命中比例

hbase.regionserver.blockCacheMissCount                        blockCache非命中比例

hbase.regionserver.blockCacheSize                            blockCache大小

hbase.regionserver.compactionQueueSize                        compaction Queue的大小

hbase.regionserver.compactionSize_avg_time                    平均执行一次Compaction的数据大小

hbase.regionserver.compactionSize_num_ops                    执行compaction的次数

hbase.regionserver.compactionTime_avg_time                    平均执行一次Compaction的时间

hbase.regionserver.compactionTime_num_ops                    执行compaction的次数

hbase.regionserver.deleteRequestLatency_75th_percentile        75%的删除请求延时的概率统计

hbase.regionserver.deleteRequestLatency_95th_percentile        95%的删除请求延时的概率统计

hbase.regionserver.deleteRequestLatency_99th_percentile        99%的删除请求延时的概率统计

hbase.regionserver.deleteRequestLatency_max                    删除请求的最大值

hbase.regionserver.deleteRequestLatency_mean                删除请求的平均值

hbase.regionserver.deleteRequestLatency_median                删除请求的中位值

hbase.regionserver.deleteRequestLatency_min                    删除请求的最小值

hbase.regionserver.deleteRequestLatency_num_ops                删除请求的个数

hbase.regionserver.deleteRequestLatency_std_dev                删除请求的标准差

hbase.regionserver.flushQueueSize                            flush Queue的大小

hbase.regionserver.flushSize_avg_time                        平均执行一次flush的数据大小

hbase.regionserver.flushSize_num_ops                        执行flush的次数

hbase.regionserver.flushTime_avg_time                        平均执行一次flush的时间

hbase.regionserver.flushTime_num_ops                        执行flush的次数

hbase.regionserver.fsReadLatencyHistogram_75th_percentile    75%的读HLog时间的概率统计

hbase.regionserver.fsReadLatencyHistogram_95th_percentile    95%的读HLog时间的概率统计

hbase.regionserver.fsReadLatencyHistogram_99th_percentile    99%的读HLog时间的概率统计

hbase.regionserver.fsReadLatencyHistogram_max                读HLog时间的最大值

hbase.regionserver.fsReadLatencyHistogram_mean                读HLog时间的平均值

hbase.regionserver.fsReadLatencyHistogram_median            读HLog时间的中位值

hbase.regionserver.fsReadLatencyHistogram_min                读HLog时间的最小值

hbase.regionserver.fsReadLatencyHistogram_num_ops            读HLog的次数

hbase.regionserver.fsReadLatencyHistogram_std_dev            读HLog时间的标准差

hbase.regionserver.fsReadLatency_avg_time                    读HLog时间的平均时间

hbase.regionserver.fsReadLatency_num_ops                    读HLog时间的次数

hbase.regionserver.fsSyncLatency_avg_time                    sync HLog的平均时间

hbase.regionserver.fsSyncLatency_num_ops                    sync HLog的次数

hbase.regionserver.fsWriteLatencyHistogram_75th_percentile    75%的写HLog的概率统计

hbase.regionserver.fsWriteLatencyHistogram_95th_percentile    95%的写HLog的概率统计

hbase.regionserver.fsWriteLatencyHistogram_99th_percentile    99%的写HLog的概率统计

hbase.regionserver.fsWriteLatencyHistogram_max                写HLog时间的最大值

hbase.regionserver.fsWriteLatencyHistogram_mean                写HLog时间的最大值

hbase.regionserver.fsWriteLatencyHistogram_median            写HLog时间的最大值

hbase.regionserver.fsWriteLatencyHistogram_min                写HLog时间的最大值

hbase.regionserver.fsWriteLatencyHistogram_num_ops            写HLog的次数

hbase.regionserver.fsWriteLatencyHistogram_std_dev            写HLog时间的标准差

hbase.regionserver.fsWriteLatency_avg_time                    写HLog操作的平均Latency

hbase.regionserver.fsWriteLatency_num_ops                    写HLog操作的次数

hbase.regionserver.getRequestLatency_75th_percentile        75%的get请求时间的概率统计

hbase.regionserver.getRequestLatency_95th_percentile        95%的get请求时间的概率统计

hbase.regionserver.getRequestLatency_99th_percentile        99%的get请求时间的概率统计

hbase.regionserver.getRequestLatency_max                    get请求时间的最大值

hbase.regionserver.getRequestLatency_mean                   get请求时间的平均值

hbase.regionserver.getRequestLatency_median                 get请求时间的中位值

hbase.regionserver.getRequestLatency_min                    get请求时间的最小值

hbase.regionserver.getRequestLatency_num_ops                get请求的次数

hbase.regionserver.getRequestLatency_std_dev                get请求时间的标准差

hbase.regionserver.hdfsBlocksLocalityIndex                    统计RegionServer所在机器的数据本地化的概率

hbase.regionserver.hlogFileCount                            hlog file的个数

hbase.regionserver.mbInMemoryWithoutWAL                        RegionServer中不写WAL的Put操作的数据在Memstore占用的空间

hbase.regionserver.memstoreSizeMB                            RegionServer中所有HRegion中的memstore大小的总和

hbase.regionserver.numPutsWithoutWAL                        RegionServer中不写WAL(Write-Ahead-Log)的Put操作的个数

hbase.regionserver.putRequestLatency_75th_percentile        75%的put请求时间的概率统计

hbase.regionserver.putRequestLatency_95th_percentile        95%的put请求时间的概率统计

hbase.regionserver.putRequestLatency_99th_percentile        99%的put请求时间的概率统计

hbase.regionserver.putRequestLatency_max                    put请求时间的最大值

hbase.regionserver.putRequestLatency_mean                   put请求时间的平均值

hbase.regionserver.putRequestLatency_median                 put请求时间的中位值

hbase.regionserver.putRequestLatency_min                    put请求时间的最小值

hbase.regionserver.putRequestLatency_num_ops                put请求的次数

hbase.regionserver.putRequestLatency_std_dev                put请求时间的标准差

hbase.regionserver.readRequestsCount                        读请求的数量:readRequestCount与客户端读取数据的个数不等价,而且大部分情况下readRequestCount 远小于客户端读取数据个数,因为next(1000)只算一次请求

hbase.regionserver.regionSplitFailureCount                    region split失败的次数

hbase.regionserver.regionSplitSuccessCount                    region split成功的次数

hbase.regionserver.regions                                    region的个数

hbase.regionserver.requests                                    请求的数量

hbase.regionserver.rootIndexSizeKB                            storefileIndex的大小,和storefileIndexSizeMB相同

hbase.regionserver.storefileIndexSizeMB                        storefileIndex的大小

hbase.regionserver.storefiles                                RegionServer中所有的Storefiles的个数

hbase.regionserver.stores                                    RegionServer包含的Store的个数

hbase.regionserver.totalStaticBloomSizeKB                    所有Store上的Bloom Filter大小的总和。

hbase.regionserver.totalStaticIndexSizeKB                    HRegionServer上每个HFile文件的IndexSize的大小,这是指未压缩的,不带有其它信息的所有HFileBlockIndex信息的总和 。

hbase.regionserver.writeRequestsCount                        写请求的数量:writeRequestCount与客户端写操作个数不完全等价,批量写只记做一次请求,大部分情况下writeRequestCount远小于客户端写操作的个数(尤其批量写频繁的情况下)。

map/reduce参数监控metrics

mapred.ShuffleMetrics.ShuffleConnections                    shuffle的连接数

mapred.ShuffleMetrics.ShuffleOutputBytes                    shuffle输出数据大小

mapred.ShuffleMetrics.ShuffleOutputsFailed                    shuffle失败的次数

mapred.ShuffleMetrics.ShuffleOutputsOK                        shuffle成功的次数

yarn(map/reduce v2)参数监控metrics

yarn.NodeManagerMetrics.AllocatedContainers                    当前分配的container个数

yarn.NodeManagerMetrics.AllocatedGB                            当前分配的container内存

yarn.NodeManagerMetrics.AvailableGB                            当前free的内存

yarn.NodeManagerMetrics.ContainersCompleted                    完成状态的container个数

yarn.NodeManagerMetrics.ContainersIniting                    初始化状态的container个数

yarn.NodeManagerMetrics.ContainersKilled                    killed状态的container个数

yarn.NodeManagerMetrics.ContainersLaunched                    启动态的container个数

yarn.NodeManagerMetrics.ContainersRunning                    运行态的container的个数

yarn 集群metrics

yarn.ClusterMetrics.NumActiveNMs                            活的nodemanager个数

yarn.ClusterMetrics.NumLostNMs                                不健康的nodemanager个数

yarn 任务队列metrics

yarn.QueueMetrics.ActiveApplications                        活跃的task的个数

yarn.QueueMetrics.ActiveUsers                                活跃的用户个数        

yarn.QueueMetrics.AggregateContainersAllocated                总共分配的container个数

yarn.QueueMetrics.AggregateContainersReleased                总共释放的container个数

yarn.QueueMetrics.AllocatedContainers                        已经分配的container个数

yarn.QueueMetrics.AllocatedMB                                已经分配的内存

yarn.QueueMetrics.AppsCompleted                                已完成的task数

yarn.QueueMetrics.AppsPending                                挂起的task数

yarn.QueueMetrics.AppsRunning                                运行的task数

yarn.QueueMetrics.AppsSubmitted                                已经提交的task数

yarn.QueueMetrics.AvailableMB                                可用的内存

yarn.QueueMetrics.PendingContainers                            挂起的container数

yarn.QueueMetrics.PendingMB                                    挂起的内存

yarn.QueueMetrics.running_0                                    运行时间在0-60分钟内的task个数

yarn.QueueMetrics.running_1440                                运行时间在1440分钟以上的task个数

yarn.QueueMetrics.running_300                                运行时间在300-1440分钟内的task个数

yarn.QueueMetrics.running_60                                运行时间在60-300分钟内的task个数

hadoop rpc参数监控metrics

rpc.metrics.NumOpenConnections                        number of open connections rpc连接打开的数目

rpc.metrics.ReceivedBytes                             number of bytes received rpc收到的字节数

rpc.metrics.RpcProcessingTime_avg_time                Average time for RPC Operations in last interval rpc在最近的交互中平均操作时间                   

rpc.metrics.RpcProcessingTime_num_ops                 rpc在最近的交互中连接数目

rpc.metrics.RpcQueueTime_avg_time                     rpc在交互中平均等待时间

rpc.metrics.RpcQueueTime_num_ops                      rpc queue中完成的rpc操作数目

rpc.metrics.SentBytes                                 number of bytes sent  rpc发送的数据字节

rpc.metrics.callQueueLen                              length of the rpc queue  rpc 队列长度

rpc.metrics.rpcAuthenticationFailures                 number of failed authentications  rpc 验证失败次数

rpc.metrics.rpcAuthenticationSuccesses                number of successful authentications   验证成功数

rpc.metrics.rpcAuthorizationFailures                  number of failed authorizations   授权失败次数

rpc.metrics.rpcAuthorizationSuccesses                 number of successful authorizations  成功次数

rpc.detailed-metrics.canCommit_avg_time                  rpc询问是否提交任务平均时间                                                                                                                                                                                                                                

rpc.detailed-metrics.canCommit_num_ops                rpc询问是否提交任务次数                                                                                                                                                                                                                                     

rpc.detailed-metrics.commitPending_avg_time           rpc报告任务提交完成,但是该提交仍然处于pending状态的平均时间                                                                                                                                                                                           

rpc.detailed-metrics.commitPending_num_ops            rpc报告任务提交完成,但是该提交仍然处于pending状态的次数                                                                                                                                                                                                    

rpc.detailed-metrics.done_avg_time                    rpc报告任务成功完成的平均时间                                                                                                                                                                                                                              

rpc.detailed-metrics.done_num_ops                     rpc报告任务成功完成的次数                                                                                                                                                                                                                                   

rpc.detailed-metrics.fatalError_avg_time              rpc报告任务出现fatalerror的平均时间                                                                                                                                                                                                                         

rpc.detailed-metrics.fatalError_num_ops               rpc报告任务出现fatalerror的次数                                                                                                                                                                                                                            

rpc.detailed-metrics.getBlockInfo_avg_time            从指定datanode获取block的平均时间                                                                                                                                                                                                                          

rpc.detailed-metrics.getBlockInfo_num_ops             从指定datanode获取block的次数                                                                                                                                                                                                                               

rpc.detailed-metrics.getMapCompletionEvents_avg_time  reduce获取已经完成的map输出地址事件的平均时间

rpc.detailed-metrics.getMapCompletionEvents_num_ops   reduce获取已经完成的map输出地址事件的次数

rpc.detailed-metrics.getProtocolVersion_avg_time      获取rpc协议版本信息的平均时间

rpc.detailed-metrics.getProtocolVersion_num_ops       获取rpc协议版本信息的次数

rpc.detailed-metrics.getTask_avg_time                 当子进程启动后,获取jvmtask的平均时间

rpc.detailed-metrics.getTask_num_ops                  当子进程启动后,获取jvmtask的次数

rpc.detailed-metrics.ping_avg_time                    子进程周期性的检测父进程是否还存活的平均时间

rpc.detailed-metrics.ping_num_ops                     子进程周期性的检测父进程是否还存活的次数

rpc.detailed-metrics.recoverBlock_avg_time             为指定的block开始恢复标记生成的平均时间

rpc.detailed-metrics.recoverBlock_num_ops              为指定的block开始恢复标记生成的次数

rpc.detailed-metrics.reportDiagnosticInfo_avg_time     向父进程报告任务错误消息的平均时间,该操作应尽可能少,这些消息会在jobtracker中保存

rpc.detailed-metrics.reportDiagnosticInfo_num_ops      向父进程报告任务错误消息的次数

rpc.detailed-metrics.startBlockRecovery_avg_time       开始恢复block的平均时间

rpc.detailed-metrics.startBlockRecovery_num_ops        开始恢复block的次数

rpc.detailed-metrics.statusUpdate_avg_time             汇报子进程进度给父进程的平均时间

rpc.detailed-metrics.statusUpdate_num_ops              汇报子进程进度给父进程的次数

rpc.detailed-metrics.updateBlock_avg_time              更新block到新的标记及长度的平均操作时间

rpc.detailed-metrics.updateBlock_num_ops               更新block到新的标记及长度的次数

jvm参数监控metrics

jvm.JvmMetrics.GcCount                            JVM进行GC的次数

jvm.JvmMetrics.GcTimeMillis                        GC花费的时间,单位为微妙

jvm.JvmMetrics.LogError                            Log中输出ERROR的次数

jvm.JvmMetrics.LogFatal                            Log中输出FATAL的次数

jvm.JvmMetrics.LogInfo                            Log中输出INFO的次数

jvm.JvmMetrics.LogWarn                            Log中输出WARN的次数

jvm.JvmMetrics.MemHeapCommittedM                JVM分配的堆大小(单位MB)

jvm.JvmMetrics.MemHeapUsedM                        JVM已经使用的堆大小(单位MB)

jvm.JvmMetrics.MemNonHeapCommittedM                JVM分配给非堆的大小(单位M)

jvm.JvmMetrics.MemNonHeapUsedM                    JVM已使用的非堆的大小(单位M)

jvm.JvmMetrics.ThreadsBlocked                    处于BLOCKED状态线程数量

jvm.JvmMetrics.ThreadsNew                        处于NEW状态线程数量

jvm.JvmMetrics.ThreadsRunnable                    处于RUNNABLE状态线程数量

jvm.JvmMetrics.ThreadsTerminated                处于TERMINATED状态线程数量

jvm.JvmMetrics.ThreadsTimedWaiting                处于TIMED_WAITING状态线程数量

jvm.JvmMetrics.ThreadsWaiting                    处于WAITING状态线程数量

posted @ 2019-04-04 10:12  信方  阅读(4457)  评论(0编辑  收藏  举报