Ozone Insight工具的使用
前言
分布式系统的运行过程比一般的企业级系统要复杂许多,里面会牵扯到很多服务的调用以及复杂的并行逻辑处理。因此对于分布式系统的问题研究分析,并不是一件简单的事情。但是如果我们有一些路径能够知道它里面运行的一些情况,比如关键metric指标等等,这会给我们带来很大的帮助。现有的许多系统提供的最多的可供外界使用的信息,就是metric,不过有时这些metric指标查询起来并不是很方便。倘若系统能够提供一个直接的命令操作,让用户能直接获取这些指标,这样在操作性上无疑会大大提升其可用性。Ozone在这块做了特别的实现,专门做了insight命令工具来提升其observability。本文笔者来简单聊聊这个insight工具。
Ozone的Insight视角
在介绍Ozone insight命令之前,我们先来了解下Ozone系统内所谓的Insight具体指的是什么呢?
Ozone为了提升其系统对外的可观察性,通过对其内部各个关键服务模块(不仅仅是进程级别,还是内部线程级别,Protocol级别)做了endpoint的实现,然后对外能够展示出有效的信息,这里的有效信息包括:
- 关键服务的(实时)日志
- 关键服务的metric指标
- 关键服务的配置
具体的实现原理,笔者在之前的文章:如何提高分布式系统的可观察性:Insight Tool的引入描述过,感兴趣的同学可仔细阅读里面的细节实现,这里就不多加阐述了。
可能有人说了,上述3个信息并没有特别之处,在普通系统内也能够得到。没错,但是ozone将这些查询行为直接做成了工具命令给用户使用,在这点上还是做得比较创新的。下面来看这些insight命令的具体使用方式,然后我们就能感受到它到底有多方便了。
Ozone的insight工具命令的使用
首先,我们可以通过-help参数来获取insight命令的所有可用命令,
[hdfs@lyq bin]$ ./ozone insight -help
Unknown option: -elp (while processing option: '-help')
Usage: ozone insight [-hV] [--verbose] [-conf=<configurationPath>]
[-D=<String=String>]... [COMMAND]
Show debug information about a selected Ozone component
--verbose More verbose output. Show the stack trace of the errors.
-conf=<configurationPath>
-D, --set=<String=String>
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
list Show available insight points.
log, logs Show log4j events related to the insight point
metrics, metric Show available metrics.
config Show configuration for a specific subcomponents
然后在命令具体使用之前,我们要知道当前有哪些可用的insight point,insight point意为那些关键的服务点,例如关键线程服务,关键Protocol协议操作等等。
[hdfs@lyq bin]$ ./ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async event delivery
scm.protocol.block-location SCM Block location protocol endpoint
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
我们可以看到上面的insight point的粒度已经是非常细粒度的级别了。
下面我们来一一使用上面的3个子命令,首先是log命令,log这里会实时抓取目标insight point对应的日志类的log,如下为point scm.node-manager的日志获取:
[hdfs@lyq apache]$ ozone/bin/ozone insight log scm.node-manager
[SCM] 2019-12-13 21:04:46,966 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=lyq-xxx.com]
[SCM] 2019-12-13 21:05:14,998 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=lyq-xxx.com]
然后是metric指标的获取,这里的metric指标和我们平常在页面上通过jmx拿到的指标基本是一致的,不过在这里 通过不同的insight point其实是做了二次归类的。
[hdfs@lyq apache]$ ozone/bin/ozone insight metric om.protocol.client
Metrics for `om.protocol.client` (Ozone Manager RPC endpoint)
RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 2037
Sent bytes: 1760
RPC queue
RPC average queue time: 0.5
RPC call queue length: 0
RPC performance
RPC processing time average: 8.0
Number of slow calls: 0
Message type counters
Number of CreateVolume: 1
Number of SetVolumeProperty: 0
Number of CheckVolumeAccess: 0
Number of InfoVolume: 2
Number of DeleteVolume: 0
Number of ListVolume: 0
Number of CreateBucket: 0
Number of InfoBucket: 0
Number of SetBucketProperty: 0
Number of DeleteBucket: 0
Number of ListBuckets: 0
Number of CreateKey: 0
Number of LookupKey: 0
Number of RenameKey: 0
Number of DeleteKey: 0
Number of ListKeys: 0
Number of CommitKey: 0
Number of AllocateBlock: 0
Number of CreateS3Bucket: 0
Number of DeleteS3Bucket: 0
Number of InfoS3Bucket: 0
Number of ListS3Buckets: 0
Number of InitiateMultiPartUpload: 0
Number of CommitMultiPartUpload: 0
Number of CompleteMultiPartUpload: 0
Number of AbortMultiPartUpload: 0
Number of GetS3Secret: 0
Number of ListMultiPartUploadParts: 0
Number of ServiceList: 4
Number of DBUpdates: 0
Number of GetDelegationToken: 0
Number of RenewDelegationToken: 0
Number of CancelDelegationToken: 0
Number of GetFileStatus: 0
Number of CreateDirectory: 0
Number of CreateFile: 0
Number of LookupFile: 0
Number of ListStatus: 0
Number of AddAcl: 0
Number of RemoveAcl: 0
Number of SetAcl: 0
Number of GetAcl: 1
Number of PurgeKeys: 0
Number of ListMultipartUploads: 0
最后一个命令是config配置值的获取,这里获取到的是当前系统所加载使用的配置项的值,而不是获取本地的配置文件值,系统真正在使用的配置值才是我们想知道的。
[hdfs@lyq bin]$ ./ozone insight config scm.replica-manager
Configuration for `scm.replica-manager` (SCM closed container replication manager)
hdds.scm.replication.thread.interval
default: 300s
current: 300s
When a heartbeat from the data node arrives on SCM, It is queued for processing with the time stamp of when the heartbeat arrived. There is a heartbeat processing thread inside SCM that runs at a specified interval. This value controls how frequently this thread is run.
There are some assumptions build into SCM such as this value should allow the heartbeat processing thread to run at least three times more frequently than heartbeats and at least five times more than stale node detection time. If you specify a wrong value, SCM will gracefully refuse to run. For more info look at the node manager tests in SCM.
In short, you don't need to change this.
hdds.scm.replication.event.timeout
default: 10m
current: 10m
Timeout for the container replication/deletion commands sent to datanodes. After this timeout the command will be retried.
上面config的命令输出信息提供了insight point相关的配置信息,对于用户来说还是十分友好的,不仅仅有当前值还有默认值的大小,以及配置的描述信息。
笔者在使用完这个工具后,不得不说Ozone实现的这套insight工具使用性还是很高的。其内部核心思想通过对关键服务设置insight point,然后对外暴露信息。
引用
[1].https://blog.csdn.net/Androidlushangderen/article/details/100824677
[2].https://issues.apache.org/jira/browse/HDDS-1935 . Improve the visibility with Ozone Insight tool