prometheus-net.DotNetRuntime 获取 CLR 指标原理解析

`prometheus-net.DotNetRuntime` 介绍

Intro

前面集成 Prometheus 的文章中简单提到过，prometheus-net.DotNetRuntime 可以获取到一些 CLR 的数据，比如说 GC, ThreadPool, Contention, JIT 等指标，而这些指标可以很大程度上帮助我们解决很多问题，比如应用执行过程中是否经常发生 GC，GC 等待时间时间是否过长，是否有发生死锁或竞争锁时间过长，是否有发生线程池饿死等等一系列问题，有了这些指标我们就可以清晰的在运行时了解到这些信息。

来看一下官方介绍

A plugin for the prometheus-net package, exposing .NET core runtime metrics including:

Garbage collection collection frequencies and timings by generation/ type, pause timings and GC CPU consumption ratio

Heap size by generation

Bytes allocated by small/ large object heap

JIT compilations and JIT CPU consumption ratio

Thread pool size, scheduling delays and reasons for growing/ shrinking

Lock contention

Exceptions thrown, broken down by type

These metrics are essential for understanding the peformance of any non-trivial application. Even if your application is well instrumented, you're only getting half the story- what the runtime is doing completes the picture.

支持的指标

Contention Events

只要运行时使用的 System.Threading.Monitor 锁或 Native锁出现争用情况，就会引发争用事件。

一个线程等待的锁被另一线程占有时将发生争用。

Name	Description	Type
dotnet_contention_seconds_total	发生锁争用的耗时(秒)总计	Counter
dotnet_contention_total	锁争用获得锁的数量总计	Counter

Thread Pool Events

Worker thread 线程池和 IO thread 线程池信息

Name	Description	Type
dotnet_threadpool_num_threads	线程池中活跃的线程数量	Gauge
dotnet_threadpool_io_num_threads	IO 线程池中活跃线程数量(WindowsOnly)	Gauge
dotnet_threadpool_adjustments_total	线程池中线程调整总计	Counter

Garbage Collection Events

Captures information pertaining to garbage collection, to help in diagnostics and debugging.

Name	Description	Type
dotnet_gc_collection_seconds	执行 GC 回收过程耗费的时间（秒）	Histogram
dotnet_gc_pause_seconds	GC 回收造成的 Pause 耗费的时间（秒）	Histogram
dotnet_gc_collection_reasons_total	触发 GC 垃圾回收的原因统计	Counter
dotnet_gc_cpu_ratio	运行垃圾收集所花费的进程CPU时间的百分比	Gauge
dotnet_gc_pause_ratio	进程暂停进行垃圾收集所花费的时间百分比	Gauge
dotnet_gc_heap_size_bytes	当前各个 GC 堆的大小 (发生垃圾回收之后才会更新)	Gauge
dotnet_gc_allocated_bytes_total	大小对象堆上已分配的字节总数（每100 KB分配更新）	Counter
dotnet_gc_pinned_objects	pinned 对象的数量	Gauge
dotnet_gc_finalization_queue_length	等待 finalize 的对象数	Gauge

JIT Events

Name	Description	Type
dotnet_jit_method_total	JIT编译器编译的方法总数	Counter
dotnet_jit_method_seconds_total	JIT编译器中花费的总时间（秒）	Counter
dotnet_jit_cpu_ratio	JIT 花费的 CPU 时间	Gauge

集成方式

上面的列出来的指标是我觉得比较重要的指标，还有一些 ThreadPool Scheduling 的指标和 CLR Exception 的指标我觉得意义不是特别大，有需要的可以去源码里看一看

集成的方式有两种，一种是作者提供了一个默认的 Collector 会去收集所有支持的 CLR 指标信息，另外一种则是可以自己自定义的要收集的 CLR 指标类型，来看示例：

使用默认的 Collector 收集 CLR 指标

DotNetRuntimeStatsBuilder.Default().StartCollecting();

使用自定义的 Collector 收集 CLR 指标

DotNetRuntimeStatsBuilder.Customize()
    .WithContentionStats() // Contention event
    .WithGcStats() // GC 指标
    .WithThreadPoolStats() // ThreadPool 指标
    // .WithCustomCollector(null) // 你可以自己实现一个自定义的 Collector
    .StartCollecting();

上面提到过默认的 Collector 会收集支持的所有的 CLR 指标，且看源码怎么做的

构建了一个 Builder 通过建造者模式来构建复杂配置的收集器，类似于 .net core 里的 HostBuilder/LoggingBuilder ...，像极了 Host.CreateDefaultBuilder，做了一些变形

源码地址：https://github.com/djluck/prometheus-net.DotNetRuntime/blob/master/src/prometheus-net.DotNetRuntime/DotNetRuntimeStatsBuilder.cs

实现原理

那它是如何工作的呢，如何实现捕获 CLR 的指标的呢，下面我们就来解密一下，

在项目 README 里已经有了简单的介绍，是基于 CLR 的 ETW Events 来实现的，具体的 CLR 支持的 ETW Events 可以参考文档：https://docs.microsoft.com/en-us/dotnet/framework/performance/clr-etw-events

而 ETW Events 是通过 EventSource 的方式使得我们可以在进程外获取到进程的一些运行信息，这也是我们可以通过 PerfMonitor/PerfView 等方式进程外获取进程 CLR 信息的重要实现方式，同样的微软的新的诊断工具 dotnet diagnostic tools 的实现方式 EventPipe 也是基于 EventSOurce 的

而 EventSource 的事件不仅仅可以通过进程外的这些工具来消费，我们也可以在应用程序中实现 EventListener 来实现进程内的 EventSource 事件消费，而这就是 prometheus-net.DotNetRuntime 这个库的实现本质方法

可以参考源码：https://github.com/djluck/prometheus-net.DotNetRuntime/blob/master/src/prometheus-net.DotNetRuntime/DotNetEventListener.cs

具体的事件处理是在对应的 Collector 中：

https://github.com/djluck/prometheus-net.DotNetRuntime/tree/master/src/prometheus-net.DotNetRuntime/StatsCollectors

Metrics Samples

为了比较直观的看到这些指标可以带来的效果，分享一下我的应用中用到的一些 dashboard 截图

Lock Contention

GC

从上面的图可以清晰的看到这个时间点发生了一次垃圾回收，此时 GC Heap 的大小和 GC 垃圾回收的CPU 占用率和耗时都可以大概看的出来，对于我们运行时诊断应用程序问题会很有帮助

Thread

Thread 的信息还可以拿到一些 threadpool 线程调度的数量以及延迟，这里没有展示出来，

目前我主要关注的是线程池中线程的数量和线程池线程调整的原因，线程池线程调整的原因中有一个是 starvation，这个指标尤其需要关注一下，应避免出现 threadpool starvation 的情况，出现这个的原因通常是因为有一些不当的用法，如： Task.Wait、Task.Result、await Task.Run() 来把一个同步方法变成异步等不好的用法导致的

DiagnosticSource

除了 EventSource 之外，还有一个 DiagnosticSource 可以帮助我们诊断应用程序的性能问题，目前微软也是推荐类库中使用 DiagnosticSource 的方式来让应用诊断类库中的一些性能问题，这也是目前大多数 APM 实现的机制，Skywalking、ElasticAPM、OpenTelemetry 等都使用了 DiagnosticSource 的方式来实现应用程序的性能诊断

如果是进程外应用程序的性能诊断推荐首选 EventSource，如果是进程内推荐首选 DiagnosticSource

通常我们都应该使用 DiagnosticSource，即使想进程外捕获，也是可以做到的

关于这二者的使用，可以看一下这个 Comment https://github.com/dotnet/aspnetcore/issues/2312#issuecomment-359514074

除了上面列出来的那些指标还有一些指标，比如 exception，threadpool scheduling，还有当前 dotnet 的环境（系统版本，GC 类型，Runtime 版本，程序 TargetFramework，CPU 数量等），有兴趣的可以用一下试一下

exception 指标使用下来感觉帮助不大，有一些即使是已经处理的或者忽略的 Exception 也会被统计，这些 Exception 大多并不会影响应用程序的运行，如果参考这个的话可能会带来很多的困扰，所以我觉得还是需要应用程序来统计 exception 指标更为合适一些

prometheus-net.DotNetRuntime 作为 prometheus-net 的一个插件，依赖于 prometheus-net 去写 metrics 信息，也就是说 metrics 的信息可以通过 prometheus-net 来获取

集成 asp.net core 的时候和之前集成 prometheus-net 是一样的，metrics path 是同一个，可以参考我这个项目: https://github.com/OpenReservation/ReservationServer/tree/dev/OpenReservation

注意：作者推荐 .netcore3.0 以上使用，.netcore 2.x 会有一些 BUG，可以在 Issue 里看到

Reference

posted @ 2020-12-05 11:32 WeihanLi 阅读(1506) 评论(2) 编辑收藏举报

刷新页面返回顶部

Loading

Love it or leave it

prometheus-net.DotNetRuntime 获取 CLR 指标原理解析

`prometheus-net.DotNetRuntime` 介绍

Intro

支持的指标

Contention Events

Thread Pool Events

Garbage Collection Events

JIT Events

集成方式

实现原理

Metrics Samples

Lock Contention

GC

Thread

DiagnosticSource

More

Reference

公告

Loading

Love it or leave it

prometheus-net.DotNetRuntime 获取 CLR 指标原理解析

prometheus-net.DotNetRuntime 介绍

Intro

支持的指标

Contention Events

Thread Pool Events

Garbage Collection Events

JIT Events

集成方式

实现原理

Metrics Samples

Lock Contention

GC

Thread

DiagnosticSource

More

Reference

公告

`prometheus-net.DotNetRuntime` 介绍