数据库连接池配置不当导致的full gc问题记录

背景

线上有个流量比较大的服务,qps日常在2000-3000之间,请求方对耗时要求比较高,响应时间要求在300ms以内,服务上线以后,运行也比较平稳。某天夜里,突然有几千个dubbo请求超时,找了几个请求查看调用链路发现超时的地方各不相同,服务依赖的接口方均表示服务没有异常。
这种情况下,一般就是服务自身的问题了,仔细看了下dubbo的超时,发现超时的都是同一个provider,于是看了下这个出问题的provider,看了下这台机器的gc情况,发现超时时有一次full gc,gc耗时达到了1s以上。

full gc时间过长问题排查

gc日志查看

发现full gc时间过长以后,就去对应的机器上把gc log下载了下来,对gc日志进行分析。

36402.792: [GC (Allocation Failure) 36402.792: [ParNew: 2105440K->7584K(2359296K), 0.0124564 secs] 4517143K->2419327K(4980736K), 0.0129547 secs] [Times: user=0.13 sys=0.00, real=0.01 secs] 
36402.810: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2411743K(2621440K)] 2419601K(4980736K), 0.0037437 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
36402.814: [CMS-concurrent-mark-start]
36403.622: [CMS-concurrent-mark: 0.800/0.808 secs] [Times: user=5.73 sys=0.26, real=0.81 secs] 
36403.623: [CMS-concurrent-preclean-start]
36403.801: [CMS-concurrent-preclean: 0.177/0.178 secs] [Times: user=0.69 sys=0.05, real=0.18 secs] 
36403.801: [CMS-concurrent-abortable-preclean-start]
36403.998: [GC (Allocation Failure) 36403.998: [ParNew: 2104736K->7273K(2359296K), 0.0122722 secs] 4516479K->2419078K(4980736K), 0.0127518 secs] [Times: user=0.13 sys=0.01, real=0.02 secs] 
36404.672: [CMS-concurrent-abortable-preclean: 0.826/0.870 secs] [Times: user=3.39 sys=0.26, real=0.87 secs] 
36404.677: [GC (CMS Final Remark) [YG occupancy: 1097431 K (2359296 K)]36404.677: [GC (CMS Final Remark) 36404.678: [ParNew: 1097431K->7946K(2359296K), 0.0146413 secs] 3509235K->2419797K(4980736K), 0.0151127 secs] [Times: user=0.17 sys=0.00, real=0.02 secs] 
36404.693: [Rescan (parallel) , 0.0044661 secs]36404.697: [weak refs processing, 1.6796757 secs]36406.377: [class unloading, 0.1835045 secs]36406.560: [scrub symbol table, 0.0291043 secs]36406.590: [scrub string table, 0.0029833 secs][1 CMS-remark: 2411851K(2621440K)] 2419797K(4980736K), 2.1253433 secs] [Times: user=2.33 sys=0.00, real=2.13 secs] 
36406.803: [CMS-concurrent-sweep-start]
36409.458: [GC (Allocation Failure) 36409.458: [ParNew: 2105098K->53837K(2359296K), 0.0675226 secs] 4328721K->2277488K(4980736K), 0.0681548 secs] [Times: user=0.86 sys=0.00, real=0.07 secs] 
36410.035: [GC (Allocation Failure) 36410.035: [ParNew: 2150989K->63917K(2359296K), 0.0566114 secs] 4128382K->2041367K(4980736K), 0.0572665 secs] [Times: user=0.71 sys=0.00, real=0.06 secs] 
36410.966: [GC (Allocation Failure) 36410.966: [ParNew: 2161069K->51123K(2359296K), 0.0144844 secs] 3640286K->1530366K(4980736K), 0.0149783 secs] [Times: user=0.16 sys=0.00, real=0.02 secs] 
36411.552: [CMS-concurrent-sweep: 2.287/4.749 secs] [Times: user=23.03 sys=1.41, real=4.75 secs] 
36411.552: [CMS-concurrent-reset-start]
36411.558: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.03 sys=0.00, real=0.00 secs] 

gc日志里面各项代表的意义可以看这篇文章:https://www.jianshu.com/p/a5b3905bf225
我们的老年代收集算法用的是cms算法,现在主要看一下里面会stw的两个阶段耗时,Initial Mark阶段耗时real是0.01秒,耗时很低。然后就是重新标记Final Remark阶段,耗时达到了2.13s,这里面看具体的耗时,weak refs processing, 1.6796757 secs这个占用了比较久的时间,那么这个弱引用处理,究竟在处理什么呢,就要看具体dump出来的堆了。

weak refs processing处理逻辑

weak refs processing的源码如下:

void ReferenceProcessor::process_discovered_references(
  BoolObjectClosure*           is_alive,
  OopClosure*                  keep_alive,
  VoidClosure*                 complete_gc,
  AbstractRefProcTaskExecutor* task_executor) {
  NOT_PRODUCT(verify_ok_to_handle_reflists());
 
  assert(!enqueuing_is_done(), "If here enqueuing should not be complete");
  // Stop treating discovered references specially.
  disable_discovery();
 
  bool trace_time = PrintGCDetails && PrintReferenceGC;
  // Soft references
  {
    TraceTime tt("SoftReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredSoftRefs, _current_soft_ref_policy, true,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  update_soft_ref_master_clock();
 
  // Weak references
  {
    TraceTime tt("WeakReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredWeakRefs, NULL, true,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Final references
  {
    TraceTime tt("FinalReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredFinalRefs, NULL, false,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Phantom references
  {
    TraceTime tt("PhantomReference", trace_time, false, gclog_or_tty);
    process_discovered_reflist(_discoveredPhantomRefs, NULL, false,
                               is_alive, keep_alive, complete_gc, task_executor);
  }
 
  // Weak global JNI references. It would make more sense (semantically) to
  // traverse these simultaneously with the regular weak references above, but
  // that is not how the JDK1.2 specification is. See #4126360. Native code can
  // thus use JNI weak references to circumvent the phantom references and
  // resurrect a "post-mortem" object.
  {
    TraceTime tt("JNI Weak Reference", trace_time, false, gclog_or_tty);
    if (task_executor != NULL) {
      task_executor->set_single_threaded_mode();
    }
    process_phaseJNI(is_alive, keep_alive, complete_gc);
  }
}

看JVM源码,weak refs processing主要包括SoftReference、WeakReference、FinalReference、PhantomReference以及JNI Weak Reference这五种Reference对象的处理,处理的主要内容是对之前标记的Reference对象重新处理,重新判断是否需要标记(不标记就是要回收的),如果不标记就需要放到refqueue里,等待java ReferenceHandler线程处理。
所以从工作的主要内容看各种Reference的处理时间跟reference的个数成正比。

dump出堆进行分析

怎么查看堆内都有哪些对象呢
初步定位
简单查看对象数目可以使用下面命令,初步定位问题。

// 这里面7代表要查看进程id
jmap -histo:live  7  > dump_7

这个dump_7中就会有堆中对象的个数和占用空间大小信息,这个由于我没有线上机权限,并没有执行。
dump出堆进行分析
这里使用如下命令可以dump出堆,这个操作会影响生产环境应用,慎用

jmap -dump:format=b,file=headdump_7.hprof  7

dump出来堆文件后,将这个文件用mat打开,可以看到内存泄漏暴露大概的地方:


 
image.png

里面的内容复制出来如下:

The class "com.mysql.cj.jdbc.AbandonedConnectionCleanupThread", loaded by "org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020", occupies 119,594,176 (45.21%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Node[]", loaded by "<system class loader>", which occupies 119,594,016 (45.21%) bytes.

Keywords:
com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
org.apache.catalina.loader.ParallelWebappClassLoader @ 0x721856020
java.util.concurrent.ConcurrentHashMap$Node[]

查看详情可以看到,connectionFinalizerPhantomRefs占用了其中大部分空间,下一节会分析下AbandonedConnectionCleanupThread类中的connectionFinalizerPhantomRefs:


 
image.png

AbandonedConnectionCleanupThread类分析

AbandonedConnectionCleanupThread类主要变量

AbandonedConnectionCleanupThread类维护了一个单例的单线程池,负责定期清理已回收的connection连接。

public class AbandonedConnectionCleanupThread implements Runnable {
   // 存放connection对应的幻象引用集合
    private static final Set<ConnectionFinalizerPhantomReference> connectionFinalizerPhantomRefs = ConcurrentHashMap.newKeySet();
    // 引用队列,从这个里面取连接出来进行释放
    private static final ReferenceQueue<MysqlConnection> referenceQueue = new ReferenceQueue<>();

    private static final ExecutorService cleanupThreadExcecutorService;
    private static Thread threadRef = null;
    private static Lock threadRefLock = new ReentrantLock();

    static {
        cleanupThreadExcecutorService = Executors.newSingleThreadExecutor(r -> {
            Thread t = new Thread(r, "mysql-cj-abandoned-connection-cleanup");
            t.setDaemon(true);
            // Tie the thread's context ClassLoader to the ClassLoader that loaded the class instead of inheriting the context ClassLoader from the current
            // thread, which would happen by default.
            // Application servers may use this information if they attempt to shutdown this thread. By leaving the default context ClassLoader this thread
            // could end up being shut down even when it is shared by other applications and, being it statically initialized, thus, never restarted again.

            ClassLoader classLoader = AbandonedConnectionCleanupThread.class.getClassLoader();
            if (classLoader == null) {
                // This class was loaded by the Bootstrap ClassLoader, so lets tie the thread's context ClassLoader to the System ClassLoader instead.
                classLoader = ClassLoader.getSystemClassLoader();
            }

            t.setContextClassLoader(classLoader);
            return threadRef = t;
        });
        cleanupThreadExcecutorService.execute(new AbandonedConnectionCleanupThread());
    }
}

加入connectionFinalizerPhantomRefs分析

下面看下连接是什么时候加入到connectionFinalizerPhantomRefs的。

// com.mysql.cj.jdbc.AbandonedConnectionCleanupThread
/**
     * Tracks the finalization of a {@link MysqlConnection} object and keeps a reference to its {@link NetworkResources} so that they can be later released.
     * 
     * @param conn
     *            the Connection object to track for finalization
     * @param io
     *            the network resources to close on the connection finalization
     */
    protected static void trackConnection(MysqlConnection conn, NetworkResources io) {
        threadRefLock.lock();
        try {
            if (isAlive()) {
                ConnectionFinalizerPhantomReference reference = new ConnectionFinalizerPhantomReference(conn, io, referenceQueue);
                connectionFinalizerPhantomRefs.add(reference);
            }
        } finally {
            threadRefLock.unlock();
        }
    }

这个trackConnection方法是在生产连接的时候会被调用,代码如下:

// com.mysql.cj.jdbc.ConnectionImpl
/**
     * Creates a connection to a MySQL Server.
     * 
     * @param hostInfo
     *            the {@link HostInfo} instance that contains the host, user and connections attributes for this connection
     * @exception SQLException
     *                if a database access error occurs
     */
    public ConnectionImpl(HostInfo hostInfo) throws SQLException {

        try {
            
          // 省去连接生成过程

        } catch (CJException e1) {
            throw SQLExceptionsMapping.translateException(e1, getExceptionInterceptor());
        }

        try {
            createNewIO(false);

            unSafeQueryInterceptors();
            // 将连接加入到清理线程中的幻象引用集合中
            AbandonedConnectionCleanupThread.trackConnection(this, this.getSession().getNetworkResources());
        } catch (SQLException ex) {
            cleanup(ex);

            // don't clobber SQL exceptions
            throw ex;
        } catch (Exception ex) {
            cleanup(ex);

            throw SQLError
                    .createSQLException(
                            this.propertySet.getBooleanProperty(PropertyKey.paranoid).getValue() ? Messages.getString("Connection.0")
                                    : Messages.getString("Connection.1",
                                            new Object[] { this.session.getHostInfo().getHost(), this.session.getHostInfo().getPort() }),
                            MysqlErrorNumbers.SQL_STATE_COMMUNICATION_LINK_FAILURE, ex, getExceptionInterceptor());
        }

    }

ConnectionFinalizerPhantomReference类

下面看下ConnectionFinalizerPhantomReference连接清理幻象引用的实现,这个类的构造方法中有MysqlConnection 数据库连接对象,NetworkResources 数据库连接资源,ReferenceQueue引用队列,用于在被PhantomReference引用的connection被gc时,JVM会将PhantomReference对象扔到ReferenceQueue中。

 /**
     * {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
     * This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
     */
    private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
        private NetworkResources networkResources;

        ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
            super(conn, refQueue);
            this.networkResources = networkResources;
        }

        void finalizeResources() {
            if (this.networkResources != null) {
                try {
                    this.networkResources.forceClose();
                } finally {
                    this.networkResources = null;
                }
            }
        }
    }

连接关闭的操作

放到引用队列以后,AbandonedConnectionCleanupThread中前面提到的线程池会去执行关闭连接的操作。

public void run() {
        for (;;) {
            try {
                checkThreadContextClassLoader();
                Reference<? extends MysqlConnection> reference = referenceQueue.remove(5000);
                if (reference != null) {
                    // 释放连接资源
                    finalizeResource((ConnectionFinalizerPhantomReference) reference);
                }
            } catch (InterruptedException e) {
                threadRefLock.lock();
                try {
                    threadRef = null;

                    // Finalize remaining references.
                    Reference<? extends MysqlConnection> reference;
                    while ((reference = referenceQueue.poll()) != null) {
                        finalizeResource((ConnectionFinalizerPhantomReference) reference);
                    }
                    connectionFinalizerPhantomRefs.clear();
                } finally {
                    threadRefLock.unlock();
                }
                return;
            } catch (Exception ex) {
                // Nowhere to really log this.
            }
        }
    }

finalizeResource实现:

/**
     * Release resources from the given {@link ConnectionFinalizerPhantomReference} and remove it from the references set.
     * 
     * @param reference
     *            the {@link ConnectionFinalizerPhantomReference} to finalize.
     */
    private static void finalizeResource(ConnectionFinalizerPhantomReference reference) {
        try {
           // 调用引用中回收资源的方法
            reference.finalizeResources();
            reference.clear();
        } finally {
            // 引用从集合中删除
            connectionFinalizerPhantomRefs.remove(reference);
        }
    }

ConnectionFinalizerPhantomReference#finalizeResources方法实现:

 /**
     * {@link PhantomReference} subclass to track {@link MysqlConnection} objects finalization.
     * This class holds a reference to the Connection's {@link NetworkResources} so they it can be later closed.
     */
    private static class ConnectionFinalizerPhantomReference extends PhantomReference<MysqlConnection> {
        private NetworkResources networkResources;

        ConnectionFinalizerPhantomReference(MysqlConnection conn, NetworkResources networkResources, ReferenceQueue<? super MysqlConnection> refQueue) {
            super(conn, refQueue);
            this.networkResources = networkResources;
        }

        void finalizeResources() {
            if (this.networkResources != null) {
                try {
                   // 关闭其中的网络连接资源
                    this.networkResources.forceClose();
                } finally {
                    this.networkResources = null;
                }
            }
        }
    }

看到这里大家应该明白,是jdbc为每个connection都生成了一个ConnectionFinalizerPhantomReference,目的是为了当connection对象回收时,顺便回收相关资源。这其实是一个保底操作,是怕connnection资源被上层的连接池或者使用者忘记close,从而导致资源泄漏。

问题解决方式

现在知道了为什么ConnectionFinalizerPhantomReference会产生,下面分析下为什么会有这么多的ConnectionFinalizerPhantomReference,我们使用的是hikari线程池,hikari连接池参数配置介绍见https://www.jianshu.com/p/eb85103f29d6
我们现有的配置如下:

  • 最小连接数没设置,默认会和最大连接数一样,是100
  • maxLifeTime用的是默认值,30min,30分钟后连接会被回收
    也就是说每30分钟都会产生100个无效的连接,一天会产生4800个ConnectionFinalizerPhantomReference。

这个问题优化的方式有两种:
调整数据库连接池配置
现在连接数100个过多,默认的10个就够用了,这样连接数的产生就会小10倍,然后可以调整maxLifeTime,数据库默认的连接空闲时间是8小时,maxLifeTime可以调整为6个小时,这样每天只会产生40个连接,这个ConnectionFinalizerPhantomReference过多的问题也就能解决了。
删除connectionFinalizerPhantomRefs集合
数据库连接的释放有连接池保证,这个保底机制其实是多余的,可以起个定时任务线程,去清理这个connectionFinalizerPhantomRefs集合。
下面是一个旧版本的清理示例代码。

 
image.png

 

参考连接

PhantomReference导致CMS GC耗时严重
PhantomReference 引发的GC问题


本文来源链接:https://www.jianshu.com/p/2db280229343

 
posted @ 2022-11-21 17:22  looyee  阅读(87)  评论(0编辑  收藏  举报