主要环境与用到的(关键)组件:
Springboot2.3.2 其中,dubbo-spring-boot-starter版本为2.7.8 zookeeper3.5.9
首先是服务报错:
No provider available from registry ... for ... on consumer ... use dubbo version 2.7.8, please check status of providers(disabled, not registered or in blacklist).
很多使用过dubbo的同学对这个报错信息应该都不陌生,一般情况下,如果没有配置好生产端、消费端或者注册中心,启动的时候就会报这个错。但这里的情况是,消费端和生产端都已经正常运行了一段时间了,大概10天左右,突然报这个错。
这里借助网上总结的一张图,来梳理一下排查No Provider的情况:
首先就来看看zookeeper下,生产端的provider情况,找到zookeeper的安装bin目录下:
sh zkCli.sh
ls /dubbo/xxx/provider
查看生产端的信息,结果发现[],这时的生产端程序还是running状态(ps -ef进程存在),大概率就是这个生产端出了问题。由于日志有滚动删除,而且出问题的时间点也不确定,所以这个问题感觉没办法在服务本身的日志中查看。于是想到查看zookeeper下的事务日志:
cd lib/ java -cp .:zookeeper-3.5.9.jar:slf4j-api-1.7.25.jar:zookeeper-jute-3.5.9.jar org.apache.zookeeper.server.LogFormatter /tmp/zookeeper/version-2/log.7d1 > /tmp/zookeeper/version-2/log.log
但也看不出什么端倪。这里不太确定是服务本身的问题,还是服务于zookeeper之间链接的问题。
于是在消费服务中设置了一个定时任务,每秒钟去调用生产端的一个接口,并在生产和消费端都打印日志,同时加个监控,这样在除了问题之后,就能第一时间找到具体的错误信息。
重启服务8天后,重现了这个错误,这一次直接看到了错误日志。
首先,消费端爆出了
2022-02-18 16:23:27.713 WARN 2386027 --- [scheduling-1] o.a.d.r.c.s.FailoverClusterInvoker : [DUBBO] Although retry the method heart in the service com.jf.zk.proxy.facade.BindFacade was successful by the provider 192.168.1.13:20010, but there have been failed providers [192.168.1.13:20010] (1/1) from the registry 192.168.1.13:2181 on the consumer 192.168.1.13 using the dubbo version 2.7.8. Last error is: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false®ister.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000×tamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010, dubbo version: 2.7.8, current host: 192.168.1.13 org.apache.dubbo.rpc.RpcException: Invoke remote method timeout. method: heart, provider: dubbo://192.168.1.13:20010/com.jf.zk.proxy.facade.BindFacade?anyhost=true&application=proxy-web&check=false&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&group=hcontrol-test&init=false&interface=com.jf.zk.proxy.facade.BindFacade&metadata-type=remote&methods=appUserBind,unbind,heart,ideUserBind&pid=2386027&qos.enable=false®ister.ip=192.168.1.13&release=2.7.8&remote.application=proxy-service&revision=1.3.1&side=consumer&sticky=false&timeout=30000×tamp=1644487367193&version=1.0, cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010 ...... Caused by: java.util.concurrent.ExecutionException: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2022-02-18 16:22:39.000, end time: 2022-02-18 16:23:09.020, client elapsed: 0 ms, server elapsed: 30020 ms, timeout: 30000 ms, request: Request [id=707282, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /192.168.1.13:54056 -> /192.168.1.13:20010 ...... 2022-02-18 16:23:28.009 ERROR 2386027 --- [scheduling-1] o.s.s.s.TaskUtils$LoggingErrorHandler : Unexpected error occurred in scheduled task org.apache.dubbo.rpc.RpcException: No provider available from registry ...
这里看到是一些请求超过了consumer端的timeout,后续就是No provider,猜想应该是生产端出现了什么问题,响应越来越慢,到后来服务长时间不可用,直至与其它服务断开了链接。这里就要看看生产端到底发生了什么,关键日志如下:
2022-02-18 16:23:47.942 INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn : Unable to read additional data from server sessionid 0x100000040fc01ae, likely server has closed socket, closing socket connection and attempting reconnect 2022-02-18 16:23:32.144 WARN 2385338 --- [NettyServerWorker-6-2] o.a.d.remoting.transport.AbstractServer : [DUBBO] All clients has disconnected from /192.168.1.13:20010. You can graceful shutdown now., dubbo version: 2.7.8, current host: 192.168.1.13 2022-02-18 16:27:15.830 WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-2] o.a.d.r.exchange.support.DefaultFuture : [DUBBO] The timeout response finally returned at 2022-02-18 16:27:13.882, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13 2022-02-18 16:27:38.590 WARN 2385338 --- [DubboClientHandler-192.168.1.13:20004-thread-3] o.a.d.r.exchange.support.DefaultFuture : [DUBBO] The timeout response finally returned at 2022-02-18 16:26:53.015, response status is 31, channel: /192.168.1.13:35954 -> /192.168.1.13:20004, please check provider side for detailed result., dubbo version: 2.7.8, current host: 192.168.1.13 ...... 2022-02-18 16:23:47.117 INFO 2385338 --- [main-SendThread(192.168.1.13:2181)] org.apache.zookeeper.ClientCnxn : Unable to read additional data from server sessionid 0x100000040fc01ad, likely server has closed socket, closing socket connection and attempting reconnect 2022-02-18 16:35:32.357 INFO 2385338 --- [NettyClientWorker-1-2] o.a.d.r.t.netty4.NettyClientHandler : [DUBBO] The connection of /192.168.1.13:44900 -> /192.168.1.13:8003 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13 2022-02-18 16:36:13.950 WARN 2385338 --- [dubbo-client-idleCheck-thread-1] o.a.d.r.e.s.header.ReconnectTimerTask : [DUBBO] Reconnect to channel HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [/192.168.1.13:35954 -> /192.168.1.13:20004]], because heartbeat read idle time out: 180000ms, dubbo version: 2.7.8, current host: 192.168.1.13 ...... 2022-02-18 18:46:33.184 WARN 2385338 --- [nioEventLoopGroup-10-3] i.n.c.AbstractChannelHandlerContext : An exception 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception: java.lang.OutOfMemoryError: Java heap space
这个服务虽然出现了很多与另一个服务的链接问题,但后续出现了java.lang.OutOfMemoryError: Java heap space,大概率是由于这个内存问题导致的。
由于服务还是running(ps -ef进程存在),我们可以直接把它的内存dump出来。
jmap -dump:file=javaDump.hprof,format=b {该服务的pid}
而后使用jvisualvm.exe(jdk自带工具,其实JDK自带了很多有用的工具,有必要了解一下)将这个hprof文件打开,可以查看很多信息:
查看到了一个我们代码里定义的SessionModel对象异常地多,最终发现是同事代码里每次新建一个业务会话时,都会生成一个SessionModel对象,将其放在ConcurrentHashMap中,但没有clear或者remove的逻辑,导致相关的对象越来越多,最终导致OOM。
猜测这个服务在内存资源消耗到一定程度时,已经无法处理其他的请求,发出去的请求也无法应答,所以与之相关的rpc链接会被断开,最终这个服务从注册中心被清理。
这里还可以看看这个生产端的服务A调用其它服务B的日志:
...... 2022-02-18 16:32:07.642 INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler : [DUBBO] IdleStateEvent triggered, close channel NettyChannel [channel=[id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954]], dubbo version: 2.7.8, current host: 192.168.1.13 2022-02-18 16:32:07.642 INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.transport.netty4.NettyChannel : [DUBBO] Close netty channel [id: 0x0efa3ba8, L:/192.168.1.13:20004 - R:/192.168.1.13:35954], dubbo version: 2.7.8, current host: 192.168.1.13 2022-02-18 16:32:07.642 INFO 348283 --- [NettyServerWorker-6-3] o.a.d.r.t.netty4.NettyServerHandler : [DUBBO] The connection of /192.168.1.13:35954 -> /192.168.1.13:20004 is disconnected., dubbo version: 2.7.8, current host: 192.168.1.13
这里找到对应日志的代码:
猜测是服务B由于长时间没有接收到来自上文提到的服务A(此时应该是客户端的角色)的心跳,持续了timeout这么长时间,于是服务B就把链接关闭了,此时对于A来说,应该就是收不到B服务的响应,所以也会在日志前期报访问服务B的timeout。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!