《博客园.NET历险记》第2季 -《memcached连接暴增案》第1集：问题表现

在《.NET 5.0 背锅案》第7集-大结局之后，园子和 .NET 继续过上了幸福生活。。。剧情很美好，现实很残酷。。。现实是旧案刚结，新案立至，而且新案与旧案有关联，被迫继续拍剧，并对该剧重新进行命名，剧名叫~~《高并发下的.NET》~~（后来改名为《博客园.NET历险记》），第1季的剧情是《.NET 5.0 背锅案》，第2季的剧情是《memcached连接暴增案》。

就在我们发布修复 StackExchange.Redis.Extensions 并发问题的更新后的第2天就出现了 memcached tcp 连接暴增的情况（通常连接数在3k以下，暴增时竟然超过15k），而这周一至周四每天都会出现。

下图是其中1台 memcached 服务器（一共有3台）的 tcp 连接数监控图（来自阿里云云监控）

阿里云云监控的报警邮件：

华东1(杭州) 云服务器ECS实例：memcached1，云服务器实例ID=i-bp1f2iqt161cleifjsfl，状态=ESTABLISHED，instanceName=memcached1 ，Host.tcpconnection于18:41发生告警，平均值为15265.55Count，持续时间0分钟。

而昨天下午（11月26日）16:17 的访问高峰出现了一个相反情况，memcached 服务器 tcp 连接数突然暴跌

接着，16:45:07.025 开始出现 EnyimMemcachedCore 重置 socket 连接超时的错误日志

Enyim.Caching.MemcachedClient
2020-11-26 16:45:07.025 [Error] Timeout to reset an acquired socket.

16:45:07.774 开始出现 EnyimMemcachedCore 重置 socket 连接失败的错误日志

Enyim.Caching.Configuration.MemcachedClientConfiguration
2020-11-26 16:45:07.774 [Error] Failed to reset an acquired socket.

16:45:56.582 开始出现 SqlClient 连接数据库超时的错误日志，遇到这个错误的请求会出现500错误

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
---> System.ComponentModel.Win32Exception (258): Unknown error 258

16:45:56.658 开始出现 StackExchange.Redis 的超时错误

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 1444ms elapsed, timeout is 1000ms), command=MGET, next: GET blog_v2_BlogPosts-389476, inst: 0, qu: 0, qs: 3, aw: False, rs: ReadAsync, ws: Idle, in: 14559, in-pipe: 0, out-pipe: 0, serverEndpoint: redis-blog:6379, mc: 1/1/0, mgr: 8 of 10 available, clientName: blog-web-5cf67c9795-6wmvz, IOCP: (Busy=0,Free=1000,Min=20,Max=1000), WORKER: (Busy=50,Free=32717,Min=50,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
at StackExchange.Redis.Extensions.Core.Implementations.RedisDatabase.GetAllAsync(...)

16:45:57.305 开始出现 pod 重启的日志，说明开始有 pod 因为健康检查失败而被强制重启

2020-11-26 16:45:57.305 [Error] Request Execution Time: 7740 ms
ThreadPool.ThreadCount: 50, Pending work item count: 216, Completed work item count: 226710523

pod 启动后接入负载时由于并发请求大，EnyimMemcachedCore 出现下面的错误日志

[Error] Pool is full, timeouting. 10.0.78.156:11211

随后出现更多 SqlClient 连接数据库超时的错误日志。

Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding

16:52:00.028 EnyimMemcachedCore 开始出现告警日志

[Warning] Socket bound to "10.0.78.156:11211" has 44 unread data! This is probably a bug in the code. InstanceID was a247c5cd-575f-4f04-8573-fb03f199cb9c.

16:55 开始，从日志记录可以看出更多 pod 在重启

16:56:00.032 开始出现连接数据库服务器失败的错误

An error occurred using the connection to database '"xxx"' on server '"yyyy.sqlserver.rds.aliyuncs.com,3433"'.
Microsoft.EntityFrameworkCore.Database.Connection

日志中继续出现大量连接数据库服务器失败或者超时的错误

17:03 开始，从日志看系统恢复了正常。

目前我们还没有找到问题的线索，正在进一步排查。

最后，附上阿里云RDS的数据库连接监控图

posted @ 2020-11-27 12:41 博客园团队阅读(2439) 评论(16) 收藏举报

刷新页面返回顶部

《博客园.NET历险记》第2季 -《memcached连接暴增案》第1集：问题表现

公告