mongo数据库CPU使用率飙高及优化
出现的情况
某天高高兴兴上班,发现一台mongo数据库负载较高,达到了300多,且一直很稳定,难道是硬件过时了.
检查主机的资源使用发现一个mongod进程使用了几乎所有的CPU资源.
top - 13:55:32 up 134 days, 20:24, 4 users, load average: 313.44, 304.76, 306.63
Tasks: 850 total, 2 running, 848 sleeping, 0 stopped, 0 zombie
%Cpu(s): 95.1 us, 3.2 sy, 0.0 ni, 1.5 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 39464614+total, 14463147+free, 79623008 used, 17039164+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 31228908+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
219329 mongo 20 0 33.1g 28.6g 9456 S 9286 7.6 21315:15 mongod
310443 mongo 20 0 9264864 672576 5432 S 57.4 0.2 904:50.28 mongos
229368 mongo 20 0 19.9g 13.6g 10460 S 54.2 3.6 33754:51 mongod
213236 mongo 20 0 16.1g 13.0g 8664 S 31.3 3.4 669:49.10 mongod
229190 mongo 20 0 3111000 819996 9856 S 1.9 0.2 1607:46 mongod
444095 mongo 20 0 3004816 720496 9736 S 1.6 0.2 1347:41 mongod
通过perf查看资源的使用情况,wt_btcur_next,wt_cursor_get_key两个函数使用了几乎所有资源,看名字就可以猜个大概是mongo数据库wt引擎负责b+树和游标读取数据的函数.大概率和mongo的读写消耗有很大关系.
使用mongostat查看了对应mongo分片的资源消耗情况,执行的sql比较少,但是队列特别长,猜测有mongo实例在作妖.
############################ xx.xx.xx.48 mongostat at 2024-11-07-13-59-01 ###################################
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
52 519 75 340 131 263|0 2.0% 58.2% 0 33.3G 28.8G 0|1776 1|126 478k 502k 2214 shardsrv2 PRI Nov 7 13:59:10.224
50 556 66 370 118 231|0 2.0% 58.2% 0 33.3G 28.8G 0|1783 1|127 465k 515k 2214 shardsrv2 PRI Nov 7 13:59:12.217
57 544 72 348 126 243|0 2.0% 58.2% 0 33.3G 28.8G 0|1777 1|128 474k 508k 2214 shardsrv2 PRI Nov 7 13:59:14.222
这个mongo是分片集群中的一个分片,看到其他分片的资源消耗都很低,判断有两种情况,一种是分片数据分配不均匀,另一种情况是有问题的表不是一个分片表.
首先查询了分片的balance情况,设置的凌晨开始,早上六点结束,看了几天的日志,发现并没有报错,查询每个分片中的数据也很均匀,可以排除是数据分配不均匀产生的问题.
再检查mongo的日志,发现了端倪,一个删除的sql执行了将近两个小时
2024-11-07T13:32:33.227+0800 I WRITE [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms
2024-11-07T13:32:33.239+0800 I COMMAND [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms
可以确认就是这个sql消耗了所有的CPU资源,成为了电费杀手,简直就是电网战略合作伙伴的关系了.
查看慢SQL,先筛选执行时间特别长的,找到更多类似的SQL
{
"desc" : "conn2389",
"threadId" : "140234727237376",
"connectionId" : 2389,
"client" : "xx.xx.xx.xx:38706",
"clientMetadata" : {
"driver" : {
"name" : "MongoDB Internal Client",
"version" : "x.x.xx"
},
"os" : {
"type" : "Linux",
"name" : xxxxxxx ",
"architecture" : "x86_64",
"version" : "Kernel xxxxxx"
}
},
"active" : true,
"opid" : 10218935,
"secs_running" : 6566,
"microsecs_running" : NumberLong("6566038844"),
"op" : "remove",
"ns" : "PS_DB.PS_ACCOUNT",
"query" : {
""Uin"" : NumberLong(1586603223)
},
"planSummary" : "COLLSCAN",
"numYields" : 551828,
"locks" : {
"Global" : "r"
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(551829),
"w" : NumberLong(551828)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(551828)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(551828)
}
}
}
}
执行时间都特别长,数量大,并且使用了全表扫描.
检查表上的索引情况
shardsrv2:PRIMARY> db.PS_ACCOUNT.getIndexes()
[
{
"v" : 2,
"key" : {
"UIN" : 1
},
"name" : "UIN_1",
"ns" : "PS_DB.PS_ACCOUNT"
},
{
"v" : 2,
"key" : {
"Uin" : 1
},
"name" : "Uin_1",
"ns" : "PS_DB.PS_ACCOUNT"
},
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "PS_DB.PS_ACCOUNT"
}
]
表上是有合适索引可以使用的,但是并没有使用,难道是选择性不好,经过检查,UIN的条件还是不错的.
自己跑个执行计划试试看
mongos> db.PS_ACCOUNT.find({ Uin: { $gte: 0 } }).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "config",
"connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",
"serverInfo" : {
"host" : "mongo-016",
"port" : 23000,
"version" : "x.x.xx",
"gitVersion" : "00000000000000000000"
},
"plannerVersion" : 1,
"namespace" : "admin.PS_ACCOUNT",
"indexFilterSet" : false,
"parsedQuery" : {
"Uin" : {
"$gte" : 0
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
]
}
},
"ok" : 1
}
mongos> db.PS_ACCOUNT.find({ Uin: "1294288363" }).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "config",
"connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",
"serverInfo" : {
"host" : "mongo-016",
"port" : 23000,
"version" : "x.x.xx",
"gitVersion" : "0000000000000000000000"
},
"plannerVersion" : 1,
"namespace" : "admin.PS_ACCOUNT",
"indexFilterSet" : false,
"parsedQuery" : {
"Uin" : {
"$eq" : "1294288363"
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
]
}
},
"ok" : 1
}
自己跑了执行计划,也执行了条件相同的SQL,都可以正常使用到索引,带着疑问打开了详细的日志,等多记录点日志再分析.
2024-11-06T19:00:45.013+0800 I COMMAND [conn2108] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { Uin: 3801898340 }, limit: 0 } ], writeConcern: { w: 1 }, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_command 12ms
2024-11-07T13:32:33.227+0800 I WRITE [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms
2024-11-07T13:32:33.239+0800 I COMMAND [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms
日志中记录了大致相同的两个SQL语句,一个耗时12ms,另一个耗时7993879ms,同样是SQL,怎么差距就这么大呢.
通过二十年玩找不同游戏的经验,仔细观察发现第二个SQL的Uin外面有个双引号.
去测试环境测试了下正常SQL和不正常SQL的
--正常的SQL
db.PS_USERROUTER.find({"Uin" : 1234576}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "PS_CACHE.PS_USERROUTER",
"indexFilterSet" : false,
"parsedQuery" : {
"Uin" : {
"$eq" : 1234576
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"Uin" : 1
},
"indexName" : "Uin_1",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"Uin" : [
"[1234576.0, 1234576.0]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "test9211",
"port" : 27017,
"version" : "3.4.10",
"gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"
},
"ok" : 1
}
--不正常的SQL
db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "PS_CACHE.PS_USERROUTER",
"indexFilterSet" : false,
"parsedQuery" : {
""Uin"" : {
"$eq" : 1234576
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
""Uin"" : {
"$eq" : 1234576
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "test9211",
"port" : 27017,
"version" : "3.4.10",
"gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"
},
"ok" : 1
}
模拟出了想要的效果,第一个SQL会使用索引,第二个SQL不会使用索引.
解决方案
知道了是双引号引发的问题,也就知道了问题的原因.
对应的解决方案也就有了两种:
1.研发同事找到对应有问题的SQL并修改为正确的SQL.
2.创建一个奇葩的带有双引号的索引.
db.PS_USERROUTER.createIndex({"\"Uin\"": 1});
> db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "PS_CACHE.PS_USERROUTER",
"indexFilterSet" : false,
"parsedQuery" : {
""Uin"" : {
"$eq" : 1234576
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
""Uin"" : 1
},
"indexName" : "\"Uin\"_1",
"isMultiKey" : false,
"multiKeyPaths" : {
""Uin"" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
""Uin"" : [
"[1234576.0, 1234576.0]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "test9211",
"port" : 27017,
"version" : "x.x.xx",
"gitVersion" : "00000000000000000000000"
},
"ok" : 1
测试环境测试了使用带双引号的索引完美解决了这个问题.
负载的情况看着就吓人,第一个解决方案要等上线窗口才能操作.因此在生产库也创建了带有双引号的索引,先临时解决了这个问题.
总结
如果是关系型数据库,写错列名的情况一定会被解析器拦截.但是在mongo中,集合中并没有严格的列限制,如果给了一个错误名字或多加了东西的条件,没有命中索引,就会去表中查找一遍,结果注定是没有的,数据库的特性决定了一定要去集合中一丝不苟的查找一遍.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)