mongo数据库CPU使用率飙高及优化

出现的情况

某天高高兴兴上班,发现一台mongo数据库负载较高,达到了300多,且一直很稳定,难道是硬件过时了.
检查主机的资源使用发现一个mongod进程使用了几乎所有的CPU资源.

top - 13:55:32 up 134 days, 20:24,  4 users,  load average: 313.44, 304.76, 306.63
Tasks: 850 total,   2 running, 848 sleeping,   0 stopped,   0 zombie
%Cpu(s): 95.1 us,  3.2 sy,  0.0 ni,  1.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 39464614+total, 14463147+free, 79623008 used, 17039164+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 31228908+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
219329 mongo     20   0   33.1g  28.6g   9456 S  9286  7.6  21315:15 mongod
310443 mongo     20   0 9264864 672576   5432 S  57.4  0.2 904:50.28 mongos
229368 mongo     20   0   19.9g  13.6g  10460 S  54.2  3.6  33754:51 mongod
213236 mongo     20   0   16.1g  13.0g   8664 S  31.3  3.4 669:49.10 mongod
229190 mongo     20   0 3111000 819996   9856 S   1.9  0.2   1607:46 mongod
444095 mongo     20   0 3004816 720496   9736 S   1.6  0.2   1347:41 mongod

通过perf查看资源的使用情况,wt_btcur_next,wt_cursor_get_key两个函数使用了几乎所有资源,看名字就可以猜个大概是mongo数据库wt引擎负责b+树和游标读取数据的函数.大概率和mongo的读写消耗有很大关系.

使用mongostat查看了对应mongo分片的资源消耗情况,执行的sql比较少,但是队列特别长,猜测有mongo实例在作妖.

############################ xx.xx.xx.48 mongostat at 2024-11-07-13-59-01 ###################################
insert query update delete getmore command dirty  used flushes vsize   res    qrw   arw net_in net_out conn       set repl                time
    52   519     75    340     131   263|0  2.0% 58.2%       0 33.3G 28.8G 0|1776 1|126   478k    502k 2214 shardsrv2  PRI Nov  7 13:59:10.224
    50   556     66    370     118   231|0  2.0% 58.2%       0 33.3G 28.8G 0|1783 1|127   465k    515k 2214 shardsrv2  PRI Nov  7 13:59:12.217
    57   544     72    348     126   243|0  2.0% 58.2%       0 33.3G 28.8G 0|1777 1|128   474k    508k 2214 shardsrv2  PRI Nov  7 13:59:14.222

这个mongo是分片集群中的一个分片,看到其他分片的资源消耗都很低,判断有两种情况,一种是分片数据分配不均匀,另一种情况是有问题的表不是一个分片表.
首先查询了分片的balance情况,设置的凌晨开始,早上六点结束,看了几天的日志,发现并没有报错,查询每个分片中的数据也很均匀,可以排除是数据分配不均匀产生的问题.
再检查mongo的日志,发现了端倪,一个删除的sql执行了将近两个小时

2024-11-07T13:32:33.227+0800 I WRITE    [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms
2024-11-07T13:32:33.239+0800 I COMMAND  [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms

可以确认就是这个sql消耗了所有的CPU资源,成为了电费杀手,简直就是电网战略合作伙伴的关系了.

查看慢SQL,先筛选执行时间特别长的,找到更多类似的SQL


		{
				"desc" : "conn2389",
				"threadId" : "140234727237376",
				"connectionId" : 2389,
				"client" : "xx.xx.xx.xx:38706",
				"clientMetadata" : {
						"driver" : {
								"name" : "MongoDB Internal Client",
								"version" : "x.x.xx"
						},
						"os" : {
								"type" : "Linux",
								"name" : xxxxxxx ",
								"architecture" : "x86_64",
								"version" : "Kernel xxxxxx"
						}
				},
				"active" : true,
				"opid" : 10218935,
				"secs_running" : 6566,
				"microsecs_running" : NumberLong("6566038844"),
				"op" : "remove",
				"ns" : "PS_DB.PS_ACCOUNT",
				"query" : {
						""Uin"" : NumberLong(1586603223)
				},
				"planSummary" : "COLLSCAN",
				"numYields" : 551828,
				"locks" : {
						"Global" : "r"
				},
				"waitingForLock" : false,
				"lockStats" : {
						"Global" : {
								"acquireCount" : {
										"r" : NumberLong(551829),
										"w" : NumberLong(551828)
								}
						},
						"Database" : {
								"acquireCount" : {
										"w" : NumberLong(551828)
								}
						},
						"Collection" : {
								"acquireCount" : {
										"w" : NumberLong(551828)
								}
						}
				}
}

执行时间都特别长,数量大,并且使用了全表扫描.
检查表上的索引情况

shardsrv2:PRIMARY> db.PS_ACCOUNT.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "UIN" : 1
                },
                "name" : "UIN_1",
                "ns" : "PS_DB.PS_ACCOUNT"
        },
        {
                "v" : 2,
                "key" : {
                        "Uin" : 1
                },
                "name" : "Uin_1",
                "ns" : "PS_DB.PS_ACCOUNT"
        },
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "PS_DB.PS_ACCOUNT"
        }
]

表上是有合适索引可以使用的,但是并没有使用,难道是选择性不好,经过检查,UIN的条件还是不错的.
自己跑个执行计划试试看

mongos> db.PS_ACCOUNT.find({ Uin: { $gte: 0 } }).explain()
{
        "queryPlanner" : {
                "mongosPlannerVersion" : 1,
                "winningPlan" : {
                        "stage" : "SINGLE_SHARD",
                        "shards" : [
                                {
                                        "shardName" : "config",
                                        "connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",
                                        "serverInfo" : {
                                                "host" : "mongo-016",
                                                "port" : 23000,
                                                "version" : "x.x.xx",
                                                "gitVersion" : "00000000000000000000"
                                        },
                                        "plannerVersion" : 1,
                                        "namespace" : "admin.PS_ACCOUNT",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
                                                "Uin" : {
                                                        "$gte" : 0
                                                }
                                        },
                                        "winningPlan" : {
                                                "stage" : "EOF"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        ]
                }
        },
        "ok" : 1
}
mongos> db.PS_ACCOUNT.find({ Uin: "1294288363" }).explain()
{
        "queryPlanner" : {
                "mongosPlannerVersion" : 1,
                "winningPlan" : {
                        "stage" : "SINGLE_SHARD",
                        "shards" : [
                                {
                                        "shardName" : "config",
                                        "connectionString" : "configsrv/xx.xx.xx.47:23000,xx.xx.xx.48:23000,xx.xx.xx.49:23000",
                                        "serverInfo" : {
                                                "host" : "mongo-016",
                                                "port" : 23000,
                                                "version" : "x.x.xx",
                                                "gitVersion" : "0000000000000000000000"
                                        },
                                        "plannerVersion" : 1,
                                        "namespace" : "admin.PS_ACCOUNT",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
                                                "Uin" : {
                                                        "$eq" : "1294288363"
                                                }
                                        },
                                        "winningPlan" : {
                                                "stage" : "EOF"
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        ]
                }
        },
        "ok" : 1
}

自己跑了执行计划,也执行了条件相同的SQL,都可以正常使用到索引,带着疑问打开了详细的日志,等多记录点日志再分析.

2024-11-06T19:00:45.013+0800 I COMMAND  [conn2108] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { Uin: 3801898340 }, limit: 0 } ], writeConcern: { w: 1 }, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_command 12ms
		
2024-11-07T13:32:33.227+0800 I WRITE    [conn444] remove PS_DB.PS_ACCOUNT query: { "Uin": 2527524969 } planSummary: COLLSCAN keysExamined:0 docsExamined:123458497 ndeleted:0 numYields:970758 locks:{ Global: { acquireCount: { r: 970759, w: 970759 } }, Database: { acquireCount: { w: 970759 } }, Collection: { acquireCount: { w: 970759 } } } 7993866ms
2024-11-07T13:32:33.239+0800 I COMMAND  [conn444] command PS_DB.$cmd command: delete { delete: "PS_ACCOUNT", deletes: [ { q: { "Uin": 2527524969 }, limit: 0 } ], writeConcern: {}, ordered: true, shardVersion: [ Timestamp 0|0, ObjectId('000000000000000000000000') ] } numYields:0 reslen:221 locks:{ Global: { acquireCount: { r: 970760, w: 970760 } }, Database: { acquireCount: { w: 970760 } }, Collection: { acquireCount: { w: 970760 } }, Metadata: { acquireCount: { W: 1 } } } protocol:op_command 7993879ms

日志中记录了大致相同的两个SQL语句,一个耗时12ms,另一个耗时7993879ms,同样是SQL,怎么差距就这么大呢.
通过二十年玩找不同游戏的经验,仔细观察发现第二个SQL的Uin外面有个双引号.

去测试环境测试了下正常SQL和不正常SQL的

--正常的SQL
db.PS_USERROUTER.find({"Uin" : 1234576}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "PS_CACHE.PS_USERROUTER",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "Uin" : {
                                "$eq" : 1234576
                        }
                },
                "winningPlan" : {
                        "stage" : "FETCH",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "Uin" : 1
                                },
                                "indexName" : "Uin_1",
                                "isMultiKey" : false,
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        "Uin" : [
                                                "[1234576.0, 1234576.0]"
                                        ]
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "test9211",
                "port" : 27017,
                "version" : "3.4.10",
                "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"
        },
        "ok" : 1
}
--不正常的SQL
db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "PS_CACHE.PS_USERROUTER",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        ""Uin"" : {
                                "$eq" : 1234576
                        }
                },
                "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                ""Uin"" : {
                                        "$eq" : 1234576
                                }
                        },
                        "direction" : "forward"
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "test9211",
                "port" : 27017,
                "version" : "3.4.10",
                "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"
        },
        "ok" : 1
}

模拟出了想要的效果,第一个SQL会使用索引,第二个SQL不会使用索引.

解决方案

知道了是双引号引发的问题,也就知道了问题的原因.
对应的解决方案也就有了两种:
1.研发同事找到对应有问题的SQL并修改为正确的SQL.
2.创建一个奇葩的带有双引号的索引.

db.PS_USERROUTER.createIndex({"\"Uin\"": 1});
> db.PS_USERROUTER.find({"\"Uin\"" : 1234576}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "PS_CACHE.PS_USERROUTER",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        ""Uin"" : {
                                "$eq" : 1234576
                        }
                },
                "winningPlan" : {
                        "stage" : "FETCH",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        ""Uin"" : 1
                                },
                                "indexName" : "\"Uin\"_1",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                        ""Uin"" : [ ]
                                },
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        ""Uin"" : [
                                                "[1234576.0, 1234576.0]"
                                        ]
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "test9211",
                "port" : 27017,
                "version" : "x.x.xx",
                "gitVersion" : "00000000000000000000000"
        },
        "ok" : 1

测试环境测试了使用带双引号的索引完美解决了这个问题.
负载的情况看着就吓人,第一个解决方案要等上线窗口才能操作.因此在生产库也创建了带有双引号的索引,先临时解决了这个问题.

总结

如果是关系型数据库,写错列名的情况一定会被解析器拦截.但是在mongo中,集合中并没有严格的列限制,如果给了一个错误名字或多加了东西的条件,没有命中索引,就会去表中查找一遍,结果注定是没有的,数据库的特性决定了一定要去集合中一丝不苟的查找一遍.

posted @ 2024-11-13 21:56 ideal_x 阅读(187) 评论(0) 收藏举报

刷新页面返回顶部

mongo数据库CPU使用率飙高及优化

出现的情况

解决方案

总结

公告