一、问题发现
命令行进入数据库实例手动给某张表进行alter操作,发现如下报错。
mysql> use xx_xxx; No connection. Trying to reconnect... Connection id: 5 Current database: *** NONE *** Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id: 3 Current database: xx_xxx ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111) ERROR: Can't connect to the server
二、问题定位
上述这种错误常见是MySQL实例关闭或者非正常宕机、MySQL连接超时、MySQL请求线程被kill。根据现有的业务场景,审核平台能正常连接数据库甚至连接有问题的数据库建表,MySQL服务对外正常,错误日志没有非正常输出,没有开发或者测试人员反映有问题的数据库存在使用问题。但是通过Navicat连接打开问题数据库发现报错(MySQL server has gone away),通过命令行界面进入其他数据库,执行数据库命令都正常,进入问题数据库连最基本的数据库相应变量值和状态值都无法show。
排查暴力破解数据库尝试连接的源头,缩小问题来源(这里排查走偏了),发现问题依然存在。但是比较难理解的是通过审核平台使用问题库却能建表成功,与之前遇到的整库数据文件损坏还不一样,这里怀疑可能是某张表数据文件损坏导致了错误。
查看日志发现实例存在异常shutdown和崩溃恢复记录,但还不能确定具体的原因,可以明确的是单个库存在问题,可以从其他途径去恢复。但是DBA存在即有价值,我们可以尽可能的先尝试以最小的代价解决问题。虽之前遇到过多次硬件故障导致的数据文件损坏,可以通过集群的其他实例和备份完成恢复并不会很大影响业务,也遇到过自己测试发现单个库文件坏得很彻底,但能通过dump出来数据文件进行恢复。
解决问题后追溯问题,发现日志记录如下故障点(这个日志比较久远,当问题来临时可能没有那么多时间给予分析,需要快速定位初步问题并解决,解决问题的时候不一定能发现这个重要的排错依据,放上日志只是仅供参考和回溯故障原因)。
2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201 2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally! 2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery. 2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97 4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb- troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in nodb-troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened. 2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
[ERROR] [FATAL] InnoDB: Tablespace id is 974 in the data dictionary but in file ./xx_xxxxxx/xx_xxxxxx_fans_person.ibd it is 2760! 2018-12-18 20:30:29 0x7f014872b700 InnoDB: Assertion failure in thread 139643487172352 in file ut0ut.cc line 916 InnoDB: We intentionally generate a memory trap. 2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201 2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally! 2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery. 2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97 4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb- troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in nodb-troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened. 2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
三、问题解决
遇到数据文件损坏导致的数据无法正常存取,通常解决的办法是通过备份进行恢复,包括对坏点进行备份恢复。尝试过才知道有思路是好的,但是实践起来不一定容易,果不其然当我想通过dump备份数据再尝试修复的时候出现了错误 MySQL server has gone away。遇到好的问题就要分享,往往问题比较宽泛不好定位的时候容易忽略正确的处理方向。通过好朋友圈的提醒,发现use库的时候输出了-A选项,查询得知可以不加载元数据信息就能进入数据库。
-A选项意义
当我们打开数据库,即use dbname时,要预读数据库信息。由于数据库太大,即数据库中表非常多,所以如果预读数据库信息,将非常慢,所以就卡住了,如果数据库中表非常少,将不会出现问题
幸运的是通过不预读数据可以正常查看当前数据库所有表、系统变量值和状态值,然后尝试通过对InnoDB和MyISAM表进行批量修复,不过在此应该通过select...into的方式做好数据备份,这里因为是测试环境且有相应的冗余环境,就没做备份处理再修复。通过如下命令查询所有的base表并拼接SQL语句,果然发现了无法修复的坏表,印证了MySQL错误日志的信息。
##批量修复MyISAM表 select concat('repair table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'MyISAM'; #批量修复InnoDB表 select concat('optimize table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'InnoDB';
##### optimize table xx_xxxxxx_fans_person;
通过上述命令发现修复结果不OK的表,并通过查看表行数确认数据已无法导出,删除相应的坏表并重新建立新表(drop table可能出现表不存在或者建表1068错误),导入最近的一次数据备份,重启MySQL实例,发现问题解决,问题库可以正常访问。
1、删除错误表xx_xxxxxx_fans_person 2、重建表 mysql> CREATE TABLE `xx_xxxxxx_fans_person` ( -> `person_id` int(20) NOT NULL AUTO_INCREMENT, -> `person_circle_id` int(20) NOT NULL, -> `person_user_id` int(20) NOT NULL, -> `person_time` datetime NOT NULL, -> `type` int(4) DEFAULT '1' COMMENT '1. 组长 2. 成员', -> `merchant_id` int(11) DEFAULT '0', -> `leave_type` int(11) DEFAULT '0' COMMENT '请假状态 0.未请假 1.请假', -> `leave_start_time` datetime DEFAULT NULL COMMENT '请假开始时间', -> `leave_end_time` datetime DEFAULT NULL COMMENT '请假结束时间', -> `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效', -> `invalid_id` int(10) DEFAULT '0' COMMENT '失效记录关联ID ', -> PRIMARY KEY (`person_id`), -> KEY `person_circle_id` (`person_circle_id`) USING BTREE, -> KEY `person_user_id` (`person_user_id`) USING BTREE -> ) ENGINE=InnoDB; ERROR 1030 (HY000): Got error 168 from storage engine 3、重启 mysql> select count(*) from xx_xxxxxx_fans_person; ERROR 1812 (HY000): Tablespace is missing for table `xx_xxxxxx`.`xx_xxxxxx_fans_person`. mysql> drop table xx_xxxxxx_fans_person; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE `xx_xxxxxx_fans_person` ( -> `person_id` int(20) NOT NULL AUTO_INCREMENT, -> `person_circle_id` int(20) NOT NULL, -> `person_user_id` int(20) NOT NULL, -> `person_time` datetime NOT NULL, -> `type` int(4) DEFAULT '1' COMMENT '1. 组长 2. 成员', -> `merchant_id` int(11) DEFAULT '0', -> `leave_type` int(11) DEFAULT '0' COMMENT '请假状态 0.未请假 1.请假', -> `leave_start_time` datetime DEFAULT NULL COMMENT '请假开始时间', -> `leave_end_time` datetime DEFAULT NULL COMMENT '请假结束时间', -> `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效', -> `invalid_id` int(10) DEFAULT '0' COMMENT '失效记录关联ID ', -> PRIMARY KEY (`person_id`), -> KEY `person_circle_id` (`person_circle_id`) USING BTREE, -> KEY `person_user_id` (`person_user_id`) USING BTREE -> ) ENGINE=InnoDB; ERROR 1813 (HY000): Tablespace '`xx_xxxxxx`.`xx_xxxxxx_fans_person`' exists.
四、总结
1、数据库需要定时备份,防止硬件或者其他问题导致的数据文件损坏
2、先分析问题,排查基本的不可能点,必需查看日志分析问题,注意查看命令报错的输出提示信息(可能帮助我们排查或者修复)
3、可使用-A选项不加载数据库信息尝试进行表修复,提前做好备份