GTID环境中手动修复主从故障一例(Error 1146)
Preface
In my last test of pt-heartbeat,both of master and slave were out of disk.And the mysql client was hang.In order to resolve the issue,I've tryed to fix the replicaiton environment without using mysqldump to reconfigure the slave.Let's see the details.
Procedure
I dropped test tables in database "sysbench" to release the disk space on master.
1 [root@zlm2 08:56:51 /data/mysql/mysql3306/logs] 2 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=1 --table-size=10000000 --mysql-storage-engine=innodb cleanup 3 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2) 4 5 Dropping table 'sbtest1'... 6 7 (zlm@192.168.1.101 3306)[(none)]>use sysbench; 8 Reading table information for completion of table and column names 9 You can turn off this feature to get a quicker startup with -A 10 11 Database changed 12 (zlm@192.168.1.101 3306)[sysbench]>show tables; 13 +--------------------+ 14 | Tables_in_sysbench | 15 +--------------------+ 16 | hb | 17 | sbtest2 | 18 | sbtest3 | 19 | sbtest4 | 20 | sbtest5 | 21 +--------------------+ 22 5 rows in set (0.00 sec) 23 24 //Only sbtest1 was deleted.It's not enough. 25 26 [root@zlm2 08:59:04 ~/sysbench-1.0/src/lua] 27 #sysbench oltp_read_write.lua --mysql-host=192.168.1.101 --mysql-port=3306 --mysql-user=zlm --mysql-password=zlmzlm --mysql-db=sysbench --tables=5 --table-size=10000000 --mysql-storage-engine=innodb cleanup 28 sysbench 1.0.15 (using bundled LuaJIT 2.1.0-beta2) 29 30 Dropping table 'sbtest1'... 31 Dropping table 'sbtest2'... 32 Dropping table 'sbtest3'... 33 Dropping table 'sbtest4'... 34 Dropping table 'sbtest5'... 35 36 [root@zlm2 08:59:09 ~/sysbench-1.0/src/lua] 37 #df -h 38 Filesystem Size Used Avail Use% Mounted on 39 /dev/mapper/centos-root 8.4G 6.9G 1.5G 83% / //I'd got 27% free space. 40 devtmpfs 488M 0 488M 0% /dev 41 tmpfs 497M 0 497M 0% /dev/shm 42 tmpfs 497M 6.6M 491M 2% /run 43 tmpfs 497M 0 497M 0% /sys/fs/cgroup 44 /dev/sda1 497M 118M 379M 24% /boot 45 none 87G 80G 6.6G 93% /vagrant 46 47 (zlm@192.168.1.101 3306)[(none)]>drop database sysbench; 48 Query OK, 1 row affected (0.04 sec) 49 50 //Further more,I dropped the "sysbench".
The slave hung still and disk space was full.
1 [root@zlm3 08:55:17 ~] 2 #df -h 3 Filesystem Size Used Avail Use% Mounted on 4 /dev/mapper/centos-root 8.4G 8.4G 20K 100% / 5 devtmpfs 488M 0 488M 0% /dev 6 tmpfs 497M 0 497M 0% /dev/shm 7 tmpfs 497M 6.5M 491M 2% /run 8 tmpfs 497M 0 497M 0% /sys/fs/cgroup 9 /dev/sda1 497M 118M 379M 24% /boot 10 none 87G 80G 6.6G 93% /vagrant 11 12 (zlm@192.168.1.102 3306)[(none)]>show slave status\G 13 ^C^C -- query aborted 14 15 16 ^Z 17 [6]+ Stopped mysql 18 19 [root@zlm3 08:55:58 ~] 20 #pkill mysqld 21 22 [root@zlm3 08:56:04 ~] 23 #./mysqld.sh 24 25 [root@zlm3 08:56:07 ~] 26 #mysql 27 ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.1.102' (111) 28 29 [root@zlm3 09:00:35 ~] 30 #cd /data/mysql/mysql3306/data 31 32 [root@zlm3 09:00:46 /data/mysql/mysql3306/data] 33 #cat error.log |tail -n 30 34 2018-07-19T08:57:02.581937+01:00 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 17039436409 35 2018-07-19T08:57:02.581958+01:00 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 17039436418 36 2018-07-19T08:57:02.581963+01:00 0 [Note] InnoDB: Database was not shutdown normally! 37 2018-07-19T08:57:02.581965+01:00 0 [Note] InnoDB: Starting crash recovery. 38 2018-07-19T08:57:02.696292+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state. 39 2018-07-19T08:57:02.700688+01:00 0 [Note] InnoDB: Transaction 31700139 was in the XA prepared state. 40 2018-07-19T08:57:02.700814+01:00 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo 41 2018-07-19T08:57:02.700821+01:00 0 [Note] InnoDB: Trx id counter is 31700480 42 2018-07-19T08:57:02.701719+01:00 0 [Note] InnoDB: Last MySQL binlog file position 0 99139927, file name mysql-bin.000016 43 2018-07-19T08:57:02.805965+01:00 0 [Note] InnoDB: Ignoring tablespace `zlm`.`sbtest2` because the DISCARD flag is set . 44 2018-07-19T08:57:02.806462+01:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables 45 2018-07-19T08:57:02.807316+01:00 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 46 2018-07-19T08:57:02.807568+01:00 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions 47 2018-07-19T08:57:02.807594+01:00 0 [Note] InnoDB: Rollback of non-prepared transactions completed 48 2018-07-19T08:57:02.871396+01:00 0 [Warning] InnoDB: Retry attempts for writing partial data failed. 49 2018-07-19T08:57:02.871423+01:00 0 [ERROR] InnoDB: Write to file ./ibtmp1failed at offset 11534336, 1048576 bytes should have been written, only 0 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded. 50 2018-07-19T08:57:02.871441+01:00 0 [ERROR] InnoDB: Error number 28 means 'No space left on device' 51 2018-07-19T08:57:02.871446+01:00 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html 52 2018-07-19T08:57:02.871451+01:00 0 [ERROR] InnoDB: Could not set the file size of './ibtmp1'. Probably out of disk space 53 2018-07-19T08:57:02.871456+01:00 0 [ERROR] InnoDB: Unable to create the shared innodb_temporary 54 2018-07-19T08:57:02.871459+01:00 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error 55 2018-07-19T08:57:03.273011+01:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 56 2018-07-19T08:57:03.273029+01:00 0 [ERROR] Plugin 'InnoDB' init function returned error. 57 2018-07-19T08:57:03.273033+01:00 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed. 58 2018-07-19T08:57:03.273037+01:00 0 [ERROR] Failed to initialize builtin plugins. 59 2018-07-19T08:57:03.273040+01:00 0 [ERROR] Aborting 60 61 2018-07-19T08:57:03.273046+01:00 0 [Note] Binlog end 62 2018-07-19T08:57:03.273389+01:00 0 [Note] mysqld: Shutdown complete 63 64 //The mysqld process could not run again because of no free disk space.
I decided to drop all the binlogs on slave to release the disk space.
1 [root@zlm3 09:01:34 /data/mysql/mysql3306] 2 #cd logs 3 4 [root@zlm3 09:01:35 /data/mysql/mysql3306/logs] 5 #ls -l 6 total 2691436 7 -rw-r----- 1 mysql mysql 4333745 Jul 13 11:07 mysql-bin.000001 8 -rw-r----- 1 mysql mysql 653 Jul 16 09:28 mysql-bin.000002 9 -rw-r----- 1 mysql mysql 268435663 Jul 16 10:03 mysql-bin.000003 10 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:22 mysql-bin.000004 11 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:32 mysql-bin.000005 12 -rw-r----- 1 mysql mysql 268435798 Jul 16 10:48 mysql-bin.000006 13 -rw-r----- 1 mysql mysql 268435798 Jul 16 11:29 mysql-bin.000007 14 -rw-r----- 1 mysql mysql 35214498 Jul 16 11:59 mysql-bin.000008 15 -rw-r----- 1 mysql mysql 249 Jul 16 11:59 mysql-bin.000009 16 -rw-r----- 1 mysql mysql 199908536 Jul 18 11:27 mysql-bin.000010 17 -rw-r----- 1 mysql mysql 249 Jul 18 11:27 mysql-bin.000011 18 -rw-r----- 1 mysql mysql 268474318 Jul 19 04:20 mysql-bin.000012 19 -rw-r----- 1 mysql mysql 268817026 Jul 19 04:24 mysql-bin.000013 20 -rw-r----- 1 mysql mysql 268802986 Jul 19 04:28 mysql-bin.000014 21 -rw-r----- 1 mysql mysql 268826152 Jul 19 04:32 mysql-bin.000015 22 -rw-r----- 1 mysql mysql 99422208 Jul 19 08:56 mysql-bin.000016 23 -rw-r----- 1 mysql mysql 704 Jul 19 04:32 mysql-bin.index 24 25 [root@zlm3 09:01:36 /data/mysql/mysql3306/logs] 26 #rm -f * 27 28 [root@zlm3 09:01:38 /data/mysql/mysql3306/logs] 29 #ls -l 30 total 0 31 32 [root@zlm3 09:01:48 ~] 33 #df -h 34 Filesystem Size Used Avail Use% Mounted on 35 /dev/mapper/centos-root 8.4G 4.5G 4.0G 53% / //The free disk space had been reduced to 47%. 36 devtmpfs 488M 0 488M 0% /dev 37 tmpfs 497M 0 497M 0% /dev/shm 38 tmpfs 497M 6.5M 491M 2% /run 39 tmpfs 497M 0 497M 0% /sys/fs/cgroup 40 /dev/sda1 497M 118M 379M 24% /boot 41 none 87G 80G 6.6G 93% /vagrant
Ran the mysqld again and dropped the database "sysbench" on slave.
1 [root@zlm3 09:01:42 /data/mysql/mysql3306/logs] 2 #sh /root/mysqld.sh 3 4 [root@zlm3 09:01:51 /data/mysql/mysql3306/logs] 5 #ps aux|grep mysqld 6 mysql 4346 7.0 17.8 1109984 181576 pts/0 Sl 09:01 0:00 mysqld --defaults-file=/data/mysql/mysql3306/my.cnf 7 root 4380 0.0 0.0 112640 960 pts/0 R+ 09:01 0:00 grep --color=auto mysqld 8 9 [root@zlm3 09:01:55 /data/mysql/mysql3306/logs] 10 #mysql 11 Welcome to the MySQL monitor. Commands end with ; or \g. 12 Your MySQL connection id is 2 13 Server version: 5.7.21-log MySQL Community Server (GPL) 14 15 Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. 16 17 Oracle is a registered trademark of Oracle Corporation and/or its 18 affiliates. Other names may be trademarks of their respective 19 owners. 20 21 Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. 22 23 (zlm@192.168.1.102 3306)[(none)]>drop database sysbench; 24 Query OK, 6 rows affected (0.11 sec)
Started the replication threads of slave.
1 (zlm@192.168.1.102 3306)[(none)]>start slave; 2 Query OK, 0 rows affected (0.00 sec) 3 4 (zlm@192.168.1.102 3306)[(none)]>show slave status\G 5 *************************** 1. row *************************** 6 Slave_IO_State: Waiting for master to send event 7 Master_Host: 192.168.1.101 8 Master_User: repl 9 Master_Port: 3306 10 Connect_Retry: 60 11 Master_Log_File: mysql-bin.000019 12 Read_Master_Log_Pos: 155765801 13 Relay_Log_File: relay-bin.000068 14 Relay_Log_Pos: 398 15 Relay_Master_Log_File: mysql-bin.000019 16 Slave_IO_Running: Yes 17 Slave_SQL_Running: No 18 Replicate_Do_DB: 19 Replicate_Ignore_DB: 20 Replicate_Do_Table: 21 Replicate_Ignore_Table: 22 Replicate_Wild_Do_Table: 23 Replicate_Wild_Ignore_Table: 24 Last_Errno: 1146 25 Last_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist' 26 Skip_Counter: 0 27 Exec_Master_Log_Pos: 99148246 28 Relay_Log_Space: 56618146 29 Until_Condition: None 30 Until_Log_File: 31 Until_Log_Pos: 0 32 Master_SSL_Allowed: No 33 Master_SSL_CA_File: 34 Master_SSL_CA_Path: 35 Master_SSL_Cert: 36 Master_SSL_Cipher: 37 Master_SSL_Key: 38 Seconds_Behind_Master: NULL 39 Master_SSL_Verify_Server_Cert: No 40 Last_IO_Errno: 0 41 Last_IO_Error: 42 Last_SQL_Errno: 1146 43 Last_SQL_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist' //Since the database had been droppted.This error was notable. 44 Replicate_Ignore_Server_Ids: 45 Master_Server_Id: 1013306 46 Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e 47 Master_Info_File: mysql.slave_master_info 48 SQL_Delay: 0 49 SQL_Remaining_Delay: NULL 50 Slave_SQL_Running_State: 51 Master_Retry_Count: 86400 52 Master_Bind: 53 Last_IO_Error_Timestamp: 54 Last_SQL_Error_Timestamp: 180719 09:02:07 55 Master_SSL_Crl: 56 Master_SSL_Crlpath: 57 Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021 //It was stuck on transaction 3714549(which contained error). 58 Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3714548, 59 5c77c31b-4add-11e8-81e2-080027de0e0e:1-168 60 Auto_Position: 1 61 Replicate_Rewrite_DB: 62 Channel_Name: 63 Master_TLS_Version: 64 1 row in set (0.00 sec) 65 66 [root@zlm3 09:10:42 ~] 67 #perror 1146 68 MySQL error code 1146 (ER_NO_SUCH_TABLE): Table '%-.192s.%-.192s' doesn't exist 69 70 //Error 1146 indicated the absence of table "sbtest1" in "sysbench" database. 71 //Obviously,the slave was replaying the operations relevant to this table on master.The table even the database had been dropped. 72 //How could I do next step?Do I have to generate a new mysqldump file and reconfigure the slave again? 73 //There's One thing I'm rather sure that there were no other transactions generated in the whole course except the operations on "sysbench" database. 74 //Since I'd drop "sysbentch" database on both master and slave.Maybe I can fix the issue easily.
Checked the Executed_Gtid_Set on master.
1 (zlm@192.168.1.101 3306)[(none)]>show master status; 2 +------------------+-----------+--------------+------------------+------------------------------------------------+ 3 | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | 4 +------------------+-----------+--------------+------------------+------------------------------------------------+ 5 | mysql-bin.000019 | 155765801 | | | 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 | 6 +------------------+-----------+--------------+------------------+------------------------------------------------+ 7 1 row in set (0.00 sec) 8 9 //The executed gtid was upto "3730021".
Tryed to fix the replica of master.
1 (zlm@192.168.1.102 3306)[(none)]>show slave status\G 2 *************************** 1. row *************************** 3 Slave_IO_State: Waiting for master to send event 4 Master_Host: 192.168.1.101 5 Master_User: repl 6 Master_Port: 3306 7 Connect_Retry: 60 8 Master_Log_File: mysql-bin.000019 9 Read_Master_Log_Pos: 155765801 10 Relay_Log_File: relay-bin.000068 11 Relay_Log_Pos: 398 12 Relay_Master_Log_File: mysql-bin.000019 13 Slave_IO_Running: Yes 14 Slave_SQL_Running: No 15 Replicate_Do_DB: 16 Replicate_Ignore_DB: 17 Replicate_Do_Table: 18 Replicate_Ignore_Table: 19 Replicate_Wild_Do_Table: 20 Replicate_Wild_Ignore_Table: 21 Last_Errno: 1146 22 Last_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist' 23 Skip_Counter: 0 24 Exec_Master_Log_Pos: 99148246 25 Relay_Log_Space: 56618627 26 Until_Condition: None 27 Until_Log_File: 28 Until_Log_Pos: 0 29 Master_SSL_Allowed: No 30 Master_SSL_CA_File: 31 Master_SSL_CA_Path: 32 Master_SSL_Cert: 33 Master_SSL_Cipher: 34 Master_SSL_Key: 35 Seconds_Behind_Master: NULL 36 Master_SSL_Verify_Server_Cert: No 37 Last_IO_Errno: 0 38 Last_IO_Error: 39 Last_SQL_Errno: 1146 40 Last_SQL_Error: Error executing row event: 'Table 'sysbench.sbtest1' doesn't exist' 41 Replicate_Ignore_Server_Ids: 42 Master_Server_Id: 1013306 43 Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e 44 Master_Info_File: mysql.slave_master_info 45 SQL_Delay: 0 46 SQL_Remaining_Delay: NULL 47 Slave_SQL_Running_State: 48 Master_Retry_Count: 86400 49 Master_Bind: 50 Last_IO_Error_Timestamp: 51 Last_SQL_Error_Timestamp: 180719 09:02:57 52 Master_SSL_Crl: 53 Master_SSL_Crlpath: 54 Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021 55 Executed_Gtid_Set: 56 Auto_Position: 1 57 Replicate_Rewrite_DB: 58 Channel_Name: 59 Master_TLS_Version: 60 1 row in set (0.00 sec) 61 62 (zlm@192.168.1.102 3306)[(none)]>reset master; 63 Query OK, 0 rows affected (0.02 sec) 64 65 (zlm@192.168.1.102 3306)[(none)]>set @@global.gtid_purged='1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021'; 66 Query OK, 0 rows affected (0.00 sec) 67 68 //On account of surely knowing there were no other transactions at all.I set the "gtid_purged" variable to the value of "gtid_executed" on master. 69 //It means I guised that all the transactions generated on master had been replayed on slave already.The slave could retrieve new GTID at the moment. 70 71 (zlm@192.168.1.102 3306)[(none)]>start slave sql_thread; 72 Query OK, 0 rows affected (0.02 sec) 73 74 (zlm@192.168.1.102 3306)[(none)]>show slave status\G 75 *************************** 1. row *************************** 76 Slave_IO_State: Waiting for master to send event 77 Master_Host: 192.168.1.101 78 Master_User: repl 79 Master_Port: 3306 80 Connect_Retry: 60 81 Master_Log_File: mysql-bin.000019 82 Read_Master_Log_Pos: 155765801 83 Relay_Log_File: relay-bin.000069 84 Relay_Log_Pos: 438 85 Relay_Master_Log_File: mysql-bin.000019 86 Slave_IO_Running: Yes 87 Slave_SQL_Running: Yes //The sql_thread became "Yes". 88 Replicate_Do_DB: 89 Replicate_Ignore_DB: 90 Replicate_Do_Table: 91 Replicate_Ignore_Table: 92 Replicate_Wild_Do_Table: 93 Replicate_Wild_Ignore_Table: 94 Last_Errno: 0 95 Last_Error: 96 Skip_Counter: 0 97 Exec_Master_Log_Pos: 155765801 98 Relay_Log_Space: 56618434 99 Until_Condition: None 100 Until_Log_File: 101 Until_Log_Pos: 0 102 Master_SSL_Allowed: No 103 Master_SSL_CA_File: 104 Master_SSL_CA_Path: 105 Master_SSL_Cert: 106 Master_SSL_Cipher: 107 Master_SSL_Key: 108 Seconds_Behind_Master: 0 109 Master_SSL_Verify_Server_Cert: No 110 Last_IO_Errno: 0 111 Last_IO_Error: 112 Last_SQL_Errno: 0 113 Last_SQL_Error: 114 Replicate_Ignore_Server_Ids: 115 Master_Server_Id: 1013306 116 Master_UUID: 1b7181ee-6eaf-11e8-998e-080027de0e0e 117 Master_Info_File: mysql.slave_master_info 118 SQL_Delay: 0 119 SQL_Remaining_Delay: NULL 120 Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates 121 Master_Retry_Count: 86400 122 Master_Bind: 123 Last_IO_Error_Timestamp: 124 Last_SQL_Error_Timestamp: 125 Master_SSL_Crl: 126 Master_SSL_Crlpath: 127 Retrieved_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:3714549-3730021 128 Executed_Gtid_Set: 1b7181ee-6eaf-11e8-998e-080027de0e0e:1-3730021 //The slave had skipped those GTID(which contained error 1146) of master and waited for newer GTID.The replica had been fixed up. 129 Auto_Position: 1 130 Replicate_Rewrite_DB: 131 Channel_Name: 132 Master_TLS_Version: 133 1 row in set (0.00 sec)
Summary
- The variable "gtid_purged" cannot be set if "gtid_executed" is not empty.
- Caution,"reset master" can only be used on slave.Keep in mind that don't do it on master anytime.
- This case can be followed only in test environment 'cause you cannot guarantee whether all the transactions are really replayed on slave.
版权声明:本文为博主原创文章,如需转载请保留此声明及博客链接,谢谢!
博客地址: http://www.cnblogs.com/aaron8219 & http://blog.csdn.net/aaron8219
博客地址: http://www.cnblogs.com/aaron8219 & http://blog.csdn.net/aaron8219