两个Redis实例互相SLAVEOF会怎样

今天尝试配置Redis Sentinel 来监控Redis服务器,中间由于某些设想我突然想到如果两个Redis实例互相slaveof会怎样。以下是我的试验:

两个Redis实例,redis1配置作为master,redis2配置作为slave:slaveof redis1。

启动redis1、redis2。

启动成功并且redis2也成功slaveof redis1后,redis-cli连接redis1,执行命令将redis1设置为redis2的从库:

slaveof [redis2 IP]  [redis2 port] 

执行后的结果是......两个redis都在重复抛出SYNC命令执行失败的log,也就是显然两个redis不能互相作为从库。

redis1执行slaveof后的log:

[14793] 06 Sep 17:36:20.426 * SLAVE OF 10.18.129.49:9778 enabled (user request)
[14793] 06 Sep 17:36:20.636 - Accepted 10.18.129.49:44277
[14793] 06 Sep 17:36:20.637 - Client closed connection
[14793] 06 Sep 17:36:20.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:20.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:20.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:20.804 * Master replied to PING, replication can continue...
[14793] 06 Sep 17:36:20.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14793] 06 Sep 17:36:21.636 - Accepted 10.18.129.49:44279
[14793] 06 Sep 17:36:21.637 - Client closed connection
[14793] 06 Sep 17:36:21.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:21.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:21.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:21.804 * Master replied to PING, replication can continue...
[14793] 06 Sep 17:36:21.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14793] 06 Sep 17:36:22.636 - Accepted 10.18.129.49:44281
[14793] 06 Sep 17:36:22.637 - Client closed connection
[14793] 06 Sep 17:36:22.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:22.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:22.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:22.804 * Master replied to PING, replication can continue..        

redis2的log:

[14796] 06 Sep 17:36:20.426 - Client closed connection
[14796] 06 Sep 17:36:20.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:20.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:20.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:20.636 * Master replied to PING, replication can continue...
[14796] 06 Sep 17:36:20.636 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14796] 06 Sep 17:36:20.804 - Accepted 10.18.129.49:51034
[14796] 06 Sep 17:36:20.805 - Client closed connection
[14796] 06 Sep 17:36:21.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:21.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:21.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:21.636 * Master replied to PING, replication can continue...
[14796] 06 Sep 17:36:21.637 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14796] 06 Sep 17:36:21.804 - Accepted 10.18.129.49:51036
[14796] 06 Sep 17:36:21.805 - Client closed connection
[14796] 06 Sep 17:36:22.636 - DB 0: 20 keys (0 volatile) in 32 slots HT.
[14796] 06 Sep 17:36:22.636 - 0 clients connected (0 slaves), 801176 bytes in use
[14796] 06 Sep 17:36:22.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:22.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:22.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:22.636 * Master replied to PING, replication can continue..

两个redis就这样都进入SYNC失败的死循环状态。

我想到的疑问是:为什么原来的从库redis2会重新执行SYNC命令?

从上面的redis2的log第一行可以看到原先的主从连接断开了。

看了执行主从设置的源码replication.c,下面是redis1执行slaveof命令的代码,它在中间执行disconnectSlaves()导致原来的主从连接断开:

void slaveofCommand(redisClient *c) {
    if (!strcasecmp(c->argv[1]->ptr,"no") &&!strcasecmp(c->argv[2]->ptr,"one")) {
        // 省略了
    } else {
        // 省略了
        /* There was no previous master or the user specified a different one,
         * we can continue. */
        sdsfree(server.masterhost);
        server.masterhost = sdsdup(c->argv[1]->ptr);
        server.masterport = port;
        if (server.master) freeClient(server.master);
        disconnectSlaves(); /* Force our slaves to resync with us as well. */
        cancelReplicationHandshake();
        server.repl_state = REDIS_REPL_CONNECT;
        redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
            server.masterhost, server.masterport);
    }
    addReply(c,shared.ok);
}

disconnectSlaves()旁边的注解是:Force our slaves to resync with us as well. 意思类似于先把你们(redis2)断开,等我(redis1)同步我的主库搞定后你们再来向我同步。这样导致redis2和redis1断开了,而redis2一开始作为从库如果它和主库断开它会不断尝试重新连接并执行SYNC命令直到成功。

了解了为什么redis2也执行SYNC命令后,第二个疑问是为什么两个redis的SYNC操作都会一直失败,实际上原因和第一个差不多。两个redis的log异常都是:ERR Can't SYNC while not connected with my master。这个log在代码中是:

void syncCommand(redisClient *c) {
    /* ignore SYNC if already slave or in monitor mode */
    if (c->flags & REDIS_SLAVE) return;

    /* Refuse SYNC requests if we are a slave but the link with our master
     * is not ok... */
    if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED) {
        addReplyError(c,"Can't SYNC while not connected with my master");
        return;
    }

    /* SYNC can't be issued when the server has pending data to send to
     * the client about already issued commands. We need a fresh reply
     * buffer registering the differences between the BGSAVE and the current
     * dataset, so that we can copy to other slaves if needed. */
    if (listLength(c->reply) != 0) {
        addReplyError(c,"SYNC is invalid with pending input");
        return;
    }
//省略
}

syncCommand函数是Redis作为主库收到从库发来的SYNC命令时的处理,看上面注释部分“Refuse SYNC requests if we are a slave but the link with our master is not ok...”。当redis1作为主库收到从库的SYNC命令,会执行syncCommand函数,其中if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED)... ,redis1刚好设置为别的主库(redis2)的从库但还没完成同步工作(redis1需要向redis2发送SYNC请求并且返回成功才能完成同步,而redis2处理redis1的SYNC请求时又需要redis1处理好redis2的SYNC请求才行,这导致死锁了),所以这个判断返回true,redis1直接reply error:Can't SYNC while not connected with my master)。redis2的情况也一样,所以双方都处在Can't SYNC while not connected with my master的状态。

欢迎留言!

posted @ 2013-09-06 18:38  Shaopeng  阅读(2173)  评论(0编辑  收藏  举报