两个Redis实例互相SLAVEOF会怎样

今天尝试配置Redis Sentinel 来监控Redis服务器，中间由于某些设想我突然想到如果两个Redis实例互相slaveof会怎样。以下是我的试验：

两个Redis实例，redis1配置作为master，redis2配置作为slave：slaveof redis1。

启动redis1、redis2。

启动成功并且redis2也成功slaveof redis1后，redis-cli连接redis1，执行命令将redis1设置为redis2的从库:

slaveof [redis2 IP] [redis2 port]

执行后的结果是......两个redis都在重复抛出SYNC命令执行失败的log，也就是显然两个redis不能互相作为从库。

redis1执行slaveof后的log：

[14793] 06 Sep 17:36:20.426 * SLAVE OF 10.18.129.49:9778 enabled (user request)
[14793] 06 Sep 17:36:20.636 - Accepted 10.18.129.49:44277
[14793] 06 Sep 17:36:20.637 - Client closed connection
[14793] 06 Sep 17:36:20.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:20.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:20.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:20.804 * Master replied to PING, replication can continue...
[14793] 06 Sep 17:36:20.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14793] 06 Sep 17:36:21.636 - Accepted 10.18.129.49:44279
[14793] 06 Sep 17:36:21.637 - Client closed connection
[14793] 06 Sep 17:36:21.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:21.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:21.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:21.804 * Master replied to PING, replication can continue...
[14793] 06 Sep 17:36:21.804 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14793] 06 Sep 17:36:22.636 - Accepted 10.18.129.49:44281
[14793] 06 Sep 17:36:22.637 - Client closed connection
[14793] 06 Sep 17:36:22.804 * Connecting to MASTER...
[14793] 06 Sep 17:36:22.804 * MASTER <-> SLAVE sync started
[14793] 06 Sep 17:36:22.804 * Non blocking connect for SYNC fired the event.
[14793] 06 Sep 17:36:22.804 * Master replied to PING, replication can continue..

redis2的log：

[14796] 06 Sep 17:36:20.426 - Client closed connection
[14796] 06 Sep 17:36:20.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:20.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:20.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:20.636 * Master replied to PING, replication can continue...
[14796] 06 Sep 17:36:20.636 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14796] 06 Sep 17:36:20.804 - Accepted 10.18.129.49:51034
[14796] 06 Sep 17:36:20.805 - Client closed connection
[14796] 06 Sep 17:36:21.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:21.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:21.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:21.636 * Master replied to PING, replication can continue...
[14796] 06 Sep 17:36:21.637 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[14796] 06 Sep 17:36:21.804 - Accepted 10.18.129.49:51036
[14796] 06 Sep 17:36:21.805 - Client closed connection
[14796] 06 Sep 17:36:22.636 - DB 0: 20 keys (0 volatile) in 32 slots HT.
[14796] 06 Sep 17:36:22.636 - 0 clients connected (0 slaves), 801176 bytes in use
[14796] 06 Sep 17:36:22.636 * Connecting to MASTER...
[14796] 06 Sep 17:36:22.636 * MASTER <-> SLAVE sync started
[14796] 06 Sep 17:36:22.636 * Non blocking connect for SYNC fired the event.
[14796] 06 Sep 17:36:22.636 * Master replied to PING, replication can continue..

两个redis就这样都进入SYNC失败的死循环状态。

我想到的疑问是：为什么原来的从库redis2会重新执行SYNC命令？

从上面的redis2的log第一行可以看到原先的主从连接断开了。

看了执行主从设置的源码replication.c，下面是redis1执行slaveof命令的代码，它在中间执行disconnectSlaves()导致原来的主从连接断开：

void slaveofCommand(redisClient *c) {
    if (!strcasecmp(c->argv[1]->ptr,"no") &&!strcasecmp(c->argv[2]->ptr,"one")) {
        // 省略了
    } else {
        // 省略了
        /* There was no previous master or the user specified a different one,
         * we can continue. */
        sdsfree(server.masterhost);
        server.masterhost = sdsdup(c->argv[1]->ptr);
        server.masterport = port;
        if (server.master) freeClient(server.master);
        disconnectSlaves(); /* Force our slaves to resync with us as well. */
        cancelReplicationHandshake();
        server.repl_state = REDIS_REPL_CONNECT;
        redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
            server.masterhost, server.masterport);
    }
    addReply(c,shared.ok);
}

disconnectSlaves()旁边的注解是：Force our slaves to resync with us as well. 意思类似于先把你们(redis2)断开，等我(redis1)同步我的主库搞定后你们再来向我同步。这样导致redis2和redis1断开了，而redis2一开始作为从库如果它和主库断开它会不断尝试重新连接并执行SYNC命令直到成功。

了解了为什么redis2也执行SYNC命令后，第二个疑问是为什么两个redis的SYNC操作都会一直失败，实际上原因和第一个差不多。两个redis的log异常都是：ERR Can't SYNC while not connected with my master。这个log在代码中是：

void syncCommand(redisClient *c) {
    /* ignore SYNC if already slave or in monitor mode */
    if (c->flags & REDIS_SLAVE) return;

    /* Refuse SYNC requests if we are a slave but the link with our master
     * is not ok... */
    if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED) {
        addReplyError(c,"Can't SYNC while not connected with my master");
        return;
    }

    /* SYNC can't be issued when the server has pending data to send to
     * the client about already issued commands. We need a fresh reply
     * buffer registering the differences between the BGSAVE and the current
     * dataset, so that we can copy to other slaves if needed. */
    if (listLength(c->reply) != 0) {
        addReplyError(c,"SYNC is invalid with pending input");
        return;
    }
    //省略
}

syncCommand函数是Redis作为主库收到从库发来的SYNC命令时的处理，看上面注释部分“Refuse SYNC requests if we are a slave but the link with our master is not ok...”。当redis1作为主库收到从库的SYNC命令，会执行syncCommand函数，其中if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED)... ，redis1刚好设置为别的主库(redis2)的从库但还没完成同步工作(redis1需要向redis2发送SYNC请求并且返回成功才能完成同步，而redis2处理redis1的SYNC请求时又需要redis1处理好redis2的SYNC请求才行，这导致死锁了)，所以这个判断返回true，redis1直接reply error：Can't SYNC while not connected with my master)。redis2的情况也一样，所以双方都处在Can't SYNC while not connected with my master的状态。

欢迎留言！

posted @ 2013-09-06 18:38 Shaopeng 阅读(2173) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Shaopeng

两个Redis实例互相SLAVEOF会怎样

公告