深入剖析Redis主从复制

 
【Redis 主从复制的内部协议和机制】
 

一、主从概述

Redis 支持 Master-Slave(主从)模式,Redis Server 可以设置为另一个 Redis Server 的主机(从机),从机定期从主机拿数据。特殊的,一个从机同样可以设置为一个 Redis Server 的主机,这样一来 Master-Slave 的分布看起来就是一个有向无环图 DAG,如此形成 Redis Server 集群,无论是主机还是从机都是 Redis Server,都可以提供服务。

wKioL1N3m4DTVxq0AACLKt_Ktdk945.jpg

在配置后,主机可负责读写服务,从机只负责读。Redis 提高这种配置方式,为的是让其支持数据的弱一致性,即最终一致性。在业务中,选择强一致性还是弱一致性,应该取决于具体的业务需求,像微博,完全可以使用弱一致性模型;像淘宝,可以选用强一致性模型。

 

Redis 主从复制的实现主要在 replication.c 中。

 

这篇文章涉及较多的代码,但我已经尽量删繁就简,达到能说明问题本质。为了保留代码的原生性并让读者能够阅读原生代码的注释,剖析 Redis 的几篇文章都没有删除代码中的英文注释,并已加注释。

 

二、积压空间

在《深入剖析 Redis AOF 持久化策略》中,介绍了更新缓存的概念,举一个例子:客户端发来命令:set name Jhon,这一数据更新被记录为:*3\r\n$3\r\nSET\r\n$4\r\nname\r\n$3\r\nJhon\r\n,并存储在更新缓存中。

 

同样,在主从连接中,也有更新缓存的概念。只是两者的用途不一样,前者被写入本地,后者被写入从机,这里我们把它称为积压空间。

 

更新缓存存储在 server.repl_backlog,Redis 将其作为一个环形空间来处理,这样做节省了空间,避免内存再分配的情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
struct redisServer {
    /* Replication (master) */
    // 最近一次使用(访问)的数据集
    int slaveseldb;                 /* Last SELECTed DB in replication output */
                                                                   
    // 全局的数据同步偏移量
    long long master_repl_offset;   /* Global replication offset */
                                                                    
    // 主从连接心跳频率
    int repl_ping_slave_period;     /* Master pings the slave every N seconds */
                                                                    
    // 积压空间指针
    char *repl_backlog;             /* Replication backlog for partial syncs */
                                                                  
    // 积压空间大小
    long long repl_backlog_size;    /* Backlog circular buffer size */
                                                                  
    // 积压空间中写入的新数据的大小
    long long repl_backlog_histlen; /* Backlog actual data length */
                                                                  
    // 下一次向积压空间写入数据的起始位置
    long long repl_backlog_idx;     /* Backlog circular buffer current offset */
                                                                  
    // 积压数据的起始位置,是一个宏观值
    long long repl_backlog_off;     /* Replication offset of first byte
                                       in the backlog buffer. */
                                                                    
    // 积压空间有效时间
    time_t repl_backlog_time_limit; /* Time without slaves after the backlog gets released. */
}

积压空间中的数据变更记录是什么时候被写入的?在执行一个 Redis 命令的时候,如果存在数据的修改(写),那么就会把变更记录传播。Redis 源码中是这么实现的:call()->propagate()->replicationFeedSlaves()

 

注释:命令真正执行的地方在 call() 中,call() 如果发现数据被修改(dirty),则传播 propagrate(),replicationFeedSlaves() 将修改记录写入积压空间和所有已连接的从机。

 

这里可能会有疑问:为什么把数据添加入积压空间,又把数据分发给所有的从机?为什么不仅仅将数据分发给所有从机呢?

 

因为有一些从机会因特殊情况(???)与主机断开连接,注意从机断开前有暂存主机的状态信息,因此这些断开的从机就没有及时收到更新的数据。Redis 为了让断开的从机在下次连接后能够获取更新数据,将更新数据加入了积压空间。从 replicationFeedSlaves() 实现来看,在线的 Slave 能马上收到数据更新记录;因某些原因暂时断开连接的 Slave,需要从积压空间中找回断开期间的数据更新记录。如果断开的时间足够长,Master 会拒绝 Slave 的部分同步请求,从而 Slave 只能进行全同步。

 

下面是源码注释:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
// call() 函数是执行命令的核心函数,真正执行命令的地方
/* Call() is the core of Redis execution of a command */
void call(redisClient *c, int flags) {
    ......
    /* Call the command. */
    c->flags &= ~(REDIS_FORCE_AOF | REDIS_FORCE_REPL);
    redisOpArrayInit(&server.also_propagate);
                                                        
    // 脏数据标记,数据是否被修改
    dirty = server.dirty;
                                                        
    // 执行命令对应的函数
    c->cmd->proc(c);
                                                        
    dirty = server.dirty - dirty;
    duration = ustime() - start;
                                                        
    ......
                                                        
    // 将客户端请求的数据修改记录传播给 AOF 和从机
    /* Propagate the command into the AOF and replication link */
    if (flags & REDIS_CALL_PROPAGATE) {
        int flags = REDIS_PROPAGATE_NONE;
                                                        
        // 强制主从复制
        if (c->flags & REDIS_FORCE_REPL) flags |= REDIS_PROPAGATE_REPL;
                                                        
        // 强制 AOF 持久化
        if (c->flags & REDIS_FORCE_AOF) flags |= REDIS_PROPAGATE_AOF;
                                                        
        // 数据被修改
        if (dirty)
            flags |= (REDIS_PROPAGATE_REPL | REDIS_PROPAGATE_AOF);
                                                        
        // 传播数据修改记录
        if (flags != REDIS_PROPAGATE_NONE)
            propagate(c->cmd, c->db->id, c->argv, c->argc, flags);
    }
    ......
}
                                                        
// 向 AOF 和从机发布数据更新
/* Propagate the specified command (in the context of the specified database id)
 * to AOF and Slaves.
 *
 * flags are an xor between:
 * + REDIS_PROPAGATE_NONE (no propagation of command at all)
 * + REDIS_PROPAGATE_AOF (propagate into the AOF file if is enabled)
 * + REDIS_PROPAGATE_REPL (propagate into the replication link)
 */
void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
               int flags)
{
    // AOF 策略需要打开,且设置 AOF 传播标记,将更新发布给本地文件
    if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)
        feedAppendOnlyFile(cmd, dbid, argv, argc);
                                                        
    // 设置了从机传播标记,将更新发布给从机
    if (flags & REDIS_PROPAGATE_REPL)
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
}
                                                        
// 向积压空间和从机发送数据
void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
    listNode *ln;
    listIter li;
    int j, len;
    char llstr[REDIS_LONGSTR_SIZE];
                                                        
    // 没有积压数据且没有从机,直接退出
    /* If there aren't slaves, and there is no backlog buffer to populate,
     * we can return ASAP. */
    if (server.repl_backlog == NULL && listLength(slaves) == 0) return;
                                                        
    /* We can't have slaves attached and no backlog. */
    redisAssert(!(listLength(slaves) != 0 && server.repl_backlog == NULL));
                                                        
    /* Send SELECT command to every slave if needed. */
    if (server.slaveseldb != dictid) {
        robj *selectcmd;
                                                        
        // 小于等于 10 的可以用共享对象
        /* For a few DBs we have pre-computed SELECT command. */
        if (dictid >= 0 && dictid < REDIS_SHARED_SELECT_CMDS) {
            selectcmd = shared.select[dictid];
        else {
        // 不能使用共享对象,生成 SELECT 命令对应的 redis 对象
            int dictid_len;
                                                        
            dictid_len = ll2string(llstr, sizeof(llstr), dictid);
            selectcmd = createObject(REDIS_STRING,
                sdscatprintf(sdsempty(),
                "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",
                dictid_len, llstr));
        }
                                                        
        // 这里可能会有疑问:为什么把数据添加入积压空间,又把数据分发给所有的从机?
        // 为什么不仅仅将数据分发给所有从机呢?
        // 因为有一些从机会因特殊情况(???)与主机断开连接,注意从机断开前有暂存
        // 主机的状态信息,因此这些断开的从机就没有及时收到更新的数据。redis 为了让
        // 断开的从机在下次连接后能够获取更新数据,将更新数据加入了积压空间。
                                                        
        // 将 SELECT 命令对应的 redis 对象数据添加到积压空间
        /* Add the SELECT command into the backlog. */
        if (server.repl_backlog) feedReplicationBacklogWithObject(selectcmd);
                                                        
        // 将数据分发所有的从机
        /* Send it to slaves. */
        listRewind(slaves, &li);
        while((ln = listNext(&li))) {
            redisClient *slave = ln->value;
            addReply(slave, selectcmd);
        }
                                                        
        // 销毁对象
        if (dictid < 0 || dictid >= REDIS_SHARED_SELECT_CMDS)
            decrRefCount(selectcmd);
    }
                                                        
    // 更新最近一次使用(访问)的数据集
    server.slaveseldb = dictid;
                                                        
    // 将命令写入积压空间
    /* Write the command to the replication backlog if any. */
    if (server.repl_backlog) {
        char aux[REDIS_LONGSTR_SIZE+3];
                                                        
        // 命令个数
        /* Add the multi bulk reply length. */
        aux[0] = '*';
        len = ll2string(aux + 1, sizeof(aux) - 1, argc);
        aux[len+1] = '\r';
        aux[len+2] = '\n';
        feedReplicationBacklog(aux, len + 3);
                                                        
        // 逐个命令写入
        for (j = 0; j < argc; j++) {
            long objlen = stringObjectLen(argv[j]);
                                                        
            /* We need to feed the buffer with the object as a bulk reply
             * not just as a plain string, so create the $..CRLF payload len
             * ad add the final CRLF */
            aux[0] = '$';
            len = ll2string(aux + 1, sizeof(aux) - 1, objlen);
            aux[len+1] = '\r';
            aux[len+2] = '\n';
                                                        
            /* 每个命令格式如下:
            $3
            *3
            SET
            *4
            NAME
            *4
            Jhon*/
                                                        
            // 命令长度
            feedReplicationBacklog(aux, len + 3);
            // 命令
            feedReplicationBacklogWithObject(argv[j]);
            // 换行
            feedReplicationBacklog(aux + len + 1, 2);
        }
    }
                                                        
    // 立即给每一个从机发送命令
    /* Write the command to every slave. */
    listRewind(slaves, &li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;
                                                        
        // 如果从机要求全同步,则不对此从机发送数据
        /* Don't feed slaves that are still waiting for BGSAVE to start */
        if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) continue;
                                                        
        /* Feed slaves that are waiting for the initial SYNC (so these commands
         * are queued in the output buffer until the initial SYNC completes),
         * or are already in sync with the master. */
                                                        
        // 向从机命令的长度
        /* Add the multi bulk length. */
        addReplyMultiBulkLen(slave, argc);
                                                        
        // 向从机发送命令
        /* Finally any additional argument that was not stored inside the
         * static buffer if any (from j to argc). */
        for (j = 0; j < argc; j++)
            addReplyBulk(slave, argv[j]);
    }
}
 

三、主从数据同步机制概述

Redis 主从同步有两种方式(或者所两个阶段):全同步和部分同步。

 

主从刚刚连接的时候,进行全同步;全同步结束后,进行部分同步。当然,如果有需要,Slave 在任何时候都可以发起全同步。Redis 策略是,无论如何,首先会尝试进行部分同步,如不成功,要求从机进行全同步,并启动 BGSAVE……BGSAVE 结束后,传输 RDB 文件;如果成功,允许从机进行部分同步,并传输积压空间中的数据。

 

下面这幅图,总结了主从同步的机制:

wKiom1N3nPbS-5JMAAHj5MGbf50000.jpg

如需设置 Slave,Master 需要向 Slave 发送 SLAVEOF hostname port,从机接收到后会自动连接主机,注册相应读写事件(syncWithMaster())。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// 修改主机
void slaveofCommand(redisClient *c) {
    if (!strcasecmp(c->argv[1]->ptr, "no") &&
        !strcasecmp(c->argv[2]->ptr, "one")) {
        // slaveof no one 断开主机连接
        if (server.masterhost) {
            replicationUnsetMaster();
            redisLog(REDIS_NOTICE, "MASTER MODE enabled (user request)");
        }
    else {
        long port;
                                                                                                           
        if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
            return;
                                                                                                           
        // 可能已经连接需要连接的主机
        /* Check if we are already attached to the specified slave */
        if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
            && server.masterport == port) {
            redisLog(REDIS_NOTICE,
                "SLAVE OF would result into synchronization with the master we are
                already connected with. No operation performed.");
                                                                                                           
            addReplySds(c, sdsnew("+OK Already connected to specified master\r\n"));
            return;
        }
                                                                                                           
        // 断开之前连接主机的连接,连接新的。 replicationSetMaster() 并不会真正连接主机,
        // 只是修改 struct server 中关于主机的设置。真正的主机连接在 replicationCron() 中完成
        /* There was no previous master or the user specified a different one,
         * we can continue. */
        replicationSetMaster(c->argv[1]->ptr, port);
        redisLog(REDIS_NOTICE, "SLAVE OF %s:%d enabled (user request)",
            server.masterhost, server.masterport);
    }
    addReply(c,shared.ok);
}
                                                                                                           
// 设置新主机
/* Set replication to the specified master address and port. */
void replicationSetMaster(char *ip, int port) {
    sdsfree(server.masterhost);
    server.masterhost = sdsdup(ip);
    server.masterport = port;
                                                                                                           
    // 断开之前主机的连接
    if (server.master) freeClient(server.master);
    disconnectSlaves(); /* Force our slaves to resync with us as well. */
                                                                                                           
    // 取消缓存主机
    replicationDiscardCachedMaster(); /* Don't try a PSYNC. */
                                                                                                           
    // 释放积压空间
    freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */
                                                                                                           
    // cancelReplicationHandshake() 尝试断开数据传输和主机连接
    cancelReplicationHandshake();
    server.repl_state = REDIS_REPL_CONNECT;
    server.master_repl_offset = 0;
}
                                                                                                           
// 管理主从连接的定时程序定时程序,每秒执行一次
// 在 serverCorn() 中调用
/* --------------------------- REPLICATION CRON  ----------------------------- */
                                                                                                           
/* Replication cron funciton, called 1 time per second. */
void replicationCron(void) {
    ......
    // 如果需要( REDIS_REPL_CONNECT),尝试连接主机,真正连接主机的操作在这里
    /* Check if we should connect to a MASTER */
    if (server.repl_state == REDIS_REPL_CONNECT) {
        redisLog(REDIS_NOTICE, "Connecting to MASTER %s:%d",
            server.masterhost, server.masterport);
        if (connectWithMaster() == REDIS_OK) {
            redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync started");
        }
    }
    ......
}

四、全同步

接着自动发起 PSYNC 请求 Master 进行全同步。无论如何,Redis 首先会尝试部分同步,如果失败才尝试全同步。而刚刚建立连接的 Master-Slave 需要全同步。

 

从机连接主机后,会主动发起 PSYNC 命令,从机会提供 master_runid 和 offset,主机验证 master_runid 和 offset 是否有效?master_runid 相当于主机身份验证码,用来验证从机上一次连接的主机,offset 是全局积压空间数据的偏移量。

验证未通过则,则进行全同步:主机返回 +FULLRESYNC master_runid offset(从机接收并记录 master_runid 和 offset,并准备接收 RDB 文件)接着启动 BGSAVE 生成 RDB 文件,BGSAVE 结束后,向从机传输,从而完成全同步。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// 连接主机 connectWithMaster() 的时候,会被注册为回调函数
void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
    char tmpfile[256], *err;
    int dfd, maxtries = 5;
    int sockerr = 0, psync_result;
    socklen_t errlen = sizeof(sockerr);
                                                                                              
    ......
                                                                                              
    // 这里尝试向主机请求部分同步,主机会回复以拒绝或接受请求。如果拒绝部分同步,
    // 会返回 +FULLRESYNC master_runid offset
    // 从机接收后准备进行全同步    psync_result = slaveTryPartialResynchronization(fd);
    if (psync_result == PSYNC_CONTINUE) {
        redisLog(REDIS_NOTICE,
            "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
        return;
    }
                                                                                              
    // 执行全同步
    /* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
     * and the server.repl_master_runid and repl_master_initial_offset are
     * already populated. */
                                                                                              
    // 未知结果,进行出错处理
    if (psync_result == PSYNC_NOT_SUPPORTED) {
        redisLog(REDIS_NOTICE,"Retrying with SYNC...");
        if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
            redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
                strerror(errno));
            goto error;
        }
    }
                                                                                              
    // 为什么要尝试 5次???
    /* Prepare a suitable temp file for bulk transfer */
    while(maxtries--) {
        snprintf(tmpfile, 256, "temp-%d.%ld.rdb", (int)server.unixtime, (long int)getpid());
        dfd = open(tmpfile, O_CREAT|O_WRONLY|O_EXCL, 0644);
        if (dfd != -1) break;
        sleep(1);
    }
                                                                                              
    if (dfd == -1) {
        redisLog(REDIS_WARNING,
            "Opening the temp file needed for MASTER <-> SLAVE synchronization: %s",
            strerror(errno));
        goto error;
    }
                                                                                              
    // 注册读事件,回调函数 readSyncBulkPayload(), 准备读 RDB 文件
    /* Setup the non blocking download of the bulk file. */
    if (aeCreateFileEvent(server.el,fd, AE_READABLE, readSyncBulkPayload, NULL) == AE_ERR) {
        redisLog(REDIS_WARNING, "Can't create readable event for SYNC: %s (fd=%d)",
            strerror(errno), fd);
        goto error;
    }
                                                                                              
    // 设置传输 RDB 文件数据的选项
    // 状态
    server.repl_state = REDIS_REPL_TRANSFER;
    // RDB 文件大小
    server.repl_transfer_size = -1;
    // 已经传输的大小
    server.repl_transfer_read = 0;
    // 上一次同步的偏移,为的是定时写入磁盘
    server.repl_transfer_last_fsync_off = 0;
    // 本地 RDB 文件套接字
    server.repl_transfer_fd = dfd;
    // 上一次同步 IO 时间
    server.repl_transfer_lastio = server.unixtime;
    // 临时文件名
    server.repl_transfer_tmpfile = zstrdup(tmpfile);
    return;
                                                                                              
error:
    close(fd);
    server.repl_transfer_s = -1;
    server.repl_state = REDIS_REPL_CONNECT;
    return;
}

全同步请求的数据是 RDB 数据文件和积压空间中的数据。关于 RDB 数据文件,请参看《深入剖析 Redis RDB 持久化策略》。如果没有后台持久化 BGSAVE 进程,那么 BGSVAE 会被触发,否则所有请求全同步的 Slave 都会被标记为等待 BGSAVE 结束。BGSAVE 结束后,Master 会马上向所有的从机发送 RDB 文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
// 主机 SYNC 和 PSYNC 命令处理函数,会尝试进行部分同步和全同步
/* SYNC ad PSYNC command implemenation. */
void syncCommand(redisClient *c) {
    ......
    // 主机尝试部分同步,失败的话向从机发送 +FULLRESYNC master_runid offset,接着启动 BGSAVE
                                                                    
    // 执行全同步:
    /* Full resynchronization. */
    server.stat_sync_full++;
                                                                    
    /* Here we need to check if there is a background saving operation
     * in progress, or if it is required to start one */
    if (server.rdb_child_pid != -1) {
        /*  存在 BGSAVE 后台进程。
          1.如果 master 现有所连接的所有从机 slaves 当中有存在 REDIS_REPL_WAIT_BGSAVE_END 的从机,
            那么将从机 c 设置为 REDIS_REPL_WAIT_BGSAVE_END;
          2.否则,设置为 REDIS_REPL_WAIT_BGSAVE_START*/
                                                                    
        /* Ok a background save is in progress. Let's check if it is a good
         * one for replication, i.e. if there is another slave that is
         * registering differences since the server forked to save */
        redisClient *slave;
        listNode *ln;
        listIter li;
                                                                    
        // 检测是否已经有从机申请全同步
        listRewind(server.slaves, &li);
        while((ln = listNext(&li))) {
            slave = ln->value;
            if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) break;
        }
                                                                    
        if (ln) {
            // 存在状态为 REDIS_REPL_WAIT_BGSAVE_END 的从机 slave,
            // 就将此从机 c 状态设置为 REDIS_REPL_WAIT_BGSAVE_END,
            // 从而在 BGSAVE 进程结束后,可以发送 RDB 文件,
            // 同时将从机 slave 中的更新复制到此从机 c。
                                                                    
            /* Perfect, the server is already registering differences for
             * another slave. Set the right state, and copy the buffer. */
                                                                    
            // 将其他从机上的待回复的缓存复制到从机 c
            copyClientOutputBuffer(c, slave);
                                                                    
            // 修改从机 c 状态为「等待 BGSAVE 进程结束」
            c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
            redisLog(REDIS_NOTICE, "Waiting for end of BGSAVE for SYNC");
        else {
            // 不存在状态为 REDIS_REPL_WAIT_BGSAVE_END 的从机,就将此从机 c 状态设置为
            // REDIS_REPL_WAIT_BGSAVE_START,即等待新的 BGSAVE 进程的开启。
            // 修改状态为「等待 BGSAVE 进程开始」
            /* No way, we need to wait for the next BGSAVE in order to
             * register differences */
            c->replstate = REDIS_REPL_WAIT_BGSAVE_START;
            redisLog(REDIS_NOTICE, "Waiting for next BGSAVE for SYNC");
        }
    else {
        // 不存在 BGSAVE 后台进程,启动一个新的 BGSAVE 进程
        /* Ok we don't have a BGSAVE in progress, let's start one */
        redisLog(REDIS_NOTICE, "Starting BGSAVE for SYNC");
        if (rdbSaveBackground(server.rdb_filename) != REDIS_OK) {
            redisLog(REDIS_NOTICE, "Replication failed, can't BGSAVE");
            addReplyError(c, "Unable to perform background save");
            return;
        }
                                                                    
        // 将此从机 c 状态设置为 REDIS_REPL_WAIT_BGSAVE_END,从而在 BGSAVE 进程结束后,
        // 可以发送 RDB 文件,同时将从机 slave 中的更新复制到此从机 c。
        c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
                                                                    
        // 清理脚本缓存???
        /* Flush the script cache for the new slave. */
        replicationScriptCacheFlush();
    }
                                                                    
    if (server.repl_disable_tcp_nodelay)
        anetDisableTcpNoDelay(NULL, c->fd); /* Non critical if it fails. */
    c->repldbfd = -1;
    c->flags |= REDIS_SLAVE;
    server.slaveseldb = -1; /* Force to re-emit the SELECT command. */
    listAddNodeTail(server.slaves,c);
    if (listLength(server.slaves) == 1 && server.repl_backlog == NULL)
        createReplicationBacklog();
    return;
}
                                                                    
// BGSAVE 结束后,会调用
/* A background saving child (BGSAVE) terminated its work. Handle this. */
void backgroundSaveDoneHandler(int exitcode, int bysignal) {
    // 其他操作
    ......
    // 可能从机正在等待 BGSAVE 进程的终止
    /* Possibly there are slaves waiting for a BGSAVE in order to be served
     * (the first stage of SYNC is a bulk transfer of dump.rdb) */
    updateSlavesWaitingBgsave(exitcode == 0 ? REDIS_OK : REDIS_ERR);
}
                                                                    
// 当 RDB 持久化(backgroundSaveDoneHandler())结束后,会调用此函数
// RDB 文件就绪,给所有的从机发送 RDB 文件
/* This function is called at the end of every background saving.
* The argument bgsaveerr is REDIS_OK if the background saving succeeded
* otherwise REDIS_ERR is passed to the function.
*
* The goal of this function is to handle slaves waiting for a successful
* background saving in order to perform non-blocking synchronization. */
void updateSlavesWaitingBgsave(int bgsaveerr) {
    listNode *ln;
    int startbgsave = 0;
    listIter li;
                                                                    
    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;
                                                                    
        // 等待 BGSAVE 开始。调整状态为等待下一次 BGSAVE 进程的结束
        if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) {
            startbgsave = 1;
                                                                    
            slave->replstate = REDIS_REPL_WAIT_BGSAVE_END;
                                                                    
        // 等待 BGSAVE 结束。准备向 slave 发送 RDB 文件
        else if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) {
            struct redis_stat buf;
                                                                    
            // 如果 RDB 持久化失败, bgsaveerr 会被设置为 REDIS_ERR
            if (bgsaveerr != REDIS_OK) {
                freeClient(slave);
                redisLog(REDIS_WARNING, "SYNC failed. BGSAVE child returned an error");
                continue;
            }
                                                                    
            // 打开 RDB 文件
            if ((slave->repldbfd = open(server.rdb_filename, O_RDONLY)) == -1 ||
                redis_fstat(slave->repldbfd, &buf) == -1) {
                freeClient(slave);
                redisLog(REDIS_WARNING,
                    "SYNC failed. Can't open/stat DB after BGSAVE: %s"strerror(errno));
                continue;
            }
                                                                    
            slave->repldboff = 0;
            slave->repldbsize = buf.st_size;
            slave->replstate = REDIS_REPL_SEND_BULK;
                                                                    
            // 如果之前有注册写事件,取消
            aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
                                                                    
            // 注册新的写事件,sendBulkToSlave() 传输 RDB 文件
            if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave,
                slave) == AE_ERR) {
                freeClient(slave);
                continue;
            }
        }
    }
                                                                    
    // startbgsave == REDIS_ERR 表示 BGSAVE 失败,再一次进行 BGSAVE 尝试
    if (startbgsave) {
        /* Since we are starting a new background save for one or more slaves,
         * we flush the Replication Script Cache to use EVAL to propagate every
         * new EVALSHA for the first time, since all the new slaves don't know
         * about previous scripts. */
        replicationScriptCacheFlush();
                                                                    
        if (rdbSaveBackground(server.rdb_filename) != REDIS_OK) {
            /* BGSAVE 可能 fork 失败,所有等待 BGSAVE 的从机都将结束连接。
             * 这是 redis 自我保护的措施,fork 失败很可能是内存紧张
             */
            listIter li;
                                                                    
            listRewind(server.slaves,&li);
            redisLog(REDIS_WARNING, "SYNC failed. BGSAVE failed");
            while((ln = listNext(&li))) {
                redisClient *slave = ln->value;
                                                                    
                if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START)
                    freeClient(slave);
            }
        }
    }
}

五、部分同步

如上所说,无论如何,Redis 首先会尝试部分同步。部分同步即把积压空间缓存的数据,即更新记录发送给从机。

 

从机连接主机后,会主动发起 PSYNC 命令,从机会提供 master_runid 和 offset,主机验证 master_runid 和 offset 是否有效?

验证通过则,进行部分同步:主机返回 +CONTINUE(从机接收后会注册积压数据接收事件),接着发送积压空间数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
// 连接主机 connectWithMaster() 的时候,会被注册为回调函数
void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
    char tmpfile[256], *err;
    int dfd, maxtries = 5;
    int sockerr = 0, psync_result;
    socklen_t errlen = sizeof(sockerr);
                                                
    ......
                                                
    // 尝试部分同步,主机允许进行部分同步会返回 +CONTINUE,从机接收后注册相应的事件
                                                
    /* Try a partial resynchonization. If we don't have a cached master
     * slaveTryPartialResynchronization() will at least try to use PSYNC
     * to start a full resynchronization so that we get the master run id
     * and the global offset, to try a partial resync at the next
     * reconnection attempt. */
                                                
    // 函数返回三种状态:
    // PSYNC_CONTINUE:表示会进行部分同步,在 slaveTryPartialResynchronization()
    // 中已经设置回调函数 readQueryFromClient()
    // PSYNC_FULLRESYNC:全同步,会下载 RDB 文件
    // PSYNC_NOT_SUPPORTED:未知
    psync_result = slaveTryPartialResynchronization(fd);
    if (psync_result == PSYNC_CONTINUE) {
        redisLog(REDIS_NOTICE,
            "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
        return;
    }
                                                
    // 执行全同步
    ......
}
                                                
// 函数返回三种状态:
// PSYNC_CONTINUE:表示会进行部分同步,已经设置回调函数
// PSYNC_FULLRESYNC:全同步,会下载 RDB 文件
// PSYNC_NOT_SUPPORTED:未知
#define PSYNC_CONTINUE 0
#define PSYNC_FULLRESYNC 1
#define PSYNC_NOT_SUPPORTED 2
int slaveTryPartialResynchronization(int fd) {
    char *psync_runid;
    char psync_offset[32];
    sds reply;
                                                
    /* Initially set repl_master_initial_offset to -1 to mark the current
     * master run_id and offset as not valid. Later if we'll be able to do
     * a FULL resync using the PSYNC command we'll set the offset at the
     * right value, so that this information will be propagated to the
     * client structure representing the master into server.master. */
    server.repl_master_initial_offset = -1;
                                                
    if (server.cached_master) {
        // 缓存了上一次与主机连接的信息,可以尝试进行部分同步,减少数据传输
        psync_runid = server.cached_master->replrunid;
        snprintf(psync_offset, sizeof(psync_offset), "%lld",
            server.cached_master->reploff + 1);
                                        
        redisLog(REDIS_NOTICE,
            "Trying a partial resynchronization (request %s:%s).",
            psync_runid, psync_offset);
    else {
        // 未缓存上一次与主机连接的信息,进行全同步
        // psync ? -1 可以获取主机的 master_runid
        redisLog(REDIS_NOTICE, "Partial resynchronization not possible (no cached master)");
        psync_runid = "?";
        memcpy(psync_offset, "-1", 3);
    }
                                                
    // 向主机发送命令,并接收回复
    /* Issue the PSYNC command */
    reply = sendSynchronousCommand(fd, "PSYNC", psync_runid, psync_offset, NULL);
                                                
    // 全同步
    if (!strncmp(reply, "+FULLRESYNC", 11)) {
        char *runid = NULL, *offset = NULL;
                                                
        /* FULL RESYNC, parse the reply in order to extract the run id
         * and the replication offset. */
        runid = strchr(reply, ' ');
        if (runid) {
            runid++;
            offset = strchr(runid, ' ');
            if (offset) offset++;
        }
        if (!runid || !offset || (offset-runid-1) != REDIS_RUN_ID_SIZE) {
            redisLog(REDIS_WARNING,
                "Master replied with wrong +FULLRESYNC syntax.");
                                            
            /* This is an unexpected condition, actually the +FULLRESYNC
             * reply means that the master supports PSYNC, but the reply
             * format seems wrong. To stay safe we blank the master
             * runid to make sure next PSYNCs will fail. */
            memset(server.repl_master_runid, 0, REDIS_RUN_ID_SIZE + 1);
        else {
            // 拷贝 runid
            memcpy(server.repl_master_runid, runid, offset-runid-1);
            server.repl_master_runid[REDIS_RUN_ID_SIZE] = '\0';
            server.repl_master_initial_offset = strtoll(offset,NULL,10);
            redisLog(REDIS_NOTICE, "Full resync from master: %s:%lld",
                server.repl_master_runid,
                server.repl_master_initial_offset);
        }
        /* We are going to full resync, discard the cached master structure. */
        replicationDiscardCachedMaster();
        sdsfree(reply);
        return PSYNC_FULLRESYNC;
    }
                                                
    // 部分同步
    if (!strncmp(reply, "+CONTINUE", 9)) {
        /* Partial resync was accepted, set the replication state accordingly */
        redisLog(REDIS_NOTICE, "Successful partial resynchronization with master.");
        sdsfree(reply);
                                                
        // 缓存主机替代现有主机,且为 PSYNC(部分同步) 做好准备c
        replicationResurrectCachedMaster(fd);
                                                
        return PSYNC_CONTINUE;
    }
                                                
    /* If we reach this point we receied either an error since the master does
     * not understand PSYNC, or an unexpected reply from the master.
     * Reply with PSYNC_NOT_SUPPORTED in both cases. */
                                                
    // 接收到主机发出的错误信息
    if (strncmp(reply, "-ERR", 4)) {
        /* If it's not an error, log the unexpected event. */
        redisLog(REDIS_WARNING, "Unexpected reply to PSYNC from master: %s", reply);
    else {
        redisLog(REDIS_NOTICE,
            "Master does not support PSYNC or is in "
            "error state (reply: %s)", reply);
    }
    sdsfree(reply);
    replicationDiscardCachedMaster();
    return PSYNC_NOT_SUPPORTED;
}
                                                
// 主机 SYNC 和 PSYNC 命令处理函数,会尝试进行部分同步和全同步
/* SYNC ad PSYNC command implemenation. */
void syncCommand(redisClient *c) {
    ......
                                                
    // 主机尝试部分同步,允许则进行部分同步,会返回 +CONTINUE,接着发送积压空间
                                                
    /* Try a partial resynchronization if this is a PSYNC command.
     * If it fails, we continue with usual full resynchronization, however
     * when this happens masterTryPartialResynchronization() already
     * replied with:
     *
     * +FULLRESYNC <runid> <offset>
     *
     * So the slave knows the new runid and offset to try a PSYNC later
     * if the connection with the master is lost. */
    if (!strcasecmp(c->argv[0]->ptr, "psync")) {
        // 部分同步
        if (masterTryPartialResynchronization(c) == REDIS_OK) {
            server.stat_sync_partial_ok++;
            return/* No full resync needed, return. */
        else {
        // 部分同步失败,会进行全同步,这时会收到来自客户端的 runid
            char *master_runid = c->argv[1]->ptr;
                                                
            /* Increment stats for failed PSYNCs, but only if the
             * runid is not "?", as this is used by slaves to force a full
             * resync on purpose when they are not albe to partially
             * resync. */
            if (master_runid[0] != '?')
                server.stat_sync_partial_err++;
        }
    else {
        /* If a slave uses SYNC, we are dealing with an old implementation
         * of the replication protocol (like redis-cli --slave). Flag the client
         * so that we don't expect to receive REPLCONF ACK feedbacks. */
        c->flags |= REDIS_PRE_PSYNC_SLAVE;
    }
                                                
    // 执行全同步:
    ......
}
                                                
// 主机尝试是否能进行部分同步
/* This function handles the PSYNC command from the point of view of a
* master receiving a request for partial resynchronization.
*
* On success return REDIS_OK, otherwise REDIS_ERR is returned and we proceed
* with the usual full resync. */
int masterTryPartialResynchronization(redisClient *c) {
    long long psync_offset, psync_len;
    char *master_runid = c->argv[1]->ptr;
    char buf[128];
    int buflen;
                                                
    /* Is the runid of this master the same advertised by the wannabe slave
     * via PSYNC? If runid changed this master is a different instance and
     * there is no way to continue. */
    if (strcasecmp(master_runid, server.runid)) {
    // 当因为异常需要与主机断开连接的时候,从机会暂存主机的状态信息,以便
    // 下一次的部分同步。
    // 1)master_runid 是从机提供一个因缓存主机的 runid,
    // 2)server.runid 是本机(主机)的 runid。
    // 匹配失败,说明是本机(主机)不是从机缓存的主机,这时候不能进行部分同步,
    // 只能进行全同步
                                                
        // "?" 表示从机要求全同步
        // 什么时候从机会要求全同步???
        /* Run id "?" is used by slaves that want to force a full resync. */
        if (master_runid[0] != '?') {
            redisLog(REDIS_NOTICE,"Partial resynchronization not accepted: "
                "Runid mismatch (Client asked for '%s', I'm '%s')",
                master_runid, server.runid);
        else {
            redisLog(REDIS_NOTICE, "Full resync requested by slave.");
        }
        goto need_full_resync;
    }
                                                
    // 从参数中解析整数,整数是从机指定的偏移量
    /* We still have the data our slave is asking for? */
    if (getLongLongFromObjectOrReply(c, c->argv[2], &psync_offset, NULL) != REDIS_OK)
        goto need_full_resync;
                                                
    // 部分同步失败的情况
    if (!server.repl_backlog || /*不存在积压空间*/
        psync_offset < server.repl_backlog_off ||  /*psync_offset 太过小,
                                                    即从机错过太多更新记录,
                                                    安全起见,实行全同步*/
                                                    /*psync_offset 越界*/
        psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen))
    // 经检测,不满足部分同步的条件,转而进行全同步
    {
        redisLog(REDIS_NOTICE,
            "Unable to partial resync with the slave for lack of
            backlog (Slave request was: %lld).", psync_offset);
                                        
        if (psync_offset > server.master_repl_offset) {
            redisLog(REDIS_WARNING,
                "Warning: slave tried to PSYNC with an offset that is greater
                than the master replication offset.");
        }
        goto need_full_resync;
    }
                                                
    // 执行部分同步:
    // 1)标记客户端为从机
    // 2)通知从机准备接收数据。从机收到 +CONTINUE 会做好准备
    // 3)开发发送数据
    /* If we reached this point, we are able to perform a partial resync:
     * 1) Set client state to make it a slave.
     * 2) Inform the client we can continue with +CONTINUE
     * 3) Send the backlog data (from the offset to the end) to the slave. */
                                                
    // 将连接的客户端标记为从机
    c->flags |= REDIS_SLAVE;
                                                
    // 表示进行部分同步
    // #define REDIS_REPL_ONLINE 9 /* RDB file transmitted, sending just
    // updates. */
    c->replstate = REDIS_REPL_ONLINE;
                                                
    // 更新 ack 的时间
    c->repl_ack_time = server.unixtime;
                                                
    // 添加入从机链表
    listAddNodeTail(server.slaves, c);
                                                
    // 告诉从机可以进行部分同步,从机收到后会做相关的准备(注册回调函数)
    /* We can't use the connection buffers since they are used to accumulate
     * new commands at this stage. But we are sure the socket send buffer is
     * emtpy so this write will never fail actually. */
    buflen = snprintf(buf, sizeof(buf), "+CONTINUE\r\n");
    if (write(c->fd, buf, buflen) != buflen) {
        freeClientAsync(c);
        return REDIS_OK;
    }
                                                
    // 向从机写积压空间中的数据,积压空间存储有「更新缓存」
    psync_len = addReplyReplicationBacklog(c, psync_offset);
                                                
    redisLog(REDIS_NOTICE,
        "Partial resynchronization request accepted. Sending %lld bytes of backlog
        starting from offset %lld.", psync_len, psync_offset);
                                    
    /* Note that we don't need to set the selected DB at server.slaveseldb
     * to -1 to force the master to emit SELECT, since the slave already
     * has this state from the previous connection with the master. */
    refreshGoodSlavesCount();
    return REDIS_OK; /* The caller can return, no full resync needed. */
                                                
need_full_resync:
    ......
    // 向从机发送 +FULLRESYNC runid repl_offset
}

六、暂缓主机

从机因为某些原因,譬如网络延迟(PING 超时,ACK 超时等),可能会断开与主机的连接。这时候,从机会尝试保存与主机连接的信息,譬如全局积压空间数据偏移量等,以便下一次的部分同步,并且从机会再一次尝试连接主机。注意一点,如果断开的时间足够长,部分同步肯定会失败的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
void freeClient(redisClient *c) {
    listNode *ln;
                                                                  
    /* If this is marked as current client unset it */
    if (server.current_client == c) server.current_client = NULL;
                                                                  
    // 如果此机为从机,已经连接主机,可能需要保存主机状态信息,以便进行 PSYNC
    /* If it is our master that's beging disconnected we should make sure
     * to cache the state to try a partial resynchronization later.
     *
     * Note that before doing this we make sure that the client is not in
     * some unexpected state, by checking its flags. */
    if (server.master && c->flags & REDIS_MASTER) {
        redisLog(REDIS_WARNING,"Connection with master lost.");
        if (!(c->flags & (REDIS_CLOSE_AFTER_REPLY|
                          REDIS_CLOSE_ASAP|
                          REDIS_BLOCKED|
                          REDIS_UNBLOCKED)))
        {
            replicationCacheMaster(c);
            return;
        }
    }
    ......
}
                                                                  
// 为了实现部分同步,从机会保存主机的状态信息后才会断开主机的连接,主机状态信息
// 保存在 server.cached_master
// 会在 freeClient() 中调用,保存与主机连接的状态信息,以便进行 PSYNC
void replicationCacheMaster(redisClient *c) {
    listNode *ln;
                                                                  
    redisAssert(server.master != NULL && server.cached_master == NULL);
    redisLog(REDIS_NOTICE,"Caching the disconnected master state.");
                                                                  
    // 从客户端列表删除主机的信息
    /* Remove from the list of clients, we don't want this client to be
     * listed by CLIENT LIST or processed in any way by batch operations. */
    ln = listSearchKey(server.clients,c);
    redisAssert(ln != NULL);
    listDelNode(server.clients,ln);
                                                                  
    // 保存主机的状态信息
    /* Save the master. Server.master will be set to null later by
     * replicationHandleMasterDisconnection(). */
    server.cached_master = server.master;
                                                                  
    // 注销事件,关闭连接
    /* Remove the event handlers and close the socket. We'll later reuse
     * the socket of the new connection with the master during PSYNC. */
    aeDeleteFileEvent(server.el,c->fd,AE_READABLE);
    aeDeleteFileEvent(server.el,c->fd,AE_WRITABLE);
    close(c->fd);
                                                                  
    /* Set fd to -1 so that we can safely call freeClient(c) later. */
    c->fd = -1;
                                                                  
    // 修改连接的状态,设置 server.master = NULL
    /* Caching the master happens instead of the actual freeClient() call,
     * so make sure to adjust the replication state. This function will
     * also set server.master to NULL. */
    replicationHandleMasterDisconnection();
}

七、总结

简单来说,主从同步就是 RDB 文件的上传下载;主机有小部分的数据修改,就把修改记录传播给每个从机。这篇文章详述了 Redis 主从复制的内部协议和机制。接下来的几篇关于 Redis 的文章,主要是其内部数据结构。

 

posted @ 2015-07-09 14:14  Uncle_Nucky  阅读(476)  评论(0编辑  收藏  举报