说说pgpool-II的 health check

pgpool-II中，与health check 相干的配置文件项有两个：

health_check_period
health_check_timeout

乍一看他们文档的解释，看官方网站的说法：

http://pgpool.projects.postgresql.org/pgpool-II/doc/pgpool-en.html

health_check_period
This parameter specifies the interval between the health checks in seconds. 

Default is 0, which means health check is disabled. You need to reload pgpool.conf if you change health_check_period.

health_check_timeout

pgpool-II periodically tries to connect to the backends to detect any error on the servers or networks. This error check procedure is called "health check". 

If an error is detected, pgpool-II tries to perform failover or degeneration. 

This parameter serves to prevent the health check from waiting for a long time in acase such as un unplugged network cable. The timeout value is in seconds. Default value is 20. 

0 disables timeout (waits until TCP/IP timeout). 

This health check requires one extra connection to each backend, 
so max_connections in the postgresql.conf needs to be incremented as needed. You need to reload pgpool.conf if you change this value.

实际的情形如何呢，这里以 pgpool-II 3.1 为例(为了看着方便，去掉了一部分不重要的代码)：

/*                                    
* pgpool main program                                    
*/                                    
int main(int argc, char **argv)                                    
{                                    
    ……                                
    /*                                
     * This is the main loop                                
     */                                
    for (;;)                                
    {                                
        CHECK_REQUEST;             
        /* do we need health checking for PostgreSQL? */   
        if (pool_config->health_check_period > 0)                            
        {                          
            ……                     
            if (pool_config->health_check_timeout > 0)                        
            {                        
                /*                    
                 * set health checker timeout. we want to detect  
                 * communication path failure much earlier before 
                 * TCP/IP stack detects it.                    
                 */                    
                pool_signal(SIGALRM, health_check_timer_handler); 
                alarm(pool_config->health_check_timeout);                    
            }                        
                                    
            /*                        
             * do actual health check. trying to connect to the backend   
             */                        
            errno = 0;                        
            health_check_timer_expired = 0;                        
            POOL_SETMASK(&UnBlockSig);                        
            sts = health_check();                        
            POOL_SETMASK(&BlockSig);                        
                                    
            if (pool_config->parallel_mode || pool_config->enable_query_cache) 
                sys_sts = system_db_health_check();                    
                                    
            if ((sts > 0 || sys_sts < 0) 
             && (errno != EINTR || (errno == EINTR && health_check_timer_expired)))
            {                        
                if (sts > 0)                    
                {                    
                    sts--;         
                    if (!pool_config->parallel_mode)                
                    {                
                        if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))
                        {            
                            pool_log("health_check: %d failover is canceld 
　　　　　　　　　　　　　　　　　　　　　　because failover is disallowed", sts);        
                        }            
                        else            
                        {            
                            pool_log("set %d th backend down status", sts);        
                            Req_info->kind = NODE_DOWN_REQUEST;        
                            Req_info->node_id[0] = sts;        
                            failover();        
                            /* need to distribute this info to children */        
                        }            
                    }                
                    else                
                    {                
                        retrycnt++;            
                        pool_signal(SIGALRM, SIG_IGN); /* Cancel timer */ 
                                    
                        if (retrycnt > NUM_BACKENDS)            
                        {            
                            /* retry count over */        
                            pool_log("set %d th backend down status", sts);        
                            Req_info->kind = NODE_DOWN_REQUEST;        
                            Req_info->node_id[0] = sts;        
                            failover();        
                            retrycnt = 0;        
                        }            
                        else            
                        {            
                            /* continue to retry */        
                            sleep_time = pool_config->health_check_period/
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　NUM_BACKENDS;        
                            pool_debug("retry sleep time: %d seconds", sleep_time);
                            pool_sleep(sleep_time);        
                            continue;        
                        }            
                    }                
                }                    
                ……                    
            }                        
                                    
            if (pool_config->health_check_timeout > 0)                        
            {                        
                /* seems ok. cancel health check timer */                    
                pool_signal(SIGALRM, SIG_IGN);                    
            }                        
                                    
            sleep_time = pool_config->health_check_period;                        
            pool_sleep(sleep_time);                        
        }                            
        else                            
        {                            
            for (;;)                        
            {                        
                int r;                    
                struct timeval t = {3, 0};
                POOL_SETMASK(&UnBlockSig);                    
                r = pool_pause(&t);                    
                POOL_SETMASK(&BlockSig);                    
                if (r > 0)                    
                    break;                
            }                        
        }                            
    }                              
    pool_shmem_exit(0);                                
}

可以看得比较清楚了，

第一点，health_check_period的作用，如果不为零，则health_check可以发生。
其他非零值其实都是一样。

第二点，health_check_timeout的作用，如果>0，则会被设置timer,timer到时间后，激活 health_check_timer_handler，对调用 health_check()函数的。

第三点，这里是最坑爹的部分了：

在主循环里面，只要 health_check_period不为零，则要不断地在循环里面作 health_check()动作。
这个一般而言比缺省的 health_check_timeout 20秒可高多了。

实际运行 pgpool命令的时候，如果加入 -d 参数，就可以看到这一点：pgpool-II不断通过调用healt_check()来检查各节点状况。

可以说，有了这个主循环里面折腾 health_check以后，health_check_timeout就形同虚设了。

只是不知道从哪个版本开始变成这样的，或者可以说　pgpool-II的开发者很不负责，没有很好地协调代码和文档。也许这是很多开源项目的通病了。

posted @ 2012-07-27 15:43 健哥的数据花园阅读(1114) 评论(0) 编辑收藏举报

刷新页面返回顶部

健哥的数据花园

说说pgpool-II的 health check

公告