Nacos 2.0源码分析-健康检查机制

温馨提示:
本文内容基于个人学习Nacos 2.0.1版本代码总结而来,因个人理解差异,不保证完全正确。如有理解错误之处欢迎各位拍砖指正,相互学习;转载请注明出处。

什么是健康检查?

本人理解的健康检查是Nacos对服务端的各种连接状态的一种管理。比如服务端和数据库的连接是否正常,一个HTTP的连接是否正常,一个TCP的连接是否正常,注册的服务是否正常可用。要对什么样的连接做健康检查,检查什么内容完全可以自己扩展。

健康检查整体设计

  • 负责执行健康检查的任务是NacosHealthCheckTask, 目前有两种实现ClientBeatCheckTaskV2和HealthCheckTaskV2;前者处理心跳相关的状态,后者处理各种连接的状态。
  • ClientBeatCheckTaskV2 在执行健康检查过程中会使用InstanceBeatCheckTaskInterceptorChain中的拦截器列表对将要进行的任务进行拦截处理。这个将要进行的任务是InstanceBeatCheckTask,它内部维护了一个Checker列表,用于添加额外的检查器。
  • HealthCheckTaskV2 在执行健康检查过程中会使用HealthCheckProcessorV2Delegate对任务进行处理。

nacos-naming模块下的com.alibaba.nacos.naming.healthcheck包中定义了健康检查相关的内容,通过观察这个包下面的类可以发现几个关键字ProcessorTaskInterceptorChecker。它们将扮演不同角色共同组成一个完整的健康检查功能。本节将单独分析拦截器的具体实现,并不涉及拦截器的整体串联流程。请查看《拦截器机制》

拦截器系列(Interceptor)

Naming模块下的拦截器主要用于拦截将要执行的任务,对任务进行一些验证处理。下图展示了拦截器接口的继承关系:

Interceptable

所有需要被拦截器处理的任务都需要实现此接口,它定义了被拦截之后和未被拦截时的执行流程。

/**
 * Interceptable Interface.
 *
 * @author xiweng.yy
 */
public interface Interceptable {
    
    /**
	 * 若没有拦截器拦截此对象,此方法会被调用
     * If no {@link NacosNamingInterceptor} intercept this object, this method will be called to execute.
     */
    void passIntercept();
    
    /**
	 * 若此对象被拦截器拦截,此方法会被调用
     * If one {@link NacosNamingInterceptor} intercept this object, this method will be called.
     */
    void afterIntercept();
}

InstanceBeatCheckTask

用于检查心跳的执行状态。

/**
 * Instance beat check task.
 * 实例心跳检查任务
 * @author xiweng.yy
 */
public class InstanceBeatCheckTask implements Interceptable {

    /**
     * 检查器集合
     */
    private static final List<InstanceBeatChecker> CHECKERS = new LinkedList<>();

    private final IpPortBasedClient client;

    private final Service service;

    private final HealthCheckInstancePublishInfo instancePublishInfo;

    static {
        // 初始化时获取检查器
        CHECKERS.add(new UnhealthyInstanceChecker());
        CHECKERS.add(new ExpiredInstanceChecker());
        CHECKERS.addAll(NacosServiceLoader.load(InstanceBeatChecker.class));
    }

    public InstanceBeatCheckTask(IpPortBasedClient client, Service service, HealthCheckInstancePublishInfo instancePublishInfo) {
        this.client = client;
        this.service = service;
        this.instancePublishInfo = instancePublishInfo;
    }

    @Override
    public void passIntercept() {
        // 当没有被拦截的时候执行检查
        for (InstanceBeatChecker each : CHECKERS) {
            each.doCheck(client, service, instancePublishInfo);
        }
    }

    @Override
    public void afterIntercept() {
    }

    public IpPortBasedClient getClient() {
        return client;
    }

    public Service getService() {
        return service;
    }

    public HealthCheckInstancePublishInfo getInstancePublishInfo() {
        return instancePublishInfo;
    }
}

NacosNamingInterceptor

NacosNamingInterceptor接口限定了它的实现必须是处理Interceptable类型。它最主要的功能就是定义了拦截机制。

/**
 * Nacos naming interceptor.
 *
 * @author xiweng.yy
 */
public interface NacosNamingInterceptor<T extends Interceptable> {
    
    /**
     * Judge whether the input type is intercepted by this Interceptor.
     * 判断输入的参数是否是当前拦截器可处理的类型
     * <p>This method only should judge the object type whether need be do intercept. Not the intercept logic.
     *
     * @param type type
     * @return true if the input type is intercepted by this Interceptor, otherwise false
     */
    boolean isInterceptType(Class<?> type);
    
    /**
     * Do intercept operation.
     * 拦截后的操作
     * <p>This method is the actual intercept operation.
     *
     * @param object need intercepted object
     * @return true if object is intercepted, otherwise false
     */
    boolean intercept(T object);
    
    /**
     * The order of interceptor. The lower the number, the earlier the execution.
     * 拦截器的优先级,数字越低,优先级越高
     * @return the order number of interceptor
     */
    int order();
}

AbstractHealthCheckInterceptor

这个抽象类用于限定它的子类只可以拦截NacosHealthCheckTask类型的任务。

/**
 * Abstract health check interceptor.
 *
 * @author xiweng.yy
 */
public abstract class AbstractHealthCheckInterceptor implements NacosNamingInterceptor<NacosHealthCheckTask> {
    
    @Override
    public boolean isInterceptType(Class<?> type) {
        return NacosHealthCheckTask.class.isAssignableFrom(type);
    }
}

HealthCheckResponsibleInterceptor

用于拦截NacosHealthCheckTask类型的任务,拦截之后判断当前处理的任务是否应该由当前节点处理。它的优先级被设为最高级-1。

/**
 * Health check responsible interceptor.
 * 判断是否需要拦截处理
 * @author xiweng.yy
 */
public class HealthCheckResponsibleInterceptor extends AbstractHealthCheckInterceptor {
    
    @Override
    public boolean intercept(NacosHealthCheckTask object) {
        return !ApplicationUtils.getBean(DistroMapper.class).responsible(object.getTaskId());
    }
    
    @Override
    public int order() {
        return Integer.MIN_VALUE + 1;
    }
}
HealthCheckEnableInterceptor

用于拦截NacosHealthCheckTask类型的任务,拦截之后判断当前节点是否开启了健康检查。它的优先级最高。

/**
 * Health check enable interceptor.
 * 检查是否开启了健康检查
 * @author xiweng.yy
 */
public class HealthCheckEnableInterceptor extends AbstractHealthCheckInterceptor {

    @Override
    public boolean intercept(NacosHealthCheckTask object) {
        try {
            return !ApplicationUtils.getBean(SwitchDomain.class).isHealthCheckEnabled() || !ApplicationUtils
                    .getBean(UpgradeJudgement.class).isUseGrpcFeatures();
        } catch (Exception e) {
            return true;
        }
    }

    @Override
    public int order() {
        return Integer.MIN_VALUE;
    }
}

通过两个实现可以看出AbstractHealthCheckInterceptor主要用于检查被拦截的NacosHealthCheckTask任务是否应当执行后续的拦截逻辑。很显然优先级最高的拦截器HealthCheckEnableInterceptor直接决定了任务是否需要继续执行下去。

AbstractBeatCheckInterceptor

这个抽象类用于限定它的子类只可以拦截InstanceBeatCheckTask类型的任务。

/**
 * Abstract Beat check Interceptor.
 * 抽象心跳检查拦截器
 * @author xiweng.yy
 */
public abstract class AbstractBeatCheckInterceptor implements NacosNamingInterceptor<InstanceBeatCheckTask> {

    @Override
    public boolean isInterceptType(Class<?> type) {
        // 指定它拦截InstanceBeatCheckTask
        return InstanceBeatCheckTask.class.isAssignableFrom(type);
    }
}

InstanceEnableBeatCheckInterceptor

用于拦截InstanceBeatCheckTask类型的任务,拦截之后用于判断当前的Instance心跳检查任务是否开启。

/**
 * Instance enable beat check interceptor.
 * 用于检查Instance是否开启了心跳检查的拦截器
 * @author xiweng.yy
 */
public class InstanceEnableBeatCheckInterceptor extends AbstractBeatCheckInterceptor {

    @Override
    public boolean intercept(InstanceBeatCheckTask object) {
        NamingMetadataManager metadataManager = ApplicationUtils.getBean(NamingMetadataManager.class);
        HealthCheckInstancePublishInfo instance = object.getInstancePublishInfo();
        Optional<InstanceMetadata> metadata = metadataManager.getInstanceMetadata(object.getService(), instance.getMetadataId());
        if (metadata.isPresent() && metadata.get().getExtendData().containsKey(UtilsAndCommons.ENABLE_CLIENT_BEAT)) {
            return ConvertUtils.toBoolean(metadata.get().getExtendData().get(UtilsAndCommons.ENABLE_CLIENT_BEAT).toString());
        }
        if (instance.getExtendDatum().containsKey(UtilsAndCommons.ENABLE_CLIENT_BEAT)) {
            return ConvertUtils.toBoolean(instance.getExtendDatum().get(UtilsAndCommons.ENABLE_CLIENT_BEAT).toString());
        }
        return false;
    }

    @Override
    public int order() {
        return Integer.MIN_VALUE + 1;
    }
}
ServiceEnableBeatCheckInterceptor

用于拦截InstanceBeatCheckTask类型的任务,拦截之后用于判断当前的Service心跳检查任务是否开启。

/**
 * Service enable beat check interceptor.
 * 用于检查Service是否开启了心跳检查的拦截器
 * @author xiweng.yy
 */
public class ServiceEnableBeatCheckInterceptor extends AbstractBeatCheckInterceptor {

    @Override
    public boolean intercept(InstanceBeatCheckTask object) {
        NamingMetadataManager metadataManager = ApplicationUtils.getBean(NamingMetadataManager.class);
        Optional<ServiceMetadata> metadata = metadataManager.getServiceMetadata(object.getService());
        if (metadata.isPresent() && metadata.get().getExtendData().containsKey(UtilsAndCommons.ENABLE_CLIENT_BEAT)) {
            return Boolean.parseBoolean(metadata.get().getExtendData().get(UtilsAndCommons.ENABLE_CLIENT_BEAT));
        }
        return false;
    }

    @Override
    public int order() {
        return Integer.MIN_VALUE;
    }
}
InstanceBeatCheckResponsibleInterceptor

用于拦截InstanceBeatCheckTask类型的任务,拦截之后用于判断当前的Instance心跳检查任务是否由当前节点来处理,若不是则不进行后续的拦截操作。

/**
 * Instance responsibility check interceptor.
 * 是否本机负责的检查拦截器
 * @author gengtuo.ygt
 * on 2021/3/24
 */
public class InstanceBeatCheckResponsibleInterceptor extends AbstractBeatCheckInterceptor {

    @Override
    public boolean intercept(InstanceBeatCheckTask object) {
        return !ApplicationUtils.getBean(DistroMapper.class).responsible(object.getClient().getResponsibleId());
    }

    @Override
    public int order() {
        return Integer.MIN_VALUE + 2;
    }

}

需要注意的是,在AbstractBeatCheckInterceptor的几个子类中,默认都是返回false的,请观察他们获取数据的位置,都是从元数据里面获取。默认情况下,注册一个服务是不需要携带这么多数据的。那么此处就产生一个疑问,不检查他们的心跳是否开启,如何执行心跳?检查还是否有用?

任务系列(Task)

nacos-naming模块的com.alibaba.nacos.naming.healthcheck包下,Task一共4种类别,分别是NacosTaskHealthCheckTaskNacosHealthCheckTaskBeatCheckTask。在Nacos中一切操作皆为Task,这也是实现高性能的一种有效方式。

NacosTask

NacosTask作为Nacos内部Task的统一接口。基本上系统级别的任务都是通过它的相关子类实现。此接口的子类分为了两个类型AbstractExecuteTaskAbstractDelayTask。分别代表立即执行的任务延迟执行的任务, 对任务体系作了更细的划分。它定义了此任务是否需要被执行。

/**
 * Nacos task.
 *
 * @author xiweng.yy
 */
public interface NacosTask {
    
    /**
     * Judge Whether this nacos task should do.
     *
     * @return true means the nacos task should be done, otherwise false
     */
    boolean shouldProcess();
}

AbstractExecuteTask

需要立即执行的任务。

/**
 * Abstract task which should be executed immediately.
 *
 * @author xiweng.yy
 */
public abstract class AbstractExecuteTask implements NacosTask, Runnable {
    
    @Override
    public boolean shouldProcess() {
        return true;
    }
}
ClientBeatCheckTaskV2
/**
 * Client beat check task of service for version 2.x.
 * @author nkorange
 */
public class ClientBeatCheckTaskV2 extends AbstractExecuteTask implements BeatCheckTask, NacosHealthCheckTask {

    private final IpPortBasedClient client;

    private final String taskId;
    /**
     * 拦截器链
     */
    private final InstanceBeatCheckTaskInterceptorChain interceptorChain;

    public ClientBeatCheckTaskV2(IpPortBasedClient client) {
        this.client = client;
        this.taskId = client.getResponsibleId();
        this.interceptorChain = InstanceBeatCheckTaskInterceptorChain.getInstance();
    }

    public GlobalConfig getGlobalConfig() {
        return ApplicationUtils.getBean(GlobalConfig.class);
    }

    @Override
    public String taskKey() {
        return KeyBuilder.buildServiceMetaKey(client.getClientId(), String.valueOf(client.isEphemeral()));
    }

    @Override
    public String getTaskId() {
        return taskId;
    }

    @Override
    public void doHealthCheck() {
        try {
            // 获取所有的Service
            Collection<Service> services = client.getAllPublishedService();
            for (Service each : services) {
                // 获取Service对应的InstancePublishInfo
                HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) client.getInstancePublishInfo(each);
                // 创建一个InstanceBeatCheckTask,并交由拦截器链处理
                interceptorChain.doInterceptor(new InstanceBeatCheckTask(client, each, instance));
            }
        } catch (Exception e) {
            Loggers.SRV_LOG.warn("Exception while processing client beat time out.", e);
        }
    }

    @Override
    public void run() {
        doHealthCheck();
    }

    @Override
    public void passIntercept() {
        doHealthCheck();
    }

    @Override
    public void afterIntercept() {
    }
}
ClientBeatUpdateTask

v2版本的,用于更新某个Client下所有的实例。

/**
 * Client beat update task.
 *
 * @author xiweng.yy
 */
public class ClientBeatUpdateTask extends AbstractExecuteTask {

    private final IpPortBasedClient client;

    public ClientBeatUpdateTask(IpPortBasedClient client) {
        this.client = client;
    }

    @Override
    public void run() {
        // 获取当前时间,更新Client和Client下的Instance的最新活跃时间
        long currentTime = System.currentTimeMillis();
        for (InstancePublishInfo each : client.getAllInstancePublishInfo()) {
            ((HealthCheckInstancePublishInfo) each).setLastHeartBeatTime(currentTime);
        }
        client.setLastUpdatedTime();
    }
}
HealthCheckTaskV2

v2版本的健康检查任务,继承了AbstractExecuteTask说明会立即执行,实现了NacosHealthCheckTask说明可被拦截器拦截处理,前面章节已经分析过NacosHealthCheckTask的相关拦截器只是用于检查是否开启了健康检查以及是否是当前节点处理的判断。

根据实际的执行逻辑来看,健康检查任务将会循环执行。看类中的注释目前还是采用和v1相同的处理逻辑,待后续版本更新之后看看会有什么区别。

/**
 * Health check task for v2.x.
 * v2版本的健康检查
 * <p>Current health check logic is same as v1.x. TODO refactor health check for v2.x.
 *
 * @author nacos
 */
public class HealthCheckTaskV2 extends AbstractExecuteTask implements NacosHealthCheckTask {

    /**
     * 一个客户端对象(此客户端代表提供服务用于被应用访问的客户端)
     * 从这里可以看出,启动一个健康检查任务是以客户端为维度的
     */
    private final IpPortBasedClient client;

    private final String taskId;

    private final SwitchDomain switchDomain;

    private final NamingMetadataManager metadataManager;

    private long checkRtNormalized = -1;
    /**
     * 检查最佳响应时间
     */
    private long checkRtBest = -1;

    /**
     * 检查最差响应时间
     */
    private long checkRtWorst = -1;

    /**
     * 检查上次响应时间
     */
    private long checkRtLast = -1;

    /**
     * 检查上上次响应时间
     */
    private long checkRtLastLast = -1;

    /**
     * 开始时间
     */
    private long startTime;

    /**
     * 任务是否取消
     */
    private volatile boolean cancelled = false;

    public HealthCheckTaskV2(IpPortBasedClient client) {
        this.client = client;
        this.taskId = client.getResponsibleId();
        this.switchDomain = ApplicationUtils.getBean(SwitchDomain.class);
        this.metadataManager = ApplicationUtils.getBean(NamingMetadataManager.class);
        // 初始化响应时间检查
        initCheckRT();
    }

    /**
     * 初始化响应时间值
     */
    private void initCheckRT() {
        // first check time delay
        // 2000 + (在5000以内的随机数)
        checkRtNormalized = 2000 + RandomUtils.nextInt(0, RandomUtils.nextInt(0, switchDomain.getTcpHealthParams().getMax()));
        // 最佳响应时间
        checkRtBest = Long.MAX_VALUE;
        // 最差响应时间为0
        checkRtWorst = 0L;
    }

    public IpPortBasedClient getClient() {
        return client;
    }

    @Override
    public String getTaskId() {
        return taskId;
    }

    /**
     * 开始执行健康检查任务
     */
    @Override
    public void doHealthCheck() {
        try {
            // 获取当前传入的Client所发布的所有Service
            for (Service each : client.getAllPublishedService()) {
                // 只有当Service开启了健康检查才执行
                if (switchDomain.isHealthCheckEnabled(each.getGroupedServiceName())) {
                    // 获取Service对应的InstancePublishInfo
                    InstancePublishInfo instancePublishInfo = client.getInstancePublishInfo(each);
                    // 获取集群元数据
                    ClusterMetadata metadata = getClusterMetadata(each, instancePublishInfo);
                    // 使用Processor代理对象对任务进行处理
                    ApplicationUtils.getBean(HealthCheckProcessorV2Delegate.class).process(this, each, metadata);
                    if (Loggers.EVT_LOG.isDebugEnabled()) {
                        Loggers.EVT_LOG.debug("[HEALTH-CHECK] schedule health check task: {}", client.getClientId());
                    }
                }
            }
        } catch (Throwable e) {
            Loggers.SRV_LOG.error("[HEALTH-CHECK] error while process health check for {}", client.getClientId(), e);
        } finally {
            // 若任务执行状态为已取消,则再次启动
            if (!cancelled) {
                HealthCheckReactor.scheduleCheck(this);
                // worst == 0 means never checked
                if (this.getCheckRtWorst() > 0) {
                    // TLog doesn't support float so we must convert it into long
                    long checkRtLastLast = getCheckRtLastLast();
                    this.setCheckRtLastLast(this.getCheckRtLast());
                    if (checkRtLastLast > 0) {
                        long diff = ((this.getCheckRtLast() - this.getCheckRtLastLast()) * 10000) / checkRtLastLast;
                        if (Loggers.CHECK_RT.isDebugEnabled()) {
                            Loggers.CHECK_RT.debug("{}->normalized: {}, worst: {}, best: {}, last: {}, diff: {}",
                                    client.getClientId(), this.getCheckRtNormalized(), this.getCheckRtWorst(),
                                    this.getCheckRtBest(), this.getCheckRtLast(), diff);
                        }
                    }
                }
            }
        }
    }

    @Override
    public void passIntercept() {
		// 拦截通过之后执行健康检查
        doHealthCheck();
    }

    @Override
    public void afterIntercept() {
		// 拦截器执行完毕之后,若当前任务终止了,则再次进行检查,由此可见其是循环执行的,循环是依赖拦截器的调用逻辑来实现。
        if (!cancelled) {
            HealthCheckReactor.scheduleCheck(this);
        }
    }

    @Override
    public void run() {
        // 调用健康检查
        doHealthCheck();
    }

	/**
     * 获取集群元数据
     * @param service               服务信息
     * @param instancePublishInfo   服务对应的ip等信息
     * @return
     */
    private ClusterMetadata getClusterMetadata(Service service, InstancePublishInfo instancePublishInfo) {
        Optional<ServiceMetadata> serviceMetadata = metadataManager.getServiceMetadata(service);
        if (!serviceMetadata.isPresent()) {
            return new ClusterMetadata();
        }
        String cluster = instancePublishInfo.getCluster();
        ClusterMetadata result = serviceMetadata.get().getClusters().get(cluster);
        return null == result ? new ClusterMetadata() : result;
    }

    //... 省略部分getter/setter
}
ClientBeatUpdateTask

用于更新Client的最新活跃时间。

/**
 * Client beat update task.
 * 客户端心跳更新任务
 * @author xiweng.yy
 */
public class ClientBeatUpdateTask extends AbstractExecuteTask {

    /**
     * 客户端对象
     */
    private final IpPortBasedClient client;

    public ClientBeatUpdateTask(IpPortBasedClient client) {
        this.client = client;
    }

    @Override
    public void run() {
        // 获取当前时间,更新Client和Client下的Instance的最新活跃时间
        long currentTime = System.currentTimeMillis();
        for (InstancePublishInfo each : client.getAllInstancePublishInfo()) {
            ((HealthCheckInstancePublishInfo) each).setLastHeartBeatTime(currentTime);
        }
        client.setLastUpdatedTime();
    }
}

AbstractDelayTask

可以延迟执行的任务。

/**
 * Abstract task which can delay and merge.
 *
 * @author huali
 * @author xiweng.yy
 */
public abstract class AbstractDelayTask implements NacosTask {
    
    /**
     * Task time interval between twice processing, unit is millisecond.
	 * 任务执行间隔时长(单位:毫秒)
     */
    private long taskInterval;
    
    /**
     * The time which was processed at last time, unit is millisecond.
	 * 上一次执行的事件(单位:毫秒)
     */
    private long lastProcessTime;
    
    /**
     * merge task.
     * 合并任务,关于合并任务,请查看它的子类实现
     * @param task task
     */
    public abstract void merge(AbstractDelayTask task);
    
    public void setTaskInterval(long interval) {
        this.taskInterval = interval;
    }
    
    public long getTaskInterval() {
        return this.taskInterval;
    }
    
    public void setLastProcessTime(long lastProcessTime) {
        this.lastProcessTime = lastProcessTime;
    }
    
    public long getLastProcessTime() {
        return this.lastProcessTime;
    }
    
    @Override
    public boolean shouldProcess() {
        return (System.currentTimeMillis() - this.lastProcessTime >= this.taskInterval);
    }
    
}

提示:
关于延迟任务,在健康检查章节就不作介绍了,健康检查都是立即执行的任务。

NacosHealthCheckTask

用于健康检查的Task,定义了健康检查的基本方法。继承Interceptable说明其可以被拦截器处理。继承Runnable说明其是一个线程,可被线程执行器调度。

/**
 * Nacos health check task.
 *
 * @author xiweng.yy
 */
public interface NacosHealthCheckTask extends Interceptable, Runnable {
    
    /**
     * Get task id.
     * 
     * @return task id.
     */
    String getTaskId();
    
    /**
     * Do health check.
     */
    void doHealthCheck();
}

ClientBeatCheckTaskV2

请参考NacosTask章节《ClientBeatCheckTaskV2》

HealthCheckTaskV2

请参考NacosTask章节《HealthCheckTaskV2》

在功能性上ClientBeatCheckTaskV2是用于检查心跳请求的执行状态。HealthCheckTaskV2则是检查其他连接的状态。在内部的处理逻辑上也有明显的不同,前者使用了拦截器来处理,后者使用了处理器来处理。

检查器系列(Checker)

个人理解Checker的存在是对拦截器的一种补充,当任务没有被拦截,但有需要进行一些检查的时候,可以使用Checker来执行检查。这一点在InstanceBeatCheckTask类(点击跳转查看)中可以体现,它将checker的调用放在了passIntercept()方法中。

InstanceBeatChecker

检查器负责对传入的实例进行检查。

/**
 * Instance heart beat checker.
 *
 * @author xiweng.yy
 */
public interface InstanceBeatChecker {
    
    /**
     * Do check for input instance.
     *
     * @param client   client
     * @param service  service of instance
     * @param instance instance publish info
     */
    void doCheck(Client client, Service service, HealthCheckInstancePublishInfo instance);
}

ExpiredInstanceChecker

已过期的实例检查器,用于检查实例是否过期,若过期则从已发布列表内部移除该服务。

/**
 * Instance beat checker for expired instance.
 * Instance检查器,用于检查是否过期
 * <p>Delete the instance if has expired.
 *
 * @author xiweng.yy
 */
public class ExpiredInstanceChecker implements InstanceBeatChecker {

    /**
     * 执行检查工作
     * @param client   client
     * @param service  service of instance
     * @param instance instance publish info
     */
    @Override
    public void doCheck(Client client, Service service, HealthCheckInstancePublishInfo instance) {
        // 实例是否可过期
        boolean expireInstance = ApplicationUtils.getBean(GlobalConfig.class).isExpireInstance();
        // 若支持过期,并已过期
        if (expireInstance && isExpireInstance(service, instance)) {
            // 从所在的Client内部已发布服务列表中移除
            deleteIp(client, service, instance);
        }
    }

    /**
     * 判断是否超时
     * @param service
     * @param instance
     * @return
     */
    private boolean isExpireInstance(Service service, HealthCheckInstancePublishInfo instance) {
        long deleteTimeout = getTimeout(service, instance);
        return System.currentTimeMillis() - instance.getLastHeartBeatTime() > deleteTimeout;
    }

    /**
     * 获取超时时间
     * @param service
     * @param instance
     * @return
     */
    private long getTimeout(Service service, InstancePublishInfo instance) {
        Optional<Object> timeout = getTimeoutFromMetadata(service, instance);
        if (!timeout.isPresent()) {
            timeout = Optional.ofNullable(instance.getExtendDatum().get(PreservedMetadataKeys.IP_DELETE_TIMEOUT));
        }
        return timeout.map(ConvertUtils::toLong).orElse(Constants.DEFAULT_IP_DELETE_TIMEOUT);
    }

    /**
     * 从元数据中获取超时时间
     * @param service
     * @param instance
     * @return
     */
    private Optional<Object> getTimeoutFromMetadata(Service service, InstancePublishInfo instance) {
        Optional<InstanceMetadata> instanceMetadata = ApplicationUtils.getBean(NamingMetadataManager.class)
                .getInstanceMetadata(service, instance.getMetadataId());
        return instanceMetadata.map(metadata -> metadata.getExtendData().get(PreservedMetadataKeys.IP_DELETE_TIMEOUT));
    }

    /**
     * 移除服务,并发布事件
     * @param client
     * @param service
     * @param instance
     */
    private void deleteIp(Client client, Service service, InstancePublishInfo instance) {
        Loggers.SRV_LOG.info("[AUTO-DELETE-IP] service: {}, ip: {}", service.toString(), JacksonUtils.toJson(instance));
        client.removeServiceInstance(service);
        NotifyCenter.publishEvent(new ClientOperationEvent.ClientDeregisterServiceEvent(service, client.getClientId()));
    }
}

UnhealthyInstanceChecker

用于检查实例是否健康,若不健康则更新状态并发布事件。

/**
 * Instance beat checker for unhealthy instances.
 * 用于检查不健康实例的检查员
 * <p>Mark these instances healthy status {@code false} if beat time out.
 *
 * @author xiweng.yy
 */
public class UnhealthyInstanceChecker implements InstanceBeatChecker {

    /**
     * 执行检查工作
     * @param client   client
     * @param service  service of instance
     * @param instance instance publish info
     */
    @Override
    public void doCheck(Client client, Service service, HealthCheckInstancePublishInfo instance) {
        // 若实例传递进来时是健康的,但经过计算超时的时候是不健康的,则需要更改状态
        if (instance.isHealthy() && isUnhealthy(service, instance)) {
            changeHealthyStatus(client, service, instance);
        }
    }

    /**
     * 根据实例的上一次更新时间判断是否超时
     * @param service
     * @param instance
     * @return
     */
    private boolean isUnhealthy(Service service, HealthCheckInstancePublishInfo instance) {
        long beatTimeout = getTimeout(service, instance);
        return System.currentTimeMillis() - instance.getLastHeartBeatTime() > beatTimeout;
    }

    /**
     * 获取超时时长
     * @param service
     * @param instance
     * @return
     */
    private long getTimeout(Service service, InstancePublishInfo instance) {
        Optional<Object> timeout = getTimeoutFromMetadata(service, instance);
        if (!timeout.isPresent()) {
            timeout = Optional.ofNullable(instance.getExtendDatum().get(PreservedMetadataKeys.HEART_BEAT_TIMEOUT));
        }
        return timeout.map(ConvertUtils::toLong).orElse(Constants.DEFAULT_HEART_BEAT_TIMEOUT);
    }

    /**
     * 从元数据中获取超时时长
     * @param service
     * @param instance
     * @return
     */
    private Optional<Object> getTimeoutFromMetadata(Service service, InstancePublishInfo instance) {
        Optional<InstanceMetadata> instanceMetadata = ApplicationUtils.getBean(NamingMetadataManager.class)
                .getInstanceMetadata(service, instance.getMetadataId());
        return instanceMetadata.map(metadata -> metadata.getExtendData().get(PreservedMetadataKeys.HEART_BEAT_TIMEOUT));
    }

    /**
     * 更新健康状态
     * @param client
     * @param service
     * @param instance
     */
    private void changeHealthyStatus(Client client, Service service, HealthCheckInstancePublishInfo instance) {
        // 设置实例为不健康
        instance.setHealthy(false);
        Loggers.EVT_LOG
                .info("{POS} {IP-DISABLED} valid: {}:{}@{}@{}, region: {}, msg: client last beat: {}", instance.getIp(),
                        instance.getPort(), instance.getCluster(), service.getName(), UtilsAndCommons.LOCALHOST_SITE,
                        instance.getLastHeartBeatTime());
        // 发布服务变更和Client变更事件
        NotifyCenter.publishEvent(new ServiceEvent.ServiceChangedEvent(service));
        NotifyCenter.publishEvent(new ClientEvent.ClientChangedEvent(client));
    }
}

注意:
Unhealthy(不健康)表示的是心跳超时,但还不至于要立马移除。心跳默认超时时长为15秒。

  • Unhealthy 超时时间可以从元数据中获取,配置名称: PreservedMetadataKeys.HEART_BEAT_TIMEOUT
  • Unhealthy 超时时间也可以使用系统默认值,配置名称: Constants.DEFAULT_HEART_BEAT_TIMEOUT

Expired(过期)表示的是服务已经达到超时的最大限制,达到这个时长之后将不再尝试心跳,而是将其移除。服务的过期时间默认为30秒。

  • Expired过期时间可以从元数据中获取,配置名称: PreservedMetadataKeys.IP_DELETE_TIMEOUT
  • Expired过期时间也可以使用系统默认值,配置名称: Constants.DEFAULT_IP_DELETE_TIMEOUT

任务处理器系列(Processor)

用于处理HealthCheckTaskV2

BeatProcessor

用于处理接收到的实例心跳。

/**
 * Thread to update ephemeral instance triggered by client beat.
 *
 * @author xiweng.yy
 */
public interface BeatProcessor extends Runnable {

}

ClientBeatProcessorV2

v2版本的心跳处理器。

/**
 * Thread to update ephemeral instance triggered by client beat for v2.x.
 * 用于更新由v2 client 心跳触发的ephemeral实例的线程
 * @author nkorange
 */
public class ClientBeatProcessorV2 implements BeatProcessor {

    private final String namespace;

    private final RsInfo rsInfo;

    /**
     * Client对象,表示此线程是一个Client一个处理线程
     */
    private final IpPortBasedClient client;

    public ClientBeatProcessorV2(String namespace, RsInfo rsInfo, IpPortBasedClient ipPortBasedClient) {
        this.namespace = namespace;
        this.rsInfo = rsInfo;
        this.client = ipPortBasedClient;
    }

    @Override
    public void run() {
        if (Loggers.EVT_LOG.isDebugEnabled()) {
            Loggers.EVT_LOG.debug("[CLIENT-BEAT] processing beat: {}", rsInfo.toString());
        }
        // 通过心跳信息组装实例
        String ip = rsInfo.getIp();
        int port = rsInfo.getPort();
        String serviceName = NamingUtils.getServiceName(rsInfo.getServiceName());
        String groupName = NamingUtils.getGroupName(rsInfo.getServiceName());
        Service service = Service.newService(namespace, groupName, serviceName, rsInfo.isEphemeral());
        HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) client.getInstancePublishInfo(service);

        // 若当前心跳传递过来的实例是当前线程代表的Client的实例才处理
        if (instance.getIp().equals(ip) && instance.getPort() == port) {
            if (Loggers.EVT_LOG.isDebugEnabled()) {
                Loggers.EVT_LOG.debug("[CLIENT-BEAT] refresh beat: {}", rsInfo.toString());
            }
            // 接收到心跳请求之后,设置当前时间为它的最新活跃时间
            instance.setLastHeartBeatTime(System.currentTimeMillis());
            // 若不是健康状态,需要将其更新为健康状态,因为此实例是当前线程所代表的Client负责的,超时的原因可能是网络延迟,总之
            // 当前Client若接收到了心跳就应当设置它为健康状态。
            if (!instance.isHealthy()) {
                instance.setHealthy(true);
                Loggers.EVT_LOG.info("service: {} {POS} {IP-ENABLED} valid: {}:{}@{}, region: {}, msg: client beat ok",
                        rsInfo.getServiceName(), ip, port, rsInfo.getCluster(), UtilsAndCommons.LOCALHOST_SITE);
                // 发布服务状态变更事件
                NotifyCenter.publishEvent(new ServiceEvent.ServiceChangedEvent(service));
                NotifyCenter.publishEvent(new ClientEvent.ClientChangedEvent(client));
            }
        }
    }
}

HealthCheckProcessorV2

v2版本的健康检查处理器。限制了它只能用于处理HealthCheckTaskV2类型的任务。

/**
 * Health check processor for v2.x.
 *
 * @author nkorange
 */
public interface HealthCheckProcessorV2 {
    
    /**
     * Run check task for service.
     *
     * @param task     health check task v2
     * @param service  service of current process
     * @param metadata cluster metadata of current process
     */
    void process(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata);
    
    /**
     * Get check task type, refer to enum HealthCheckType.
     *
     * @return check type
     */
    String getType();
}

注意:
此处分析的是v2版本的处理器,它位于com.alibaba.nacos.naming.healthcheck.v2.processor包下,请勿和v1版本的混淆。

HealthCheckProcessorV2Delegate

使用代理模式管理不同类型的处理器。

/**
 * Delegate of health check v2.x.
 * v2健康检查处理器代理
 * @author nacos
 */
@Component("healthCheckDelegateV2")
public class HealthCheckProcessorV2Delegate implements HealthCheckProcessorV2 {

    /**
     * 不同的处理器集合
     */
    private final Map<String, HealthCheckProcessorV2> healthCheckProcessorMap = new HashMap<>();

    public HealthCheckProcessorV2Delegate(HealthCheckExtendProvider provider) {
        // 初始化SPI扩展的加载,用于获取用户自定义的processor和checker
        provider.init();
    }

    @Autowired
    public void addProcessor(Collection<HealthCheckProcessorV2> processors) {
        // 添加processor到容器,以处理类别为key
        healthCheckProcessorMap.putAll(processors.stream().filter(processor -> processor.getType() != null)
                .collect(Collectors.toMap(HealthCheckProcessorV2::getType, processor -> processor)));
    }

    @Override
    public void process(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata) {
        // 从元数据中获取处理方式的类别
        String type = metadata.getHealthyCheckType();
        // 获取指定的处理器
        HealthCheckProcessorV2 processor = healthCheckProcessorMap.get(type);
        // 若未获取到,指定一个默认的处理器,默认不作处理
        if (processor == null) {
            processor = healthCheckProcessorMap.get(NoneHealthCheckProcessor.TYPE);
        }
        // 调用处理器进行处理
        processor.process(task, service, metadata);
    }

    @Override
    public String getType() {
        return null;
    }
}

HttpHealthCheckProcessor

HTTP方式的健康检查处理器。目前逻辑和v1版本相同,后续会重构。

/**
 * TCP health check processor for v2.x.
 * Http方式的心跳检查处理器, 原文的注释是不是copy的?
 * <p>Current health check logic is same as v1.x. TODO refactor health check for v2.x.
 *
 * @author xiweng.yy
 */
@Component
public class HttpHealthCheckProcessor implements HealthCheckProcessorV2 {

    /**
     * 当前处理的类型
     */
    public static final String TYPE = HealthCheckType.HTTP.name();

    /**
     * 请求模板,用于处理http请求
     */
    private static final NacosAsyncRestTemplate ASYNC_REST_TEMPLATE = HttpClientManager.getProcessorNacosAsyncRestTemplate();

    /**
     * 健康检查公用方法集合
     */
    private final HealthCheckCommonV2 healthCheckCommon;

    private final SwitchDomain switchDomain;

    public HttpHealthCheckProcessor(HealthCheckCommonV2 healthCheckCommon, SwitchDomain switchDomain) {
        this.healthCheckCommon = healthCheckCommon;
        this.switchDomain = switchDomain;
    }

    @Override
    public void process(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata) {
        // 获取指定Service对应的InstancePublishInfo
        HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) task.getClient().getInstancePublishInfo(service);
        if (null == instance) {
            return;
        }
        try {
            // TODO handle marked(white list) logic like v1.x.
            if (!instance.tryStartCheck()) {
                SRV_LOG.warn("http check started before last one finished, service: {} : {} : {}:{}",
                        service.getGroupedServiceName(), instance.getCluster(), instance.getIp(), instance.getPort());
                // 更新instance的开始检查状态
                healthCheckCommon
                        .reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task, switchDomain.getHttpHealthParams());
                return;
            }
            // 获取检查器
            Http healthChecker = (Http) metadata.getHealthChecker();
            // 获取实例所在的网络位置
            int ckPort = metadata.isUseInstancePortForCheck() ? instance.getPort() : metadata.getHealthyCheckPort();
            URL host = new URL("http://" + instance.getIp() + ":" + ckPort);
            URL target = new URL(host, healthChecker.getPath());
            Map<String, String> customHeaders = healthChecker.getCustomHeaders();
            Header header = Header.newInstance();
            header.addAll(customHeaders);
            // 发送http请求
            ASYNC_REST_TEMPLATE.get(target.toString(), header, Query.EMPTY, String.class,
                    new HttpHealthCheckCallback(instance, task, service));
            MetricsMonitor.getHttpHealthCheckMonitor().incrementAndGet();
        } catch (Throwable e) {
            instance.setCheckRt(switchDomain.getHttpHealthParams().getMax());
            healthCheckCommon.checkFail(task, service, "http:error:" + e.getMessage());
            healthCheckCommon.reEvaluateCheckRT(switchDomain.getHttpHealthParams().getMax(), task,
                    switchDomain.getHttpHealthParams());
        }
    }

    @Override
    public String getType() {
        return TYPE;
    }

    /**
     * 健康检查回调
     */
    private class HttpHealthCheckCallback implements Callback<String> {

        private final HealthCheckTaskV2 task;

        private final Service service;

        private final HealthCheckInstancePublishInfo instance;

        private long startTime = System.currentTimeMillis();

        public HttpHealthCheckCallback(HealthCheckInstancePublishInfo instance, HealthCheckTaskV2 task,
                Service service) {
            this.instance = instance;
            this.task = task;
            this.service = service;
        }

        @Override
        public void onReceive(RestResult<String> result) {
            // 设置本次响应时间
            instance.setCheckRt(System.currentTimeMillis() - startTime);
            int httpCode = result.getCode();
            // 判断请求结果
            if (HttpURLConnection.HTTP_OK == httpCode) {
                healthCheckCommon.checkOk(task, service, "http:" + httpCode);
                healthCheckCommon.reEvaluateCheckRT(System.currentTimeMillis() - startTime, task,
                        switchDomain.getHttpHealthParams());
            } else if (HttpURLConnection.HTTP_UNAVAILABLE == httpCode
                    || HttpURLConnection.HTTP_MOVED_TEMP == httpCode) {
                // server is busy, need verification later
                healthCheckCommon.checkFail(task, service, "http:" + httpCode);
                healthCheckCommon
                        .reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task, switchDomain.getHttpHealthParams());
            } else {
                //probably means the state files has been removed by administrator
                healthCheckCommon.checkFailNow(task, service, "http:" + httpCode);
                healthCheckCommon.reEvaluateCheckRT(switchDomain.getHttpHealthParams().getMax(), task,
                        switchDomain.getHttpHealthParams());
            }
        }

        @Override
        public void onError(Throwable throwable) {
            Throwable cause = throwable;
            instance.setCheckRt(System.currentTimeMillis() - startTime);
            int maxStackDepth = 50;
            for (int deepth = 0; deepth < maxStackDepth && cause != null; deepth++) {
                if (HttpUtils.isTimeoutException(cause)) {
                    healthCheckCommon.checkFail(task, service, "http:" + cause.getMessage());
                    healthCheckCommon.reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task,
                            switchDomain.getHttpHealthParams());
                    return;
                }
                cause = cause.getCause();
            }

            // connection error, probably not reachable
            if (throwable instanceof ConnectException) {
                healthCheckCommon.checkFailNow(task, service, "http:unable2connect:" + throwable.getMessage());
            } else {
                healthCheckCommon.checkFail(task, service, "http:error:" + throwable.getMessage());
            }
            healthCheckCommon.reEvaluateCheckRT(switchDomain.getHttpHealthParams().getMax(), task,
                    switchDomain.getHttpHealthParams());
        }

        @Override
        public void onCancel() {

        }
    }
}

TcpHealthCheckProcessor

TCP请求方式的健康检查处理器,内部使用NIO实现网络通信。

首先通过一张整体的概览图来看看这个Processor是怎么工作的。

  • 蓝色部分代表的是TcpHealthCheckProcessor类相关的内容
  • 橘色部分代表的是内部的子类对象被执行时的状态
  • 虚线代表异步执行

在这个处理器中NIO的特性和健康检查的特性高度契合,下面的对比展示的是他们分别作为不同角色时的工作内容。

从任务处理角度来看:

TcpHealthCheckProcessor:负责启动任务,并创建多个TaskProcessor。
TaskProcessor:负责处理心跳任务,并创建一个TimeOutTask。
TimeOutTask:负责处理心跳检测超时。
PostProcessor:负责检查心跳是否成功。

从NIO网络连接角度来看:

TcpHealthCheckProcessor:相当于NIO中的Selector,确实它内部带有一个Selector。
TaskProcessor:相当于NIO中的Channel, 每个TaskProcessor都具有一个独立的Channel。它只负责创建和连接,并不负责检查连接的结果。
TimeOutTask:相当于NIO中的Channel, 它被TaskProcessor创建,并和TaskProcessor持有相同的Channel,负责检查Channel连接是否超时。
PostProcessor:相当于NIO中的Channel, 它主动获取已经准备好的Channel,获取的Channel就是TaskProcessor创建的Channel,负责检查连接是否成功。

角色扮演图:

首先TcpHealthCheckProcessor作为一个任务处理器,它的入口是process()方法;作为一个线程,它的入口是run()方法。当它作为一个处理器的时候,内部维护了一个队列,实现生产/消费模型。process()方法由外部调用,用于产生数据。run()方法由构造方法调用,用于消费数据。

构造方法初始化的时候初始化了一个Selector, 并启动了线程。当process()方法被调用时,它将一个Beat对象放入taskQueue队列(生产者)。当线程启动执行run()方法时,从taskQueue队列取出Beat,生成一个TaskProcessor,积攒到一定数量的时候批量调用(消费者)。当每一个TaskProcessor开始执行的时候各自创建一个属于自己的Channel,并将其注册到Selector中,同时创建一个TimeOutTask,延迟执行。

因为TaskProcessor是异步调用的,因此在执行批量调用之后就开始查找刚才的TaskProcessor所注册的Channel,为每一个Channel创建一个PostProcessor,并异步执行。当PostProcessor执行时检查它代表的Channel的连接状态,根据连接状态来处理心跳结果,心跳检测以连接目标服务器成功作为标志。此时TimeOutTask在delay时间到了之后开始执行,再次检查它代表的Channel的状态,根据连接状态来处理超时情况。

@Component
public class TcpHealthCheckProcessor implements HealthCheckProcessorV2, Runnable {

    /**
     * 当前的Processor可处理的Task类型为TCP
     */
    public static final String TYPE = HealthCheckType.TCP.name();

    /**
     * 连接超时时长
     */
    public static final int CONNECT_TIMEOUT_MS = 500;

    /**
     * NIO线程数量
     * this value has been carefully tuned, do not modify unless you're confident.
     */
    private static final int NIO_THREAD_COUNT = EnvUtil.getAvailableProcessors(0.5);

    /**
     * because some hosts doesn't support keep-alive connections, disabled temporarily.
     */
    private static final long TCP_KEEP_ALIVE_MILLIS = 0;

    /**
     * v2版本健康检查通用方法集合
     */
    private final HealthCheckCommonV2 healthCheckCommon;

    private final SwitchDomain switchDomain;

    private final Map<String, BeatKey> keyMap = new ConcurrentHashMap<>();

    /**
     * Tcp心跳任务阻塞队列,用于实现生产者消费者模式。
     */
    private final BlockingQueue<Beat> taskQueue = new LinkedBlockingQueue<>();

    /**
     * NIO多路复用器,用于管理多个线程的网络连接,此处就是检查多个心跳的连接
     */
    private final Selector selector;

    public TcpHealthCheckProcessor(HealthCheckCommonV2 healthCheckCommon, SwitchDomain switchDomain) {
        this.healthCheckCommon = healthCheckCommon;
        this.switchDomain = switchDomain;
        try {
            // 创建Selector
            selector = Selector.open();
            // 使用线程执行器执行当前类,也就是将当前类作为消费者启动,run方法内部的循环将会持续进行,不断消费数据
            GlobalExecutor.submitTcpCheck(this);
        } catch (Exception e) {
            throw new IllegalStateException("Error while initializing SuperSense(TM).");
        }
    }

    /**
     * 作为Processor的时候,它提供process方法来对task进行处理,处理的结果就是将其放入消费队列
     * 作为一个线程运行的时候,它作为消费者,不断从队列中获取任务来执行
     * @param task     health check task v2
     * @param service  service of current process
     * @param metadata cluster metadata of current process
     */
    @Override
    public void process(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata) {

        // 获取Instance的检查信息
        HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) task.getClient()
                .getInstancePublishInfo(service);
        if (null == instance) {
            return;
        }
        // TODO handle marked(white list) logic like v1.x.
        if (!instance.tryStartCheck()) {
            SRV_LOG.warn("tcp check started before last one finished, service: {} : {} : {}:{}",
                    service.getGroupedServiceName(), instance.getCluster(), instance.getIp(), instance.getPort());
            healthCheckCommon
                    .reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task, switchDomain.getTcpHealthParams());
            return;
        }
        // 处理任务时,将其放入队列内部,此处相当于生产者,每调用一次process都会将其放入队列
        taskQueue.add(new Beat(task, service, metadata, instance));
        MetricsMonitor.getTcpHealthCheckMonitor().incrementAndGet();
    }

    @Override
    public String getType() {
        return TYPE;
    }

    /**
     * 处理TCP健康检查任务
     * @throws Exception
     */
    private void processTask() throws Exception {

        // 任务处理器集合,为每一个Beat创建一个TaskProcessor
        Collection<Callable<Void>> tasks = new LinkedList<>();

        /**
         * 疯狂从队列获取心跳信息
         * 循环条件:
         * 1. 队列内有数据
         * 2. 已获取的task小于CPU核数的0.5倍 * 64(例如8核CPU的话就是 8 * 0.5 * 64 = 256)
         */
        do {
            // 从队列获取元素,超时时间为250毫秒
            Beat beat = taskQueue.poll(CONNECT_TIMEOUT_MS / 2, TimeUnit.MILLISECONDS);
            // 若数据为空,继续执行下次循环
            if (beat == null) {
                return;
            }
            // 添加任务到集合中,后续一次性处理
            tasks.add(new TaskProcessor(beat));

        } while (taskQueue.size() > 0 && tasks.size() < NIO_THREAD_COUNT * 64);

        // 一次性调用所有task
        for (Future<?> f : GlobalExecutor.invokeAllTcpSuperSenseTask(tasks)) {
            f.get();
        }
    }

    @Override
    public void run() {
        while (true) {
            try {
                processTask();
                // 使用非阻塞方法获取已准备好进行I/O的channel数量集
                int readyCount = selector.selectNow();
                if (readyCount <= 0) {
                    continue;
                }

                // 处理 SelectionKey
                Iterator<SelectionKey> iter = selector.selectedKeys().iterator();
                while (iter.hasNext()) {
                    SelectionKey key = iter.next();
                    iter.remove();

                    GlobalExecutor.executeTcpSuperSense(new PostProcessor(key));
                }
            } catch (Throwable e) {
                SRV_LOG.error("[HEALTH-CHECK] error while processing NIO task", e);
            }
        }
    }
}
TaskProcessor(Inner Class)

每一个TaskProcessor都会携带一个Beat, 当被批量调用的时候执行call() 方法。它负责为每一个Beat创建一个Channel,用于连接实例所在服务器,同时也会创建一个TimeOutTask来延迟执行,用于检查连接是否超时,连接的超时就代表心跳的超时。

/**
 * 任务处理器
 */
private class TaskProcessor implements Callable<Void> {

	/**
	 * 最大等待时间500毫秒
	 */
	private static final int MAX_WAIT_TIME_MILLISECONDS = 500;

	/**
	 * 心跳对象
	 */
	Beat beat;

	public TaskProcessor(Beat beat) {
		this.beat = beat;
	}

	@Override
	public Void call() {
		// 当前任务已等待的时长
		long waited = System.currentTimeMillis() - beat.getStartTime();
		// 当前任务等待时长超过500毫秒,打印警告信息
		if (waited > MAX_WAIT_TIME_MILLISECONDS) {
			Loggers.SRV_LOG.warn("beat task waited too long: " + waited + "ms");
		}

		SocketChannel channel = null;
		try {
			HealthCheckInstancePublishInfo instance = beat.getInstance();

			BeatKey beatKey = keyMap.get(beat.toString());
			if (beatKey != null && beatKey.key.isValid()) {
				if (System.currentTimeMillis() - beatKey.birthTime < TCP_KEEP_ALIVE_MILLIS) {
					instance.finishCheck();
					return null;
				}

				beatKey.key.cancel();
				beatKey.key.channel().close();
			}

			channel = SocketChannel.open();
			channel.configureBlocking(false);
			// only by setting this can we make the socket close event asynchronous
			channel.socket().setSoLinger(false, -1);
			channel.socket().setReuseAddress(true);
			channel.socket().setKeepAlive(true);
			channel.socket().setTcpNoDelay(true);

			ClusterMetadata cluster = beat.getMetadata();
			int port = cluster.isUseInstancePortForCheck() ? instance.getPort() : cluster.getHealthyCheckPort();
			channel.connect(new InetSocketAddress(instance.getIp(), port));
			// 注册Channel到Register
			SelectionKey key = channel.register(selector, SelectionKey.OP_CONNECT | SelectionKey.OP_READ);
			key.attach(beat);
			keyMap.put(beat.toString(), new BeatKey(key));
			// 设置心跳开始时间
			beat.setStartTime(System.currentTimeMillis());
			// 启动超感任务,这里将SelectionKey传入了TimeOutTask,意味着后者将会知道当前心跳任务的连接状态
			GlobalExecutor
					.scheduleTcpSuperSenseTask(new TimeOutTask(key), CONNECT_TIMEOUT_MS, TimeUnit.MILLISECONDS);
		} catch (Exception e) {
			// 设置为检查失败
			beat.finishCheck(false, false, switchDomain.getTcpHealthParams().getMax(),
					"tcp:error:" + e.getMessage());

			if (channel != null) {
				try {
					// 关闭连接
					channel.close();
				} catch (Exception ignore) {
				}
			}
		}

		return null;
	}
}
PostProcessor(Inner Class)

TcpHealthCheckProcessor将一个已准备好连接的SelectionKey传递过来之后,获取对应的Beat,并根据这个连接状态来处理心跳的状态。

public class PostProcessor implements Runnable {

	SelectionKey key;

	public PostProcessor(SelectionKey key) {
		this.key = key;
	}

	@Override
	public void run() {
		Beat beat = (Beat) key.attachment();
		SocketChannel channel = (SocketChannel) key.channel();
		try {
			// 如果心跳检测已经超时,关闭对应的channel
			if (!beat.isHealthy()) {
				//invalid beat means this server is no longer responsible for the current service
				key.cancel();
				key.channel().close();

				beat.finishCheck();
				return;
			}

			// 是否支持套接字连接操作
			if (key.isValid() && key.isConnectable()) {
				//connected
				// 判断是否连接成功
				channel.finishConnect();
				// 更新心跳信息
				beat.finishCheck(true, false, System.currentTimeMillis() - beat.getTask().getStartTime(),
						"tcp:ok+");
			}

			// 判断key的channel是否支持read操作
			if (key.isValid() && key.isReadable()) {
				//disconnected
				// 从channel读取数据到buffer
				ByteBuffer buffer = ByteBuffer.allocate(128);
				if (channel.read(buffer) == -1) {
					key.cancel();
					key.channel().close();
				} else {
					// not terminate request, ignore
					// 若读取到channel内的数据,忽略此请求保持连接
					SRV_LOG.warn(
							"Tcp check ok, but the connected server responses some msg. Connection won't be closed.");
				}
			}
		} catch (ConnectException e) {
			// unable to connect, possibly port not opened
			beat.finishCheck(false, true, switchDomain.getTcpHealthParams().getMax(),
					"tcp:unable2connect:" + e.getMessage());
		} catch (Exception e) {
			beat.finishCheck(false, false, switchDomain.getTcpHealthParams().getMax(),
					"tcp:error:" + e.getMessage());

			try {
				// 发生异常关闭连接
				key.cancel();
				key.channel().close();
			} catch (Exception ignore) {
			}
		}
	}
}
Beat(Inner Class)
/**
 * 心跳对象
 *
 * 请注意构造方法传入的HealthCheckTaskV2 task
 * 后续一系列的处理将会调用原有的这个task来进行一些操作
 */
private class Beat {

	private final HealthCheckTaskV2 task;

	private final Service service;

	private final ClusterMetadata metadata;

	private final HealthCheckInstancePublishInfo instance;

	long startTime = System.currentTimeMillis();

	public Beat(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata,
			HealthCheckInstancePublishInfo instance) {
		this.task = task;
		this.service = service;
		this.metadata = metadata;
		this.instance = instance;
	}

	public void setStartTime(long time) {
		startTime = time;
	}

	public long getStartTime() {
		return startTime;
	}

	public HealthCheckTaskV2 getTask() {
		return task;
	}

	public Service getService() {
		return service;
	}

	public ClusterMetadata getMetadata() {
		return metadata;
	}

	public HealthCheckInstancePublishInfo getInstance() {
		return instance;
	}

	public boolean isHealthy() {
		return System.currentTimeMillis() - startTime < TimeUnit.SECONDS.toMillis(30L);
	}

	/**
	 * finish check only, no ip state will be changed.
	 */
	public void finishCheck() {
		instance.finishCheck();
	}

	public void finishCheck(boolean success, boolean now, long rt, String msg) {
		if (success) {
			healthCheckCommon.checkOk(task, service, msg);
		} else {
			if (now) {
				healthCheckCommon.checkFailNow(task, service, msg);
			} else {
				healthCheckCommon.checkFail(task, service, msg);
			}

			keyMap.remove(toString());
		}

		healthCheckCommon.reEvaluateCheckRT(rt, task, switchDomain.getTcpHealthParams());
	}

	@Override
	public String toString() {
		return service.getGroupedServiceName() + ":" + instance.getCluster() + ":" + instance.getIp() + ":"
				+ instance.getPort();
	}

	@Override
	public int hashCode() {
		return toString().hashCode();
	}

	@Override
	public boolean equals(Object obj) {
		if (!(obj instanceof Beat)) {
			return false;
		}

		return this.toString().equals(obj.toString());
	}
}
BeatKey(Inner Class)

用于记录连接的创建时间。

private static class BeatKey {

	public SelectionKey key;

	public long birthTime;

	public BeatKey(SelectionKey key) {
		this.key = key;
		this.birthTime = System.currentTimeMillis();
	}
}

TimeOutTask(Inner Class)

用于连接超时处理,由TaskProcessor创建,所操作的Channel由TaskProcessor传递,他们是一对一的关系。

/**
 * 超时任务,此任务创建时,任务不一定超时
 * 是在此任务内部来判断是否超时,以及作相应的处理
 */
private static class TimeOutTask implements Runnable {

	SelectionKey key;

	public TimeOutTask(SelectionKey key) {
		this.key = key;
	}

	@Override
	public void run() {
		if (key != null && key.isValid()) {
			// 获取本次心跳的channel对象
			SocketChannel channel = (SocketChannel) key.channel();
			// 获取注册时传入的Beat
			Beat beat = (Beat) key.attachment();
			// 判断是否连接成功,因为当前判断条件在TimeOutTask对象内,如果连接成功就不是timeout,不需要执行后续操作
			if (channel.isConnected()) {
				return;
			}

			try {
				// 完成本次连接
				channel.finishConnect();
			} catch (Exception ignore) {
			}

			try {
				// 设置check状态为false,关闭本channel的选择,让selector不再处理
				beat.finishCheck(false, false, beat.getTask().getCheckRtNormalized() * 2, "tcp:timeout");
				key.cancel();
				key.channel().close();
			} catch (Exception ignore) {
			}
		}
	}
}

MysqlHealthCheckProcessor

/**
 * TCP health check processor for v2.x.
 * Mysql集群健康检查
 * <p>Current health check logic is same as v1.x. TODO refactor health check for v2.x.
 *
 * @author xiweng.yy
 */
@Component
@SuppressWarnings("PMD.ThreadPoolCreationRule")
public class MysqlHealthCheckProcessor implements HealthCheckProcessorV2 {

    /**
     * 当前处理的类型
     */
    public static final String TYPE = HealthCheckType.MYSQL.name();

    /**
     * 健康检查公用方法集合
     */
    private final HealthCheckCommonV2 healthCheckCommon;

    private final SwitchDomain switchDomain;

    /**
     * 连接超时时长
     */
    public static final int CONNECT_TIMEOUT_MS = 500;

    /**
     * 检查时发送一条SQL语句用于判断是否连接成功
     */
    private static final String CHECK_MYSQL_MASTER_SQL = "show global variables where variable_name='read_only'";

    /**
     * Mysql 从机只读状态
     */
    private static final String MYSQL_SLAVE_READONLY = "ON";

    /**
     * 数据库连接池
     */
    private static final ConcurrentMap<String, Connection> CONNECTION_POOL = new ConcurrentHashMap<String, Connection>();

    public MysqlHealthCheckProcessor(HealthCheckCommonV2 healthCheckCommon, SwitchDomain switchDomain) {
        this.healthCheckCommon = healthCheckCommon;
        this.switchDomain = switchDomain;
    }

    @Override
    public String getType() {
        return TYPE;
    }

    @Override
    public void process(HealthCheckTaskV2 task, Service service, ClusterMetadata metadata) {

        // 获取服务对应的实例
        HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) task.getClient()
                .getInstancePublishInfo(service);
        if (null == instance) {
            return;
        }
        SRV_LOG.debug("mysql check, ip:" + instance);
        try {
            // TODO handle marked(white list) logic like v1.x.
            if (!instance.tryStartCheck()) {
                SRV_LOG.warn("mysql check started before last one finished, service: {} : {} : {}:{}",
                        service.getGroupedServiceName(), instance.getCluster(), instance.getIp(), instance.getPort());
                healthCheckCommon
                        .reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task, switchDomain.getMysqlHealthParams());
                return;
            }
            // 创建MySQL检查任务并执行
            GlobalExecutor.executeMysqlCheckTask(new MysqlCheckTask(task, service, instance, metadata));
            MetricsMonitor.getMysqlHealthCheckMonitor().incrementAndGet();
        } catch (Exception e) {
            instance.setCheckRt(switchDomain.getMysqlHealthParams().getMax());
            healthCheckCommon.checkFail(task, service, "mysql:error:" + e.getMessage());
            healthCheckCommon.reEvaluateCheckRT(switchDomain.getMysqlHealthParams().getMax(), task,
                    switchDomain.getMysqlHealthParams());
        }
    }

    /**
     * MySQL检查任务
     */
    private class MysqlCheckTask implements Runnable {

        private final HealthCheckTaskV2 task;

        private final Service service;

        private final HealthCheckInstancePublishInfo instance;

        private final ClusterMetadata metadata;

        private long startTime = System.currentTimeMillis();

        public MysqlCheckTask(HealthCheckTaskV2 task, Service service, HealthCheckInstancePublishInfo instance,
                ClusterMetadata metadata) {
            this.task = task;
            this.service = service;
            this.instance = instance;
            this.metadata = metadata;
        }

        @Override
        public void run() {

            Statement statement = null;
            ResultSet resultSet = null;

            try {
                String clusterName = instance.getCluster();
                // 组装连接缓存key
                String key =
                        service.getGroupedServiceName() + ":" + clusterName + ":" + instance.getIp() + ":" + instance
                                .getPort();
                // 从连接池获取mysql连接
                Connection connection = CONNECTION_POOL.get(key);
                // 获取健康检查器
                Mysql config = (Mysql) metadata.getHealthChecker();
                // 创建连接并缓存
                if (connection == null || connection.isClosed()) {
                    String url = "jdbc:mysql://" + instance.getIp() + ":" + instance.getPort() + "?connectTimeout="
                            + CONNECT_TIMEOUT_MS + "&socketTimeout=" + CONNECT_TIMEOUT_MS + "&loginTimeout=" + 1;
                    connection = DriverManager.getConnection(url, config.getUser(), config.getPwd());
                    CONNECTION_POOL.put(key, connection);
                }

                statement = connection.createStatement();
                statement.setQueryTimeout(1);

                resultSet = statement.executeQuery(config.getCmd());
                int resultColumnIndex = 2;

                // 判断执行语句是否是主节点查询语句
                if (CHECK_MYSQL_MASTER_SQL.equals(config.getCmd())) {
                    resultSet.next();
                    /**
                     * 从查询结果判断是主机还是从机
                     * CHECK_MYSQL_MASTER_SQL 语句执行结果为:[read_only : ON/OFF]
                     * MYSQL_SLAVE_READONLY 默认为 ON,若返回的是ON说明请求的是主机(没人会把主机Master设置为只读状态吧)
                     */
                    if (MYSQL_SLAVE_READONLY.equals(resultSet.getString(resultColumnIndex))) {
                        throw new IllegalStateException("current node is slave!");
                    }
                }
                // 处理检查结果
                healthCheckCommon.checkOk(task, service, "mysql:+ok");
                healthCheckCommon.reEvaluateCheckRT(System.currentTimeMillis() - startTime, task,
                        switchDomain.getMysqlHealthParams());
            } catch (SQLException e) {
                // fail immediately
                healthCheckCommon.checkFailNow(task, service, "mysql:" + e.getMessage());
                healthCheckCommon.reEvaluateCheckRT(switchDomain.getHttpHealthParams().getMax(), task,
                        switchDomain.getMysqlHealthParams());
            } catch (Throwable t) {
                // 不太明白此处的用意是什么
                Throwable cause = t;
                int maxStackDepth = 50;
                for (int deepth = 0; deepth < maxStackDepth && cause != null; deepth++) {
                    if (cause instanceof SocketTimeoutException || cause instanceof ConnectTimeoutException
                            || cause instanceof TimeoutException || cause.getCause() instanceof TimeoutException) {

                        healthCheckCommon.checkFail(task, service, "mysql:timeout:" + cause.getMessage());
                        healthCheckCommon.reEvaluateCheckRT(task.getCheckRtNormalized() * 2, task,
                                switchDomain.getMysqlHealthParams());
                        return;
                    }

                    cause = cause.getCause();
                }

                // connection error, probably not reachable
                healthCheckCommon.checkFail(task, service, "mysql:error:" + t.getMessage());
                healthCheckCommon.reEvaluateCheckRT(switchDomain.getMysqlHealthParams().getMax(), task,
                        switchDomain.getMysqlHealthParams());
            } finally {
                instance.setCheckRt(System.currentTimeMillis() - startTime);
                if (statement != null) {
                    try {
                        statement.close();
                    } catch (SQLException e) {
                        Loggers.SRV_LOG.error("[MYSQL-CHECK] failed to close statement:" + statement, e);
                    }
                }
                if (resultSet != null) {
                    try {
                        resultSet.close();
                    } catch (SQLException e) {
                        Loggers.SRV_LOG.error("[MYSQL-CHECK] failed to close resultSet:" + resultSet, e);
                    }
                }
            }
        }
    }
}

NoneHealthCheckProcessor

兜底处理器,默认对任务不作任何处理。

/**
 * Health checker that does nothing.
 *
 * @author nkorange
 * @since 1.0.0
 */
@Component
public class NoneHealthCheckProcessor implements HealthCheckProcessor {
    
    public static final String TYPE = "NONE";
    
    @Override
    public void process(HealthCheckTask task) {
    }
    
    @Override
    public String getType() {
        return TYPE;
    }
}

注册服务时启动检查任务

在注册服务或者同步服务时,会创建Client对象,在创建Client对象时会创建对应的客户端心跳检查任务,重点在检查上面。它用于检查心跳执行的结果,因为心跳会更新服务的状态,而这里就是检查服务的状态来反向检测心跳的执行情况。对于服务来说,客户端应该算是一个最顶级的单元,它会管理
所有服务和实例,因此在它创建的时候开启一个心跳检查任务再合适不过了。

public IpPortBasedClient(String clientId, boolean ephemeral) {
	this.ephemeral = ephemeral;
	this.clientId = clientId;
	this.responsibleId = getResponsibleTagFromId();
	if (ephemeral) {
		// 创建健康检查任务
		beatCheckTask = new ClientBeatCheckTaskV2(this);
		// 交由执行器调度
		HealthCheckReactor.scheduleCheck(beatCheckTask);
	} else {
		healthCheckTaskV2 = new HealthCheckTaskV2(this);
		HealthCheckReactor.scheduleCheck(healthCheckTaskV2);
	}
}

/**
 * Schedule client beat check task with a delay.
 * 执行一个延迟的客户端心跳检查
 * @param task client beat check task
 */
public static void scheduleCheck(BeatCheckTask task) {
	Runnable wrapperTask = task instanceof NacosHealthCheckTask ? new HealthCheckTaskInterceptWrapper((NacosHealthCheckTask) task) : task;
	futureMap.computeIfAbsent(task.taskKey(), k -> GlobalExecutor.scheduleNamingHealth(task, 5000, 5000, TimeUnit.MILLISECONDS));
}

这里我们以默认的ephemeral状态的Client为例,在构造方法中它新建了ClientBeatCheckTaskV2,并使用HealthCheckReactor来进行调度。并将这个任务缓存到了futureMap, 首次执行的时候延迟5秒,每隔5秒检查一次。

提示:
scheduleCheck(BeatCheckTask task)方法中声明的wrapperTask并未使用,这个wrapper内部也管理这一个拦截器链。后续可以添加一些拦截器来作一些验证工作,这里也属于一层扩展吧。

在检查方法内会获取当前Client下所注册的所有服务,对每一个服务都进行检查。若Client下没有已经发布的服务的话,每隔5秒再执行一次岂不浪费资源?此时启动任务时的缓存就发挥作用了,可通过futureMap来获取名下没有服务的Client检查任务,将其取消释放线程资源。至于谁来取消,后续分析。

public void doHealthCheck() {

	try {
		// 获取当前客户端下所有的服务
		Collection<Service> services = client.getAllPublishedService();
		for (Service each : services) {
			// 获取服务下的Instance,转换为HealthCheckInstancePublishInfo
			HealthCheckInstancePublishInfo instance = (HealthCheckInstancePublishInfo) client.getInstancePublishInfo(each);
			// 为每一个Instance生成一个心跳检查任务InstanceBeatCheckTask,并使用拦截器来处理
			interceptorChain.doInterceptor(new InstanceBeatCheckTask(client, each, instance));
		}
	} catch (Exception e) {
		Loggers.SRV_LOG.warn("Exception while processing client beat time out.", e);
	}
}

为获取到的每一个实例创建一个InstanceBeatCheckTask,一个ClientBeatCheckTaskV2会创建多个InstanceBeatCheckTask。在当前的任务中一共会加载3个拦截器,按照优先级排列他们分别是:

  1. ServiceEnableBeatCheckInterceptor
  2. InstanceEnableBeatCheckInterceptor
  3. InstanceBeatCheckResponsibleInterceptor

Instance心跳检查任务InstanceBeatCheckTask使用Checker来检查当前实例的状态。这里默认会有不健康实例的检查器和过期的实例检查器。它只负责启动检查,具体的检查内容,以及根据检查结果所做的操作,都由检查器完成。

// InstanceBeatCheckTask.java

// 当没有被拦截的时候执行检查
public void passIntercept() {
	
	// 遍历所有Checker
	for (InstanceBeatChecker each : CHECKERS) {
		// 交由Checker执行
		each.doCheck(client, service, instancePublishInfo);
	}
}

InstanceBeatCheckTask内部Checker的调用被放在了passIntercept()方法内部,表示无论拦截器怎么处理,最终都会执行。详细处理流程请参考: UnhealthyInstanceCheckerExpiredInstanceChecker

开启自动清理任务

在Nacos2.X版本中有一个用于自动清理任务的类EmptyServiceAutoCleanerV2,它在Nacos服务端启动的时候开始执行首次延迟30秒,每隔60秒执行一次清理。用于清理空的服务。这里主要用于清理集中管理的所有Service和所有Client(关于ServiceManager、ClientServiceIndexesManager、ServiceStorage请参考《Service的存储管理》)。

/**
 * Empty service auto cleaner for v2.x.
 * 空服务自动清理器
 * @author xiweng.yy
 */
@Component
public class EmptyServiceAutoCleanerV2 extends AbstractNamingCleaner {

    private static final String EMPTY_SERVICE = "emptyService";
	
	// Client和Service索引管理
    private final ClientServiceIndexesManager clientServiceIndexesManager;

	// Service仓库
    private final ServiceStorage serviceStorage;

    public EmptyServiceAutoCleanerV2(ClientServiceIndexesManager clientServiceIndexesManager, ServiceStorage serviceStorage) {
        this.clientServiceIndexesManager = clientServiceIndexesManager;
        this.serviceStorage = serviceStorage;
        // 延迟30秒执行,每60秒清空一次空服务
        GlobalExecutor.scheduleExpiredClientCleaner(this, TimeUnit.SECONDS.toMillis(30), GlobalConfig.getEmptyServiceCleanInterval(), TimeUnit.MILLISECONDS);

    }

    @Override
    public String getType() {
        return EMPTY_SERVICE;
    }

    @Override
    public void doClean() {
		// 获取ServiceManager
        ServiceManager serviceManager = ServiceManager.getInstance();
		// 并行处理开启阈值,当服务数量超过100的时候就使用多线程处理
        // Parallel flow opening threshold
        int parallelSize = 100;

		// 处理多个Namespace下的Service
        for (String each : serviceManager.getAllNamespaces()) {
            Set<Service> services = serviceManager.getSingletons(each);
			// 根据当前Namespace下的Service数量决定是否采用多线程处理
            Stream<Service> stream = services.size() > parallelSize ? services.parallelStream() : services.stream();
			// 对每个Service执行cleanEmptyService
            stream.forEach(this::cleanEmptyService);
        }
    }

    private void cleanEmptyService(Service service) {
		// 获取当前Service下所有的clientId
        Collection<String> registeredService = clientServiceIndexesManager.getAllClientsRegisteredService(service);
		// 若当前服务下的客户端为空,或者当前服务距离最后一次更新时间超过60秒
        if (registeredService.isEmpty() && isTimeExpired(service)) {
            Loggers.SRV_LOG.warn("namespace : {}, [{}] services are automatically cleaned", service.getNamespace(), service.getGroupedServiceName());
			// 移除Service和Client关联信息
            clientServiceIndexesManager.removePublisherIndexesByEmptyService(service);
			// 移除指定Namespace下的Service服务
            ServiceManager.getInstance().removeSingleton(service);
			// 移除Service的详细信息
            serviceStorage.removeData(service);
			// 发布Service过期事件
            NotifyCenter.publishEvent(new MetadataEvent.ServiceMetadataEvent(service, true));
        }
    }

    private boolean isTimeExpired(Service service) {
        long currentTimeMillis = System.currentTimeMillis();
        return currentTimeMillis - service.getLastUpdatedTime() >= GlobalConfig.getEmptyServiceExpiredTime();
    }
}

除了Service的自动清除还有ServiceMetaData的过期清除任务ExpiredMetadataCleaner,感兴趣的可以自行阅读相关源码。

posted @ 2021-07-21 00:23  不会发芽的种子  阅读(2454)  评论(1编辑  收藏  举报