9.源码分析---SOFARPC是如何实现故障剔除的?

SOFARPC源码解析系列:

1. 源码分析---SOFARPC可扩展的机制SPI

2. 源码分析---SOFARPC客户端服务引用

3. 源码分析---SOFARPC客户端服务调用

4. 源码分析---SOFARPC服务端暴露

5.源码分析---SOFARPC调用服务

6.源码分析---和dubbo相比SOFARPC是如何实现负载均衡的?

7.源码分析---SOFARPC是如何实现连接管理与心跳?

8.源码分析---从设计模式中看SOFARPC中的EventBus?


在第七讲里面7.源码分析---SOFARPC是如何实现连接管理与心跳?,我讲了客户端是怎么维护服务端的长连接的。但是有一种情况是Consumer 和 Provider的长连接还在,注册中心未下发摘除,但服务器端由于某些原因,例如长时间的 Full GC, 硬件故障(后文中为避免重复,统一描述为机器假死)等场景,处于假死状态。

这个时候 Consumer 应该不调用或少调用该 Provider,可以通过权重的方式来进行控制。目前 SOFARPC 5.3.0 以上的版本支持 RPC 单机故障剔除能力。SOFARPC 通过服务权重控制方式来减少异常服务的调用,将更多流量打到正常服务机器上,提高服务可用性。

接下来我们来讲讲具体的服务权重降级是怎么实现的。在看这篇文章之前我希望读者能读完如下几篇文章:

  1. 8.源码分析---从设计模式中看SOFARPC中的EventBus?,因为SOFARPC的服务权重降级是通过EventBus来调用的。
  2. 3. 源码分析---SOFARPC客户端服务调用,这篇文章里面写了是如何调用服务端的,客户端会在调用服务端的时候触发总线,给订阅者发送一个消息。
  3. 6.源码分析---和dubbo相比SOFARPC是如何实现负载均衡的?,这篇文章里面写的是SOFARPC的负载均衡是怎么实现的,以及如何通过权重控制并发量。

如果你了解了上面的知识,那么可以开始接下来的内容了。

实例

我们首先给出一个服务端和客户端的实例,方便大家去调试。

官方的文档在这里:自动故障剔除

service

public static void main(String[] args) {
    ServerConfig serverConfig = new ServerConfig()
            .setProtocol("bolt") // 设置一个协议,默认bolt
            .setPort(12200) // 设置一个端口,默认12200
            .setDaemon(false); // 非守护线程

    ProviderConfig<HelloService> providerConfig = new ProviderConfig<HelloService>()
        .setInterfaceId(HelloService.class.getName()) // 指定接口
        .setRef(new HelloServiceImpl()) // 指定实现
        .setServer(serverConfig); // 指定服务端

    providerConfig.export(); // 发布服务
}

client

public static void main(String[] args) {

        FaultToleranceConfig faultToleranceConfig = new FaultToleranceConfig();
        faultToleranceConfig.setRegulationEffective(true);
        faultToleranceConfig.setDegradeEffective(true);
        faultToleranceConfig.setTimeWindow(10);
        faultToleranceConfig.setWeightDegradeRate(0.5);

        FaultToleranceConfigManager.putAppConfig("appName", faultToleranceConfig);

        ApplicationConfig applicationConfig = new ApplicationConfig();
        applicationConfig.setAppName("appName");

        ConsumerConfig<HelloService> consumerConfig = new ConsumerConfig<HelloService>()
                .setInterfaceId(HelloService.class.getName()) // 指定接口
                .setProtocol("bolt") // 指定协议
                .setDirectUrl("bolt://127.0.0.1:12200") // 指定直连地址
                .setConnectTimeout(2000 * 1000)
                .setApplication(applicationConfig);

        HelloService helloService = consumerConfig.refer();

        while (true) {
            try {
                LOGGER.info(helloService.sayHello("world"));
            } catch (Exception e) {
                e.printStackTrace();
            }

            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

        }
    }

自动故障剔除模块的注册

我们在在客户端的例子里面通过FaultToleranceConfigManager注册了FaultToleranceConfig配置。

FaultToleranceConfig faultToleranceConfig = new FaultToleranceConfig();
faultToleranceConfig.setRegulationEffective(true);
faultToleranceConfig.setDegradeEffective(true);
faultToleranceConfig.setTimeWindow(10);
faultToleranceConfig.setWeightDegradeRate(0.5);

FaultToleranceConfigManager.putAppConfig("appName", faultToleranceConfig);

我们先进入到FaultToleranceConfigManager里面看看putAppConfig做了什么。

FaultToleranceConfigManager#putAppConfig

/**
 * All fault-tolerance config of apps
 */
private static final ConcurrentMap<String, FaultToleranceConfig> APP_CONFIGS = new ConcurrentHashMap<String, FaultToleranceConfig>();

public static void putAppConfig(String appName, FaultToleranceConfig value) {
    if (appName == null) {
        if (LOGGER.isWarnEnabled()) {
            LOGGER.warn("App name is null when put fault-tolerance config");
        }
        return;
    }
    if (value != null) {
        APP_CONFIGS.put(appName, value);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "Get a new resource, value[" + value + "]");
        }
    } else {
        APP_CONFIGS.remove(appName);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "Remove a resource, key[" + appName + "]");
        }
    }
    calcEnable();
}

static void calcEnable() {
    for (FaultToleranceConfig config : APP_CONFIGS.values()) {
        if (config.isRegulationEffective()) {
            aftEnable = true;
            return;
        }
    }
    aftEnable = false;
}

上面的方法写的非常的清楚:

  1. 校验appName,为空的话直接返回
  2. 然后把我们定义的config放到APP_CONFIGS这个变量里面
  3. 调用calcEnable,根据我们配置的config,将aftEnable变量设置为true

到这里就完成了故障剔除的配置设置。

注册故障剔除模块

我们在8.源码分析---从设计模式中看SOFARPC中的EventBus?里面讲了,初始化ConsumerConfig的时候会初始化父类的静态代码块,然后会初始化RpcRuntimeContext的静态代码块。

RpcRuntimeContext

static {
    if (LOGGER.isInfoEnabled()) {
        LOGGER.info("Welcome! Loading SOFA RPC Framework : {}, PID is:{}", Version.BUILD_VERSION, PID);
    }
    put(RpcConstants.CONFIG_KEY_RPC_VERSION, Version.RPC_VERSION);
    // 初始化一些上下文
    initContext();
    // 初始化其它模块
    ModuleFactory.installModules();
    // 增加jvm关闭事件
    if (RpcConfigs.getOrDefaultValue(RpcOptions.JVM_SHUTDOWN_HOOK, true)) {
        Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
            @Override
            public void run() {
                if (LOGGER.isWarnEnabled()) {
                    LOGGER.warn("SOFA RPC Framework catch JVM shutdown event, Run shutdown hook now.");
                }
                destroy(false);
            }
        }, "SOFA-RPC-ShutdownHook"));
    }
}

在这个代码块里面会调用ModuleFactory初始化其他模块

ModuleFactory#installModules

public static void installModules() {
    ExtensionLoader<Module> loader = ExtensionLoaderFactory.getExtensionLoader(Module.class);
    //moduleLoadList 默认是 *
    String moduleLoadList = RpcConfigs.getStringValue(RpcOptions.MODULE_LOAD_LIST);
    for (Map.Entry<String, ExtensionClass<Module>> o : loader.getAllExtensions().entrySet()) {
        String moduleName = o.getKey();
        Module module = o.getValue().getExtInstance();
        // judge need load from rpc option
        if (needLoad(moduleLoadList, moduleName)) {
            // judge need load from implement
            if (module.needLoad()) {
                if (LOGGER.isInfoEnabled()) {
                    LOGGER.info("Install Module: {}", moduleName);
                }
                //安装模板
                module.install();
                INSTALLED_MODULES.put(moduleName, module);
            } else {
                if (LOGGER.isInfoEnabled()) {
                    LOGGER.info("The module " + moduleName + " does not need to be loaded.");
                }
            }
        } else {
            if (LOGGER.isInfoEnabled()) {
                LOGGER.info("The module " + moduleName + " is not in the module load list.");
            }
        }
    }
}

这里会根据SPI初始化四个模块,分别是:
fault-tolerance
sofaTracer-resteasy
lookout
sofaTracer

我们这里只讲解fault-tolerance模块。

然后我们进入到FaultToleranceModule#install方法中

private Regulator                regulator = new TimeWindowRegulator();

public void install() {
    subscriber = new FaultToleranceSubscriber();
    //注册ClientSyncReceiveEvent和ClientAsyncReceiveEvent到总线中
    EventBus.register(ClientSyncReceiveEvent.class, subscriber);
    EventBus.register(ClientAsyncReceiveEvent.class, subscriber);

    String regulatorAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_REGULATOR, "timeWindow");
    regulator = ExtensionLoaderFactory.getExtensionLoader(Regulator.class).getExtension(regulatorAlias);
    //调用TimeWindowRegulator的init方法
    regulator.init();
}

这里我们的订阅者是FaultToleranceSubscriber实例,订阅了两个ClientSyncReceiveEvent和ClientAsyncReceiveEvent事件。

然后会调用regulator的实现类TimeWindowRegulator的初始化方法
TimeWindowRegulator#init


/**
 * 度量策略(创建计算模型, 对计算模型里的数据进行度量,选出正常和异常节点)
 */
private MeasureStrategy                          measureStrategy;

/**
 * 计算策略(根据度量结果,判断是否需要执行降级或者恢复) 
 */
private RegulationStrategy                       regulationStrategy;

/**
 * 降级策略: 例如调整权重 
 */
private DegradeStrategy                          degradeStrategy;

/**
 * 恢复策略:例如调整权重 
 */
private RecoverStrategy                          recoverStrategy;

/**
 * Listener for invocation stat change.
 */
private final InvocationStatListener             listener           = new TimeWindowRegulatorListener();


public void init() {
    String measureStrategyAlias = RpcConfigs
        .getOrDefaultValue(RpcOptions.AFT_MEASURE_STRATEGY, "serviceHorizontal");
    String regulationStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_REGULATION_STRATEGY,
        "serviceHorizontal");
    String degradeStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_DEGRADE_STRATEGY, "weight");
    String recoverStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_RECOVER_STRATEGY, "weight");
    //ServiceHorizontalMeasureStrategy
    measureStrategy = ExtensionLoaderFactory.getExtensionLoader(MeasureStrategy.class).getExtension(
        measureStrategyAlias);
    //ServiceHorizontalRegulationStrategy
    regulationStrategy = ExtensionLoaderFactory.getExtensionLoader(RegulationStrategy.class).getExtension(
            regulationStrategyAlias);
    //WeightDegradeStrategy
    degradeStrategy = ExtensionLoaderFactory.getExtensionLoader(DegradeStrategy.class).getExtension(
            degradeStrategyAlias);
    //WeightRecoverStrategy
    recoverStrategy = ExtensionLoaderFactory.getExtensionLoader(RecoverStrategy.class).getExtension(
        recoverStrategyAlias);

    //TimeWindowRegulatorListener
    InvocationStatFactory.addListener(listener);
}

这里面主要是根据SPI初始化了度量策略,计算策略,降级策略,恢复策略,这些东西有什么用,我们下面讲。

触发权重降级

我们在3. 源码分析---SOFARPC客户端服务调用里面讲到了,客户端在调用的时候最后会调用AbstractCluster#doSendMsg方法,然后根据不同的策略,同步、异步、单向等调用然后返回response实例。

protected SofaResponse doSendMsg(ProviderInfo providerInfo, ClientTransport transport,
                                 SofaRequest request) throws SofaRpcException {
    ....
    // 同步调用
    if (RpcConstants.INVOKER_TYPE_SYNC.equals(invokeType)) {
        long start = RpcRuntimeContext.now();
        try {
            //BoltClientTransport#syncSend
            response = transport.syncSend(request, timeout);
        } finally {
            if (RpcInternalContext.isAttachmentEnable()) {
                long elapsed = RpcRuntimeContext.now() - start;
                context.setAttachment(RpcConstants.INTERNAL_KEY_CLIENT_ELAPSE, elapsed);
            }
        }
    }
    ....
}

因为在故障模块注册的时候订阅了两个ClientSyncReceiveEvent和ClientAsyncReceiveEvent事件。即一个同步事件和一个异步事件,我们这里挑同步调用进行讲解。

在上面的代码片段中,我们看到了会调用到BoltClientTransport#syncSend。

BoltClientTransport#syncSend

public SofaResponse syncSend(SofaRequest request, int timeout) throws SofaRpcException {
    //检查连接
    checkConnection();
    RpcInternalContext context = RpcInternalContext.getContext();
    InvokeContext boltInvokeContext = createInvokeContext(request);
    SofaResponse response = null;
    SofaRpcException throwable = null;
    try {
        //向总线发出ClientBeforeSendEvent事件
        beforeSend(context, request);
        response = doInvokeSync(request, boltInvokeContext, timeout);
        return response;
    } catch (Exception e) { // 其它异常
        throwable = convertToRpcException(e);
        throw throwable;
    } finally {
        //向总线发出ClientAfterSendEvent事件
        afterSend(context, boltInvokeContext, request);
        //向总线发出ClientSyncReceiveEvent事件
        if (EventBus.isEnable(ClientSyncReceiveEvent.class)) {
            //把当前被调用的provider和ConsumerConfig发送到总线中去
            EventBus.post(new ClientSyncReceiveEvent(transportConfig.getConsumerConfig(),
                    transportConfig.getProviderInfo(), request, response, throwable));
        }
    }
}

其实上面这么一大段代码和我们这篇文章有关系的也就只要最后向总线发送ClientSyncReceiveEvent事件而已。

总线发送的时候会触发订阅者FaultToleranceSubscriber的onEvent方法。

我们进入到FaultToleranceSubscriber#onEvent


public void onEvent(Event originEvent) {
    Class eventClass = originEvent.getClass();

    if (eventClass == ClientSyncReceiveEvent.class) {
        //这里会调用aftEnable
        if (!FaultToleranceConfigManager.isEnable()) {
            return;
        }
        // 同步结果
        ClientSyncReceiveEvent event = (ClientSyncReceiveEvent) originEvent;
        ConsumerConfig consumerConfig = event.getConsumerConfig();
        ProviderInfo providerInfo = event.getProviderInfo();
        InvocationStat result = InvocationStatFactory.getInvocationStat(consumerConfig, providerInfo);
        if (result != null) {
            //记录调用次数
            result.invoke();
            Throwable t = event.getThrowable();
            if (t != null) {
                 //记录异常次数
                result.catchException(t);
            }
        }
    }  
    ...
}

这里我们忽略其他的事件,只留下ClientSyncReceiveEvent事件的处理流程。
在这里我们又看到了InvocationStatFactory这个工厂类,在上面TimeWindowRegulator#init也用到了这个类。

在返回result之后会调用invoke方法,记录一下客户端调用服务端的次数,如果有异常,也会调用一下catchException方法,记录一下异常的次数。这两个参数会在做服务剔除的时候异步做统计使用。

InvocationStatFactory#getInvocationStat

public static InvocationStat getInvocationStat(ConsumerConfig consumerConfig, ProviderInfo providerInfo) {
    String appName = consumerConfig.getAppName();
    if (appName == null) {
        return null;
    }
    // 应用开启单机故障摘除功能
    if (FaultToleranceConfigManager.isRegulationEffective(appName)) {
        return getInvocationStat(new InvocationStatDimension(providerInfo, consumerConfig));
    }
    return null;
}


public static InvocationStat getInvocationStat(InvocationStatDimension statDimension) {
    //第一次的时候为空
    InvocationStat invocationStat = ALL_STATS.get(statDimension);
    if (invocationStat == null) {
        //直接new一个实例放入到ALL_STATS变量中
        invocationStat = new ServiceExceptionInvocationStat(statDimension);
        InvocationStat old = ALL_STATS.putIfAbsent(statDimension, invocationStat);
        if (old != null) {
            invocationStat = old;
        }
        //LISTENERS在调用TimeWindowRegulator#init的时候add进来的,只有一个TimeWindowRegulatorListener
        for (InvocationStatListener listener : LISTENERS) {
            listener.onAddInvocationStat(invocationStat);
        }
    }
    return invocationStat;
}

如果是第一次来到这个方法的话,那么会实例化一个ServiceExceptionInvocationStat放入到ALL_STATS变量中,然后遍历InvocationStatFactory的遍历LISTENERS,调用监听器的onAddInvocationStat方法。

LISTENERS里面的实例是我们在TimeWindowRegulator#init方法里面add进去的TimeWindowRegulatorListener。

注意,这里用了两个封装类,都是接下来要用到的。分别是InvocationStatDimension和ServiceExceptionInvocationStat。

InvocationStatDimension

public class InvocationStatDimension {
    /**
     * One provider of service reference
     */
    private final ProviderInfo   providerInfo;

    /**
     * Config of service reference
     */
    private final ConsumerConfig consumerConfig;

    /**
     * cache value: dimensionKey
     */
    private transient String     dimensionKey;
    /**
     * cache value : originWeight
     */
    private transient Integer    originWeight;
}

ServiceExceptionInvocationStat的结构图:

ServiceExceptionInvocationStat

public class ServiceExceptionInvocationStat extends AbstractInvocationStat {

    /**
     * Instantiates a new Service exception invocation stat.
     *
     * @param invocation the invocation
     */
    public ServiceExceptionInvocationStat(InvocationStatDimension invocation) {
        super(invocation);
    }

    @Override
    public long catchException(Throwable t) {
        //统计异常次数
        if (t instanceof SofaRpcException) {
            SofaRpcException exception = (SofaRpcException) t;
            if (exception.getErrorType() == RpcErrorType.CLIENT_TIMEOUT
                    || exception.getErrorType() == RpcErrorType.SERVER_BUSY) {
                return exceptionCount.incrementAndGet();
            }
        }
        return exceptionCount.get();
    }
}

然后直接看它父类的具体参数就好了
AbstractInvocationStat

public abstract class AbstractInvocationStat implements InvocationStat {
    /**
     * 统计维度
     */
    protected final InvocationStatDimension dimension;
    /**
     * 调用次数
     */
    protected final AtomicLong              invokeCount    = new AtomicLong(0L);
    /**
     * 异常次数
     */
    protected final AtomicLong              exceptionCount = new AtomicLong(0L);

    /**
     * when useless in one window, this value increment 1. <br />
     * If this value is greater than threshold, this stat will be deleted.
     */
    private final transient AtomicInteger   uselessCycle   = new AtomicInteger(0);
}

上面的这些参数,我们接下来还会用到。

权重降级具体实现

TimeWindowRegulatorListener是TimeWIndowRegulator的内部类。

class TimeWindowRegulatorListener implements InvocationStatListener {
    @Override
    public void onAddInvocationStat(InvocationStat invocationStat) {
        //度量策略不为空
        if (measureStrategy != null) {
            //ServiceHorizontalMeasureStrategy
            MeasureModel measureModel = measureStrategy.buildMeasureModel(invocationStat);
            if (measureModel != null) {
                measureModels.add(measureModel);
                startRegulate();
            }
        }
    }

    @Override
    public void onRemoveInvocationStat(InvocationStat invocationStat) {
        if (measureStrategy != null) {
            measureStrategy.removeMeasureModel(invocationStat);
        }
    }
}

这个监听器里面就是调用ServiceHorizontalMeasureStrategy#buildMeasureModel,返回调控模型。

我们先看一下MeasureModel里面封装了什么:

MeasureModel

public class MeasureModel {
    /**
     * App name of measure model
     * 服务名
     */
    private final String                            appName;
    /**
     * service name of measure model
     * 被调用的服务
     */
    private final String                            service;
    /**
     * all dimension statics stats of measure model
     * InvokeStat集合
     */
    private final ConcurrentHashSet<InvocationStat> stats = new ConcurrentHashSet<InvocationStat>();
    ....
}

所以根据这几个全局变量,我们可以推测,MeasureModel应该是根据appName+service为维度,里面有很多的InvocationStat。

我们再回到ServiceHorizontalMeasureStrategy#buildMeasureModel

public MeasureModel buildMeasureModel(InvocationStat invocationStat) {
    InvocationStatDimension statDimension = invocationStat.getDimension();
    //AppName + ":" + Service
    String key = statDimension.getDimensionKey();
    MeasureModel measureModel = appServiceMeasureModels.get(key);
    if (measureModel == null) {
        measureModel = new MeasureModel(statDimension.getAppName(), statDimension.getService());
        MeasureModel oldMeasureModel = appServiceMeasureModels.putIfAbsent(key, measureModel);
        if (oldMeasureModel == null) {
            measureModel.addInvocationStat(invocationStat);
            return measureModel;
        } else {
            oldMeasureModel.addInvocationStat(invocationStat);
            return null;
        }
    } else {
        measureModel.addInvocationStat(invocationStat);
        return null;
    }
}

buildMeasureModel方法里面的做法也和我上面说的一样。根据appName+service为维度封装不同的invocationStat在MeasureModel里面。

接着,回到TimeWindowRegulatorListener#onAddInvocationStat中,会往下调用startRegulate方法。


/**
 * 度量线程池
 */
private final ScheduledService                   measureScheduler   = new ScheduledService("AFT-MEASURE",
                                                                        ScheduledService.MODE_FIXEDRATE,
                                                                        new MeasureRunnable(), 1, 1,
                                                                        TimeUnit.SECONDS);

public void startRegulate() {
    if (measureStarted.compareAndSet(false, true)) {
        measureScheduler.start();
    }
}

ScheduledService是一个线程池,measureScheduler变量实例化了一个固定频率执行延迟线程池,会每1秒钟固定调用MeasureRunnable的run方法。

MeasureRunnable是TimeWindowRegulator的内部类:

private class MeasureRunnable implements Runnable {

    @Override
    public void run() {
        measureCounter.incrementAndGet();
        //遍历TimeWindowRegulatorListener加入的MeasureModel实例
        for (MeasureModel measureModel : measureModels) {
            try {
                //时间窗口是10,也就是说默认每过10秒才能进入下面的方法。
                if (isArriveTimeWindow(measureModel)) {
                    //ServiceHorizontalMeasureStrategy
                    MeasureResult measureResult = measureStrategy.measure(measureModel);
                    regulationExecutor.submit(new RegulationRunnable(measureResult));
                }
            } catch (Exception e) {
                LOGGER.errorWithApp(measureModel.getAppName(), "Error when doMeasure: " + e.getMessage(), e);
            }
        }
    }

    private boolean isArriveTimeWindow(MeasureModel measureModel) {
        //timeWindow默认是10
        long timeWindow = FaultToleranceConfigManager.getTimeWindow(measureModel.getAppName());
        return measureCounter.get() % timeWindow == 0;
    }
}

我们先来到ServiceHorizontalMeasureStrategy#measure来看看是怎么判断为异常或正常

如何判断一个节点是异常还是正常

我们首先不看代码的实现,先白话的说明一下是如何实现的。

  1. 首先在FaultToleranceSubscriber#onEvent中收到同步或异步结果事件后,就会从工厂中获取这次调用的 InvokeStat(如果 InvokeStat 已经存在则直接返回,如果没有则创建新的并保持到缓存中)。通过调用 InvokeStat 的 invoke 和 catchException 方法统计调用次数和异常次数。
  2. 然后在MeasureRunnable方法中根据设置的窗口期,在到达窗口期的时候会从 MeasueModel 的各个 InvokeStat 创建一份镜像数据,表示当前串口内的调用情况。
  3. 对所有的节点进行度量,计算出所有节点的平均异常率,如果某个节点的异常率大于平均异常率到一定比例,则判定为异常。

我这里选用官方的例子来进行说明:
假如有三个节点,提供同一服务,调用次数和异常数如表格所示:

invokeCount expCount
invokeStat 1 5 4
invokeStat 2 10 1
invokeStat 3 10 0

结合上述例子,度量策略的大致逻辑如下:

  • 首先统计该服务下所有 ip 的平均异常率,并用 averageExceptionRate 表示。平均异常率比较好理解,即异常总数 / 总调用次数,上例中 averageExceptionRate =(1 + 4) / (5 + 10 + 10) = 0.2.
  • 当某个ip的窗口调用次数小于该服务的最小窗口调用次数( leastWindCount )则忽略并将状态设置为 IGNOGRE。否则进行降级和恢复度量。 如 invokeStat 1 的 invokeCount 为5,如果 leastWindCount 设置为6 则 invokeStat 1 会被忽略。
  • 当某个ip的 时间窗口内的异常率和服务平均异常比例 windowExceptionRate 大于 配置的 leastWindowExceptionRateMultiplte (最小时间窗口内异常率和服务平均异常率的降级比值),那么将该IP设置为 ABNORMAL, 否则设置为 HEALTH.

windowExceptionRate 是异常率和服务平均异常比例,invokeStat 1 的异常率为 4/5 = 0.8, 则其对应的 windowExceptionRate = 0.8 / 0.2 = 4. 假设 leastWindowExceptionRateMultiplte =4, 那么 invokeStat 1 是一次服务,则需要进行降级操作。

接下来我们来看具体的源码实现:
ServiceHorizontalMeasureStrategy#measure

public MeasureResult measure(MeasureModel measureModel) {

    MeasureResult measureResult = new MeasureResult();
    measureResult.setMeasureModel(measureModel);

    String appName = measureModel.getAppName();
    List<InvocationStat> stats = measureModel.getInvocationStats();
    if (!CommonUtils.isNotEmpty(stats)) {
        return measureResult;
    }

    //1
    //这个方法主要是复制出一个当前时间点的调用情况,只统计被复制的InvocationStat
    //如果有被新剔除的InvocationStat,则不会存在于该次获取结果中。
    List<InvocationStat> invocationStats = getInvocationStatSnapshots(stats);
    //FaultToleranceConfig的timeWindow所设置的,时间窗口,默认是10
    long timeWindow = FaultToleranceConfigManager.getTimeWindow(appName);
    /* leastWindowCount在同一次度量中保持不变*/
    //默认InvocationStat如果要参与统计的窗口内最低调用次数,时间窗口内,至少调用的次数.在时间窗口内总共都不足10,认为不需要调控.
    long leastWindowCount = FaultToleranceConfigManager.getLeastWindowCount(appName);
    //最小是1,也就是时间窗口内,只要调用了就进行统计
    leastWindowCount = leastWindowCount < LEGAL_LEAST_WINDOW_COUNT ? LEGAL_LEAST_WINDOW_COUNT
        : leastWindowCount;

    //2.
    /* 计算平均异常率和度量单个ip的时候都需要使用到appWeight*/
    double averageExceptionRate = calculateAverageExceptionRate(invocationStats, leastWindowCount);

    //表示当前机器是平均异常率的多少倍才降级,默认是6
    double leastWindowExceptionRateMultiple = FaultToleranceConfigManager
        .getLeastWindowExceptionRateMultiple(appName);

    for (InvocationStat invocationStat : invocationStats) {
        MeasureResultDetail measureResultDetail = null;
        InvocationStatDimension statDimension = invocationStat.getDimension();

        long windowCount = invocationStat.getInvokeCount();
        //3
        //这里主要是根据Invocation的实际权重计算该Invocation的实际最小窗口调用次数
        long invocationLeastWindowCount = getInvocationLeastWindowCount(invocationStat,
                ProviderInfoWeightManager.getWeight(statDimension.getProviderInfo()),
                leastWindowCount);
        //4
        //当总调用的次数为0的时候,averageExceptionRate =-1,这个时候可以设置为忽略
        if (averageExceptionRate == -1) {
            measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.IGNORE);
        } else {
            if (invocationLeastWindowCount != -1 && windowCount >= invocationLeastWindowCount) {
                //获取异常率
                double windowExceptionRate = invocationStat.getExceptionRate();
                //没有异常的情况,设置状态为健康
                if (averageExceptionRate == 0) {
                    measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.HEALTH);
                } else {
                    //5
                    //这里主要是看这次被遍历到invocationStat的异常率和平均异常率之比
                    double windowExceptionRateMultiple = CalculateUtils.divide(
                            windowExceptionRate, averageExceptionRate);
                    //如果当前的invocationStat的异常是平均异常的6倍,那么就设置状态为异常
                    measureResultDetail = windowExceptionRateMultiple >= leastWindowExceptionRateMultiple ?
                            new MeasureResultDetail(statDimension, MeasureState.ABNORMAL) :
                            new MeasureResultDetail(statDimension, MeasureState.HEALTH);
                }
                measureResultDetail.setAbnormalRate(windowExceptionRate);
                measureResultDetail.setAverageAbnormalRate(averageExceptionRate);
                measureResultDetail.setLeastAbnormalRateMultiple(leastWindowExceptionRateMultiple);
            } else {
                measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.IGNORE);
            }
        }

        measureResultDetail.setWindowCount(windowCount);
        measureResultDetail.setTimeWindow(timeWindow);
        measureResultDetail.setLeastWindowCount(invocationLeastWindowCount);
        measureResult.addMeasureDetail(measureResultDetail);
    }
    //打日志
    logMeasureResult(measureResult, timeWindow, leastWindowCount, averageExceptionRate,
        leastWindowExceptionRateMultiple);

    InvocationStatFactory.updateInvocationStats(invocationStats);
    return measureResult;
}  

上面这个方法有点长,我给这个方法标注了数字,跟着数字标记去看。

  1. getInvocationStatSnapshots
public static List<InvocationStat> getInvocationStatSnapshots(List<InvocationStat> stats) {
    List<InvocationStat> snapshots = new ArrayList<InvocationStat>(stats.size());
    for (InvocationStat stat : stats) {
        //赋值一个InvocationStat出来
        InvocationStat snapshot = stat.snapshot();
        //如果被调用的次数小于0
        if (snapshot.getInvokeCount() <= 0) {
            if (stat.getUselessCycle().incrementAndGet() > 6) {
                // 6 个时间窗口无调用,删除统计
                InvocationStatFactory.removeInvocationStat(stat);
                InvocationStatDimension dimension = stat.getDimension();
                String appName = dimension.getAppName();
                if (LOGGER.isDebugEnabled(appName)) {
                    LOGGER.debugWithApp(appName, "Remove invocation stat : {}, {} because of useless cycle > 6",
                        dimension.getDimensionKey(), dimension.getProviderInfo());
                }
            }
        } else {
            //如果被调用了,那么就从新计数
            stat.getUselessCycle().set(0);
            snapshots.add(snapshot);
        }
    }
    return snapshots;
}

//ServiceExceptionInvocationStat#snapshot
public InvocationStat snapshot() {
    ServiceExceptionInvocationStat invocationStat = new ServiceExceptionInvocationStat(dimension);
    invocationStat.setInvokeCount(getInvokeCount());
    invocationStat.setExceptionCount(getExceptionCount());
    return invocationStat;
}

首先 这个方法里面首先是遍历所有的InvocationStat,然后调用snapshot创建一个新的InvocationStat实例。

其次 校验新的InvocationStat实例调用次数是不是小于等于0,如果是,说明没有在时间窗口内没有被调用过一次,那么就再看是不是在6 个时间窗口无调用,如果是,那么就删除统计数据

然后返回新的InvocationStat集合

  1. calculateAverageExceptionRate
private double calculateAverageExceptionRate(List<InvocationStat> invocationStats, long leastWindowCount) {
    long sumException = 0;
    long sumCall = 0;
    for (InvocationStat invocationStat : invocationStats) {

        long invocationLeastWindowCount = getInvocationLeastWindowCount(invocationStat,
            ProviderInfoWeightManager.getWeight(invocationStat.getDimension().getProviderInfo()),
            leastWindowCount);
        //统计所有的invocationStat被调用的次数,和异常次数
        if (invocationLeastWindowCount != -1
            && invocationStat.getInvokeCount() >= invocationLeastWindowCount) {
            sumException += invocationStat.getExceptionCount();
            sumCall += invocationStat.getInvokeCount();
        }
    }
    if (sumCall == 0) {
        return -1;
    }
    //计算异常比率
    return CalculateUtils.divide(sumException, sumCall);
}


private long getInvocationLeastWindowCount(InvocationStat invocationStat, Integer weight, long leastWindowCount) {
    //目标地址原始权重
    InvocationStatDimension statDimension = invocationStat.getDimension();
    Integer originWeight = statDimension.getOriginWeight();
    if (originWeight == 0) {
        LOGGER.errorWithApp(statDimension.getAppName(), "originWeight is 0,but is invoked. service["
                + statDimension.getService() + "];ip["
                + statDimension.getIp() + "].");
        return -1;
    } else if (weight == null) { //如果地址还未被调控过或者已经恢复。
        return leastWindowCount;
    } else if (weight == -1) { //如果地址被剔除
        return -1;
    }
    //这里主要是根据Invocation的实际权重计算该Invocation的实际最小窗口调用次数
    double rate = CalculateUtils.divide(weight, originWeight);
    long invocationLeastWindowCount = CalculateUtils.multiply(leastWindowCount, rate);
    return invocationLeastWindowCount < LEGAL_LEAST_WINDOW_COUNT ? LEGAL_LEAST_WINDOW_COUNT
            : invocationLeastWindowCount;
}

这个方法总的来说就是遍历所有的InvocationStat,然后求和说有的调用次数和异常次数,然后用(异常次数/调用次数)计算平均异常比率。

getInvocationLeastWindowCount方法主要是用来做校验,如果原始的权重为0,或者为-1,那么就返回-1。
因为当前的InvocationStat的权重可能被降权过,所以我们不能按原来的最小窗口调用次数来算,所以这里需要乘以一个比率,然后看是不是小于LEGAL_LEAST_WINDOW_COUNT,返回际权重计算该Invocation的实际最小窗口调用次数。

  1. if判断

我们在分析calculateAverageExceptionRate方法的时候看了,如果总的调用次数为0,那么averageExceptionRate会为-1。代表所有的InvocationStat没有被调用,我们设置忽略。

那么接着往下走,会发现有一个averageExceptionRate是否为0的判断,由于averageExceptionRate =(异常次数/调用次数),所以如果没有异常的时候设置状态为健康。

  1. windowExceptionRateMultipe
    windowExceptionRateMultipe这个变量主要是用来看这次被遍历到invocationStat的异常率和平均异常率之比。如果当前的(异常率/平均异常率)>=leastWindowExceptionRateMultiple,默认是6倍,那么就设置当前的invocationStat为异常。

根据MeasureResult进行降权或恢复

调用完ServiceHorizontalMeasureStrategy#measure方法后会返回一个MeasureResult,然会新建一个RegulationRunnable实例,丢到regulationExecutor线程池中执行。

RegulationRunnable是TimeWeindowRegulator的内部类。

RegulationRunnable#run

RegulationRunnable(MeasureResult measureResult) {
    this.measureResult = measureResult;
}
 
public void run() {
    List<MeasureResultDetail> measureResultDetails = measureResult.getAllMeasureResultDetails();
    for (MeasureResultDetail measureResultDetail : measureResultDetails) {
        try {
            doRegulate(measureResultDetail);
        } catch (Exception e) {
            LOGGER.errorWithApp(measureResult.getMeasureModel().getAppName(),
                "Error when doRegulate: " + e.getMessage(), e);
        }
    }
}

RegulationRunnable会在run方法里面遍历所有的measureResult,然后调用doRegulate方法进行降权或恢复的处理

void doRegulate(MeasureResultDetail measureResultDetail) {
    MeasureState measureState = measureResultDetail.getMeasureState();
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    //默认是否进行降级 ,默认为否 ServiceHorizontalRegulationStrategy
    boolean isDegradeEffective = regulationStrategy.isDegradeEffective(measureResultDetail);

    if (isDegradeEffective) {
        measureResultDetail.setLogOnly(false);
        if (measureState.equals(MeasureState.ABNORMAL)) {
            //这里是为了以防对太多节点做了降权,所以默认限制只能最多给两个节点降权
            boolean isReachMaxDegradeIpCount = regulationStrategy.isReachMaxDegradeIpCount(measureResultDetail);
            if (!isReachMaxDegradeIpCount) {
                //降权 WeightDegradeStrategy
                degradeStrategy.degrade(measureResultDetail);
            } else {
                String appName = measureResult.getMeasureModel().getAppName();
                if (LOGGER.isInfoEnabled(appName)) {
                    LOGGER.infoWithApp(appName, LogCodes.getLog(LogCodes.INFO_REGULATION_ABNORMAL_NOT_DEGRADE,
                            "Reach degrade number limit.", statDimension.getService(), statDimension.getIp(),
                            statDimension.getAppName()));
                }
            }
        } else if (measureState.equals(MeasureState.HEALTH)) {
            boolean isExistDegradeList = regulationStrategy.isExistInTheDegradeList(measureResultDetail);
            if (isExistDegradeList) {
                //恢复
                recoverStrategy.recover(measureResultDetail);
                regulationStrategy.removeFromDegradeList(measureResultDetail);
            }
            //没有被降级过,因此不需要被恢复。
        }
    } else {
        measureResultDetail.setLogOnly(true);
        if (measureState.equals(MeasureState.ABNORMAL)) {
            //这个时候调用degrade,主要是打印日志用的
            degradeStrategy.degrade(measureResultDetail);
            String appName = measureResult.getMeasureModel().getAppName();
            if (LOGGER.isInfoEnabled(appName)) {
                LOGGER.infoWithApp(appName, LogCodes.getLog(LogCodes.INFO_REGULATION_ABNORMAL_NOT_DEGRADE,
                        "Degrade switch is off", statDimension.getService(),
                        statDimension.getIp(), statDimension.getAppName()));
            }
        }
    }
}
}

我们分两种情况进行分析。

  1. 如果该节点是异常节点
    首先会调用ServiceHorizontalRegulationStrategy#isReachMaxDegradeIpCount方法。

ServiceHorizontalRegulationStrategy#isReachMaxDegradeIpCount

public boolean isReachMaxDegradeIpCount(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());

    String ip = statDimension.getIp();
    if (ips.contains(ip)) {
        return false;
    } else {
        //默认一个服务能够调控的最大ip数
        int degradeMaxIpCount = FaultToleranceConfigManager.getDegradeMaxIpCount(statDimension.getAppName());
        ipsLock.lock();
        try {
            if (ips.size() < degradeMaxIpCount) {
                ips.add(ip);
                return false;
            } else {
                return true;
            }
        } finally {
            ipsLock.unlock();
        }
    }
}

这个方法是为了能够控制最多一个服务下面能调控多少个节点。比如一个服务下面只有3个节点,其中2个节点出了问题,通过调控解决了,那么不可能将第三个节点也进行调控了吧,必须要进行人工干预了,为啥会出现这样的问题。

然后会调用WeightDegradeStrategy#degrade对节点进行降权
WeightDegradeStrategy#degrade

public void degrade(MeasureResultDetail measureResultDetail) {
    //调用LogPrintDegradeStrategy方法,打印日志用
    super.degrade(measureResultDetail);

    if (measureResultDetail.isLogOnly()) {
        return;
    }

    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    String appName = statDimension.getAppName();

    ProviderInfo providerInfo = statDimension.getProviderInfo();
    // if provider is removed or provider is warming up
    //如果为空,或是在预热中,则直接返回
    if (providerInfo == null || providerInfo.getStatus() == ProviderStatus.WARMING_UP) {
        return;
    }
    //目前provider权重
    int currentWeight = ProviderInfoWeightManager.getWeight(providerInfo);
    //降权比重
    double weightDegradeRate = FaultToleranceConfigManager.getWeightDegradeRate(appName);
    //最少权重,默认为1
    int degradeLeastWeight = FaultToleranceConfigManager.getDegradeLeastWeight(appName);
    //权重比率 * 目前权重
    int degradeWeight = CalculateUtils.multiply(currentWeight, weightDegradeRate);
    //不能小于最小值
    degradeWeight = degradeWeight < degradeLeastWeight ? degradeLeastWeight : degradeWeight;

    // degrade weight of this provider info
    boolean success = ProviderInfoWeightManager.degradeWeight(providerInfo, degradeWeight);
    if (success && LOGGER.isInfoEnabled(appName)) {
        LOGGER.infoWithApp(appName, "the weight was degraded. serviceUniqueName:["
            + statDimension.getService() + "],ip:["
            + statDimension.getIp() + "],origin weight:["
            + currentWeight + "],degraded weight:["
            + degradeWeight + "].");
    }
}

//ProviderInfoWeightManager
public static boolean degradeWeight(ProviderInfo providerInfo, int weight) {
    providerInfo.setStatus(ProviderStatus.DEGRADED);
    providerInfo.setWeight(weight);
    return true;
}

这个方法实际上就是权重拿出来,然后根据比率进行设值并且不能小于最小的比重。
最后调用ProviderInfoWeightManager把当前的节点设值为DEGRADED,并设值新的权重。

  1. 如果是健康节点

调用ServiceHorizontalRegulationStrategy#isExistInTheDegradeList判断一下当前节点有没有被降级
ServiceHorizontalRegulationStrategy#isExistInTheDegradeList

public boolean isExistInTheDegradeList(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());
    return ips != null && ips.contains(statDimension.getIp());
}

在调用isReachMaxDegradeIpCount方法的时候会把被降级的ip放入到ips集合中,所以这里只要获取就可以了。

如果该节点已被降级那么调用WeightRecoverStrategy#recover进行恢复
WeightRecoverStrategy#recover

public void recover(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ProviderInfo providerInfo = statDimension.getProviderInfo();
    // if provider is removed or provider is warming up
    if (providerInfo == null || providerInfo.getStatus() == ProviderStatus.WARMING_UP) {
        return;
    }
    Integer currentWeight = ProviderInfoWeightManager.getWeight(providerInfo);
    if (currentWeight == -1) {
        return;
    }

    String appName = statDimension.getAppName();
    //默认2
    double weightRecoverRate = FaultToleranceConfigManager.getWeightRecoverRate(appName);
    //也就是说一次只能恢复到2倍,不会一次性就恢复到originWeight
    int recoverWeight = CalculateUtils.multiply(currentWeight, weightRecoverRate);
    int originWeight = statDimension.getOriginWeight();

    // recover weight of this provider info
    if (recoverWeight >= originWeight) {
        measureResultDetail.setRecoveredOriginWeight(true);
        //将provider状态设置为AVAILABLE,并且设置Weight
        ProviderInfoWeightManager.recoverOriginWeight(providerInfo, originWeight);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "the weight was recovered to origin value. serviceUniqueName:["
                + statDimension.getService() + "],ip:["
                + statDimension.getIp() + "],origin weight:["
                + currentWeight + "],recover weight:["
                + originWeight + "].");
        }
    } else {
        measureResultDetail.setRecoveredOriginWeight(false);
        boolean success = ProviderInfoWeightManager.recoverWeight(providerInfo, recoverWeight);
        if (success && LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "the weight was recovered. serviceUniqueName:["
                + statDimension.getService() + "],ip:["
                + statDimension.getIp() + "],origin weight:["
                + currentWeight + "],recover weight:["
                + recoverWeight + "].");
        }
    }
}

这个方法很简单,各位可以看看我上面的注释。

总结

总的来说FaultToleranceModule分为两部分:

  1. FaultToleranceSubscriber订阅事件,负责订阅同步和异步结果事件
  2. 根据调用事件进行统计,以及内置的一些策略完成服务的降级和恢复操作。
posted @ 2019-08-10 21:20  luozhiyun  阅读(1030)  评论(0编辑  收藏  举报