dubbo源码阅读之集群（故障处理策略）

dubbo集群概述

dubbo集群功能的切入点在ReferenceConfig.createProxy方法以及Protocol.refer方法中。

在ReferenceConfig.createProxy方法中，如果用户指定多个提供者url或注册中心url，那么会创建多个Invoker，然后用StaticDirectory将这多个Invoker封装在一起，然后用相应的Cluster实现类将这个静态的服务目录包装成一个Invoker，每种集群类都对应一种Invoker的集群包装类，例如，FailoverClusterInvoker，FailbackClusterInvoker，FailfastClusterInvoker，FailsafeClusterInvoker，ForkingClusterInvoker等等，而这些封装集群逻辑的Invoker包装类都继承自AbstractClusterInvoker抽象类。这个抽象类里主要实现了调用时的状态检查，Invocation类参数设置，负载均衡，服务提供者可用性检测等逻辑，而服务调用失败后的行为逻辑则交由子类实现。

AbstractClusterInvoker.invoke

首先我们从这个方法看起，这个方法是Invoker类的调用入口，

@Override

// 这个方法的主要作用是为调用做一些前置工作，

// 包括检查状态，设置参数，从服务目录取出invoker列表，根据<方法名>.loadbalance参数值获取相应的负载均衡器

// 最后调用模板方法

public Result invoke(final Invocation invocation) throws RpcException {

    // 检查该Invoker是否已经被销毁

    // 在监听到注册中心变更刷新Invoker列表时可能会销毁不再可用的Invoker

    checkWhetherDestroyed();

    // binding attachments into invocation.

    // 将RpcContext中的参数绑定到invocation上

    // 用户可以通过RpcContext向每次调用传递不同的参数

    Map<String, String> contextAttachments = RpcContext.getContext().getAttachments();

    if (contextAttachments != null && contextAttachments.size() != 0) {

        ((RpcInvocation) invocation).addAttachments(contextAttachments);

    }

    // 列出所有的服务提供者

    // 这个方法直接调用服务目录的list方法

    List<Invoker<T>> invokers = list(invocation);

    // 根据url中的loadbalance参数值获取相应的负载均衡器，默认是随机负载均衡RandomLoadBalance

    LoadBalance loadbalance = initLoadBalance(invokers, invocation);

    // 添加调用id，唯一标识本次调用

    RpcUtils.attachInvocationIdIfAsync(getUrl(), invocation);

    // 模板方法，子类实现

    return doInvoke(invocation, invokers, loadbalance);

}

FailoverClusterInvoker.doInvoke

我们以默认的集群类FailoverClusterInvoker为例，分析一下这个类的doInvoke方法

// 这个方法主要实现了重试的逻辑，这也正是这个类的特性，故障转移功能

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {

    // 拷贝一份本地引用，invokers可能会变

    List<Invoker<T>> copyInvokers = invokers;

    // 检查提供者列表是否为空

    checkInvokers(copyInvokers, invocation);

    String methodName = RpcUtils.getMethodName(invocation);

    // 获取调用的方法的retries参数值，重试次数等于该值+1，因为第一次调用不算重试

    int len = getUrl().getMethodParameter(methodName, Constants.RETRIES_KEY, Constants.DEFAULT_RETRIES) + 1;

    if (len <= 0) {

        len = 1;

    }

    // retry loop.

    // 循环重试

    // 记录最后一次出现的异常

    RpcException le = null; // last exception.

    // 记录调用失败的提供者

    List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.

    // 记录调用过的提供者的地址，

    Set<String> providers = new HashSet<String>(len);

    for (int i = 0; i < len; i++) {

        //Reselect before retry to avoid a change of candidate `invokers`.

        //NOTE: if `invokers` changed, then `invoked` also lose accuracy.

        // 每次循环都要重新检查状态，重新列出可用的提供者Invoker，并检查可用的Invoker是否为空

        // 因为这些状态或提供者信息随时都可能发生变化

        if (i > 0) {

            checkWhetherDestroyed();

            copyInvokers = list(invocation);

            // check again

            checkInvokers(copyInvokers, invocation);

        }

        // 从可用的Invoker列表总选择一个

        // 选择逻辑中考虑了“粘滞”调用和负载均衡的逻辑

        Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);

        // 添加到已经调用的列表中

        invoked.add(invoker);

        RpcContext.getContext().setInvokers((List) invoked);

        try {

            Result result = invoker.invoke(invocation);

            if (le != null && logger.isWarnEnabled()) {

                logger.warn("Although retry the method " + methodName

                        + " in the service " + getInterface().getName()

                        + " was successful by the provider " + invoker.getUrl().getAddress()

                        + ", but there have been failed providers " + providers

                        + " (" + providers.size() + "/" + copyInvokers.size()

                        + ") from the registry " + directory.getUrl().getAddress()

                        + " on the consumer " + NetUtils.getLocalHost()

                        + " using the dubbo version " + Version.getVersion() + ". Last error is: "

                        + le.getMessage(), le);

            }

            return result;

        } catch (RpcException e) {

            // 对于业务异常直接抛出，这个异常会穿透dubbo框架直接抛给用户

            // 非业务异常例如网络问题，连接断开，提供者下线等可以通过故障转移，重试机制解决，

            // 这里之所以直接抛出是因为一旦发生了业务异常就不是dubbo框架能处理的了，再重试也没有意义了

            if (e.isBiz()) { // biz exception.

                throw e;

            }

            le = e;

        } catch (Throwable e) {

            le = new RpcException(e.getMessage(), e);

        } finally {

            providers.add(invoker.getUrl().getAddress());

        }

    }

    throw new RpcException(le.getCode(), "Failed to invoke the method "

            + methodName + " in the service " + getInterface().getName()

            + ". Tried " + len + " times of the providers " + providers

            + " (" + providers.size() + "/" + copyInvokers.size()

            + ") from the registry " + directory.getUrl().getAddress()

            + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "

            + Version.getVersion() + ". Last error is: "

            + le.getMessage(), le.getCause() != null ? le.getCause() : le);

}

这个方法的逻辑还是比较清晰的，就是重试，这也就是这个这个类的主要功能，故障转移，如果调用发生异常，就重试调用其他可用的提供者。其中select方法的实现在抽象类AbstractClusterInvoker中。

AbstractClusterInvoker.select

// 这个方法主要实现了“粘滞”调用的逻辑

protected Invoker<T> select(LoadBalance loadbalance, Invocation invocation,

                            List<Invoker<T>> invokers, List<Invoker<T>> selected) throws RpcException {

    if (CollectionUtils.isEmpty(invokers)) {

        return null;

    }

    String methodName = invocation == null ? StringUtils.EMPTY : invocation.getMethodName();

    // 可以通过在url中设置sticky参数的值来决定要不要启用“粘滞”调用的特性

    // 默认不启用该特性

    boolean sticky = invokers.get(0).getUrl()

            .getMethodParameter(methodName, Constants.CLUSTER_STICKY_KEY, Constants.DEFAULT_CLUSTER_STICKY);

    //ignore overloaded method

    // 如果缓存的粘滞Invoker已经不在可用列表里了，那么就应当将其移除

    if (stickyInvoker != null && !invokers.contains(stickyInvoker)) {

        stickyInvoker = null;

    }

    //ignore concurrency problem

    // 如果启用了粘滞调用，并且粘滞调用存在，并且粘滞的Invoker不在已经调用失败的Invoker列表中

    // 那么直接返回粘滞的Invoker

    if (sticky && stickyInvoker != null && (selected == null || !selected.contains(stickyInvoker))) {

        if (availablecheck && stickyInvoker.isAvailable()) {

            return stickyInvoker;

        }

    }

    // 根据负载均衡策略选择一个Invoker

    Invoker<T> invoker = doSelect(loadbalance, invocation, invokers, selected);

    // 设置粘滞的Invoker

    if (sticky) {

        stickyInvoker = invoker;

    }

    return invoker;

}

这个方法主要实现了“粘滞”调用的逻辑。

AbstractClusterInvoker.doSelect

// 根据负载均衡策略选择一个Invoker

private Invoker<T> doSelect(LoadBalance loadbalance, Invocation invocation,

                            List<Invoker<T>> invokers, List<Invoker<T>> selected) throws RpcException {

    if (CollectionUtils.isEmpty(invokers)) {

        return null;

    }

    if (invokers.size() == 1) {

        return invokers.get(0);

    }

    // 根据负载均衡策略选择一个Invoker

    Invoker<T> invoker = loadbalance.select(invokers, getUrl(), invocation);

    //If the `invoker` is in the  `selected` or invoker is unavailable && availablecheck is true, reselect.

    // 对于选择出来的Invoker还要再判断其可用性

    // 对于如下情况需要再次选择Invoker

    // 1. 选出的Invoker在调用失败列表中

    // 2. 设置了可用检查为true并且选出的Invoker不可用

    if ((selected != null && selected.contains(invoker))

            || (!invoker.isAvailable() && getUrl() != null && availablecheck)) {

        try {

            // 重新选择Invoker, 首先排除调用失败列表进行选择，实在不行会去调用失败列表中看能不能找到又“活过来”的提供者

            Invoker<T> rinvoker = reselect(loadbalance, invocation, invokers, selected, availablecheck);

            if (rinvoker != null) {

                invoker = rinvoker;

            } else {

                //Check the index of current selected invoker, if it's not the last one, choose the one at index+1.

                int index = invokers.indexOf(invoker);

                try {

                    //Avoid collision

                    // 如果没有重选出新的Invoker，那么直接用下一个Invoker

                    invoker = invokers.get((index + 1) % invokers.size());

                } catch (Exception e) {

                    logger.warn(e.getMessage() + " may because invokers list dynamic change, ignore.", e);

                }

            }

        } catch (Throwable t) {

            logger.error("cluster reselect fail reason is :" + t.getMessage() + " if can not solve, you can set cluster.availablecheck=false in url", t);

        }

    }

    return invoker;

}

第一次选择是不考虑调用失败列表的，所以选出来的Invoker有可能在调用失败列表中，这时需要进行重选。

AbstractClusterInvoker.reselect

private Invoker<T> reselect(LoadBalance loadbalance, Invocation invocation,

                            List<Invoker<T>> invokers, List<Invoker<T>> selected, boolean availablecheck) throws RpcException {

    //Allocating one in advance, this list is certain to be used.

    List<Invoker<T>> reselectInvokers = new ArrayList<>(

            invokers.size() > 1 ? (invokers.size() - 1) : invokers.size());

    // First, try picking a invoker not in `selected`.

    for (Invoker<T> invoker : invokers) {

        if (availablecheck && !invoker.isAvailable()) {

            continue;

        }

        // 排除调用失败列表中的Invoker

        if (selected == null || !selected.contains(invoker)) {

            reselectInvokers.add(invoker);

        }

    }

    // 如果还有剩余的Invoker, 那么根据负载均衡逻策略选择一个

    if (!reselectInvokers.isEmpty()) {

        return loadbalance.select(reselectInvokers, getUrl(), invocation);

    }

    // Just pick an available invoker using loadbalance policy

    // 是在没有可用的，只能从调用失败列表中找找看有没有可用的

    // 因为在重试期间有可能之前调用失败的提供者变成可用的了

    if (selected != null) {

        for (Invoker<T> invoker : selected) {

            if ((invoker.isAvailable()) // available first

                    && !reselectInvokers.contains(invoker)) {

                reselectInvokers.add(invoker);

            }

        }

    }

    // 再次选择

    if (!reselectInvokers.isEmpty()) {

        return loadbalance.select(reselectInvokers, getUrl(), invocation);

    }

    // 实在没有可用的提供者，只能返回null了

    return null;

}

其实从这几个选择的方法中可以看出来，dubbo的作者还是很用心的，尽最大可能保证调用的成功。

FailfastClusterInvoker

快速失败，只调用一次，失败后直接抛异常。代码很简单，就不多说了

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {

    checkInvokers(invokers, invocation);

    Invoker<T> invoker = select(loadbalance, invocation, invokers, null);

    try {

        return invoker.invoke(invocation);

    } catch (Throwable e) {

        if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.

            throw (RpcException) e;

        }

        throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,

                "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()

                        + " select from all providers " + invokers + " for service " + getInterface().getName()

                        + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost()

                        + " use dubbo version " + Version.getVersion()

                        + ", but no luck to perform the invocation. Last error is: " + e.getMessage(),

                e.getCause() != null ? e.getCause() : e);

    }

}

FailsafeClusterInvoker

失败安全的故障处理策略，所谓失败安全是指在调用失败后，不抛异常只记录日志。

@Override

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {

    try {

        checkInvokers(invokers, invocation);

        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);

        return invoker.invoke(invocation);

    } catch (Throwable e) {

        logger.error("Failsafe ignore exception: " + e.getMessage(), e);

        // 返回一个空结果，用户需要对返回结果进行判断

        return new RpcResult(); // ignore

    }

}

FailbackClusterInvoker

失败后记录下失败的调用，之后以一定的间隔时间进行重试，这种策略很适合通知类的服务调用。重试间隔固定为5秒, 重试次数可以通过参数设置，默认是3次。

ForkingClusterInvoker

这种策略比较有意思，每次调用都会起多个线程并行第跑，谁先跑出结果就用谁的，这种估计很少用吧，谁这么财大气粗，大把大把的资源用来浪费。

不过这很像一些分布式计算框架中的推测执行策略，如果有些任务跑的慢，那么就会在其他节点也跑这个任务，谁先跑完就用谁的结果，比如spark中就有推测执行的机制。

总结

不同的集群包装类有不同的故障处理策略，默认的故障转移，此外常用的有快速失败，失败安全，定时重试，合并调用等等。