mshadow的原理--MXNet

这文章主要解释了表达式模板的工作原理（也是mshadow的主要原理），文章的前半部分是翻译自exp-template/README.md。我们会解释它为什么会影响编译代码的性能，表达式模板也是C++矩阵运算库的用到的主要技巧，比如Eigen,GSL,boost.uBLAS。

如何写出机器学习的高效代码？

在开始之前，我们先考一个问题，假如更新的规则如下：（这里是为了达到解释的目的，通常更新规则是这样的：weight += - eta * (grad + lambda * weight) ）

weight =  - eta * (grad + lambda * weight);

这里权重与梯度都是长度为n的向量。当你选择C++作为你的编程语言时，我想你主要考虑是效率。下面这个很重要并且用在大多数C/C++程序中：

预先分配必要的内存，在程序运行的过程中没有临时内存

这里是一个例子：

void UpdateWeight (const float *grad, float eta, float lambda,

                   int n, float *weight) {

  for (int i = 0; i < n; ++i) {

    weight[i] =  - eta * (grad[i] + lambda * weight[i]);

  }

}

这个函数用了预先分配的梯度、权重空间来计算，写这样的一个函数十分简单，然而当我们要重复这样写的时候会十分烦恼。所以问题是如果我们写成以下的样子，能得到和上面代码一样的性能吗？

void UpdateWeight (const Vec& grad, float eta, float lambda, Vec& weight) {

  weight = -eta * (grad + lambda * weight);

}

答案是可以的，但这不是最显然的答案。

一个原始的方法

让我们首先来看一下最直接的解决方法：运算符重载。

// Naive solution for vector operation overloading

struct Vec {

  int len;

  float* dptr;

  Vec(int len) : len(len) {

    dptr = new float[len];

  }

  Vec(const Vec& src) : len(src.len) {

    dptr = new float[len];

    memcpy(dptr, src.dptr, sizeof(float)*len );

  }

  ~Vec(void) {

    delete [] dptr;

  }

};

inline Vec operator+(const Vec &lhs, const Vec &rhs) {

  Vec res(lhs.len);

  for (int i = 0; i < lhs.len; ++i) {

    res.dptr[i] = lhs.dptr[i] + rhs.dptr[i];

  }

  return res;

}

如果我们用同样的方式增加更多的运算符重载，我们就可以得到我们的想要的直接写等式而不是循环的方法。然而，这种方法明显是低效的，因为中间有临时内存被分配与释放，所以我们能做得更好。

有一种更高效的选择是：我们可以仅重载运算符+=,-=，这两个运算符是不用分配临时内存的，但是这会限制我们写等式。等会我们将会讨论为什么我们需要表达式模板，虽然C++11在本教程的末尾提供了移动赋值运算符和右值引用。

延迟计算

在做运算符+时，为什么我们要分配临时内存呢？这是因为我们不知道将在运算符+中分配的目标，否则我们可以直接将结果存入到目标中，而不是放在临时变量中。

但是如果我们知道目标呢？这个结代码的实现在exp_lazy.cpp中：

// Example Lazy evaluation code

// for simplicity, we use struct and make all members public

#include <cstdio>

struct Vec;

// expression structure holds the expression

struct BinaryAddExp {

  const Vec &lhs;

  const Vec &rhs;

  BinaryAddExp(const Vec &lhs, const Vec &rhs)

  : lhs(lhs), rhs(rhs) {}

};

// no constructor and destructor to allocate and de-allocate memory,

//  allocation done by user

struct Vec {

  int len;

  float* dptr;

  Vec(void) {}

  Vec(float *dptr, int len)

      : len(len), dptr(dptr) {}

  // here is where evaluation happens

  inline Vec &operator=(const BinaryAddExp &src) {

    for (int i = 0; i < len; ++i) {

      dptr[i] = src.lhs.dptr[i] + src.rhs.dptr[i];

    }

    return *this;

  }

};

// no evaluation happens here

inline BinaryAddExp operator+(const Vec &lhs, const Vec &rhs) {

  return BinaryAddExp(lhs, rhs);

}

const int n = 3;

int main(void) {

  float sa[n] = {1, 2, 3};

  float sb[n] = {2, 3, 4};

  float sc[n] = {3, 4, 5};

  Vec A(sa, n), B(sb, n), C(sc, n);

  // run expression

  A = B + C;

  for (int i = 0; i < n; ++i) {

    printf("%d:%f==%f+%f\n", i, A.dptr[i], B.dptr[i], C.dptr[i]);

  }

  return 0;

}

我们实现的思想是在运算符+并没有直接的计算，而是返回一个表达的对象（像抽象语法树），当我们重载运算符=时，我们就可以知道目标和所有的操作时，这样我们就可以直接计算而且不需要临时变量。同样地，我们定义DotDxp和在运算符=上定义延迟计算，并将矩阵（向量）的乘法定向的BLAS库上计算。

更长的表达式与表达式模板

使用延迟计算，我们可以很好地避免了临时变量的分配，但是代码的扩展能力被限制了：

我们只能写出A=B+C，不能写出更出的表达式了。
当我们加入表达式时，我们要重载更多的运算符=来计算每一个等式。

这里我们实现了一个魔法模板程序来解决这两个问题，代码（exp_template.cpp）如下，代码虽然有点长，但可以允许你写更多的等式。

// Example code, expression template, and more length equations

// for simplicity, we use struct and make all members public

#include <cstdio>

// this is expression, all expressions must inheritate it,

//  and put their type in subtype

template<typename SubType>

struct Exp {

  // returns const reference of the actual type of this expression

  inline const SubType& self(void) const {

    return *static_cast<const SubType*>(this);

  }

};

// binary add expression

// note how it is inheritates from Exp

// and put its own type into the template argument

template<typename TLhs, typename TRhs>

struct BinaryAddExp: public Exp<BinaryAddExp<TLhs, TRhs> > {

  const TLhs &lhs;

  const TRhs &rhs;

  BinaryAddExp(const TLhs& lhs, const TRhs& rhs)

      : lhs(lhs), rhs(rhs) {}

  // evaluation function, evaluate this expression at position i

  inline float Eval(int i) const {

    return lhs.Eval(i) + rhs.Eval(i);

  }

};

// no constructor and destructor to allocate

// and de-allocate memory, allocation done by user

struct Vec: public Exp<Vec> {

  int len;

  float* dptr;

  Vec(void) {}

  Vec(float *dptr, int len)

      :len(len), dptr(dptr) {}

  // here is where evaluation happens

  template<typename EType>

  inline Vec& operator= (const Exp<EType>& src_) {

    const EType &src = src_.self();

    for (int i = 0; i < len; ++i) {

      dptr[i] = src.Eval(i);

    }

    return *this;

  }

  // evaluation function, evaluate this expression at position i

  inline float Eval(int i) const {

    return dptr[i];

  }

};

// template add, works for any expressions

template<typename TLhs, typename TRhs>

inline BinaryAddExp<TLhs, TRhs>

operator+(const Exp<TLhs> &lhs, const Exp<TRhs> &rhs) {

  return BinaryAddExp<TLhs, TRhs>(lhs.self(), rhs.self());

}

const int n = 3;

int main(void) {

  float sa[n] = {1, 2, 3};

  float sb[n] = {2, 3, 4};

  float sc[n] = {3, 4, 5};

  Vec A(sa, n), B(sb, n), C(sc, n);

  // run expression, this expression is longer:)

  A = B + C + C;

  for (int i = 0; i < n; ++i) {

    printf("%d:%f == %f + %f + %f\n", i,

           A.dptr[i], B.dptr[i],

           C.dptr[i], C.dptr[i]);

  }

  return 0;

}

关键的思想是模板Exp<SubType>将派生的类作为模板参数，这样就可以将这个模板的自身通过self()转换成SubTpye（就是派生类）。BinaryAddExp现在是一个模板类，可以将表达式复合在一起，就像一个复合模式的模板版本一样。计算通过函数Eval完成，它在BinaryAddExp中以递归的方式完成。

由于内联，当函数在运算符=调用src.Eval(i) 会在编译时被编译成B.dptr[i] + C.dptr[i] + C.dptr[i]。
我们可以像循环一样高效地将等式写成逐元素的方式。

使它更灵活

通过上面的例子，模板编程编译时可以强大地更程序更加灵活，最后的例子比较接近mshadow了，可以请允许用户使用双目运算符（exp_template_op.cpp）。

// Example code, expression template

// with binary operator definition and extension

// for simplicity, we use struct and make all members public

#include <cstdio>

// this is expression, all expressions must inheritate it,

// and put their type in subtype

template<typename SubType>

struct Exp{

  // returns const reference of the actual type of this expression

  inline const SubType& self(void) const {

    return *static_cast<const SubType*>(this);

  }

};

// binary operators

struct mul{

  inline static float Map(float a, float b) {

    return a * b;

  }

};

// binary add expression

// note how it is inheritates from Exp

// and put its own type into the template argument

template<typename OP, typename TLhs, typename TRhs>

struct BinaryMapExp: public Exp<BinaryMapExp<OP, TLhs, TRhs> >{

  const TLhs& lhs;

  const TRhs& rhs;

  BinaryMapExp(const TLhs& lhs, const TRhs& rhs)

      :lhs(lhs), rhs(rhs) {}

  // evaluation function, evaluate this expression at position i

  inline float Eval(int i) const {

    return OP::Map(lhs.Eval(i), rhs.Eval(i));

  }

};

// no constructor and destructor to allocate and de-allocate memory

// allocation done by user

struct Vec: public Exp<Vec>{

  int len;

  float* dptr;

  Vec(void) {}

  Vec(float *dptr, int len)

      : len(len), dptr(dptr) {}

  // here is where evaluation happens

  template<typename EType>

  inline Vec& operator=(const Exp<EType>& src_) {

    const EType &src = src_.self();

    for (int i = 0; i < len; ++i) {

      dptr[i] = src.Eval(i);

    }

    return *this;

  }

  // evaluation function, evaluate this expression at position i

  inline float Eval(int i) const {

    return dptr[i];

  }

};

// template binary operation, works for any expressions

template<typename OP, typename TLhs, typename TRhs>

inline BinaryMapExp<OP, TLhs, TRhs>

F(const Exp<TLhs>& lhs, const Exp<TRhs>& rhs) {

  return BinaryMapExp<OP, TLhs, TRhs>(lhs.self(), rhs.self());

}

template<typename TLhs, typename TRhs>

inline BinaryMapExp<mul, TLhs, TRhs>

operator*(const Exp<TLhs>& lhs, const Exp<TRhs>& rhs) {

  return F<mul>(lhs, rhs);

}

// user defined operation

struct maximum{

  inline static float Map(float a, float b) {

    return a > b ? a : b;

  }

};

const int n = 3;

int main(void) {

  float sa[n] = {1, 2, 3};

  float sb[n] = {2, 3, 4};

  float sc[n] = {3, 4, 5};

  Vec A(sa, n), B(sb, n), C(sc, n);

  // run expression, this expression is longer:)

  A = B * F<maximum>(C, B);

  for (int i = 0; i < n; ++i) {

    printf("%d:%f == %f * max(%f, %f)\n",

           i, A.dptr[i], B.dptr[i], C.dptr[i], B.dptr[i]);

  }

  return 0;

}

总结

到这里为止，你应该明白它工作的基本思想：

延迟计算，使得我们能知道所有的操作数和目标。
复合模板和递归计算，使得我们能够计算逐元素操作的任意复合表达式。
由于模板和内联的设计，我们写出来的表达式像用循环实现更新规则的一样高效。

所以在编写机器学习代码时写表达式，并将精力集中在重要的算法部分上。

在MShadow中的表达式模板

在Mshadow的表达式模板用到的上面我们介绍的关键思想，但有几个小的不同点：

我们将评估代码与表达式构建和组成代码分开：
- 在表达中创建Plan类用来替代Exp类的计算函数Eval，用来计算结果。
- 这允许我们在Plan中放置较少的变量，例如，当我们评估数据时，我们不需要数组长度。
- 一个重要的原因是CUDA内核不能使用const引用来接受类。
- 虽然这种设计选择是有争议的，但我们发现迄今为止还是有用的。
延迟还支持复式的表达式，比如矩阵的乘法
- 除了逐元素的表达式，我们还支持比这样A = dot(B.T(), C)的运算，同样延迟表达是不需要分配临时内存的。
类型检查和数组长度检查。

备注

表达式模板与C++11：在C ++ 11中，移动构造函数可以用来保存重复的分配内存，这样就省去了一些需要的表达模板。然后，仍然要分配最少一次的空间。
- 这只是删除了表达式模板中表达式所需的内存，比如dst = A+B+C，dst并没有包括赋值前所分配的空间。
- 如果我们想保留所有的变量预先分配内存的语法，并且表达式执行时没有内存分配（这是我们在mshadow中所做的），我们仍然需要表达式模板。

Mshadow源码解析

Mshadow采用了表达式模板增强了c++矩阵库的性能，四个主要的数据类型的继承关系如下：

Tensor --> TRValue --> RValueExp --> Exp

基类Exp

可以看到基类是Exp，除了一些基本的数据类型（如int、float等），其它的数据类型都是继承于Exp，Exp的设计特殊之处在于可以将它的派生类作为模板参数，这样就可以将这个模板的自身通过self()（返回的是一个不可修改的实例）或者ptrself()（返回的是一个可修改的指针）转换成SubTpye（就是派生类）

template<typename SubType, typename DType, int exp_type>

struct Exp {

 public:

  /*! \return  subtype instance of current class */

  inline const SubType& self(void) const {

    return *static_cast<const SubType*>(this);

  }

  /*! \return reference of subtype instance of current class */

  inline SubType* ptrself(void) {

    return static_cast<SubType*>(this);

  }

};

RValueExp

RValueExp仅定义了一些赋值函数、重载运算符等，要注意的是它将表达式的类型写成默认的kRValue=0，后面所有的数据定义时表达类型都是这个，真正改变表达式类型的是运算符，比如dot、一些重载的运算符。

template<typename Container, typename DType>

class RValueExp: public Exp<Container, DType, type::kRValue> {

	//...

}

来看一下表达式的定义，可以看到的是等级高的表达式包括了等级低的，总结起来就是
- kRValue = 0，直接对应一个数据类，可以用来分配数据，比。
- kMapper = 1，表达式包含元素张量运算，将表达式映射到相同的形状。
- kChainer = 3，表达式可以被写成与其他表达式的链接，通常它具有定义的函数Eval(i,j)中所定义，这个能从输入的表达式中抽出结果(i,j)并和输出到特定位置中。
- kComplex = 7，用在其它运算中，比如dot。

namespace type {

// type expression type are defined as bitmask

// subtype relationshop kRValue < kMapper < kMapper < kComplex

/*!

 * \brief this expression directly correspnds to a data class,

 *   can be used to assign data

 */

const int kRValue = 0;

/*!

 * \brief expression contains element-wise tensor operations,

 *   map a expression to same shape

 */

const int kMapper = 1;

/*!

 * \brief expression that can be chained with other expressiones

 *    Usually it have function Eval(i,j) defined, which pulls the result (i, j) from input

 *    expression and output the result at certain position.

 */

const int kChainer = 3;

/*! \brief othercase: e.g dot product */

const int kComplex = 7;

}  // namespace type

TRValue

TRValue并没有什么实现的内容，但它是所有可能Teson的超类：

template<typename Container, typename Device, int dimension, typename DType>

struct TRValue: public expr::RValueExp<Container, DType> {

};

Tensor

Tensor是我们的计算基本数据类型，在mshadow中，除了基本类型，其它数据结果最终于都要在回到Tensor中计算，包括继承它了类TensorContainer和用于图模型中更抽象灵活的数据结构TBlob。

template<typename Device, int dimension,

         typename DType MSHADOW_DEFAULT_DTYPE>

struct Tensor: public TRValue<Tensor<Device, dimension, DType>,

                              Device, dimension, DType> {

 public:

  //--------------------------------

  // struct memembers

  //--------------------------------

  /*! \brief whether current type lies in cpu */

  static const bool kDevCPU = Device::kDevCPU;

  /*! \brief dimension of subtype */

  static const int  kSubdim = dimension - 1;

  //--------------------------------

  // struct memembers

  //--------------------------------

  /*! \brief pointer to the data */

  DType *dptr_;

  /*! \brief shape of the tensor */

  Shape<dimension> shape_;

  /*!

   * \brief storing the stride information in x dimension

   *    this is used to deal with pitch allocation in gpu or sse(align x dimension to 64bit) for efficiency

   */

  index_t stride_;

  /*!

   * \brief stream where the computation lies

   * stream is a device dependency concept where each computation

   */

  Stream<Device> *stream_;

  //--------------------------------

  // functions

  //--------------------------------

  ...

}

TensorContainer

TensorContainer继承自Tensor,最大的区别是TensorContainer在创建对象的时候会分配相应的空间，使用时相对比较方便。

template<typename Device, int dimension, typename DType = default_real_t>

class TensorContainer: public Tensor<Device, dimension, DType> {

}

TBlob

对于Tensor来说，它的Shape是不能改变的，但对于深度学习的图模型来说，这显然是不合适的。为了适应这个需求，设计了用于图模型中更抽象灵活的数据结构TBlob。通过TBlob的成员函数能获取它在储存的数据并转于相应的Tensor。

class TBlob {

 public:

  /*! \brief pointer to the data */

  void *dptr_;

  /*! \brief shape of the tensor */

  TShape shape_;

  /*!

   * \brief storing the stride information in x dimension

   */

  index_t stride_;

  /*! \brief device mask of the corresponding device */

  int dev_mask_;

  /*! \brief type flag of the tensor blob */

  int type_flag_;

  ...

}

计算的过程

表达式和对应的计算是分开的，而且用了延迟计算的思想，到赋值运算时会一起计算。新构建一个运算符的表达式有如下步骤：

先定义表达类，要继承于Exp，内建构造函数存储相应的数据。
构建运算符，生成表达类。
重载该表达类的Plan类，生成表达类的Plan对象，内建Eval函数计算Plan对象的值。
重载该表达类的MakePlan类，用来生成该表达类的Plan类。
重载该表达类的ExpInfo类，用来存储该表达类的相关信息。

下面我们用basic.cpp内的一段函数来说明上面的过程及原理：

40	TensorContainer<cpu, 2> lhs(Shape2(2, 3)), rhs(Shape2(2, 3)), ret(Shape2(2,2));

41	lhs = 1.0;

42	rhs = 1.0;

43	ret = implicit_dot(lhs, rhs.T());

第40行：

这个是声明变量，正如上面所说，这里为TensorContainer变量分配了相应的空间。

第41、42行：

这两行说的是同一个东西，我们就以41行lhs = 1.0为例说明（过程有省略，因为调用堆栈比较深）。

操作数1已经是一个完整的对象了，不会再有操作，所以直接调用赋值运算符=
=两边的变量，左边个是常用的类型double或者float（看编译器），右边是TensorContainer对象，所以调用Tensor_container.cpp中的函数：

inline Tensor<Device, dimension, DType> &operator=(DType s) {

	return this->__assign(s);

}

__assign在父类RValueExp（在文件Expresion.h）中定义了，查看数据类型，调用的是以下函数，其中saveto是一个运算符（也可以说成操作符）。要注意的是，这个函数内有一个操作scalar<DType>(s)，这个操作将类型从DTpye转变成ScalarExp<Dtype>，而且运算符的类型变成了KMapper。另外，this->ptrself()指的是lhs，scalar<DType>(s)则是1。

inline Container &__assign(DType s) {

	ExpEngine<sv::saveto, Container, DType>::Eval(this->ptrself(), scalar<DType>(s));

	return *(this->ptrself());

}

调用以下函数（Expr_engine-inl.h），SV是操作符saveto：

template<typename E>

inline static void Eval(RV *dst, const Exp<E, DType, type::kMapper> &exp) {

	MapExp<SV>(dst, exp);

}

调用Tensor_cpu-inl.h的MapExp、Map函数:

template<typename Saver, typename R, int dim, typename DType, typename E, int etype>

inline void MapExp(TRValue<R, cpu, dim, DType> *dst, const expr::Exp<E, DType, etype> &exp) {

	...

	MapExpCPUEngine<expr::PacketCheck<E, MSHADOW_DEFAULT_PACKET>::kPass,Saver, R, dim, DType, E, etype>

	::Map(dst->ptrself(), exp);

}

template<bool pass_check, typename Saver, typename R, int dim, typename DType, typename E, int etype>

struct MapExpCPUEngine {

		inline static void Map(TRValue<R, cpu, dim, DType> *dst, const expr::Exp<E, DType, etype> &exp) {

			MapPlan<Saver>(dst, MakePlan(exp.self()));

	}

};

MapPlan有两次调用，先是调用里面的MapPlan，函数在Expr_engine-inl.h中，并且得到Scalar<DType>对象的Plan对象。然后再调用Tensor_cpu-inl.h的MapPlan

template<typename DType>

inline Plan<ScalarExp<DType>, DType> MakePlan(const ScalarExp<DType> &e) {

	return Plan<ScalarExp<DType>, DType>(e.scalar_);

}

template<typename Saver, typename R, int dim, typename DType, typename E>

inline void MapPlan(TRValue<R, cpu, dim, DType> *dst, const expr::Plan<E, DType> &plan) {

	Shape<2> shape = expr::ShapeCheck<dim, R>::Check(dst->self()).FlatTo2D();

	expr::Plan<R, DType> dplan = expr::MakePlan(dst->self());

#if (MSHADOW_USE_CUDA == 0)

	#pragma omp parallel for

#endif

// temp remove openmp, as default setting throttles CPU

	for (openmp_index_t y = 0; y < shape[0]; ++y) {

		for (index_t x = 0; x < shape[1]; ++x) {

  		// trust your compiler! -_- they will optimize it

  			Saver::template Save<DType>(dplan.REval(y, x), plan.Eval(y, x));

		}

	}

}

再调用plan.Eval(y, x)与dplan.REval(y, x)，两个函数都在Expr_engine-inl.h中：

// scalar

template<typename DType>

class Plan<ScalarExp<DType>, DType> {

	public:

	explicit Plan(DType scalar) : scalar_(scalar) {}

	MSHADOW_XINLINE DType Eval(index_t y, index_t x) const {

		return scalar_;

	}

	private:

	DType scalar_;

};

// tensor plan

template <typename Device, int dim, typename DType>

class Plan<Tensor<Device, dim, DType>, DType> {

	public:

	explicit Plan(const Tensor<Device, dim, DType> &t)

  		: dptr_(t.dptr_), stride_(t.stride_) {}

	// for RValue, the return type should be reference

	MSHADOW_XINLINE DType &REval(index_t y, index_t x) {

		return dptr_[y * stride_ + x];

	}

	// const evaluation

	MSHADOW_XINLINE const DType &Eval(index_t y, index_t x) const {

		return dptr_[y * stride_ + x];

	}

	private:

	DType  *dptr_;

	index_t stride_;

};

最后调用的是Saver::template Save<DType>(dplan.REval(y, x), plan.Eval(y, x))，这个Save操作就是我们之前的saveto，调用在Base.h函数：

/*! \brief save to saver: = */

struct saveto {

	 /*! \brief save b to a using save method */

	template<typename DType>

	MSHADOW_XINLINE static void Save(DType &a, DType b) { // NOLINT(*)

		a = b;

	}

    ...

};

到这里，基本的赋值操作就完成了，你会发现用表达式模板是十不直接的操作。

第43行ret = implicit_dot(lhs, rhs.T())：

这个更加复杂，implicit_dot可以看成一个独立的操作符，如果想写一个新的操作符，可以参考Implicit_gemn.h。这个表达式复杂的原因是有三个运算符——.T()、implicit_dot和ret，要用到递归运算了。无论表达式多么的复杂，我们只要记住，除了赋值运算（其实它也没有Plan类）外一个表达式的结果可以用Plan.Eval来获得的。下面是一些这行代码运算的基本过程：

rhs.T()这个操作在Expression.h中，返回一个TransposeExp表达式（注意的是inline的函数会在编译中展开，这里为了方便理解，只是说了调用）。这里要注意的是TransposeExp表达式的类型是kChainer。

inline const TransposeExp<Container, DType> T(void) const {

	return TransposeExp<Container, DType>(this->self());

}

implicit_dot(lhs, rhs.T())函数在Implicit_gemn.h中，返回一个ImplicitGEMMExp表达式。这里要注意的是implicit_dot表达式的类型是kChainer。

template<typename LhsExp, typename RhsExp, typename DType, int e1, int e2>

inline ImplicitGEMMExp<LhsExp, RhsExp, DType>

implicit_dot(const Exp<LhsExp, DType, e1> &lhs, const Exp<RhsExp, DType, e2> &rhs) {

	TypeCheckPass<ExpInfo<LhsExp>::kDim == 2 && ExpInfo<RhsExp>::kDim == 2>

  		::Error_Expression_Does_Not_Meet_Dimension_Req();

	return ImplicitGEMMExp<LhsExp, RhsExp, DType>(lhs.self(), rhs.self());

}

赋值运算符=与上面所以的步骤有相同之处，只是不同的类型，调用的重载函数是不一样的。上面的调用的第一个MapPlan是调用Implicit_gemn.h中的，返回的是一个ImplicitGEMMExp的Plan类，第二个则是一样的。

template<typename LhsExp, typename RhsExp, typename DType>

inline Plan<ImplicitGEMMExp<LhsExp, RhsExp, DType>, DType>

MakePlan(const ImplicitGEMMExp<LhsExp, RhsExp, DType> &exp) {

      return Plan<ImplicitGEMMExp<LhsExp, RhsExp, DType>, DType>(exp);

}

我们来看程序运行到Saver::template Save<DType>(dplan.REval(y, x), plan.Eval(y, x))时，对于dplan.REval(y, x)和Save我们已经说过了，这次要说的是Plan.Eval(y, x)，来看下是如果进行递归调用的。这个函数在Implicit_gemn.h中：

template<typename LhsExp, typename RhsExp, typename DType>

struct Plan<ImplicitGEMMExp<LhsExp, RhsExp, DType>, DType> {

 public:

  explicit Plan(const ImplicitGEMMExp<LhsExp, RhsExp, DType> &e)

      : lhs_(MakePlan(e.lhs_)),

        rhs_(MakePlan(e.rhs_)),

        prod_size_(e.prod_size_),

        prod_size_lower_align_(packet::LowerAlign<DType, MSHADOW_DEFAULT_PACKET>(e.prod_size_)) {

  }

  MSHADOW_XINLINE DType Eval(index_t y, index_t x) const {

    typedef packet::Packet<DType> Packet;

    Packet sum = Packet::Fill(0);

    const size_t packetSize = Packet::Size();

    DType lhs_temp[packetSize], rhs_temp[packetSize];

    for (index_t i = 0; i < prod_size_lower_align_; i += packetSize) {

      // unroll

      for (index_t j = 0; j < packetSize; ++j) {

        lhs_temp[j] = lhs_.Eval(y, i + j);

      }

      for (index_t j = 0; j < packetSize; ++j) {

        rhs_temp[j] = rhs_.Eval(i + j, x);

      }

      sum = sum + Packet::LoadUnAligned(lhs_temp) * Packet::LoadUnAligned(rhs_temp);

    }

    DType ret_result = sum.Sum();

    for (index_t i =  prod_size_lower_align_; i < prod_size_; ++i) {

      ret_result += lhs_.Eval(y, i) * rhs_.Eval(i, x);

    }

    return ret_result;

  }

 private:

  expr::Plan<LhsExp, DType> lhs_;

  expr::Plan<RhsExp, DType> rhs_;

  const index_t prod_size_;

  const index_t prod_size_lower_align_;

};

在进行计算中，要得到lhs_与rhs_表达式的值，而这两个表达式也是Plan类，之前我们说过：只要记得Plan.Eval是得到表达式的值就行了。所以当调用rhs_.Eval一直递归到计算得到值为止，所以rhs_.Eval最后得到的ret = implicit_dot(lhs, rhs.T())中rhs.T()的值。

余下的过程和上述的过程差不多了。

可以看到上述的操作并没有用临时内存。
对于传统的一些操作符+、-、*、/等，会被统一到TernaryMapExp（三目）、BinaryMapExp（双目）、UnaryMapExp（单目）中，参考加文件Expression.h，它的相关类MakePlan与Plan则定义在Exp_engine-inl.h。

其它

对于gpu的编程有它的一些限制，但设计的类和上而是差的多的。
logging.h是日志系统，打印记录一些系统的信息。
io.h是读写文件的操作，与mshadow相对来说是独立的，只是它适配了一些类型的格式（比如Tensor）读写。

【防止爬虫转载而导致的格式问题——链接】：http://www.cnblogs.com/heguanyou/p/7545344.html