xgboost 源码学习
官方代码结构解析,README.MD
XGboost 回归时,损失函数式平方误差损失
分类时,是对数自燃损失;
Coding Guide
======
This file is intended to be notes about code structure in xgboost Project Logical Layout // 依赖关系,IO -> LEANER(计算梯度并且传导给GBM)-> GBM(梯度提升) -> TREE(构建树的算法)
=======
* Dependency order: io->learner->gbm->tree
- All module depends on data.h
* tree are implementations of tree construction algorithms.
* gbm is gradient boosting interface, that takes trees and other base learner to do boosting.
- gbm only takes gradient as sufficient statistics, it does not compute the gradient.
* learner is learning module that computes gradient for specific object, and pass it to GBM File Naming Convention // .h定义数据结构和接口,.hpp实现接口
=======
* .h files are data structures and interface, which are needed to use functions in that layer.
* -inl.hpp files are implementations of interface, like cpp file in most project.
- You only need to understand the interface file to understand the usage of that layer
* In each folder, there can be a .cpp file, that compiles the module of that layer How to Hack the Code // 目标函数定义和修改
======
* Add objective function: add to learner/objective-inl.hpp and register it in learner/objective.h ```CreateObjFunction```
- You can also directly do it in python
* Add new evaluation metric: add to learner/evaluation-inl.hpp and register it in learner/evaluation.h ```CreateEvaluator```
* Add wrapper for a new language, most likely you can do it by taking the functions in python/xgboost_wrapper.h, which is purely C based, and call these C functions to use xgboost
XGBoost: eXtreme Gradient Boosting
An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.
It implements machine learning algorithm under gradient boosting framework, including generalized linear model and gradient boosted regression tree (GBDT). XGBoost can also also distributed and scale to Terascale data.
UpdateOneIter流程主要有以下几个步骤: 1. LazyInitDMatrix(train);
2. PredictRaw(train, &preds_);
3. obj_->GetGradient(preds_, train->info(), iter, &gpair_);
4. gbm_->DoBoost(train, &gpair_, obj_.get());
objective.h 文件
#ifndef XGBOOST_LEARNER_OBJECTIVE_H_
#define XGBOOST_LEARNER_OBJECTIVE_H_
/*!
* \file objective.h
* \brief interface of objective function used for gradient boosting
* \author Tianqi Chen, Kailong Chen
*/
#include "dmatrix.h" namespace xgboost {
namespace learner {
/*! \brief interface of objective function */
class IObjFunction{/// 所有目标函数的基类定义
public:
/*! \brief virtual destructor */
virtual ~IObjFunction(void){} /// 虚析构函数,释放空间
/*!
* \brief set parameters from outside
* \param name name of the parameter
* \param val value of the parameter
*/
virtual void SetParam(const char *name, const char *val) = 0; /// 参数名、参数值
/*!
* \brief get gradient over each of predictions, given existing information
* \param preds prediction of current round
* \param info information about labels, weights, groups in rank
* \param iter current iteration number
* \param out_gpair output of get gradient, saves gradient and second order gradient in
*/
virtual void GetGradient(const std::vector<float> &preds,
const MetaInfo &info,
int iter,
std::vector<bst_gpair> *out_gpair) = 0; /// 计算梯度
/*! \return the default evaluation metric for the objective */
virtual const char* DefaultEvalMetric(void) const = 0; /// 默认评测函数
// the following functions are optional, most of time default implementation is good enough
/*!
* \brief transform prediction values, this is only called when Prediction is called
* \param io_preds prediction values, saves to this vector as well
*/
virtual void PredTransform(std::vector<float> *io_preds){}
/*!
* \brief transform prediction values, this is only called when Eval is called,
* usually it redirect to PredTransform
* \param io_preds prediction values, saves to this vector as well
*/
virtual void EvalTransform(std::vector<float> *io_preds) {
this->PredTransform(io_preds);
}
/*!
* \brief transform probability value back to margin
* this is used to transform user-set base_score back to margin
* used by gradient boosting
* \return transformed value
*/
virtual float ProbToMargin(float base_score) const {
return base_score;
}
};
} // namespace learner
} // namespace xgboost // this are implementations of objective functions /// .hpp中是目标函数的实现
#include "objective-inl.hpp"
// factory function
namespace xgboost {
namespace learner {
/*! \brief factory funciton to create objective function by name */
inline IObjFunction* CreateObjFunction(const char *name) { /// 实现的目标函数,根据传入名称确定调用哪个
using namespace std;
/// RegLossObj类实现,传入不同的参数对应不同的损失
if (!strcmp("reg:linear", name)) return new RegLossObj(LossType::kLinearSquare);
if (!strcmp("reg:logistic", name)) return new RegLossObj(LossType::kLogisticNeglik);
if (!strcmp("binary:logistic", name)) return new RegLossObj(LossType::kLogisticClassify);
if (!strcmp("binary:logitraw", name)) return new RegLossObj(LossType::kLogisticRaw); /// PoissonRegression类实现
if (!strcmp("count:poisson", name)) return new PoissonRegression(); /// SoftmaxMultiClassObj 类实现
if (!strcmp("multi:softmax", name)) return new SoftmaxMultiClassObj(0);
if (!strcmp("multi:softprob", name)) return new SoftmaxMultiClassObj(1); /// 分别由LambdaRankObj LambdaRankObjNDCG LambdaRankObjMAP 实现
if (!strcmp("rank:pairwise", name )) return new PairwiseRankObj();
if (!strcmp("rank:ndcg", name)) return new LambdaRankObjNDCG();
if (!strcmp("rank:map", name)) return new LambdaRankObjMAP();
utils::Error("unknown objective function type: %s", name);
return NULL;
}
} // namespace learner
} // namespace xgboost
#endif // XGBOOST_LEARNER_OBJECTIVE_H_ /// .h定义数据结构和接口,.hpp实现接口 /*
/// 八种定义,针对不同的目标函数有不同的求解结果
“reg:linear” –线性回归。
“reg:logistic” –逻辑回归。
“binary:logistic” –二分类的逻辑回归问题,输出为概率。
“binary:logitraw” –二分类的逻辑回归问题,输出的结果为wTx。
“count:poisson” –计数问题的poisson回归,输出结果为poisson分布。 在poisson回归中,max_delta_step的缺省值为0.7。(used to safeguard optimization)
“multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题,同时需要设置参数num_class(类别个数)
“multi:softprob” –和softmax一样,但是输出的是ndata * nclass的向量,可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss,比如AUC 这类的,就是pairwise
*/ /*
///
UpdateOneIter流程主要有以下几个步骤: 1. LazyInitDMatrix(train);
2. PredictRaw(train, &preds_);
3. obj_->GetGradient(preds_, train->info(), iter, &gpair_);
4. gbm_->DoBoost(train, &gpair_, obj_.get()); */
objective-inl.hpp 文件:
#ifndef XGBOOST_LEARNER_OBJECTIVE_INL_HPP_
#define XGBOOST_LEARNER_OBJECTIVE_INL_HPP_
/*!
* \file objective-inl.hpp
* \brief objective function implementations
* \author Tianqi Chen, Kailong Chen
*/ /// 关于目标函数的求解可以参看: https://www.cnblogs.com/harvey888/p/7203256.html
/// 算法原理:http://wepon.me/files/gbdt.pdf
/// 目标函数推导分析:https://blog.csdn.net/yuxeaotao/article/details/90378782
/// https://blog.csdn.net/a819825294/article/details/51206410 /// 源码流程:https://blog.csdn.net/matrix_zzl/article/details/78699605
/// 源码主要函数:https://blog.csdn.net/weixin_39750084/article/details/83244191 #include <vector>
#include <algorithm>
#include <utility>
#include <cmath>
#include <functional>
#include "../data.h"
#include "./objective.h"
#include "./helper_utils.h"
#include "../utils/random.h"
#include "../utils/omp.h" namespace xgboost {
namespace learner {/// 实现一些常用的计算功能,并定义为inline
/*! \brief defines functions to calculate some commonly used functions */
struct LossType {
/*! \brief indicate which type we are using */
int loss_type;
// list of constants
static const int kLinearSquare = 0; /// 线性回归
static const int kLogisticNeglik = 1; /// 逻辑回归,输出概率
static const int kLogisticClassify = 2; /// 二分类,输出概率
static const int kLogisticRaw = 3; /// 输出原始的值,sigmoid 之后就能得到概率和上面的两个相同
/*!
* \brief transform the linear sum to prediction
* \param x linear sum of boosting ensemble
* \return transformed prediction
*/
inline float PredTransform(float x) const {/// 0和3 输出一样,1和2输出一样
switch (loss_type) {
case kLogisticRaw:
case kLinearSquare: return x;
case kLogisticClassify:
case kLogisticNeglik: return 1.0f / (1.0f + std::exp(-x));
default: utils::Error("unknown loss_type"); return 0.0f;
}
}
/*!
* \brief check if label range is valid
*/
inline bool CheckLabel(float x) const {/// 判定label是否合理
if (loss_type != kLinearSquare) {
return x >= 0.0f && x <= 1.0f;
}
return true;
}
/*!
* \brief error message displayed when check label fail
*/
inline const char * CheckLabelErrorMsg(void) const {
if (loss_type != kLinearSquare) {
return "label must be in [0,1] for logistic regression";
} else {
return "";
}
}
/*!
* \brief calculate first order gradient of loss, given transformed prediction
* \param predt transformed prediction
* \param label true label
* \return first order gradient
*/
inline float FirstOrderGradient(float predt, float label) const {/// 计算不同目标函数的一阶导数,可以看到kLogisticClassify 和 kLogisticNeglik 是一样的返回值
switch (loss_type) {
case kLinearSquare: return predt - label;
case kLogisticRaw: predt = 1.0f / (1.0f + std::exp(-predt));
case kLogisticClassify:
case kLogisticNeglik: return predt - label;
default: utils::Error("unknown loss_type"); return 0.0f;
}
}
/*!
* \brief calculate second order gradient of loss, given transformed prediction
* \param predt transformed prediction
* \param label true label
* \return second order gradient
*/
inline float SecondOrderGradient(float predt, float label) const {/// 计算出二阶导数
// cap second order gradient to postive value
const float eps = 1e-16f;
switch (loss_type) {
case kLinearSquare: return 1.0f;
case kLogisticRaw: predt = 1.0f / (1.0f + std::exp(-predt));
case kLogisticClassify:
case kLogisticNeglik: return std::max(predt * (1.0f - predt), eps); /// 设置梯度阈值
default: utils::Error("unknown loss_type"); return 0.0f;
}
}
/*!
* \brief transform probability value back to margin
*/
inline float ProbToMargin(float base_score) const {/// 将概率转化到范围内
if (loss_type == kLogisticRaw ||
loss_type == kLogisticClassify ||
loss_type == kLogisticNeglik ) {
utils::Check(base_score > 0.0f && base_score < 1.0f,
"base_score must be in (0,1) for logistic loss");
base_score = -std::log(1.0f / base_score - 1.0f);
}
return base_score;
}
/*! \brief get default evaluation metric for the objective */
inline const char *DefaultEvalMetric(void) const {/// 默认的评测函数
if (loss_type == kLogisticClassify) return "error";
if (loss_type == kLogisticRaw) return "auc";
return "rmse";
}
}; /*! \brief objective function that only need to */ /// 逻辑回归
class RegLossObj : public IObjFunction {/// explicit 关键字,防止构造函数的隐式自动转化,IObjFunction 来自objective.h
public:
explicit RegLossObj(int loss_type) {/// 原则上应该在所有的构造函数前加explicit关键字,这样可以大大减少错误的发生
loss.loss_type = loss_type;
scale_pos_weight = 1.0f;
}
virtual ~RegLossObj(void) {}/// 基类,虚析构函数(防止被子类继承在析构时发生内存泄漏)
virtual void SetParam(const char *name, const char *val) {/// 虚函数,实现多态
using namespace std;
if (!strcmp("scale_pos_weight", name)) {
scale_pos_weight = static_cast<float>(atof(val));
}
}
virtual void GetGradient(const std::vector<float> &preds,
const MetaInfo &info,
int iter,
std::vector<bst_gpair> *out_gpair) {
utils::Check(info.labels.size() != 0, "label set cannot be empty");
utils::Check(preds.size() % info.labels.size() == 0,
"labels are not correctly provided");
std::vector<bst_gpair> &gpair = *out_gpair;
gpair.resize(preds.size());
// check if label in range
bool label_correct = true;
// start calculating gradient
const unsigned nstep = static_cast<unsigned>(info.labels.size());
const bst_omp_uint ndata = static_cast<bst_omp_uint>(preds.size());
#pragma omp parallel for schedule(static) /// 下面的循环,多线程并行编程,静态调度
for (bst_omp_uint i = 0; i < ndata; ++i) {
const unsigned j = i % nstep;
float p = loss.PredTransform(preds[i]);
float w = info.GetWeight(j);
if (info.labels[j] == 1.0f) w *= scale_pos_weight;
if (!loss.CheckLabel(info.labels[j])) label_correct = false;
gpair[i] = bst_gpair(loss.FirstOrderGradient(p, info.labels[j]) * w,
loss.SecondOrderGradient(p, info.labels[j]) * w);
}
utils::Check(label_correct, loss.CheckLabelErrorMsg());
}
virtual const char* DefaultEvalMetric(void) const {
return loss.DefaultEvalMetric();
}
virtual void PredTransform(std::vector<float> *io_preds) {
std::vector<float> &preds = *io_preds;
const bst_omp_uint ndata = static_cast<bst_omp_uint>(preds.size());
#pragma omp parallel for schedule(static)
for (bst_omp_uint j = 0; j < ndata; ++j) {
preds[j] = loss.PredTransform(preds[j]);
}
}
virtual float ProbToMargin(float base_score) const {
return loss.ProbToMargin(base_score);
}
/// 定义的类内变量为protected 可以被该类中的函数、子类的函数、以及其友元函数访问,但不能被该类的对象访问
protected:
float scale_pos_weight;
LossType loss;
}; // poisson regression for count ///泊松回归
class PoissonRegression : public IObjFunction {
public:
explicit PoissonRegression(void) {
max_delta_step = 0.0f;
}
virtual ~PoissonRegression(void) {} virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp( "max_delta_step", name )) {
max_delta_step = static_cast<float>(atof(val));
}
}
virtual void GetGradient(const std::vector<float> &preds,
const MetaInfo &info,
int iter,
std::vector<bst_gpair> *out_gpair) {
utils::Check(max_delta_step != 0.0f,
"PoissonRegression: need to set max_delta_step");
utils::Check(info.labels.size() != 0, "label set cannot be empty");
utils::Check(preds.size() == info.labels.size(),
"labels are not correctly provided");
std::vector<bst_gpair> &gpair = *out_gpair;
gpair.resize(preds.size());
// check if label in range
bool label_correct = true;
// start calculating gradient
const long ndata = static_cast<bst_omp_uint>(preds.size());
#pragma omp parallel for schedule(static)
for (long i = 0; i < ndata; ++i) {
float p = preds[i];
float w = info.GetWeight(i);
float y = info.labels[i];
if (y >= 0.0f) {
gpair[i] = bst_gpair((std::exp(p) - y) * w,
std::exp(p + max_delta_step) * w);
} else {
label_correct = false;
}
}
utils::Check(label_correct,
"PoissonRegression: label must be nonnegative");
}
virtual void PredTransform(std::vector<float> *io_preds) {
std::vector<float> &preds = *io_preds;
const long ndata = static_cast<long>(preds.size());
#pragma omp parallel for schedule(static)
for (long j = 0; j < ndata; ++j) {
preds[j] = std::exp(preds[j]);
}
}
virtual void EvalTransform(std::vector<float> *io_preds) {
PredTransform(io_preds);
}
virtual float ProbToMargin(float base_score) const {
return std::log(base_score);
}
virtual const char* DefaultEvalMetric(void) const {
return "poisson-nloglik";
} private: /// 定义的类内变量为private 只能由该类中的函数、其友元函数访问,不能被任何其他访问,该类的对象也不能访问.
float max_delta_step;
}; // softmax multi-class classification /// 多分类
class SoftmaxMultiClassObj : public IObjFunction {
public:
explicit SoftmaxMultiClassObj(int output_prob)
: output_prob(output_prob) {
nclass = 0;
}
virtual ~SoftmaxMultiClassObj(void) {}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp( "num_class", name )) nclass = atoi(val);
}
virtual void GetGradient(const std::vector<float> &preds,
const MetaInfo &info,
int iter,
std::vector<bst_gpair> *out_gpair) {
utils::Check(nclass != 0, "must set num_class to use softmax");
utils::Check(info.labels.size() != 0, "label set cannot be empty");
utils::Check(preds.size() % (static_cast<size_t>(nclass) * info.labels.size()) == 0,
"SoftmaxMultiClassObj: label size and pred size does not match");
std::vector<bst_gpair> &gpair = *out_gpair;
gpair.resize(preds.size());
const unsigned nstep = static_cast<unsigned>(info.labels.size() * nclass);
const bst_omp_uint ndata = static_cast<bst_omp_uint>(preds.size() / nclass);
int label_error = 0;
#pragma omp parallel
{
std::vector<float> rec(nclass);
#pragma omp for schedule(static)
for (bst_omp_uint i = 0; i < ndata; ++i) {
for (int k = 0; k < nclass; ++k) {
rec[k] = preds[i * nclass + k];
}
Softmax(&rec);
const unsigned j = i % nstep;
int label = static_cast<int>(info.labels[j]);
if (label < 0 || label >= nclass) {
label_error = label; label = 0;
}
const float wt = info.GetWeight(j);
for (int k = 0; k < nclass; ++k) {
float p = rec[k];
const float h = 2.0f * p * (1.0f - p) * wt;
if (label == k) {
gpair[i * nclass + k] = bst_gpair((p - 1.0f) * wt, h);
} else {
gpair[i * nclass + k] = bst_gpair(p* wt, h);
}
}
}
}
utils::Check(label_error >= 0 && label_error < nclass,
"SoftmaxMultiClassObj: label must be in [0, num_class),"\
" num_class=%d but found %d in label", nclass, label_error);
}
virtual void PredTransform(std::vector<float> *io_preds) {
this->Transform(io_preds, output_prob);
}
virtual void EvalTransform(std::vector<float> *io_preds) {
this->Transform(io_preds, 1);
}
virtual const char* DefaultEvalMetric(void) const {
return "merror";
} private:
inline void Transform(std::vector<float> *io_preds, int prob) {
utils::Check(nclass != 0, "must set num_class to use softmax");
std::vector<float> &preds = *io_preds;
std::vector<float> tmp;
const bst_omp_uint ndata = static_cast<bst_omp_uint>(preds.size()/nclass);
if (prob == 0) tmp.resize(ndata);
#pragma omp parallel
{
std::vector<float> rec(nclass);
#pragma omp for schedule(static)
for (bst_omp_uint j = 0; j < ndata; ++j) {
for (int k = 0; k < nclass; ++k) {
rec[k] = preds[j * nclass + k];
}
if (prob == 0) {
tmp[j] = static_cast<float>(FindMaxIndex(rec));
} else {
Softmax(&rec);
for (int k = 0; k < nclass; ++k) {
preds[j * nclass + k] = rec[k];
}
}
}
}
if (prob == 0) preds = tmp;
}
// data field
int nclass;
int output_prob;
}; /*! \brief objective for lambda rank */ /// LambdaRankObj 排序目标函数
class LambdaRankObj : public IObjFunction {
public:
LambdaRankObj(void) {
loss.loss_type = LossType::kLogisticRaw;
fix_list_weight = 0.0f;
num_pairsample = 1;
}
virtual ~LambdaRankObj(void) {}
virtual void SetParam(const char *name, const char *val) {
using namespace std;
if (!strcmp( "loss_type", name )) loss.loss_type = atoi(val);
if (!strcmp( "fix_list_weight", name)) fix_list_weight = static_cast<float>(atof(val));
if (!strcmp( "num_pairsample", name)) num_pairsample = atoi(val);
}
virtual void GetGradient(const std::vector<float> &preds,
const MetaInfo &info,
int iter,
std::vector<bst_gpair> *out_gpair) {
utils::Check(preds.size() == info.labels.size(), "label size predict size not match");
std::vector<bst_gpair> &gpair = *out_gpair;
gpair.resize(preds.size());
// quick consistency when group is not available
std::vector<unsigned> tgptr(2, 0); tgptr[1] = static_cast<unsigned>(info.labels.size());
const std::vector<unsigned> &gptr = info.group_ptr.size() == 0 ? tgptr : info.group_ptr;
utils::Check(gptr.size() != 0 && gptr.back() == info.labels.size(),
"group structure not consistent with #rows");
const bst_omp_uint ngroup = static_cast<bst_omp_uint>(gptr.size() - 1);
#pragma omp parallel
{
// parall construct, declare random number generator here, so that each
// thread use its own random number generator, seed by thread id and current iteration
random::Random rnd; rnd.Seed(iter* 1111 + omp_get_thread_num());
std::vector<LambdaPair> pairs;
std::vector<ListEntry> lst;
std::vector< std::pair<float, unsigned> > rec;
#pragma omp for schedule(static)
for (bst_omp_uint k = 0; k < ngroup; ++k) {
lst.clear(); pairs.clear();
for (unsigned j = gptr[k]; j < gptr[k+1]; ++j) {
lst.push_back(ListEntry(preds[j], info.labels[j], j));
gpair[j] = bst_gpair(0.0f, 0.0f);
}
std::sort(lst.begin(), lst.end(), ListEntry::CmpPred);
rec.resize(lst.size());
for (unsigned i = 0; i < lst.size(); ++i) {
rec[i] = std::make_pair(lst[i].label, i);
}
std::sort(rec.begin(), rec.end(), CmpFirst);
// enumerate buckets with same label, for each item in the lst, grab another sample randomly
for (unsigned i = 0; i < rec.size(); ) {
unsigned j = i + 1;
while (j < rec.size() && rec[j].first == rec[i].first) ++j;
// bucket in [i,j), get a sample outside bucket
unsigned nleft = i, nright = static_cast<unsigned>(rec.size() - j);
if (nleft + nright != 0) {
int nsample = num_pairsample;
while (nsample --) {
for (unsigned pid = i; pid < j; ++pid) {
unsigned ridx = static_cast<unsigned>(rnd.RandDouble() * (nleft+nright));
if (ridx < nleft) {
pairs.push_back(LambdaPair(rec[ridx].second, rec[pid].second));
} else {
pairs.push_back(LambdaPair(rec[pid].second, rec[ridx+j-i].second));
}
}
}
}
i = j;
}
// get lambda weight for the pairs
this->GetLambdaWeight(lst, &pairs);
// rescale each gradient and hessian so that the lst have constant weighted
float scale = 1.0f / num_pairsample;
if (fix_list_weight != 0.0f) {
scale *= fix_list_weight / (gptr[k+1] - gptr[k]);
}
for (size_t i = 0; i < pairs.size(); ++i) {
const ListEntry &pos = lst[pairs[i].pos_index];
const ListEntry &neg = lst[pairs[i].neg_index];
const float w = pairs[i].weight * scale;
float p = loss.PredTransform(pos.pred - neg.pred);
float g = loss.FirstOrderGradient(p, 1.0f);
float h = loss.SecondOrderGradient(p, 1.0f);
// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex].grad += g * w;
gpair[pos.rindex].hess += 2.0f * w * h;
gpair[neg.rindex].grad -= g * w;
gpair[neg.rindex].hess += 2.0f * w * h;
}
}
}
}
virtual const char* DefaultEvalMetric(void) const {
return "map";
} protected:
/*! \brief helper information in a list */
struct ListEntry {
/*! \brief the predict score we in the data */
float pred;
/*! \brief the actual label of the entry */
float label;
/*! \brief row index in the data matrix */
unsigned rindex;
// constructor
ListEntry(float pred, float label, unsigned rindex)
: pred(pred), label(label), rindex(rindex) {}
// comparator by prediction
inline static bool CmpPred(const ListEntry &a, const ListEntry &b) {
return a.pred > b.pred;
}
// comparator by label
inline static bool CmpLabel(const ListEntry &a, const ListEntry &b) {
return a.label > b.label;
}
};
/*! \brief a pair in the lambda rank */
struct LambdaPair {
/*! \brief positive index: this is a position in the list */
unsigned pos_index;
/*! \brief negative index: this is a position in the list */
unsigned neg_index;
/*! \brief weight to be filled in */
float weight;
// constructor
LambdaPair(unsigned pos_index, unsigned neg_index)
: pos_index(pos_index), neg_index(neg_index), weight(1.0f) {}
};
/*!
* \brief get lambda weight for existing pairs
* \param list a list that is sorted by pred score
* \param io_pairs record of pairs, containing the pairs to fill in weights
*/
virtual void GetLambdaWeight(const std::vector<ListEntry> &sorted_list,
std::vector<LambdaPair> *io_pairs) = 0; private:
// loss function
LossType loss;
// number of samples peformed for each instance
int num_pairsample;
// fix weight of each elements in list
float fix_list_weight;
}; class PairwiseRankObj: public LambdaRankObj{
public:
virtual ~PairwiseRankObj(void) {} protected:
virtual void GetLambdaWeight(const std::vector<ListEntry> &sorted_list,
std::vector<LambdaPair> *io_pairs) {}
}; // beta version: NDCG lambda rank
class LambdaRankObjNDCG : public LambdaRankObj {
public:
virtual ~LambdaRankObjNDCG(void) {} protected:
virtual void GetLambdaWeight(const std::vector<ListEntry> &sorted_list,
std::vector<LambdaPair> *io_pairs) {
std::vector<LambdaPair> &pairs = *io_pairs;
float IDCG;
{
std::vector<float> labels(sorted_list.size());
for (size_t i = 0; i < sorted_list.size(); ++i) {
labels[i] = sorted_list[i].label;
}
std::sort(labels.begin(), labels.end(), std::greater<float>());
IDCG = CalcDCG(labels);
}
if (IDCG == 0.0) {
for (size_t i = 0; i < pairs.size(); ++i) {
pairs[i].weight = 0.0f;
}
} else {
IDCG = 1.0f / IDCG;
for (size_t i = 0; i < pairs.size(); ++i) {
unsigned pos_idx = pairs[i].pos_index;
unsigned neg_idx = pairs[i].neg_index;
float pos_loginv = 1.0f / std::log(pos_idx + 2.0f);
float neg_loginv = 1.0f / std::log(neg_idx + 2.0f);
int pos_label = static_cast<int>(sorted_list[pos_idx].label);
int neg_label = static_cast<int>(sorted_list[neg_idx].label);
float original =
((1 << pos_label) - 1) * pos_loginv + ((1 << neg_label) - 1) * neg_loginv;
float changed =
((1 << neg_label) - 1) * pos_loginv + ((1 << pos_label) - 1) * neg_loginv;
float delta = (original - changed) * IDCG;
if (delta < 0.0f) delta = - delta;
pairs[i].weight = delta;
}
}
}
inline static float CalcDCG(const std::vector<float> &labels) {
double sumdcg = 0.0;
for (size_t i = 0; i < labels.size(); ++i) {
const unsigned rel = static_cast<unsigned>(labels[i]);
if (rel != 0) {
sumdcg += ((1 << rel) - 1) / std::log(static_cast<float>(i + 2));
}
}
return static_cast<float>(sumdcg);
}
}; // map LambdaRank
class LambdaRankObjMAP : public LambdaRankObj {
public:
virtual ~LambdaRankObjMAP(void) {} protected:
struct MAPStats {
/*! \brief the accumulated precision */
float ap_acc;
/*!
* \brief the accumulated precision,
* assuming a positive instance is missing
*/
float ap_acc_miss;
/*!
* \brief the accumulated precision,
* assuming that one more positive instance is inserted ahead
*/
float ap_acc_add;
/* \brief the accumulated positive instance count */
float hits;
MAPStats(void) {}
MAPStats(float ap_acc, float ap_acc_miss, float ap_acc_add, float hits)
: ap_acc(ap_acc), ap_acc_miss(ap_acc_miss), ap_acc_add(ap_acc_add), hits(hits) {}
};
/*!
* \brief Obtain the delta MAP if trying to switch the positions of instances in index1 or index2
* in sorted triples
* \param sorted_list the list containing entry information
* \param index1,index2 the instances switched
* \param map_stats a vector containing the accumulated precisions for each position in a list
*/
inline float GetLambdaMAP(const std::vector<ListEntry> &sorted_list,
int index1, int index2,
std::vector<MAPStats> *p_map_stats) {
std::vector<MAPStats> &map_stats = *p_map_stats;
if (index1 == index2 || map_stats[map_stats.size() - 1].hits == 0) {
return 0.0f;
}
if (index1 > index2) std::swap(index1, index2);
float original = map_stats[index2].ap_acc;
if (index1 != 0) original -= map_stats[index1 - 1].ap_acc;
float changed = 0;
float label1 = sorted_list[index1].label > 0.0f ? 1.0f : 0.0f;
float label2 = sorted_list[index2].label > 0.0f ? 1.0f : 0.0f;
if (label1 == label2) {
return 0.0;
} else if (label1 < label2) {
changed += map_stats[index2 - 1].ap_acc_add - map_stats[index1].ap_acc_add;
changed += (map_stats[index1].hits + 1.0f) / (index1 + 1);
} else {
changed += map_stats[index2 - 1].ap_acc_miss - map_stats[index1].ap_acc_miss;
changed += map_stats[index2].hits / (index2 + 1);
}
float ans = (changed - original) / (map_stats[map_stats.size() - 1].hits);
if (ans < 0) ans = -ans;
return ans;
}
/*
* \brief obtain preprocessing results for calculating delta MAP
* \param sorted_list the list containing entry information
* \param map_stats a vector containing the accumulated precisions for each position in a list
*/
inline void GetMAPStats(const std::vector<ListEntry> &sorted_list,
std::vector<MAPStats> *p_map_acc) {
std::vector<MAPStats> &map_acc = *p_map_acc;
map_acc.resize(sorted_list.size());
float hit = 0, acc1 = 0, acc2 = 0, acc3 = 0;
for (size_t i = 1; i <= sorted_list.size(); ++i) {
if (sorted_list[i - 1].label > 0.0f) {
hit++;
acc1 += hit / i;
acc2 += (hit - 1) / i;
acc3 += (hit + 1) / i;
}
map_acc[i - 1] = MAPStats(acc1, acc2, acc3, hit);
}
}
virtual void GetLambdaWeight(const std::vector<ListEntry> &sorted_list,
std::vector<LambdaPair> *io_pairs) {
std::vector<LambdaPair> &pairs = *io_pairs;
std::vector<MAPStats> map_stats;
GetMAPStats(sorted_list, &map_stats);
for (size_t i = 0; i < pairs.size(); ++i) {
pairs[i].weight =
GetLambdaMAP(sorted_list, pairs[i].pos_index,
pairs[i].neg_index, &map_stats);
}
}
}; } // namespace learner
} // namespace xgboost
#endif // XGBOOST_LEARNER_OBJECTIVE_INL_HPP_
/// 关于目标函数的求解可以参看: https://www.cnblogs.com/harvey888/p/7203256.html
/// 算法原理:http://wepon.me/files/gbdt.pdf
/// 目标函数推导分析:https://blog.csdn.net/yuxeaotao/article/details/90378782
/// https://blog.csdn.net/a819825294/article/details/51206410
/// 源码流程:https://blog.csdn.net/matrix_zzl/article/details/78699605
/// 源码主要函数:https://blog.csdn.net/weixin_39750084/article/details/83244191
xgboost 源码学习的更多相关文章
- Java集合专题总结(1):HashMap 和 HashTable 源码学习和面试总结
2017年的秋招彻底结束了,感觉Java上面的最常见的集合相关的问题就是hash--系列和一些常用并发集合和队列,堆等结合算法一起考察,不完全统计,本人经历:先后百度.唯品会.58同城.新浪微博.趣分 ...
- jQuery源码学习感想
还记得去年(2015)九月份的时候,作为一个大四的学生去参加美团霸面,结果被美团技术总监教育了一番,那次问了我很多jQuery源码的知识点,以前虽然喜欢研究框架,但水平还不足够来研究jQuery源码, ...
- MVC系列——MVC源码学习:打造自己的MVC框架(四:了解神奇的视图引擎)
前言:通过之前的三篇介绍,我们基本上完成了从请求发出到路由匹配.再到控制器的激活,再到Action的执行这些个过程.今天还是趁热打铁,将我们的View也来完善下,也让整个系列相对完整,博主不希望烂尾. ...
- MVC系列——MVC源码学习:打造自己的MVC框架(三:自定义路由规则)
前言:上篇介绍了下自己的MVC框架前两个版本,经过两天的整理,版本三基本已经完成,今天还是发出来供大家参考和学习.虽然微软的Routing功能已经非常强大,完全没有必要再“重复造轮子”了,但博主还是觉 ...
- MVC系列——MVC源码学习:打造自己的MVC框架(二:附源码)
前言:上篇介绍了下 MVC5 的核心原理,整篇文章比较偏理论,所以相对比较枯燥.今天就来根据上篇的理论一步一步进行实践,通过自己写的一个简易MVC框架逐步理解,相信通过这一篇的实践,你会对MVC有一个 ...
- MVC系列——MVC源码学习:打造自己的MVC框架(一:核心原理)
前言:最近一段时间在学习MVC源码,说实话,研读源码真是一个痛苦的过程,好多晦涩的语法搞得人晕晕乎乎.这两天算是理解了一小部分,这里先记录下来,也给需要的园友一个参考,奈何博主技术有限,如有理解不妥之 ...
- 我的angularjs源码学习之旅2——依赖注入
依赖注入起源于实现控制反转的典型框架Spring框架,用来削减计算机程序的耦合问题.简单来说,在定义方法的时候,方法所依赖的对象就被隐性的注入到该方法中,在方法中可以直接使用,而不需要在执行该函数的时 ...
- ddms(基于 Express 的表单管理系统)源码学习
ddms是基于express的一个表单管理系统,今天抽时间看了下它的代码,其实算不上源码学习,只是对它其中一些小的开发技巧做一些记录,希望以后在项目开发中能够实践下. 数据层封装 模块只对外暴露mod ...
- leveldb源码学习系列
楼主从2014年7月份开始学习<>,由于书籍比较抽象,为了加深思考,同时开始了Google leveldb的源码学习,主要是想学习leveldb的设计思想和Google的C++编程规范.目 ...
随机推荐
- FDD-LTE上下行带宽一样的,为什么上下行流量差别这么大
转:https://zhidao.baidu.com/question/923940070377297579.html 虽然FD系统,上下行使用的带宽一样,但是上下行的信号编码效率完全不同.上行信号( ...
- linux防火墙扩展模块实战(二)
iptables扩展模块 扩展匹配条件:需要加载扩展模块(/usr/lib64/xtables/*.so),方可生效 查看帮助 man iptables-extensions (1)隐式扩展 ...
- java后端处理高并发
一个登陆页面可能会被很多账户同时登陆或者注册,那么我们就好处理这些并发,否则降低程序的使用率,甚至程序奔溃,下面一段代码处理程序的高并发效果不错. /** *@author xiaoxie *@dat ...
- JAVA API连接HDFS HA集群
使用JAVA API连接HDFS时我们需要使用NameNode的地址,开启HA后,两个NameNode可能会主备切换,如果连接的那台主机NameNode挂掉了,连接就会失败. HDFS提供了names ...
- Kafka 基本知识分享
目录 一.基本术语 二.Kafka 基本命令 三.易混淆概念 四.Kafka的特性 五.Kafka的使用场景 六.Kakfa 的设计思想 七.Kafka 配置文件设置 八.新消费者 九.Kafka该怎 ...
- flask中使用ajax 处理前端请求,每隔一段时间请求不通的接口,结果展示同一页面
需求: flask中使用ajax 处理前端请求,每隔一段时间请求不通的接口,结果展示同一页面 用到 setTimeout方法,setTimeout(function(){},1000):setTime ...
- 7月新的开始 - Axure学习06 - 母版的使用
母版的使用 主导航.底部.在很多页面上都是一样的: 如果在每一个页面都写一次的化.话.是非常浪费时间的,为了方便.可以使用母版: 母版可以帮助我们将一些元素重复利用,既可以保证页面的统一性.还可以节省 ...
- 深入剖析TOMCAT
理解tomcat之搭建简易http服务器 做过java web的同学都对tomcat非常熟悉.我们在使用tomcat带来的便利的同时,是否想过tomcat是如何工作的呢?tomcat本质是一个http ...
- Laravel 项目架构 弹性、可维护性
公司项目可能需要对架构进行重建,老大给了我一个视频让我学习里面的思想,看完后觉得收获很大,主讲人对laravel项目各个层次有很清晰的理解,力求做到职责单一分明,提高可维护性.下面是我看完视频对其内容 ...
- Linux分区格式化
格式化(format)是指对磁盘或磁盘中的分区(partition)进行初始化的一种操作,这种操作通常会导致现有的磁盘或分区中所有的文件被清除.格式化通常分为低级格式化和高级格式化.如果没有特别指明, ...