GBDT原理实例演示 2
一开始我们设定F(x)也就是每个样本的预测值是0(也可以做一定的随机化)
Scores = { 0, 0, 0, 0, 0, 0, 0, 0}
那么我们先计算当前情况下的梯度值
GetGradientInOneQuery = [this](int query, const Fvec& scores)
{
//和实际代码稍有出入
简化版本
_gradient[query] = ((2.0 * label) * sigmoidParam) / (1.0 + std::exp(((2.0 * label) * sigmoidParam) * scores[query]));
};
考虑 0号样本 label是1 , learningRate也就是sigmoidParam设置为0.1, scores[query] = 0 当前Scores全是0
2 * 1 * 0.1 / (1 + exp(2 * 1 * 0.1 * 0)) = 0.1
考虑 7号样本 label是-1
2 * -1 * 0.1 / (1 + exp(2 * -1 * 0.1 * 0)) = -0.1
因此当前计算的梯度值是
Gradient = {
0.1, 0.1, 0.1, 0.1, -0.1, -0.1, -0.1, -0.1}
于是我们要当前树的输出F(x)拟合的targets就是这个Grandient
Targets = {
0.1, 0.1, 0.1, 0.1, -0.1, -0.1, -0.1, -0.1}
RegressionTree tree = TreeLearner->FitTargets(activeFeatures, AdjustTargetsAndSetWeights());
virtual RegressionTree FitTargets(BitArray& activeFeatures, Fvec& targets) override
现在我们考虑拟合这个梯度
gdb ./test_fastrank_train
(gdb) r -in dating.txt -cl gbdt -ntree 2 -nl 3 -lr 0.1 -mil 1 -c train -vl 1 -mjson=1
p Partitioning
$3 = {_documents = std::vector of length 8, capacity 8 = {0, 1, 2, 3, 4, 5, 6, 7}, _initialDocuments = std::vector of length 0, capacity 0, _leafBegin = std::vector of length 3, capacity 3 = {0, 0,
0}, _leafCount = std::vector of length 3, capacity 3 = {8, 0, 0}, _tempDocuments = std::vector of length 0, capacity 0}
gbdt对应每个特征要做离散化分桶处理,比如分255个桶,这里样本数据比较少,对应height特征,
20, 60, 3, 66, 30, 20, 15, 10
分桶也就是变成
BinMedians = std::vector of length 7, capacity 7 = {3, 10, 15, 20, 30, 60, 66}
p *Feature
$11 = {_vptr.Feature = 0xce8650 <vtable for gezi::Feature+16>, Name = "hight",
BinUpperBounds = std::vector of length 7, capacity 7 = {6.5, 12.5, 17.5, 25, 45, 63, 1.7976931348623157e+308},
BinMedians = std::vector of length 7, capacity 7 = {3, 10, 15, 20, 30, 60, 66},
Bins = {_vptr.TVector = 0xce8670 <vtable for gezi::TVector<int>+16>, indices = std::vector of length 0, capacity 0,
values = std::vector of length 8, capacity 8 = {3, 5, 0, 6, 4, 3, 2, 1}, sparsityRatio = 0.29999999999999999, keepDense = false, keepSparse = false, normalized = false, numNonZeros = 7,
length = 8, _zeroValue = 0}, Trust = 1}
Bins对应分桶的结果,比如_0样本hight 20,那么分桶结果是编号3的桶(0开始index)
考虑Root节点的分裂,分裂前考虑是8个样本在一个节点,我们选取一个最佳的特征,以及对应该特征最佳的分裂点
考虑hight特征,我们要扫描所有可能的分裂点
这里也就是说
考虑6个不同的分裂点
for (int t = 0; t < (histogram.NumFeatureValues - 1); t += 1)
比如6.5这个分裂点
那么
就是左子树 1个(_2样本),
右子树7个,考虑下面公式
收益是 0.1^2/1 + (-0.1)^2/7 - CONSTANT = 0.01142857142857143 - CONSTANT
类似的考虑分裂点12.5,17.5……….. 选取一个最佳分裂点
然后同样的考虑 money, face 特征
选取最优(特征,分裂点)组合,
这里最优组合是(hight, 45)
左侧得到
_0,_2,_4,_5,_6, _7 -> 0.1 + 0.1 - 0.1 - 0.1 - 0.1 -0.1
右侧得到
_1,_3 -> 0.1 + 0.1
收益是
(-0.2)^2 /6 + (0.2)^2 / 2 - CONSTANT = 0.026666666666666665 - CONSTANT
(gdb) p bestShiftedGain
$22 = 0.026666666666666675
对应>的子树输出应该是
0.2 / 2 = 0.1 下图对应展示output是1,因为后续还有AdjustOutput,因为至少需要 F_m(x) = F_m-1(x) + learning_rate*(当前树的预测值(也就是预测负梯度..))
黄色部分是最终该棵树的输出值
之后再选取两个分裂后的组
选一个最佳(特征,分裂)组合 -> (face, 57.5)
(gdb) p tree
$26 = {<gezi::OnlineRegressionTree> = {NumLeaves = 3, _gainPValue = std::vector of length 2, capacity 2 = {0.15304198078836101, 0.27523360741160119},
_lteChild = std::vector of length 2, capacity 2 = {1, -1}, _gtChild = std::vector of length 2, capacity 2 = {-2, -3}, _leafValue = std::vector of length 3, capacity 3 = {-0.10000000000000002,
0.10000000000000002, 0.033333333333333347}, _threshold = std::vector of length 2, capacity 2 = {4, 2}, _splitFeature = std::vector of length 2, capacity 2 = {0, 2},
_splitGain = std::vector of length 2, capacity 2 = {0.026666666666666675, 0.026666666666666679}, _maxOutput = 0.10000000000000002, _previousLeafValue = std::vector of length 2, capacity 2 = {0,
-0.033333333333333333}, _weight = 1, _featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>},
_parent = std::vector of length 3, capacity 3 = {1, -1, -2}}
调整一下Output
//GradientDecent.h
virtual RegressionTree& TrainingIteration(BitArray& activeFeatures) override
{
RegressionTree tree = TreeLearner->FitTargets(activeFeatures, AdjustTargetsAndSetWeights());
if (AdjustTreeOutputsOverride == nullptr)
{ //如果父类ObjectiveFunction里面没有虚函数 不能使用dynamic_pointer_cast... @TODO
(dynamic_pointer_cast<IStepSearch>(ObjectiveFunction))->AdjustTreeOutputs(tree, TreeLearner->Partitioning, *TrainingScores);
}
{
UpdateAllScores(tree);
}
Ensemble.AddTree(tree);
return Ensemble.Tree();
}
virtual void AdjustTreeOutputs(RegressionTree& tree, DocumentPartitioning& partitioning, ScoreTracker& trainingScores) override
{
//AutoTimer timer("dynamic_pointer_cast<IStepSearch>(ObjectiveFunction))->AdjustTreeOutputs");
for (int l = 0; l < tree.NumLeaves; l++)
{
Float output = 0.0;
if (_bestStepRankingRegressionTrees)
{
output = _learningRate * tree.GetOutput(l);
}
else
{ //现在走这里
output = (_learningRate * (tree.GetOutput(l) + 1.4E-45)) / (partitioning.Mean(_weights, Dataset.SampleWeights, l, false) + 1.4E-45);
}
if (output > _maxTreeOutput)
{
output = _maxTreeOutput;
}
else if (output < -_maxTreeOutput)
{
output = -_maxTreeOutput;
}
tree.SetOutput(l, output);
}
}
(gdb) p _weights
$33 = std::vector of length 8, capacity 8 = {0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002, 0.010000000000000002,
0.010000000000000002, 0.010000000000000002}
_learningRate * tree.Getoutput(1) / partioning.Mean(_weights..) = 0.1 * 0.1 / 0.01 = 1
(gdb) p tree
$35 = (gezi::RegressionTree &) @0x7fffffffd480: {<gezi::OnlineRegressionTree> = {
NumLeaves = 3, _gainPValue = std::vector of length 2, capacity 2 = {0.15304198078836101, 0.27523360741160119},
,
0.33333333333333343}, _threshold = std::vector of length 2, capacity 2 = {4, 2}, _splitFeature = std::vector of length 2, capacity 2 = {0, 2},
_splitGain = std::vector of length 2, capacity 2 = {0.026666666666666675, 0.026666666666666679}, _maxOutput = 0.10000000000000002, _previousLeafValue = std::vector of length 2, capacity 2 = {0, -0.033333333333333333}, _weight = 1, _featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>}, _parent = std::vector of length 3, capacity 3 = {1, -1, -2}}
之后UpdateAllScores(tree); 是用来更新scores的值,这里就是8个样本对应的scores值,也就是计算F(x),注意多棵树则是对应记录多棵树的输出的值累加。
virtual void AddScores(RegressionTree& tree, DocumentPartitioning& partitioning, Float multiplier
= 1)
{
for (int l = 0; l < tree.NumLeaves; l++)
{
int begin;
int count;
ivec& documents = partitioning.ReferenceLeafDocuments(l, begin, count);
Float output = tree.LeafValue(l) * multiplier;
int end = begin + count;
#pragma omp parallel for
for (int i = begin; i < end; i++)
{
Scores[documents[i]] += output;
}
SendScoresUpdatedMessage();
}
对应第一个棵树生成结束后
(gdb) p Scores
$7 = std::vector of length 8, capacity 8 = {0.33333333333333343, 1, 0.33333333333333343, 1, -1, -1, 0.33333333333333343, -1}
这个时候再对应计算梯度:
for (int query = 0; query < Dataset.NumDocs; query++)
{
GetGradientInOneQuery(query, scores);
}
_gradient[0] =
2 * 1 * 0.1 / (1 + exp(2 * 1 * 0.1 * 0.33333333333333343))
: 0.2/(1.0 + math.exp(2*0.1/3))
Out[2]: 0.09666790068611772
这时候
我们需要拟合的梯度变为
(gdb) p _gradient
$9 = std::vector of length 8, capacity 8 = {0.096667900686117719, 0.090033200537504438,
0.096667900686117719, 0.090033200537504438, -0.090033200537504438, -0.090033200537504438,
-0.10333209931388229, -0.090033200537504438}
第二棵树
p tree
$10 = {<gezi::OnlineRegressionTree> = {NumLeaves = 3,
_gainPValue = std::vector of length 2, capacity 2 = {0.13944890100441296,
0.02357537149418417}, _lteChild = std::vector of length 2, capacity 2 = {-1, -2},
_gtChild = std::vector of length 2, capacity 2 = {1, -3},
_leafValue = std::vector of length 3, capacity 3 = {-0.9721949587186075,
-0.30312179217966367, 0.94840573799486361},
_threshold = std::vector of length 2, capacity 2 = {1, 1},
_splitFeature = std::vector of length 2, capacity 2 = {1, 2},
_splitGain = std::vector of length 2, capacity 2 = {0.024924858166579064,
0.023238200798742146}, _maxOutput = 0.094456333969913306,
_previousLeafValue = std::vector of length 2, capacity 2 = {0, 0.032222633562039242},
_weight = 1,
_featureNames = 0x6e6a5a <gezi::FastRank::GetActiveFeatures(std::vector<bool, std::allocator<bool> >&)+34>}, _parent = std::vector of length 3, capacity 3 = {0, 1, -2}}
累加第二棵树后的Scores,如果有第三棵树,那么在这个Scores的基础上再计算梯度值
(gdb) p Scores
$11 = std::vector of length 8, capacity 8 = {1.2817390713281971, 0.69687820782033638,
1.2817390713281971, 1.9484057379948636, -1.3031217921796636, -1.9721949587186076,
-0.63886162538527413, -1.3031217921796636}
GBDT原理实例演示 2的更多相关文章
- GBDT原理实例演示 1
考虑一个简单的例子来演示GBDT算法原理 下面是一个二分类问题,1表示可以考虑的相亲对象,0表示不考虑的相亲对象 特征维度有3个维度,分别对象 身高,金钱,颜值 cat dating.txt ...
- 审核流(3)低调奢华,简单不凡,实例演示-SNF.WorkFlow--SNF快速开发平台3.1
下面我们就从什么都没有,结合审核流进行演示实例.从无到有如何快速完美的实现,然而如此简单.低调而奢华,简单而不凡. 从只有数据表通过SNF.CodeGenerator代码生成器快速生成单据并与审核流进 ...
- JAVA之旅(十二)——Thread,run和start的特点,线程运行状态,获取线程对象和名称,多线程实例演示,使用Runnable接口
JAVA之旅(十二)--Thread,run和start的特点,线程运行状态,获取线程对象和名称,多线程实例演示,使用Runnable接口 开始挑战一些难度了,线程和I/O方面的操作了,继续坚持 一. ...
- ASP.NET Core 6框架揭秘实例演示[01]: 编程初体验
作为<ASP.NET Core 3框架揭秘>的升级版,<ASP.NET Core 6框架揭秘>提供了很多新的章节,同时对现有的内容进行大量的修改.虽然本书旨在对ASP.NET ...
- ASP.NET Core 6框架揭秘-实例演示版[持续更新中…]
作为<ASP.NET Core 3框架揭秘>的升级版,<ASP.NET Core 6框架揭秘>提供了很多新的章节,同时对现有的内容进行大量的修改.虽然本书旨在对ASP.NET ...
- ASP.NET Core 6框架揭秘实例演示[28]:自定义一个服务器
作为ASP.NET Core请求处理管道的"龙头"的服务器负责监听和接收请求并最终完成对请求的响应.它将原始的请求上下文描述为相应的特性(Feature),并以此将HttpCont ...
- SSO之CAS单点登录实例演示
本文目录: 一.概述 二.演示环境 三.JDK安装配置 四.安全证书配置 五.部署CAS-Server相关的Tomcat 六.部署CAS-Client相关的Tomcat 七. 测试验证SSO 一.概述 ...
- Thrift入门及Java实例演示<转载备用>
Thrift入门及Java实例演示 作者: Michael 日期: 年 月 日 •概述 •下载配置 •基本概念 .数据类型 .服务端编码基本步骤 .客户端编码基本步骤 .数据传输协议 •实例演示(ja ...
- 原生JS编写的照片墙效果实例演示特效
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
随机推荐
- C++输入输出流
一.C++输入输出流的含义 以前所用到的输入和输出,都是以终端为对象的,即从键盘输入数据,运行结果输出到显示器屏幕上.从操作系统的角度看,每一个与主机相连的输入输出设备都被看作一个文件.程序的输入指的 ...
- phpcms常用标签
http://v9.help.phpcms.cn/html/pc_tag/modules/ 9帮助中心 {template "content","header" ...
- 解压版MySQL安装说明
一.复制my.ini到MySQL解压的目录 例如:E:\MySQL 二.修改my.ini第39~40行 basedir = "E:\\MySQL" datadir = " ...
- Laravel 5.1 文档攻略 —— Eloquent Collection
简介 像all()和get(),还有一些处理模型关系这种会返回多条数据的方法,在Eloquent里面会返回一个collection对象集合(对象装在对象里),而不是像DQB的数组结果集合(对象装在数组 ...
- Linux之ls命令
s 命令可以说是linux下最常用的命令之一. -a 列出目录下的所有文件,包括以 . 开头的隐含文件.-b 把文件名中不可输出的字符用反斜杠加字符编号(就象在C语言里一样)的形式列出.-c 输出文件 ...
- CentOS Linux VPS安装IPSec+L2TP VPN
CentOS Linux VPS安装IPSec+L2TP VPN 时间 -- :: 天使羊波波闪耀光芒 相似文章 () 原文 http://www.live-in.org/archives/818.h ...
- SQL 执行计划(二)
最近总想整理下对MSSQL的一些理解与感悟,却一直没有心思和时间写,晚上无事便写了一篇探索MSSQL执行计划,本文讲执行计划但不仅限于讲执行计划. 网上的SQL优化的文章实在是很多,说实在的,我也曾经 ...
- Myeclipse8.5 反编译插件 jad 安装
准备工作 下载jad.exe文件和下载jadeclipse插件:http://pan.baidu.com/s/1pJKjVwn JadClipse 官网:http://jadclipse.source ...
- poj 1442
一个排序的题目. 题意:给你m个数a[m],和n个数b[n]. 首先a[0]….a[b[0]]排序.输出第一个数. 然后a[0]….a[b[1]]排序.输出第二个数. 以此类推,直到输出第n个数. 思 ...
- POJ 2479
---恢复内容开始--- http://poj.org/problem?id=2479 #include <stdio.h> #include <iostream> using ...