数据挖掘中决策树算法实现—

数据挖掘中决策树算法实现——Bash

博客分类：

数据挖掘决策树 bash 非递归实现标准信息熵

一、决策树简介：

关于决策树，几乎是数据挖掘分类算法中最先介绍到的。

决策树，顾名思义就是用来做决定的树，一个分支就是一个决策过程。

每个决策过程中涉及一个数据的属性，而且只涉及一个。然后递归地，贪心地直到满足决策条件(即可以得到明确的决策结果)。

决策树的实现首先要有一些先验（已经知道结果的历史）数据做训练，通过分析训练数据得到每个属性对结果的影响的大小，这里我们通过一种叫做信息增益的理论去描述它，期间也涉及到熵的概念。也可参考文章信息增益与熵.

下面我们结合实例说一下决策树实现过程中的上述关键概念：

假设我们有如下数据：

age	job	house	credit	class
1	0	0	1	0
1	0	0	2	0
1	1	0	2	1
1	1	1	1	1
1	0	0	1	0
2	0	0	1	0
2	0	0	2	0
2	1	1	2	1
2	0	1	3	1
2	0	1	3	1
3	0	1	3	1
3	0	1	2	1
3	1	0	2	1
3	1	0	3	1
3	0	0	1	0

(一)

我们首先要通过计算找到哪个属性的所有属性值能更好地表达class字段的不同。通过计算，我们发现house的属性值最能表现class字段的不同。这个衡量标准其实就是信息增益。计算方法是：首先计算全部数据的熵，然后除class之外的其他属性逐个遍历，找到熵最小的那个属性(house)，然后将全部数据的熵减去按照house属性划分数据之后的数据的熵。

这个值如果满足条件假如(>0.1)，我们认为数据应该按照这个节点进行分裂，也就是说这个属性(house)构成了我们的一次决策过程。

(二)

然后

在按照house分裂的每个数据集上，针对其他属性(house除外)进行与(一)相同的过程，直到信息增益不足以满足数据分裂的条件。

这样，我们就得到了一个关于属性数据划分的一棵树。可以作为class字段未知的数据的决策依据。

二、决策树代码实现：

具体计算代码如下：---假设上述数据我们保存为descision.dat文件，以及需要bash4.0及以上支持运行。

#!/home/admin/bin/bash_bin/bash_4
input=$1;
if [ -z $input ]; then
echo "please input the traning file";
exit 1;
fi
## pre calculate the log2 value for the later calculate operation
declare -a log2;
logi=0;
records=$(cat $input | wc -l);
for i in `awk -v n=$records 'BEGIN{for(i=1;i<n;i++) print log(i)/log(2);}'`
do
((logi+=1));
log2[$logi]=$i;
done
## function for calculating the entropy for the given distribution of the class
function getEntropy {
local input=`echo $1`;
if [[ $input == *" "* ]]; then
local current_entropy=0;
local sum=0;
local i;
for i in $input
do
((sum+=$i));
current_entropy=$(awk -v n=$i -v l=${log2[$i]} -v o=$current_entropy 'BEGIN{print n*l+o}');
done
current_entropy=$(awk -v n=$current_entropy -v b=$sum -v l=${log2[$sum]} 'BEGIN{print n/b*-1+l;}')
eval $2=$current_entropy;
else
eval $2=0;
fi
}
### the header title of the input data
declare -A header_info;
header=$(head -1 $input);
headers=(${header//,/ })
length=${#headers[@]};
for((i=0;i<length;i++))
do
attr=${headers[$i]};
header_info[$attr]=$i;
done
### the data content of the input data
data=${input}_dat;
sed -n '2,$p' $input > $data
## use an array to store the information of a descision tree
## the node structure is {child,slibling,parent,attr,attr_value,leaf,class}
## the root is a virtual node with none used attribute
## only the leaf node has class flag and the "leaf,class" is meaningfull
## the descision_tree
declare -a descision_tree;
## the root node with no child\slibing and anythings else
descision_tree[0]="0:0:0:N:N:0:0";
## use recursive algrithm to build the tree
## so we need a trace_stack to record the call level infomation
declare -a trace_stack;
## push the root node into the stack
trace_stack[0]=0;
stack_deep=1;
## begin to build the tree until the trace_stack is empty
while [ $stack_deep -ne 0 ]
do
((stack_deep-=1));
current_node_index=${trace_stack[$stack_deep]};
current_node=${descision_tree[$current_node_index]};
current_node_struct=(${current_node//:/ });
## select the current data set
## get used attr and their values
attrs=${current_node_struct[3]};
attrv=${current_node_struct[4]};
declare -a grepstra=();
if [ $attrs != "N" ];then
attr=(${attrs//,/ });
attrvs=(${attrv//,/ });
attrc=${#attr[@]};
for((i=0;i<attrc;i++))
do
a=${attr[$i]};
index=${header_info[$a]};
grepstra[$index]=${attrvs[$i]};
done
fi
for((i=0;i<length;i++))
do
if [ -z ${grepstra[$i]} ]; then
grepstra[$i]=".*";
fi
done
grepstrt=${grepstra[*]};
grepstr=${grepstrt// /,};
grep $grepstr $data > current_node_data
## calculate the entropy before split the records
entropy=0;
input=`cat current_node_data | cut -d "," -f 5 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`
getEntropy "$input" entropy;
## calculate the entropy for each of the rest attrs
## and select the min one
min_attr_entropy=1;
min_attr_name="";
min_attr_index=0;
for((i=0;i<length-1;i++))
do
## just use the rest attrs
if [[ "$attrs" != *"${headers[$i]}"* ]]; then
## calculate the entropy for the current attr
### get the different values for the headers[$i]
j=$((i+1));
cut -d "," -f $j,$length current_node_data > tmp_attr_ds
dist_values=`cut -d , -f 1 tmp_attr_ds | sort | uniq -c | sed 's/^ \+//g' | sed 's/ /,/g'`;
totle=0;
totle_entropy_attr=0;
for k in $dist_values
do
info=(${k//,/ });
((totle+=${info[0]}));
cur_class_input=`grep "^${info[1]}," tmp_attr_ds | cut -d "," -f 2 | sort | uniq -c | sed 's/^ \+//g' | cut -d " " -f 1`
cur_attr_value_entropy=0;
getEntropy "$cur_class_input" cur_attr_value_entropy;
totle_entropy_attr=$(awk -v c=${info[0]} -v e=$cur_attr_value_entropy -v o=$totle_entropy_attr 'BEGIN{print c*e+o;}');
done
attr_entropy=$(awk -v e=$totle_entropy_attr -v c=$totle 'BEGIN{print e/c;}');
if [ $(echo "$attr_entropy < $min_attr_entropy" | bc) = 1 ]; then
min_attr_entropy=$attr_entropy;
min_attr_name="${headers[$i]}";
min_attr_index=$j;
fi
fi
done
## calculate the gain between the original entropy of the current node
## and the entropy after split by the attribute which has the min_entropy
gain=$(awk -v b=$entropy -v a=$min_attr_entropy 'BEGIN{print b-a;}');
## when the gain is large than 0.1 and then put it as a branch
## and add the child nodes to the current node and push the index to the trace_stack
## otherwise make it as a leaf node and get the class flag
## and do not push trace_stack
if [ $(echo "$gain > 0.1" | bc) = 1 ]; then
### get the attribute values
attr_values_str=`cut -d , -f $min_attr_index current_node_data | sort | uniq`;
attr_values=($attr_values_str);
### generate the node and add to the tree and add their index to the trace_stack
tree_store_length=${#descision_tree[@]};
current_node_struct[0]=$tree_store_length;
parent_node_index=$current_node_index;
attr_value_c=${#attr_values[@]};
for((i=0;i<attr_value_c;i++))
do
tree_store_length=${#descision_tree[@]};
slibling=0;
if [ $i -lt $((attr_value_c-1)) ]; then
slibling=$((tree_store_length+1));
fi
new_attr="";
new_attrvalue="";
if [ $attrs != "N" ]; then
new_attr="$attrs,$min_attr_name";
new_attrvalue="$attrv,${attr_values[$i]}";
else
new_attr="$min_attr_name";
new_attrvalue="${attr_values[$i]}";
fi
new_node="0:$slibling:$parent_node_index:$new_attr:$new_attr_value:0:0";
descision_tree[$tree_store_length]="$new_node";
trace_stack[$stack_deep]=$tree_store_length;
((stack_deep+=1));
done
current_node_update=${current_node_struct[*]};
descision_tree[$current_node_index]=${current_node_update// /:};
else ## current node is a leaf node
current_node_struct[5]=1;
current_node_struct[6]=`cut -d , -f $length current_node_data | sort | uniq -c | sort -n -r | head -1 | sed 's/^ \+[^ ]* //g'`;
current_node_update=${current_node_struct[*]};
descision_tree[$current_node_index]=${current_node_update// /:};
fi
## output the descision tree after every step for split or leaf node generater
echo ${descision_tree[@]};
done

执行代码：

./descision.sh descision.dat

执行结果为：

1:0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:0:0
1:0:0:N:N:0:0 0:2:0:house:0:0:0 0:0:0:house:1:1:1
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:0:0
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:0:0 0:0:1:house,job:0,1:1:1
1:0:0:N:N:0:0 3:2:0:house:0:0:0 0:0:0:house:1:1:1 0:4:1:house,job:0,0:1:0 0:0:1:house,job:0,1:1:1

输出结果中展示了决策树结构生成过程的细节，以及生成过程中树的变化过程

本代码中使用了一维数组结构来存储整棵决策树，输出的先后顺序按照数组下标输出。

输出结果中最后一行表示最终的决策树。它表示的树形结构其实是：

这样看着是不是就爽多了。

说明：

关于上述决策树结果中其实有部分存在误导：

默认根节点是放在数组的第一个位置的，即索引值为0，而子节点中的child与sibling为0时并不表示指向跟节点，而是表示无意义，即没有子节点或兄弟节点。

该决策树所代表的分类规则：

根据该决策树输出，我们挖掘的规则是这样的：

首先根据house属性判断，当house属性为1时，走到索引为2的节点，此时该节点是叶子节点，预测值class为1.

当house属性为0时，接着按照job属性来判断，当job属性为0时走到索引为3的节点，预测值class为0。如果job属性为1时走到索引值为4的节点，此时预测值class为1.

数据挖掘中决策树算法实现——Bash的更多相关文章

sklearn中决策树算法DesiciontTreeClassifier()调用以及sklearn自带的数据包sklearn.datasets.load_iris()的应用
决策树方法的简单调用记录一下 clf=tree.DecisionTreeClassifier() dataMat=[];labelMat=[] dataPath='D:/machinelearning ...
ID3和C4.5分类决策树算法 - 数据挖掘算法（7）
(2017-05-18 银河统计) 决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来判断其可行性的决策分析方法,是直观运用概率分析的一种图解法.由于这种决策分支画 ...
《BI那点儿事》Microsoft 决策树算法
Microsoft 决策树算法是由 Microsoft SQL Server Analysis Services 提供的分类和回归算法,用于对离散和连续属性进行预测性建模.对于离散属性,该算法根据数据 ...
ID3决策树算法原理及C++实现(其中代码转自别人的博客)
分类是数据挖掘中十分重要的组成部分.分类作为一种无监督学习方式被广泛的使用. 之前关于"数据挖掘中十大经典算法"中,基于ID3核心思想的分类算法C4.5榜上有名.所以不难看出ID3 ...
决策树算法的Python实现—基于金融场景实操
决策树是最经常使用的数据挖掘算法,本次分享jacky带你深入浅出,走进决策树的世界基本概念决策树(Decision Tree) 它通过对训练样本的学习,并建立分类规则,然后依据分类规则,对新样本数 ...
吴裕雄--天生自然python机器学习：决策树算法
我们经常使用决策树处理分类问题’近来的调查表明决策树也是最经常使用的数据挖掘算法. 它之所以如此流行,一个很重要的原因就是使用者基本上不用了解机器学习算法,也不用深究它是如何工作的. K-近邻算法可 ...
scikit-learn决策树算法类库使用小结
之前对决策树的算法原理做了总结,包括决策树算法原理(上)和决策树算法原理(下).今天就从实践的角度来介绍决策树算法,主要是讲解使用scikit-learn来跑决策树算法,结果的可视化以及一些参数调参的 ...
4-Spark高级数据分析-第四章用决策树算法预测森林植被
预测是非常困难的,更别提预测未来. 4.1 回归简介随着现代机器学习和数据科学的出现,我们依旧把从“某些值”预测“另外某个值”的思想称为回归.回归是预测一个数值型数量,比如大小.收入和温度,而分类则 ...
就是要你明白机器学习系列--决策树算法之悲观剪枝算法(PEP)
前言在机器学习经典算法中,决策树算法的重要性想必大家都是知道的.不管是ID3算法还是比如C4.5算法等等,都面临一个问题,就是通过直接生成的完全决策树对于训练样本来说是“过度拟合”的,说白了是太精确 ...

随机推荐

[Go] 编码规范
gofmt 大部分的格式问题可以通过 gofmt 解决,gofmt 自动格式化代码,保证所有的Go代码与官方推荐的格式保持一致,于是所有格式有关问题,都以 gofmt 的结果为准. 注释在编码阶段应 ...
jQuery 事件方法大全-超全的总结
jquery经常使用的事件: /* on off hover blur change click dblclick focus ...
C#面向服务编程技术WCF从入门到实战演练
一.WCF课程介绍 1.1.Web Service会被WCF取代吗? 对于这个问题阿笨的回答是:两者在功能特性上却是有新旧之分,但是对于特定的系统,适合自己的就是最好的.不能哪一个技术框架和行业标准作 ...
(Delphi) Using the Disk Cache 使用磁盘缓存
The Chilkat Spider component has disk caching capabilities. To setup a disk cache, create a new dire ...
Java io.netty.util.ReferenceCountUtil 代码实例
原文:https://www.helplib.com/Java_API_Classes/article_64580 以下是展示如何使用io.netty.util.ReferenceCountUtil的 ...
firedac调用ORACLE的存储过程
firedac调用ORACLE的存储过程 EMB官方原文地址:http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Using_Oracle_with_F ...
jQuery - 同时添加click和dblclick事件
添加事件的代码比较简单,有两种方法: $("abc").bind({"click":fn,"dblclick":fn}); $(" ...
Swift - 添加纯净的Alamofire
Swift - 添加纯净的Alamofire 如果你有代码洁癖,不能容忍任何多余的东西,请继续往下看. . 下载Alamofire (https://github.com/Alamofire/Ala ...
mysql查询字段时实现左右补零
右补0:select RPAD(id,8,'0') as pad from tmp; 左补0:select LPAD(id,8,'0') as pad from tmp;
asp.net mvc5 安装
原文地址 http://docs.nuget.org/docs/start-here/using-the-package-manager-console 工具-->NuGet程序包管理器--&g ...

数据挖掘中 决策树算法实现——Bash

数据挖掘中 决策树算法实现——Bash

数据挖掘中 决策树算法实现——Bash的更多相关文章

随机推荐

热门专题

数据挖掘中决策树算法实现——Bash

数据挖掘中决策树算法实现——Bash

数据挖掘中决策树算法实现——Bash的更多相关文章