数据挖掘之决策树ID3算法（C#实现）

决策树是一种非常经典的分类器，它的作用原理有点类似于我们玩的猜谜游戏。比如猜一个动物：

问：这个动物是陆生动物吗？

答：是的。

问：这个动物有鳃吗？

答：没有。

这样的两个问题顺序就有些颠倒，因为一般来说陆生动物是没有鳃的（记得应该是这样的，如有错误欢迎指正）。所以玩这种游戏，提问的顺序很重要，争取每次都能够获得尽可能多的信息量。

AllElectronics顾客数据库标记类的训练元组
RID	age	income	student	credit_rating	Class: buys_computer
1	youth	high	no	fair	no
2	youth	high	no	excellent	no
3	middle_aged	high	no	fair	yes
4	senior	medium	no	fair	yes
5	senior	low	yes	fair	yes
6	senior	low	yes	excellent	no
7	middle_aged	low	yes	excellent	yes
8	youth	medium	no	fair	no
9	youth	low	yes	fair	yes
10	senior	medium	yes	fair	yes
11	youth	medium	yes	excellent	yes
12	middle_aged	medium	no	excellent	yes
13	middle_aged	high	yes	fair	yes
14	senior	medium	no	excellent	no

以AllElectronics顾客数据库标记类的训练元组为例。我们想要以这些样本为训练集，训练我们的决策树模型，以此来挖掘出顾客是否会购买电脑的决策模式。

在决策树ID3算法中，计算信息度的公式如下：

$$Info_A(D) = \sum_{j=1}^v\frac{|D_j|}{D} \times Info(D_j)$$

计算信息增益的公式如下：

$$Gain(A) = Info(D) - Info_A(D)$$

按照公式，在要进行分类的类别变量中，有5个“no”和9个“yes”，因此期望信息为：

$$Info(D)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14}=0.940$$

首先计算特征age的期望信息：

$$Info_{age}(D)=\frac{5}{14} \times (-\frac{2}{5}log_2\frac{2}{5} - \frac{3}{5}log_2\frac{3}{5})+\frac{4}{14} \times (-\frac{4}{4}log_2\frac{4}{4} - \frac{0}{4}log_2\frac{0}{4})+\frac{5}{14} \times (-\frac{3}{5}log_2\frac{3}{5} - \frac{2}{5}log_2\frac{2}{5})$$

因此，如果按照age进行划分，则获得的信息增益为：

$$Gain(age) = Info(D)-Info_{age}(D) = 0.940-0.694=0.246$$

依次计算以income、student和credit_rating来分裂的信息增益，由此选择能够带来最大信息增益的变量，在当

前结点选择以以该变量的取值进行分裂。递归地进行执行即可生成决策树。更加详细的内容可以参考：

https://en.wikipedia.org/wiki/Decision_tree

C#代码的实现如下：

 using System;

 using System.Collections.Generic;

 using System.Linq;

 namespace MachineLearning.DecisionTree

 {

     public class DecisionTreeID3<T> where T : IEquatable<T>

     {

         T[,] Data;

         string[] Names;

         int Category;

         T[] CategoryLabels;

         DecisionTreeNode<T> Root;

         public DecisionTreeID3(T[,] data, string[] names, T[] categoryLabels)

         {

             Data = data;

             Names = names;

             Category = data.GetLength() - ;//类别变量需要放在最后一列

             CategoryLabels = categoryLabels;

         }

         public void Learn()

         {

             int nRows = Data.GetLength();

             int nCols = Data.GetLength();

             int[] rows = new int[nRows];

             int[] cols = new int[nCols];

             for (int i = ; i < nRows; i++) rows[i] = i;

             for (int i = ; i < nCols; i++) cols[i] = i;

             Root = new DecisionTreeNode<T>(-, default(T));

             Learn(rows, cols, Root);

             DisplayNode(Root);

         }

         public void DisplayNode(DecisionTreeNode<T> Node, int depth = )

         {

             if (Node.Label != -)

                 Console.WriteLine("{0} {1}: {2}", new string('-', depth * ), Names[Node.Label], Node.Value);

             foreach (var item in Node.Children)

                 DisplayNode(item, depth + );

         }

         private void Learn(int[] pnRows, int[] pnCols, DecisionTreeNode<T> Root)

         {

             var categoryValues = GetAttribute(Data, Category, pnRows);

             var categoryCount = categoryValues.Distinct().Count();

             if (categoryCount == )

             {

                 var node = new DecisionTreeNode<T>(Category, categoryValues.First());

                 Root.Children.Add(node);

             }

             else

             {

                 if (pnRows.Length == ) return;

                 else if (pnCols.Length == )

                 {

                     //投票～

                     //多数票表决制

                     var Vote = categoryValues.GroupBy(i => i).OrderBy(i => i.Count()).First();

                     var node = new DecisionTreeNode<T>(Category, Vote.First());

                     Root.Children.Add(node);

                 }

                 else

                 {

                     var maxCol = MaxEntropy(pnRows, pnCols);

                     var attributes = GetAttribute(Data, maxCol, pnRows).Distinct();

                     string currentPrefix = Names[maxCol];

                     foreach (var attr in attributes)

                     {

                         int[] rows = pnRows.Where(irow => Data[irow, maxCol].Equals(attr)).ToArray();

                         int[] cols = pnCols.Where(i => i != maxCol).ToArray();

                         var node = new DecisionTreeNode<T>(maxCol, attr);

                         Root.Children.Add(node);

                         Learn(rows, cols, node);//递归生成决策树

                     }

                 }

             }

         }

         public double AttributeInfo(int attrCol, int[] pnRows)

         {

             var tuples = AttributeCount(attrCol, pnRows);

             var sum = (double)pnRows.Length;

             double Entropy = 0.0;

             foreach (var tuple in tuples)

             {

                 int[] count = new int[CategoryLabels.Length];

                 foreach (var irow in pnRows)

                     if (Data[irow, attrCol].Equals(tuple.Item1))

                     {

                         int index = Array.IndexOf(CategoryLabels, Data[irow, Category]);

                         count[index]++;//目前仅支持类别变量在最后一列

                     }

                 double k = 0.0;

                 for (int i = ; i < count.Length; i++)

                 {

                     double frequency = count[i] / (double)tuple.Item2;

                     double t = -frequency * Log2(frequency);

                     k += t;

                 }

                 double freq = tuple.Item2 / sum;

                 Entropy += freq * k;

             }

             return Entropy;

         }

         public double CategoryInfo(int[] pnRows)

         {

             var tuples = AttributeCount(Category, pnRows);

             var sum = (double)pnRows.Length;

             double Entropy = 0.0;

             foreach (var tuple in tuples)

             {

                 double frequency = tuple.Item2 / sum;

                 double t = -frequency * Log2(frequency);

                 Entropy += t;

             }

             return Entropy;

         }

         private static IEnumerable<T> GetAttribute(T[,] data, int col, int[] pnRows)

         {

             foreach (var irow in pnRows)

                 yield return data[irow, col];

         }

         private static double Log2(double x)

         {

             return x == 0.0 ? 0.0 : Math.Log(x, 2.0);

         }

         public int MaxEntropy(int[] pnRows, int[] pnCols)

         {

             double cateEntropy = CategoryInfo(pnRows);

             int maxAttr = ;

             double max = double.MinValue;

             foreach (var icol in pnCols)

                 if (icol != Category)

                 {

                     double Gain = cateEntropy - AttributeInfo(icol, pnRows);

                     if (max < Gain)

                     {

                         max = Gain;

                         maxAttr = icol;

                     }

                 }

             return maxAttr;

         }

         public IEnumerable<Tuple<T, int>> AttributeCount(int col, int[] pnRows)

         {

             var tuples = from n in GetAttribute(Data, col, pnRows)

                          group n by n into i

                          select Tuple.Create(i.First(), i.Count());

             return tuples;

         }

     }

 }

决策树结点的构造：

 using System.Collections.Generic;

 namespace MachineLearning.DecisionTree

 {

     public sealed class DecisionTreeNode<T>

     {

         public int Label { get; set; }

         public T Value { get; set; }

         public List<DecisionTreeNode<T>> Children { get; set; }

         public DecisionTreeNode(int label, T value)

         {

             Label = label;

             Value = value;

             Children = new List<DecisionTreeNode<T>>();

         }

     }

 }

调用方法如下：

 using System;

 using System.Collections.Generic;

 using System.Linq;

 using System.Text;

 using System.Threading.Tasks;

 using MachineLearning.DecisionTree;

 namespace MachineLearning

 {

     class Program

     {

         static void Main(string[] args)

         {

             var da = new string[,]

             {

                 {"youth","high","no","fair","no"},

                 {"youth","high","no","excellent","no"},

                 {"middle_aged","high","no","fair","yes"},

                 {"senior","medium","no","fair","yes"},

                 {"senior","low","yes","fair","yes"},

                 {"senior","low","yes","excellent","no"},

                 {"middle_aged","low","yes","excellent","yes"},

                 {"youth","medium","no","fair","no"},

                 {"youth","low","yes","fair","yes"},

                 {"senior","medium","yes","fair","yes"},

                 {"youth","medium","yes","excellent","yes"},

                 {"middle_aged","medium","no","excellent","yes"},

                 {"middle_aged","high","yes","fair","yes"},

                 {"senior","medium","no","excellent","no"}

             };

             var names = new string[] { "age", "income", "student", "credit_rating", "Class: buys_computer" };

             var tree = new DecisionTreeID3<string>(da, names, new string[] { "yes", "no" });

             tree.Learn();

             Console.ReadKey();

         }

     }

 }

运行结果：

注：作者本人也在学习中，能力有限，如有错漏还请不吝指正。转载请注明作者。

数据挖掘之决策树ID3算法（C#实现）的更多相关文章

机器学习之决策树(ID3)算法与Python实现
机器学习之决策树(ID3)算法与Python实现机器学习中,决策树是一个预测模型:他代表的是对象属性与对象值之间的一种映射关系.树中每个节点表示某个对象,而每个分叉路径则代表的某个可能的属性值,而每 ...
决策树ID3算法[分类算法]
ID3分类算法的编码实现 <?php /* *决策树ID3算法(分类算法的实现) */ /* *求信息增益Grain(S1,S2) */ //-------------------------- ...
决策树---ID3算法（介绍及Python实现）
决策树---ID3算法决策树: 以天气数据库的训练数据为例. Outlook Temperature Humidity Windy PlayGolf? sunny 85 85 FALSE no ...
02-21 决策树ID3算法
目录决策树ID3算法一.决策树ID3算法学习目标二.决策树引入三.决策树ID3算法详解 3.1 if-else和决策树 3.2 信息增益四.决策树ID3算法流程 4.1 输入 4.2 输出 ...
决策树ID3算法的java实现(基本试用所有的ID3)
已知:流感训练数据集,预定义两个类别: 求:用ID3算法建立流感的属性描述决策树流感训练数据集 No. 头痛肌肉痛体温患流感 1 是(1) 是(1) 正常(0) 否(0) 2 是(1) 是(1 ...
决策树 -- ID3算法小结
ID3算法(Iterative Dichotomiser 3 迭代二叉树3代),是一个由Ross Quinlan发明的用于决策树的算法:简单理论是越是小型的决策树越优于大的决策树. 算法归 ...
【Machine Learning in Action --3】决策树ID3算法
1.简单概念描述决策树的类型有很多,有CART.ID3和C4.5等,其中CART是基于基尼不纯度(Gini)的,这里不做详解,而ID3和C4.5都是基于信息熵的,它们两个得到的结果都是一样的,本次定 ...
决策树ID3算法的java实现
决策树的分类过程和人的决策过程比较相似,就是先挑“权重”最大的那个考虑,然后再往下细分.比如你去看医生,症状是流鼻涕,咳嗽等,那么医生就会根据你的流鼻涕这个权重最大的症状先认为你是感冒,接着再根据你咳 ...
决策树ID3算法
决策树 (Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法 ...

随机推荐

[Android Studio] *.jar 与 *.aar 的生成与*.aar导入项目方法
主要讲解Android Studio中生成aar文件以及本地方式使用aar文件的方法. 在Android Studio中对一个自己库进行生成操作时将会同时生成*.jar与*.aar文件. 分别存储位置 ...
"****" is not translated in zh, zh_CN.的解决方法
最近在开发一个app,要用到静默安装等一些小技术,但是引发了问题如下: 在Android SDK Tool r19之后, Export的时候遇到xxx is not translated in yyy ...
php 使用 curl 发送 post 数据
作为第三方开发商,经常会需要调用平台接口,远程调用,就要用到curl,其实质就是叫调用的方法与用到的参数以http post的方式发送至平台服务器. 简单的例子: $url = 'http://'; ...
LayaAir引擎——（三）
LyaAir引擎(JavaScript)实现图片的翻转一半图片4.png位于bin/开场过渡文件夹下,图片大小150*30(根据实际情况做调整) var button; var scale1 = ...
wamp apache 的虚拟机配置多域名访问的三部曲
wamp apache 的虚拟机配置多域名访问的三部曲 wamp: 1:C:\WINDOWS\system32\drivers\etc->hosts 加入自己的 ...
javaEE-----org.springframework.dao.InvalidDataAccessApiUsageException: Write operation
org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in read ...
Xamarin Android.Views.WindowManagerBadTokenException: Unable to add window -- token android.os.BinderProxy
Android.Views.WindowManagerBadTokenException: Unable to add window -- token android.os.BinderProxy@ ...
IOS照片颠倒分析及移动/页面端的处理策略和思路
前言: 前几天, 写了一篇关于IOS手机上传照片颠倒的技术分析文章: IOS照片颠倒分析及PHP服务端的处理. 不过其思路是从服务器来进行处理的, 这种做法相当普遍. 今天来讲述下, 如何从移动端/页 ...
Outline of Apache Jena Notes
1 description 这篇是语义网应用框架Apache Jena学习记录的索引. 初始动机见Apache Jena - A Bootstrap 2 Content 内容组织基本上遵循Jena首页 ...
WINDOW.PARENT.CKEDITOR.TOOLS.CALLFUNCTION 图片上传
CKEDITOR 编辑器图片上传 WINDOW.PARENT.CKEDITOR.TOOLS.CALLFUNCTION (CKEditorFuncNum,图片路径,返回信息); CKEditor ...

数据挖掘之决策树ID3算法（C#实现）

数据挖掘之决策树ID3算法（C#实现）的更多相关文章

随机推荐

热门专题