Naive Bayes Theorm And Application - Theorem

Naive Bayes model:
1. Naive Bayes model
2. model: discrete attributes with finit number of values
2. Parameter density estimation
3. Naive Bayes classification algorithm
4. AutoClass clustering alogrithm

\(\textbf{1. Naive Bayes model}\)
In this model, We want to estimate \(P(X_1,...,X_n)\), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.\[P(X_1,...,X_n)=\Pi(X_1,...,P_n)\]While \(P(X_i)\) can be any distribution you like, e.g. \(\{0.5 : red, 0.2 : blue, 0.3 : yellow\}\).
To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have \(2_n\) states of \(X_1,...,X_n\) and \(2^n-1\) independent parameters; While adding independence assumption, the the scale of parameters decreases to \(2n+1\), and the parameter and n parameters in total.
For example,in a classification problem,We assume \(\theta_C\) is the probability class. Then we have 2n+1 parameters:
\[\begin{align}
&P(C=T)=\theta_C \notag\\
&P(C=T)=1 - \theta_C \notag\\
&P(X_i=T|C=T)=\theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^F \notag\\
& \mathbf{\theta}\langle {\theta_C, \theta_{i}^T,...,\theta_{n}^T,\theta_{i}^F,...\theta_{n}^F} \rangle \notag\\
\end{align}\]
As you can see above, it makes incredible saving in number of parameters. Representing \(P(X_1,.., X_n)\) explicitly suffers from curse of dimensionality, while \(\Pi_{i=1}^n P(X_i)\) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.

Naive Bayes classifier
For a NB classification problem we should learn:
1. \(P(X_1,...X_n|C)=\Pi_{i=1}^nP(X_i|C)\), for each class assumes that \(X_i\) and \(X_j\) are conditionally independent of each other given C.
2. P(C)
To classify: given \(\bf{x}\), choose c that maximizes:
\[P(c|x)~\propto~P(c)P(x_i|c)\]
The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.
There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - \(\textbf{eaiser to learn}\).

In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.

A simple problem:
There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:
* Variable X has states {f, t}(t = win)
* Parameter \(\theta=P(X=t)\)
* Observations \(X^1=t, X^2=f, X^3=f\)
* These comprise the data \(\bf{D}\)
* Task: estimate \(\theta\)
* Use \(\theta\) to estimate \(P(X^4=t)\)
Firstly we will introduce \(\textbf{Maximum likelihood(ML)}\) algorithm, the likelihood mean:
* \(\textbf{Likelihhod:}\) \(\bf{L}(\theta)=P(\bf{D}|\theta)=P(X^1,X^2,X^3|\theta)\)
* \(\textbf{ML Principle:}\) Choose \(\theta\) so as to maximize\(\mathbf{L}(\theta)\)
* \(L(\theta)=P(X^1|\theta)P(X^2|\theta)P(X^3|\theta)\)
* \(\textbf{Log likelihood:}\) \[\bf{LL}(\theta)=logP(X^1|\theta)+logP(X^2|\theta)+logP(X^3|\theta)\]
* ML Principle equivalent: Choose \(\theta\) so as to maximize \(LL(\theta)\)
In this example:
\(P(X^i=t|\theta)=\theta\)
\(L(\theta)=P(X^1=t, X62=f, X^3=f|\theta)=\theta(1-\theta)(1-\theta)\)
\(LL(\theta)=log\theta+2log(1-\theta)\)
set derivative to 0: \(\frac{1}{\theta}~-~\frac{2}{1-\theta} = 0\)
Solve to find: \(\theta=1/3\)
In the example just now, you can find that \(\theta=1/3\) exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML's estimate is exactly the one most suggested by the data. More generally, we get the observations \(X^1,X^2,...,X^n\), let \(N_t\) be the number of instances with value t and \(N_f\) be the number of instances with value f. Then the maximum likelihood estimate for \(\theta\) is: \[\hat{\theta}=\frac{N_t}{N_t+N_f}=\frac{N_t}{N}\]
\(\textbf{Problem with this approach}\)
\(\color{red}{Overfits}:\) pays too much attention to noise in the data, for example if the team was particularly compete with \(\color{red}{\textbf{Chinese national soccer team}}\) recently, then we will oversee the team performance.
\(\color{red}{Ignores\,prior\,experience}\): If some experts told you that the team is a small team, you should not be confident even you have won CNS.
Events don't occur in the data are deemed impossible, for example the match end with 1 vs 1.

\(\textbf{Incorporating a prior}\)
* \(\textbf{Prior}:\) \(P(\theta)\) before seeing any data
* \(\textbf{Posterior:}\) \(P(\theta|\mathbf{D})\)
* \(\textbf{Maximum a Posterior principle (MAP):}\) Choose \(\theta\) to maximize \(P(\theta|\mathbf{D})\) and \(P(\theta|\mathbf{D})\) is proportional to \(P(\theta)L(\theta)\)
For learning the parameter of a Boolean random variable, an appropriate prior over \(\theta\) is the \(\textbf{beta}\) distribution. As you know, the beta distribution has 2 \(\textbf{parameters:}\) \(\alpha\) and \(\beta\), and these paramters control the shape of the prior. \(\alpha\) and \(\beta\) control how relatively likely true and false outcomes are, if \(\alpha\) is large relative to \(\beta\), \(\theta\) will be more likely to be large.
As the graph below:

And the \(\color{red}{magnitude}\) of \(\alpha\) and \(\beta\) control how peaked the beta distribution is, if \(\alpha\) and \(\beta\) are large, the beta will be sharply peaked.
The magnitude of \(\alpha\) and \(\beta\):

Updating the prior
To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:
for example, the prior is Beta(4, 7), and we observe 1 "+"" and 4 "-", then the posterior is Beta(5, 11).

Understanding the hyperparameters
Hyperparameter \(\alpha\) represents the number of previous positive observation that we had, plus 1; similarly, \(\beta\) represents the number of previous "-" observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. \(\textbf{The more we trust our prior experience, the larger the hyperparameters in the prior.}\)

Mode and mean of the beta distribution
The mode of \(Beta(\alpha,\beta)\) is \(\frac{\alpha-1}{\alpha-\beta-2}\). e.g. mode of Beta(2,3) is 1/3. The mean of \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha+\beta}\). e.g. mean of Beta(2,3) is 2/5.

MAP estimate
For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\]

In the example above, the prior is Beta(5,3), and the observations is: \(X^1=t, X^2=f,X^3=f\), so the posterior is: Beta(6,5), The MAP estimate is: \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\]
ML vs MAP
Maximum likelihood estimate: \(\hat{\theta_{ML}}=\frac{m}{N}\); MAP estimate: \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\). And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.
Drawback of MAP
MAP not fully consider the range of possible values for \(\theta\), only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, \(\textbf{Baysian approach}\), this approach not makeestimate of \(\theta\). and the posterior distribution is maintainted over the value of \(\theta\). e.g. given \(X^1,X^2,X^3\) we want to predict \(X^4\) using the entire distribution over \(\theta\).
\[\begin{align}
P(X^4|X^1,X^2,X^3)&=\int_{0}^1{P(X^4|\theta)P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = \int_{0}^1\theta{P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = E\left[ {\theta |X^1, X^2, X^3} \right] \notag\\
& = \mathbf{mean}\,of\,the\,posterior \notag\\
\end{align}\]
For beta distribution \(E\left[{\theta=\frac{m+\alpha}{N+\alpha+\beta}}\right]\)
\(E\left[{\theta}\right]\) is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.
In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\), while the Bayesian approach Expectation is: \(\hat{\theta_{MAP}}=\frac{m+\alpha}{N+\alpha+\beta}=\frac{6}{11}\), we can see the latter is more closely to 1/2.

\(\textbf{Multi-valued class and attributes}\)
In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.
given that: |C| = k and |X|=m, and parameters are listed as below:
\[\begin{align}
&P(C=c)=\theta_c \notag\\
&P(X_i=x|C=c)=\theta_{i,x}^c \notag\\
\end{align}\]
So the parameters grows up to kmn+k-1, the counts of instances is also listed below:
\[\begin{align}
&- N=total\,number\,of\,instances \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c\,and\,X_i=x \notag\\
&- k*m*n+k+1\,parameters\,in\,total. \notag\\
\end{align}\]
From the analysis above, we can make a brief conclusion about Naive Bayes:
The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.
\(\textbf{reference}\)
Note: Most content of this essay are from CMU lecture, while I lose the link of the website.
CMU statistical learning

Naive Bayes Theorem and Application - Theorem的更多相关文章

  1. 机器学习---用python实现朴素贝叶斯算法(Machine Learning Naive Bayes Algorithm Application)

    在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...

  2. 学习笔记之Naive Bayes Classifier

    Naive Bayes classifier - Wikipedia https://en.wikipedia.org/wiki/Naive_Bayes_classifier In machine l ...

  3. 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)

    6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...

  4. [机器学习] 分类 --- Naive Bayes(朴素贝叶斯)

    Naive Bayes-朴素贝叶斯 Bayes' theorem(贝叶斯法则) 在概率论和统计学中,Bayes' theorem(贝叶斯法则)根据事件的先验知识描述事件的概率.贝叶斯法则表达式如下所示 ...

  5. 机器学习算法 --- Naive Bayes classifier

    一.引言 在开始算法介绍之前,让我们先来思考一个问题,假设今天你准备出去登山,但起床后发现今天早晨的天气是多云,那么你今天是否应该选择出去呢? 你有最近这一个月的天气情况数据如下,请做出判断. 这个月 ...

  6. ML | Naive Bayes

    what's xxx In machine learning, naive Bayes classifiers are a family of simple probabilistic classif ...

  7. Spark MLlib 之 Naive Bayes

    1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...

  8. [Machine Learning & Algorithm] 朴素贝叶斯算法(Naive Bayes)

    生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子 让我从一个例子 ...

  9. Microsoft Naive Bayes 算法——三国人物身份划分

    Microsoft朴素贝叶斯是SSAS中最简单的算法,通常用作理解数据基本分组的起点.这类处理的一般特征就是分类.这个算法之所以称为“朴素”,是因为所有属性的重要性是一样的,没有谁比谁更高.贝叶斯之名 ...

随机推荐

  1. 在.NET下学习Extjs(第三个案例 Array的过滤方法(filter))

    Ext.Array.filter(Array array,Function fn,Object scope):Array array是一个数组,fn是过滤函数,scope是作用域,filter返回的是 ...

  2. PSP个人软件开发工具

    (您的阅读是我的荣幸,如有不满之处请留言指正!) 尚未完善.....工作中 为开发人员提供一个PSP工具,简化时间记录工作:同时提供数据使用的工具,帮助开发人提高估算能力.   PSP个人软件开发工具 ...

  3. Java 获取 文件md5校验码

    讯雷下载的核心思想是校验文件的md5值,两个文件若md5相同则为同一文件. 当得到用户下载某个文件的请求后它根据数据库中保留的文件md5比对出拥有此文件的url, 将用户请求挂接到此url上并仿造一个 ...

  4. Java SE基础部分——常用类库之SimpleDateFormat(日期格式化)

    取得当前日期,并按照不同日期格式化输入.代码如下: // 20160618 SimpleDateFomat类的使用 日期格式化 练习 package MyPackage; //自己定义的包 impor ...

  5. EC读书笔记系列之4:条款8 别让异常逃离析构函数

    条款8 别让异常逃离析构函数 记住: ★析构函数绝对不要吐出异常.若一个被析构函数调用的函数可能抛出异常,析构函数应该捕捉任何异常,然后吞下它们(不传播)或结束程序. ★若客户需对某个操作函数运行期间 ...

  6. 不要伤害指针(1)--运算符&和*

    原文转载地址:http://blog.csdn.net/sunchaoenter/article/details/6646001 增加自己的想法,作为笔记. 这里&是取地址运算符,*是间接运算 ...

  7. CSS Hack代码与浏览兼容总结

    关于CSS Hack的东西能少尽量少吧.发现这篇文章我写得太复杂了,所以重新精简了一下,把代码粘贴到jsfiddle上,方面修改代码和维护. 1, IE条件注释法,微软官方推荐的hack方式. 只在I ...

  8. 【grunt整合版】学会使用grunt打包前端代码

    grunt 是一套前端自动化工具,一个基于nodeJs的命令行工具,一般用于:① 压缩文件② 合并文件③ 简单语法检查 对于其他用法,我还不太清楚,我们这里简单介绍下grunt的压缩.合并文件,初学, ...

  9. javascript不同数据类型的转换

    <script type="text/javascript"> var userEnteredNumber=prompt("Please enter a nu ...

  10. cookie程序设计举例

    编写Cookie应用程序,一般流程是:首先尝试获取某个Cookie变量,如果有,则表明是老客户,读取其cookie信息,为其提供服务. 如果没有,则表明是第一次来访的客户,通过表单提交获取其身份信息, ...