Naive Bayes Theorm And Application - Theorem

Naive Bayes model:
1. Naive Bayes model
2. model: discrete attributes with finit number of values
2. Parameter density estimation
3. Naive Bayes classification algorithm
4. AutoClass clustering alogrithm

\(\textbf{1. Naive Bayes model}\)
In this model, We want to estimate \(P(X_1,...,X_n)\), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.\[P(X_1,...,X_n)=\Pi(X_1,...,P_n)\]While \(P(X_i)\) can be any distribution you like, e.g. \(\{0.5 : red, 0.2 : blue, 0.3 : yellow\}\).
To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have \(2_n\) states of \(X_1,...,X_n\) and \(2^n-1\) independent parameters; While adding independence assumption, the the scale of parameters decreases to \(2n+1\), and the parameter and n parameters in total.
For example,in a classification problem,We assume \(\theta_C\) is the probability class. Then we have 2n+1 parameters:
\[\begin{align}
&P(C=T)=\theta_C \notag\\
&P(C=T)=1 - \theta_C \notag\\
&P(X_i=T|C=T)=\theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^F \notag\\
& \mathbf{\theta}\langle {\theta_C, \theta_{i}^T,...,\theta_{n}^T,\theta_{i}^F,...\theta_{n}^F} \rangle \notag\\
\end{align}\]
As you can see above, it makes incredible saving in number of parameters. Representing \(P(X_1,.., X_n)\) explicitly suffers from curse of dimensionality, while \(\Pi_{i=1}^n P(X_i)\) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.

Naive Bayes classifier
For a NB classification problem we should learn:
1. \(P(X_1,...X_n|C)=\Pi_{i=1}^nP(X_i|C)\), for each class assumes that \(X_i\) and \(X_j\) are conditionally independent of each other given C.
2. P(C)
To classify: given \(\bf{x}\), choose c that maximizes:
\[P(c|x)~\propto~P(c)P(x_i|c)\]
The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.
There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - \(\textbf{eaiser to learn}\).

In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.

A simple problem:
There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:
* Variable X has states {f, t}(t = win)
* Parameter \(\theta=P(X=t)\)
* Observations \(X^1=t, X^2=f, X^3=f\)
* These comprise the data \(\bf{D}\)
* Task: estimate \(\theta\)
* Use \(\theta\) to estimate \(P(X^4=t)\)
Firstly we will introduce \(\textbf{Maximum likelihood(ML)}\) algorithm, the likelihood mean:
* \(\textbf{Likelihhod:}\) \(\bf{L}(\theta)=P(\bf{D}|\theta)=P(X^1,X^2,X^3|\theta)\)
* \(\textbf{ML Principle:}\) Choose \(\theta\) so as to maximize\(\mathbf{L}(\theta)\)
* \(L(\theta)=P(X^1|\theta)P(X^2|\theta)P(X^3|\theta)\)
* \(\textbf{Log likelihood:}\) \[\bf{LL}(\theta)=logP(X^1|\theta)+logP(X^2|\theta)+logP(X^3|\theta)\]
* ML Principle equivalent: Choose \(\theta\) so as to maximize \(LL(\theta)\)
In this example:
\(P(X^i=t|\theta)=\theta\)
\(L(\theta)=P(X^1=t, X62=f, X^3=f|\theta)=\theta(1-\theta)(1-\theta)\)
\(LL(\theta)=log\theta+2log(1-\theta)\)
set derivative to 0: \(\frac{1}{\theta}~-~\frac{2}{1-\theta} = 0\)
Solve to find: \(\theta=1/3\)
In the example just now, you can find that \(\theta=1/3\) exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML's estimate is exactly the one most suggested by the data. More generally, we get the observations \(X^1,X^2,...,X^n\), let \(N_t\) be the number of instances with value t and \(N_f\) be the number of instances with value f. Then the maximum likelihood estimate for \(\theta\) is: \[\hat{\theta}=\frac{N_t}{N_t+N_f}=\frac{N_t}{N}\]
\(\textbf{Problem with this approach}\)
\(\color{red}{Overfits}:\) pays too much attention to noise in the data, for example if the team was particularly compete with \(\color{red}{\textbf{Chinese national soccer team}}\) recently, then we will oversee the team performance.
\(\color{red}{Ignores\,prior\,experience}\): If some experts told you that the team is a small team, you should not be confident even you have won CNS.
Events don't occur in the data are deemed impossible, for example the match end with 1 vs 1.

\(\textbf{Incorporating a prior}\)
* \(\textbf{Prior}:\) \(P(\theta)\) before seeing any data
* \(\textbf{Posterior:}\) \(P(\theta|\mathbf{D})\)
* \(\textbf{Maximum a Posterior principle (MAP):}\) Choose \(\theta\) to maximize \(P(\theta|\mathbf{D})\) and \(P(\theta|\mathbf{D})\) is proportional to \(P(\theta)L(\theta)\)
For learning the parameter of a Boolean random variable, an appropriate prior over \(\theta\) is the \(\textbf{beta}\) distribution. As you know, the beta distribution has 2 \(\textbf{parameters:}\) \(\alpha\) and \(\beta\), and these paramters control the shape of the prior. \(\alpha\) and \(\beta\) control how relatively likely true and false outcomes are, if \(\alpha\) is large relative to \(\beta\), \(\theta\) will be more likely to be large.
As the graph below:

And the \(\color{red}{magnitude}\) of \(\alpha\) and \(\beta\) control how peaked the beta distribution is, if \(\alpha\) and \(\beta\) are large, the beta will be sharply peaked.
The magnitude of \(\alpha\) and \(\beta\):

Updating the prior
To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:
for example, the prior is Beta(4, 7), and we observe 1 "+"" and 4 "-", then the posterior is Beta(5, 11).

Understanding the hyperparameters
Hyperparameter \(\alpha\) represents the number of previous positive observation that we had, plus 1; similarly, \(\beta\) represents the number of previous "-" observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. \(\textbf{The more we trust our prior experience, the larger the hyperparameters in the prior.}\)

Mode and mean of the beta distribution
The mode of \(Beta(\alpha,\beta)\) is \(\frac{\alpha-1}{\alpha-\beta-2}\). e.g. mode of Beta(2,3) is 1/3. The mean of \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha+\beta}\). e.g. mean of Beta(2,3) is 2/5.

MAP estimate
For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\]

In the example above, the prior is Beta(5,3), and the observations is: \(X^1=t, X^2=f,X^3=f\), so the posterior is: Beta(6,5), The MAP estimate is: \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\]
ML vs MAP
Maximum likelihood estimate: \(\hat{\theta_{ML}}=\frac{m}{N}\); MAP estimate: \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\). And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.
Drawback of MAP
MAP not fully consider the range of possible values for \(\theta\), only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, \(\textbf{Baysian approach}\), this approach not makeestimate of \(\theta\). and the posterior distribution is maintainted over the value of \(\theta\). e.g. given \(X^1,X^2,X^3\) we want to predict \(X^4\) using the entire distribution over \(\theta\).
\[\begin{align}
P(X^4|X^1,X^2,X^3)&=\int_{0}^1{P(X^4|\theta)P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = \int_{0}^1\theta{P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = E\left[ {\theta |X^1, X^2, X^3} \right] \notag\\
& = \mathbf{mean}\,of\,the\,posterior \notag\\
\end{align}\]
For beta distribution \(E\left[{\theta=\frac{m+\alpha}{N+\alpha+\beta}}\right]\)
\(E\left[{\theta}\right]\) is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.
In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\), while the Bayesian approach Expectation is: \(\hat{\theta_{MAP}}=\frac{m+\alpha}{N+\alpha+\beta}=\frac{6}{11}\), we can see the latter is more closely to 1/2.

\(\textbf{Multi-valued class and attributes}\)
In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.
given that: |C| = k and |X|=m, and parameters are listed as below:
\[\begin{align}
&P(C=c)=\theta_c \notag\\
&P(X_i=x|C=c)=\theta_{i,x}^c \notag\\
\end{align}\]
So the parameters grows up to kmn+k-1, the counts of instances is also listed below:
\[\begin{align}
&- N=total\,number\,of\,instances \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c\,and\,X_i=x \notag\\
&- k*m*n+k+1\,parameters\,in\,total. \notag\\
\end{align}\]
From the analysis above, we can make a brief conclusion about Naive Bayes:
The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.
\(\textbf{reference}\)
Note: Most content of this essay are from CMU lecture, while I lose the link of the website.
CMU statistical learning

Naive Bayes Theorem and Application - Theorem的更多相关文章

  1. 机器学习---用python实现朴素贝叶斯算法(Machine Learning Naive Bayes Algorithm Application)

    在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...

  2. 学习笔记之Naive Bayes Classifier

    Naive Bayes classifier - Wikipedia https://en.wikipedia.org/wiki/Naive_Bayes_classifier In machine l ...

  3. 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)

    6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...

  4. [机器学习] 分类 --- Naive Bayes(朴素贝叶斯)

    Naive Bayes-朴素贝叶斯 Bayes' theorem(贝叶斯法则) 在概率论和统计学中,Bayes' theorem(贝叶斯法则)根据事件的先验知识描述事件的概率.贝叶斯法则表达式如下所示 ...

  5. 机器学习算法 --- Naive Bayes classifier

    一.引言 在开始算法介绍之前,让我们先来思考一个问题,假设今天你准备出去登山,但起床后发现今天早晨的天气是多云,那么你今天是否应该选择出去呢? 你有最近这一个月的天气情况数据如下,请做出判断. 这个月 ...

  6. ML | Naive Bayes

    what's xxx In machine learning, naive Bayes classifiers are a family of simple probabilistic classif ...

  7. Spark MLlib 之 Naive Bayes

    1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...

  8. [Machine Learning & Algorithm] 朴素贝叶斯算法(Naive Bayes)

    生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子 让我从一个例子 ...

  9. Microsoft Naive Bayes 算法——三国人物身份划分

    Microsoft朴素贝叶斯是SSAS中最简单的算法,通常用作理解数据基本分组的起点.这类处理的一般特征就是分类.这个算法之所以称为“朴素”,是因为所有属性的重要性是一样的,没有谁比谁更高.贝叶斯之名 ...

随机推荐

  1. 配置Nutch模拟浏览器以绕过反爬虫限制

    原文链接:http://yangshangchuan.iteye.com/blog/2030741 当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓 ...

  2. git本地仓库与github远程仓库链接协议问题

    前提条件:有github账号,本地安装了git,能上网. 环境:ubuntu14.0.4LTS 首先在你得在github上创建一个仓库new repository,然后再本地创建一个文件夹mkdir ...

  3. php数组排序和分割字符串

    function sortStr($str){ $ary = str_split($str); sort($ary); $len = count($ary); $arr = array(); for( ...

  4. 初识JavaScript,感觉整个人都不好了。。。

    学习web前端的开发已经将近一个月了,开发中的三个大兄弟——“html”.“css”.“JavaScript”,小哥我已经深入接触了前两位,并与他俩建立的深厚的友谊.在编写过程中,不能说达到各位大神的 ...

  5. css3中动画animation的应用

    <!DOCTYPE html> <html> <head> <style> /* @-webkit-keyframes anim1 { // 规定动画. ...

  6. codeforces 645 D. Robot Rapping Results Report 二分+拓扑排序

    题目链接 我们可以发现, 这是一个很明显的二分+拓扑排序.... 如何判断根据当前的点, 是否能构造出来一个唯一的拓扑序列呢. 如果有的点没有出现, 那么一定不满足. 如果在加进队列的时候, 同时加了 ...

  7. AlertDialog基本用法详解

    AlertDialog简单介绍: AlertDialog可以在当前活动界面弹出一个对话框,用于提示一些重要信息或是警告内容. AlertDialog置于所有页面元素之上,能够屏蔽其他控件的交互. 由于 ...

  8. Java定时器:Timer

    项目中往往会遇到需要定时的任务,例如订单,当用户在某个规定时间内没有操作订单时,订单状态将会发生改变. 那么在这种情况下,我们会用到定时器. 举例: import java.util.Timer; / ...

  9. SQL Server 向堆表中插入数据的过程

    堆表中  IAM 记录着的数据页,表的各个数据页之间没有联系.也就是说一个页面它不会知道自己的前一页是谁,也不知道自己的后一页是谁. 插入数据时先找到IAM页,再由pfs(page free spac ...

  10. Oracle连接数过多释放机制

    Oracle连接数过多释放机制  sqlplus /nolog   打开sqlplus          connect /as sysdba    使用具有dba权限得用户登陆oracle      ...