Naive Bayes Theorm And Application - Theorem

Naive Bayes model:
1. Naive Bayes model
2. model: discrete attributes with finit number of values
2. Parameter density estimation
3. Naive Bayes classification algorithm
4. AutoClass clustering alogrithm

\(\textbf{1. Naive Bayes model}\)
In this model, We want to estimate \(P(X_1,...,X_n)\), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.\[P(X_1,...,X_n)=\Pi(X_1,...,P_n)\]While \(P(X_i)\) can be any distribution you like, e.g. \(\{0.5 : red, 0.2 : blue, 0.3 : yellow\}\).
To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have \(2_n\) states of \(X_1,...,X_n\) and \(2^n-1\) independent parameters; While adding independence assumption, the the scale of parameters decreases to \(2n+1\), and the parameter and n parameters in total.
For example,in a classification problem,We assume \(\theta_C\) is the probability class. Then we have 2n+1 parameters:
\[\begin{align}
&P(C=T)=\theta_C \notag\\
&P(C=T)=1 - \theta_C \notag\\
&P(X_i=T|C=T)=\theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^T \notag\\
&P(X_i=F|C=T)=1 - \theta_{i}^F \notag\\
& \mathbf{\theta}\langle {\theta_C, \theta_{i}^T,...,\theta_{n}^T,\theta_{i}^F,...\theta_{n}^F} \rangle \notag\\
\end{align}\]
As you can see above, it makes incredible saving in number of parameters. Representing \(P(X_1,.., X_n)\) explicitly suffers from curse of dimensionality, while \(\Pi_{i=1}^n P(X_i)\) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.

Naive Bayes classifier
For a NB classification problem we should learn:
1. \(P(X_1,...X_n|C)=\Pi_{i=1}^nP(X_i|C)\), for each class assumes that \(X_i\) and \(X_j\) are conditionally independent of each other given C.
2. P(C)
To classify: given \(\bf{x}\), choose c that maximizes:
\[P(c|x)~\propto~P(c)P(x_i|c)\]
The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.
There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - \(\textbf{eaiser to learn}\).

In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.

A simple problem:
There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:
* Variable X has states {f, t}(t = win)
* Parameter \(\theta=P(X=t)\)
* Observations \(X^1=t, X^2=f, X^3=f\)
* These comprise the data \(\bf{D}\)
* Task: estimate \(\theta\)
* Use \(\theta\) to estimate \(P(X^4=t)\)
Firstly we will introduce \(\textbf{Maximum likelihood(ML)}\) algorithm, the likelihood mean:
* \(\textbf{Likelihhod:}\) \(\bf{L}(\theta)=P(\bf{D}|\theta)=P(X^1,X^2,X^3|\theta)\)
* \(\textbf{ML Principle:}\) Choose \(\theta\) so as to maximize\(\mathbf{L}(\theta)\)
* \(L(\theta)=P(X^1|\theta)P(X^2|\theta)P(X^3|\theta)\)
* \(\textbf{Log likelihood:}\) \[\bf{LL}(\theta)=logP(X^1|\theta)+logP(X^2|\theta)+logP(X^3|\theta)\]
* ML Principle equivalent: Choose \(\theta\) so as to maximize \(LL(\theta)\)
In this example:
\(P(X^i=t|\theta)=\theta\)
\(L(\theta)=P(X^1=t, X62=f, X^3=f|\theta)=\theta(1-\theta)(1-\theta)\)
\(LL(\theta)=log\theta+2log(1-\theta)\)
set derivative to 0: \(\frac{1}{\theta}~-~\frac{2}{1-\theta} = 0\)
Solve to find: \(\theta=1/3\)
In the example just now, you can find that \(\theta=1/3\) exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML's estimate is exactly the one most suggested by the data. More generally, we get the observations \(X^1,X^2,...,X^n\), let \(N_t\) be the number of instances with value t and \(N_f\) be the number of instances with value f. Then the maximum likelihood estimate for \(\theta\) is: \[\hat{\theta}=\frac{N_t}{N_t+N_f}=\frac{N_t}{N}\]
\(\textbf{Problem with this approach}\)
\(\color{red}{Overfits}:\) pays too much attention to noise in the data, for example if the team was particularly compete with \(\color{red}{\textbf{Chinese national soccer team}}\) recently, then we will oversee the team performance.
\(\color{red}{Ignores\,prior\,experience}\): If some experts told you that the team is a small team, you should not be confident even you have won CNS.
Events don't occur in the data are deemed impossible, for example the match end with 1 vs 1.

\(\textbf{Incorporating a prior}\)
* \(\textbf{Prior}:\) \(P(\theta)\) before seeing any data
* \(\textbf{Posterior:}\) \(P(\theta|\mathbf{D})\)
* \(\textbf{Maximum a Posterior principle (MAP):}\) Choose \(\theta\) to maximize \(P(\theta|\mathbf{D})\) and \(P(\theta|\mathbf{D})\) is proportional to \(P(\theta)L(\theta)\)
For learning the parameter of a Boolean random variable, an appropriate prior over \(\theta\) is the \(\textbf{beta}\) distribution. As you know, the beta distribution has 2 \(\textbf{parameters:}\) \(\alpha\) and \(\beta\), and these paramters control the shape of the prior. \(\alpha\) and \(\beta\) control how relatively likely true and false outcomes are, if \(\alpha\) is large relative to \(\beta\), \(\theta\) will be more likely to be large.
As the graph below:

And the \(\color{red}{magnitude}\) of \(\alpha\) and \(\beta\) control how peaked the beta distribution is, if \(\alpha\) and \(\beta\) are large, the beta will be sharply peaked.
The magnitude of \(\alpha\) and \(\beta\):

Updating the prior
To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:
for example, the prior is Beta(4, 7), and we observe 1 "+"" and 4 "-", then the posterior is Beta(5, 11).

Understanding the hyperparameters
Hyperparameter \(\alpha\) represents the number of previous positive observation that we had, plus 1; similarly, \(\beta\) represents the number of previous "-" observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. \(\textbf{The more we trust our prior experience, the larger the hyperparameters in the prior.}\)

Mode and mean of the beta distribution
The mode of \(Beta(\alpha,\beta)\) is \(\frac{\alpha-1}{\alpha-\beta-2}\). e.g. mode of Beta(2,3) is 1/3. The mean of \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha+\beta}\). e.g. mean of Beta(2,3) is 2/5.

MAP estimate
For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\]

In the example above, the prior is Beta(5,3), and the observations is: \(X^1=t, X^2=f,X^3=f\), so the posterior is: Beta(6,5), The MAP estimate is: \[\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\]
ML vs MAP
Maximum likelihood estimate: \(\hat{\theta_{ML}}=\frac{m}{N}\); MAP estimate: \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}\). And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.
Drawback of MAP
MAP not fully consider the range of possible values for \(\theta\), only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, \(\textbf{Baysian approach}\), this approach not makeestimate of \(\theta\). and the posterior distribution is maintainted over the value of \(\theta\). e.g. given \(X^1,X^2,X^3\) we want to predict \(X^4\) using the entire distribution over \(\theta\).
\[\begin{align}
P(X^4|X^1,X^2,X^3)&=\int_{0}^1{P(X^4|\theta)P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = \int_{0}^1\theta{P(\theta|X^1,X^2,X^3)}d\theta\notag\\
& = E\left[ {\theta |X^1, X^2, X^3} \right] \notag\\
& = \mathbf{mean}\,of\,the\,posterior \notag\\
\end{align}\]
For beta distribution \(E\left[{\theta=\frac{m+\alpha}{N+\alpha+\beta}}\right]\)
\(E\left[{\theta}\right]\) is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.
In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is \(\hat{\theta_{MAP}}=\frac{m+\alpha-1}{N+\alpha+\beta-2}=\frac{5}{9}\), while the Bayesian approach Expectation is: \(\hat{\theta_{MAP}}=\frac{m+\alpha}{N+\alpha+\beta}=\frac{6}{11}\), we can see the latter is more closely to 1/2.

\(\textbf{Multi-valued class and attributes}\)
In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.
given that: |C| = k and |X|=m, and parameters are listed as below:
\[\begin{align}
&P(C=c)=\theta_c \notag\\
&P(X_i=x|C=c)=\theta_{i,x}^c \notag\\
\end{align}\]
So the parameters grows up to kmn+k-1, the counts of instances is also listed below:
\[\begin{align}
&- N=total\,number\,of\,instances \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c \notag\\
&- N=total\,number\,of\,instances\,with\,class\,c\,and\,X_i=x \notag\\
&- k*m*n+k+1\,parameters\,in\,total. \notag\\
\end{align}\]
From the analysis above, we can make a brief conclusion about Naive Bayes:
The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.
\(\textbf{reference}\)
Note: Most content of this essay are from CMU lecture, while I lose the link of the website.
CMU statistical learning

Naive Bayes Theorem and Application - Theorem的更多相关文章

  1. 机器学习---用python实现朴素贝叶斯算法(Machine Learning Naive Bayes Algorithm Application)

    在<机器学习---朴素贝叶斯分类器(Machine Learning Naive Bayes Classifier)>一文中,我们介绍了朴素贝叶斯分类器的原理.现在,让我们来实践一下. 在 ...

  2. 学习笔记之Naive Bayes Classifier

    Naive Bayes classifier - Wikipedia https://en.wikipedia.org/wiki/Naive_Bayes_classifier In machine l ...

  3. 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)

    6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation yo ...

  4. [机器学习] 分类 --- Naive Bayes(朴素贝叶斯)

    Naive Bayes-朴素贝叶斯 Bayes' theorem(贝叶斯法则) 在概率论和统计学中,Bayes' theorem(贝叶斯法则)根据事件的先验知识描述事件的概率.贝叶斯法则表达式如下所示 ...

  5. 机器学习算法 --- Naive Bayes classifier

    一.引言 在开始算法介绍之前,让我们先来思考一个问题,假设今天你准备出去登山,但起床后发现今天早晨的天气是多云,那么你今天是否应该选择出去呢? 你有最近这一个月的天气情况数据如下,请做出判断. 这个月 ...

  6. ML | Naive Bayes

    what's xxx In machine learning, naive Bayes classifiers are a family of simple probabilistic classif ...

  7. Spark MLlib 之 Naive Bayes

    1.前言: Naive Bayes(朴素贝叶斯)是一个简单的多类分类算法,该算法的前提是假设各特征之间是相互独立的.Naive Bayes 训练主要是为每一个特征,在给定的标签的条件下,计算每个特征在 ...

  8. [Machine Learning & Algorithm] 朴素贝叶斯算法(Naive Bayes)

    生活中很多场合需要用到分类,比如新闻分类.病人分类等等. 本文介绍朴素贝叶斯分类器(Naive Bayes classifier),它是一种简单有效的常用分类算法. 一.病人分类的例子 让我从一个例子 ...

  9. Microsoft Naive Bayes 算法——三国人物身份划分

    Microsoft朴素贝叶斯是SSAS中最简单的算法,通常用作理解数据基本分组的起点.这类处理的一般特征就是分类.这个算法之所以称为“朴素”,是因为所有属性的重要性是一样的,没有谁比谁更高.贝叶斯之名 ...

随机推荐

  1. nide.js(二)文件I/O

    文件I/O fs模块的基本用法 node.js中提供一个名为fs的模块来支持I/O操作,fs模块的文件I/O是对标准POSIX函数的简单封装. 1.writeFile函数的基本用法 文件I/O,写入是 ...

  2. SQLite语法

    一.建立数据库 sqlite3.exe test.db 二.双击sqlite-3_6_16目录下的程序sqlite3.exe,即可运行 三.退出 .exit 或者 .quit 四.SQLite支持如下 ...

  3. css黑魔法

    多行文本溢出显示省略号(...)的方法 p { overflow : hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-li ...

  4. Android快速开发不可或缺的11个工具类

     Android快速开发不可或缺的11个工具类  :http://www.devst ore.cn/code/info/363.html

  5. ORA-01652 错误中报出的不是Temp表空间的情况。

    ORA-01652  unable to extend temp segment by %s in tablespace %s 注意这里的temp segment并不一定就是指临时表空间, 也可能是其 ...

  6. MyEclipse中SVN的使用方法

    来至转载  -----新浪博客 MyEclipse中的SVN操作手册 1.导入项目 点击工具栏上的[File-Import],进入下图

  7. 基于php常用正则表达整理(上)

    电子邮件:/\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*/变量:/[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/ 基于p ...

  8. Lazarus解决含中文文件名或路径的使用问题

      其实用lazarus很久(也不算久啦..),目前打算做完手头的最后一个小程序然后就转向c#窗体程序..之前用lazarus的时候出了很多问题,资料也不是很好找,所以这回把比较容易说的记下来省得忘掉 ...

  9. CSS3盒模型display:-webkit-box;的使用

    box-flex是css3新添加的盒子模型属性,它的出现可以解决我们通过N多结构.css实现的布局方式.经典的一个布局应用就是布局的垂直等高.水平均分.按比例划分. 目前box-flex属性还没有得到 ...

  10. 第一个只出现一次的字符,josephus环,最大子数组和

    #include<stdio.h> #include<stdlib.h> #include<string.h> #define MAXINT 0x7fffffff ...