2.1. Binary Variables

1. Bernoulli distribution, p(x = 1|µ) = µ

2.Binomial distribution

+

3.beta distribution(Conjugate Prior of Bernoulli distribution)

The parameters a and b are often called hyperparameters because they control the distribution of the parameter µ.

m observations of x = 1 and l observations of x = 0.

the variance goes to zero for a → ∞ or b → ∞. It is a general property of Bayesian learning that, as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease.

2.2. Multinomial Variables

1.Consider discrete variables that can take on one of K possible mutually exclusive states.

One of the elements xk equals 1, and all remaining elements equa 0.

  x = (0, 0, 1, 0, 0, 0)T

Consider a data set D of N independent observations x1, . . . , xN. The corresponding likelihood function takes the form:

And find the maximum likelihood solution for µ. log it and add use a Lagrange multiplier λ,

Setting derivative of it with respect to µk and we abtain:

give λ = −N, and the solution is in the form:

which is the fraction of the N observations for which xk = 1.

Consider the join distribution of quantities m1, ... , mk

2.The Dirichlet distribution(Conjugate Prior of Multinomial Distribution)

m = (m1, . . . , mK)T

we can interpret the parameters αk of the Dirichlet prior as an effective number of observations of xk = 1.

multinomial distribution with K = 2.

2.3. The Gaussian Distribution

For a D-dimensional vector x:

Σ is a D × D covariance matrix.

eigenvector equation for the covariance matrix:

  Σui = λiui

Σ can be expressed as an expansion in terms of its eigenvecExercise 2.19 tors in the form:

Define:

in the yj coordinate system, the Gaussian distribution takes the form

This confirms that the multivariate Gaussian is indeed normalized.

And the expection of Gaussian distribution is:

2.3.1 Conditional Gaussian distributions

An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.

the mean and covariance of the conditional distribution p(xa|xb).

And they are  independent of xa.

2.3.2 Marginal Gaussian distributions

Marginal Gaussian distribution is also Gaussian.

2.3.3 Bayes’ theorem for Gaussian variables

Here we shall suppose that we are given a Gaussian marginal distribution p(x) and a Gaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a linear function of x, and a covariance which is independent of x. We wish to find the marginal distribution p(y) and the conditional distribution p(x|y).

2.3.4 Maximum likelihood for the Gaussian

The log likelihood function is given by

we see that the likelihood function depends on the data set only through the two quantities

the maximum likelihood estimate of the mean and corvirance matrix given by

evaluate the expectations of the maximum likelihood solutions under the true distribution, we obtain the following results

We see that the expectation of the maximum likelihood estimate for the mean is equal to the true mean. However, the maximum likelihood estimate for the covariance has an expectation that is less than the true value, and hence it is biased. We can correct this bias by defining a different estimator Σ given by

2.3.5 Sequential estimation

1.Sequential methods allow data points to be processed one at a time and then discarded and are important for on-line applications, and also where large data sets are involved so that batch processing of all data points at once is infeasible.

This result has a nice interpretation, as follows. After observing N − 1 data points we have estimated µ by . We now observe data point xN, and we obtain our revised estimate   by moving the old estimate a small amount, proportional to 1/N, in the direction of the ‘error signal’ . Note that, as N increases, so the contribution from successive data points gets smaller.

2.Robbins-Monro algorithm

The conditional expectation of z given θ defines a deterministic function f(θ) that is given by

We shall assume that the conditional variance of z is finite so that

The Robbins-Monro procedure then defines a sequence of successive estimates of the root θ given by

where z(θ(N)) is an observed value of z when θ takes the value θ(N). The coefficients {aN} represent a sequence of positive numbers that satisfy the conditions

first condition ensures that the successive corrections decrease in magnitude so that the process can converge to a limiting value. The second condition is required to ensure that the algorithm does not converge short of the root, and the third condition is needed to ensure that the accumulated noise has finite variance and hence does not spoil convergence.

2.3.6 Bayesian inference for the Gaussian

gamma distribution

The mean and variance of the gamma distribution are given by

2.3.7 Student’s t-distribution

If we have a univariate Gaussian N(x|µ, τ −1) together with a Gamma prior Gam(τ|a, b) and we integrate out the precision, we obtain the marginal distribution of x in the form

set ν = 2a and λ = a/b

which is known as Student’s t-distribution. The parameter λ is sometimes called the precision of the t-distribution, even though it is not in general equal to the inverse of the variance. The parameter ν is called the degrees of freedom.

For the particular case of ν = 1, the t-distribution reduces to the Cauchy distribution, while in the limit ν → ∞ the t-distribution St(x|µ, λ, ν) becomes a Gaussian N(x|µ, λ−1) with mean µ and precision λ.

The result is a distribution that in general has longer ‘tails’ than a Gaussian, as was seen in figure above. This gives the tdistribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers.

2.3.8 Periodic variables

1 Periodic quantities can conveniently be represented using an angular (polar) coordinate 0 θ < 2π.

We might be tempted to treat periodic variables by choosing some direction as the origin and then applying a conventional distribution such as the Gaussian. Such an approach, however, would give results that were strongly dependent on the arbitrary choice of origin.

To find an invariant measure of the mean, we note that the observations can be viewed as points on the unit circle and can therefore be described instead by two-dimensional unit vectors x1, . . . , xN where xn = 1 for n = 1, . . . , N

The Cartesian coordinates of the observations are given by xn = (cos θn, sin θn), and we can write the Cartesian coordinates of the sample mean in the form 

2 von Mises distribution

we will consider distributions p(θ) that have period 2π. Any probability density p(θ) defined over θ must not only be nonnegative and integrate to one, but it must also be periodic. Thus p(θ) must satisfy the three conditions

it follows that p(θ + M2π) = p(θ) for any integer M.

Consider a Gaussian distribution over two variables x = (x1, x2) having mean µ = (µ1, µ2) and a covariance matrix Σ = σ2I where I is the 2 × 2 identity matrix, so that.

2.3.9 Mixtures of Gaussians

Consider a superposition of K Gaussian densities of the form

 , 

which is called a mixture of Gaussians. The parameters πk in are called mixing coefficients.

one example for k = 3

2.4. The Exponential Family

1 The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith,1994).

The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form

where x may be scalar or vector, and may be discrete or continuous. Here η are called the natural parameters of the distribution, and u(x) is some function of x. The function g(η) can be interpreted as the coefficient that ensures that the distribution is normalized and therefore satisfies

where the integration is replaced by summation if x is a discrete variable.

2 Conjugate priors

In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.

2.5. Nonparametric Methods

First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point.

Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results.(degree M of the polynomial,  the value α of the regularization parameter)

2.5.1 Kernel density estimators

consider some small region R containing x, and x would be the kernel. we obtain our density estimate in the form

K is the total number of points that lie inside R. V is the volume of R

kernel function k discribes that how close data point is to x.

thus the estimated density at x is

we have used hD for the volume of a hypercube of side h in D dimensions.

k can also be a Gaussian kernel function

h represents the standard deviation of the Gaussian components and plays the role of a smoothing parameter, and there is a trade-off between sensitivity to noise at small h and over-smoothing at large h.

2.5.2 Nearest-neighbour methods

One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and a washing out of structure that might otherwise be extracted from the data. However, reducing h may lead to noisy estimates elsewhere in data space where the density is smaller.

the optimal choice for h may be dependent on location within the data space. This issue is addressed by nearest-neighbour methods for density estimation.

we consider a fixed value of K and use the data to find an appropriate value for V .

Note that the model produced by K nearest neighbours is not a true density model because the integral over all space diverges.

If we wish to classify a new point x, we draw a sphere centred on x containing precisely K points irrespective of their class. Suppose this sphere has volume V and contains Kk points from class Ck.

An interesting property of the nearest-neighbour (K = 1) classifier is that, in the limit N → ∞, the error rate is never more than twice the minimum achievable error rate of an optimal classifier.

both the K-nearest-neighbour method, and the kernel density estimator, require the entire training data set to be stored, leading to expensive computation if the data set is large.

This effect can be offset, at the expense of some additional one-off computation, by constructing tree-based search structures to allow (approximate) near neighbours to be found efficiently without doing an exhaustive search of the data set.

Previous Chapter | Next Chapter

PRML读书笔记——2 Probability Distributions的更多相关文章

  1. PRML读书会第二章 Probability Distributions(贝塔-二项式、狄利克雷-多项式共轭、高斯分布、指数族等)

    主讲人 网络上的尼采 (新浪微博: @Nietzsche_复杂网络机器学习) 网络上的尼采(813394698) 9:11:56 开始吧,先不要发言了,先讲PRML第二章Probability Dis ...

  2. 【PRML读书笔记-Chapter1-Introduction】1.2 Probability Theory

    一个例子: 两个盒子: 一个红色:2个苹果,6个橘子; 一个蓝色:3个苹果,1个橘子; 如下图: 现在假设随机选取1个盒子,从中.取一个水果,观察它是属于哪一种水果之后,我们把它从原来的盒子中替换掉. ...

  3. PRML读书笔记——机器学习导论

    什么是模式识别(Pattern Recognition)? 按照Bishop的定义,模式识别就是用机器学习的算法从数据中挖掘出有用的pattern. 人们很早就开始学习如何从大量的数据中发现隐藏在背后 ...

  4. PRML读书笔记——3 Linear Models for Regression

    Linear Basis Function Models 线性模型的一个关键属性是它是参数的一个线性函数,形式如下: w是参数,x可以是原始的数据,也可以是关于原始数据的一个函数值,这个函数就叫bas ...

  5. PRML读书笔记——Introduction

    1.1. Example: Polynomial Curve Fitting 1. Movitate a number of concepts: (1) linear models: Function ...

  6. PRML读书笔记_绪论

    一.基本名词 泛化(generalization) 训练集所训练的模型对新数据的适用程度. 监督学习(supervised learning) 训练数据的样本包含输入向量以及对应的目标向量. 分类( ...

  7. PRML读书笔记——Mathematical notation

    x, a vector, and all vectors are assumed to be column vectors. M, denote matrices. xT, a row vcetor, ...

  8. 【PRML读书笔记-Chapter1-Introduction】1.6 Information Theory

    熵 给定一个离散变量,我们观察它的每一个取值所包含的信息量的大小,因此,我们用来表示信息量的大小,概率分布为.当p(x)=1时,说明这个事件一定会发生,因此,它带给我的信息为0.(因为一定会发生,毫无 ...

  9. 【PRML读书笔记-Chapter1-Introduction】1.5 Decision Theory

    初体验: 概率论为我们提供了一个衡量和控制不确定性的统一的框架,也就是说计算出了一大堆的概率.那么,如何根据这些计算出的概率得到较好的结果,就是决策论要做的事情. 一个例子: 文中举了一个例子: 给定 ...

随机推荐

  1. 【BZOJ】2253: [2010 Beijing wc]纸箱堆叠

    题意 三维严格偏序最长链.(\(n \le 50000\)) 分析 按第一维排序然后以第二和第三维作为关键字依次加入一个二维平面,维护前缀矩形最大值. 题解 当然可以树套树....可是似乎没有随机化算 ...

  2. 【BZOJ】3239: Discrete Logging

    http://www.lydsy.com/JudgeOnline/problem.php?id=3239 题意:原题很清楚了= = #include <bits/stdc++.h> usi ...

  3. JSP -- for循环按钮处理事件

    <%@ page language="java" import="java.util.*" pageEncoding="UTF-8"% ...

  4. BZOJ4515: [Sdoi2016]游戏

    Description Alice 和 Bob 在玩一个游戏. 游戏在一棵有 n 个点的树上进行.最初,每个点上都只有一个数字,那个数字是 123456789123456789. 有时,Alice 会 ...

  5. 瀑布流 &留言板

    实例:瀑布流 留言板(一)瀑布流瀑布流实现原理分析1.ajax文件内容function ajax(method, url, data, success) {    var xhr = null;   ...

  6. 简单打包 ipa 方式!

    应用的发布也分两种 一种是.打包成ipa上传到国内第3方软件市场,当用户的手机已经JailBreak时,双击下载的ipa文件就可以安装软件 (ipa同android的apk包一样,实质是一个压缩包) ...

  7. 处理海量数据的高级排序之——归并排序(C++)

    代码实现                                                                                                 ...

  8. bzoj4562: [Haoi2016]食物链--记忆化搜索

    这道题其实比较水,半个小时AC= =对于我这样的渣渣来说真是极大的鼓舞 题目大意:给出一个有向图,求入度为0的点到出度为0的点一共有多少条路 从入读为零的点进行记忆化搜索,搜到出度为零的点返回1 所有 ...

  9. lua库函数

    这些函数都是Lua编程语言的一部分, 点击这里了解更多. assert(value) - 检查一个值是否为非nil, 若不是则(如果在wow.exe打开调试命令)显示对话框以及输出错误调试信息 col ...

  10. Go 语言学习

    golang中国 http://www.golangtc.com/ 第三方github学习 https://github.com/Unknwon/go-fundamental-programmingh ...