[Math Review] Statistics Basic: Estimation
Two Types of Estimation
- Point Estimate: the value of sample statistics
Point estimates of average height with multiple samples (Source: Zhihu)
- Confidence Intervals: intervals constructed using a method that contains the population parameter a specified proportion of the time.
95% confidence interval of average height with multiple samples (Source: Zhihu)
Confidence Interval for the Mean
Population Variance is known
Suppose that M is the mean of N samples X1, X2, ......, Xn, i.e.
According to Central Limit Theorem, the the sampling distribution of the mean M is
where μ and σ2 are the mean and variance of the population respectively. If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean. So the 95% confidence interval for M is the inverval that is symetric about the point estimate μ so that the area under normal distribution is 0.95.
That is,
Since we don't know the mean of population, we could use the sample mean instead.
Population Variance is Unknown
Dregree of Freedom
The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based. In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated en route to the estimate in question.
If the variance in a sample is used to estimate the variance in a population, we couldn't calculate the sample variace as
That's because we have two parameters to estimate (i.e., sample mean and sample variance). The degree of freedom should be N-1, so the previous formula underestimates the variance. Instead, we should use the following formula
where s2 is the estimate of the variance and M is the sample mean. The denominator of this formula is the degree of freedom.
Student's t-Distribution
Suppose that X is a random variable of normal distribution, i.e., X ~ N(μ, σ2)
is sample mean and
is sample deviation.
is a random variable of normal distribution.
is a random variable of student's t distribution.
The probability density function of T is
where is the degree of freedom, is a gamma function.
The t distribution is very similar to the normal distribution when the estimate of variance is based on many degrees of freedom, but has relatively more scores in its tails when there are fewer degrees of freedom. Here are t distributions with 2, 4, and 10 degrees of freedom and the standard normal distribution. Notice that the normal distribution has relatively more scores in the center of the distribution and the t distribution has relatively more in the tails.
The t distribution is therefore leptokurtic. The t distribution approaches the normal distribution as the degrees of freedom increase.
Confidence Interval of t Distribution
Now consider the case in which you have a normal distribution but you do not know the standard deviation. You sample N values and compute the sample mean (M) and estimate the standard error of the mean (σM) with sM. What is the probability that M will be within 1.96 sM of the population mean (μ)? This is a difficult problem because there are two ways in which M could be more than 1.96 sM from μ: (1) M could, by chance, be either very high or very low and (2) sM could, by chance, be very low. Intuitively, it makes sense that the probability of being within 1.96 standard errors of the mean should be smaller than in the case when the standard deviation is known (and cannot be underestimated).
Luckily, however, we can prove that random variable T will be student's t distribution. So we can use t distribution to estimate the mean of a normal distribution population in situations where the sample size is small and population standard deviation is unknown. For 90% confidence interval, it can be calculated as
where A is value of T that contains 90% of the area of the t distribution for n-1 degree of freedom. We can calculate A through the t table.
[Math Review] Statistics Basic: Estimation的更多相关文章
- [Math Review] Statistics Basic: Sampling Distribution
Inferential Statistics Generalizing from a sample to a population that involves determining how far ...
- [Math Review] Statistics Basics: Main Concepts in Hypothesis Testing
Case Study The case study Physicians' Reactions sought to determine whether physicians spend less ti ...
- [Math Review] Linear Algebra for Singular Value Decomposition (SVD)
Matrix and Determinant Let C be an M × N matrix with real-valued entries, i.e. C={cij}mxn Determinan ...
- 统计处理包Statsmodels: statistics in python
http://blog.csdn.net/pipisorry/article/details/52227580 Statsmodels Statsmodels is a Python package ...
- FAQ: Automatic Statistics Collection (文档 ID 1233203.1)
In this Document Purpose Questions and Answers What kind of statistics do the Automated tasks ...
- Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
- How do I learn machine learning?
https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644 How Can I Learn X? ...
- 本人AI知识体系导航 - AI menu
Relevant Readable Links Name Interesting topic Comment Edwin Chen 非参贝叶斯 徐亦达老板 Dirichlet Process 学习 ...
- [book]awesome-machine-learning books
https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md Machine-Learning / Dat ...
随机推荐
- Python 3基础教程14-在文件尾部更新内容
本文介绍在一个已经存在的文件尾部添加内容,还是用到write方法. 这里exampleFile.txt是前面文件创建的文件,里面有两行文字.
- java初学1
1.Java主要技术和分支以及应用领域 (1)Java SE Java Platform,Standard Edition,Java SE 以前称为J2SE.它允许开发和部署在桌面.服务器.嵌入式环境 ...
- 求 n的阶乘
def chengji(n): if n == 0: return 1 return chengji(n-1)*nprint(chengji(n))
- 孤荷凌寒自学python第十天序列之字符串的常用方法
孤荷凌寒自学python第十天序列之字符串的常用方法 (完整学习过程屏幕记录视频地址在文末,手写笔记在文末) Python的字符串操作方法非常丰富,原生支持字符串的多种操作: 1 查找子字符串 str ...
- jsp页面提示“Multiple annotations found at this line: - The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path”解决方案
Multiple annotations found at this line: - The superclass "javax.servlet.http.HttpServlet" ...
- stack,heap的区别
一个由C/C++编译的程序占用的内存分为以下几个部分 1.栈区(stack)— 由编译器自动分配释放 ,存放函数的参数值,局部变量的值等.其 操作方式类似于数据结构中的栈. ...
- kvm竟然抓不到kvm的tracepoint
今天终于把kvm给搭起来了,打开了host机的tracepoint竟然一个都没有抓到,这是咋回事? 难道kvm的东西只有在启动的时候才会被抓到? 虚拟出来一块内存一块CPU,虚拟出来一个内存.感觉都好 ...
- min_free_kbytes是内存最安全值的阈值,然后这个值是怎么影响到系统内存回收的呢?
min_free_kbytes 内存域水印值:min_free_kbytes 当不设置的时候:sqrt(16M)=4k 4k*4 = 16k 设置内存水印值的函数是: 6792 /* 6793 * I ...
- 基于linux操作系统安装、使用memcached详解
1.memcached的应用背景及作用 Memcached 是一个高性能的分布式内存对象缓存系统,用于动态Web应用以减轻数据库负载.它通过在内存中缓存数据和对象来减少读取数据库的次数,从而提供动态. ...
- P2052 [NOI2011]道路修建
题目描述 在 W 星球上有 n 个国家.为了各自国家的经济发展,他们决定在各个国家 之间建设双向道路使得国家之间连通.但是每个国家的国王都很吝啬,他们只愿 意修建恰好 n – 1 条双向道路. 每条道 ...