Correlation and Regression
Correlation and Regression
Sample Covariance
The covariance between two random variables is a statistical measure of the degree to which the two variables move together.
The covariance captures the linear relationship between two variables. A positive covariance indicates that the variables tend to move together; a negative covariance indicates that the variables tend to move in opposite directions.
The sample covariance is calculated as:
The actual value of the covariance is not very meaningful because its measurement is extremely sensitive to the scale of the two variables. Also, the covariance may range from negative to positive infinity, and it is presented in terms of squared units (e.g., percent squared when data are in percent). For these reasons, we take the additional step of calculating the correlation coefficient, which coverts the covariance into a standardized measure that is easier to interpret.
Sample Correlation Coefficient
The correlation coefficient, r, is a measure of the strength of the linear relationship (correlation) between two variables. The correlation coefficient has no unit of measurement; it is a "pure" measure of the tendency of two variables to move together.
The sample correlation coefficient for two variables, X and Y, is calculated as:
The correlation coefficient is bounded by positive and negative 1 (i.e., -1 <= r <= 1), where a correlation coefficient of +1 indicates that changes in the variables are perfectly positively correlated (i.e., they go up and down together, in lock-step). In contrast, if the correlation coefficient is -1, the changes in the variables are perfectly negatively correlated.
The interpretation of the possible correlation values is summarized in the following figure,
Interpreting a Scatter Plot
A scatter plot is a collection of points on a graph where each point represents the values of two variables(i.e. an X/Y pair).
Note that for r=1 and r=-1 the data points lie exactly on a line, but the slope of that line is not necessarily +1 or -1.
Limitations to Correlation Analysis
Outliers(异常值)
Outliers represent a few extreme values for sample observations. Relative to the rest of the sample data, the value of an outlier may be extraordinarily large or small. Outliers can result in apparent statistical evidence that a significant relationship exists when, in fact, there is none, or that there is no relationship when, in fact, there is a relationship.
Spurious Correlation (伪相关性)
Spurious correlation refers to the appearance of a causal linear relationship when, in fact, there is no relation. Certain data items may be highly correlated purely by chance.
Nonlinear Relationship
Correlation measures the linear relationship between two variables, it does not capture strong nonlinear relationships between variables.
Hypothesis testing the population correlation coefficient equals 0
The closer the correlation coefficient is to +1 or -1, the stronger the correlation. With the exception of these extremes(i.e. r=+/-1) we cannot really speak of the strength of the relationship indicated by the correlation coefficient without a statistical test of significance.
For our purpose, we want to test whether the correlation between the population of two variables is equal to zero.
Assuming that the two populations are normally distributed, we can use a t-test to determine whether the null hypothesis should be rejected. The test statistic is computed using the sample correlation, r, with n-2 degrees of freedom(df):
To make a decision, the calculated test statistic is compared with the critical t-value for the appropriate degrees of freedom and level of significance. Bearing in mind that we are conducting a two-tailed test, the decision rule can be stated as:
Simple Linear Regression(简单线性回归)
The purpose of simple linear regression is to explain the variation in a dependent variable in terms of the variation in a single independent variable. Here, the term "variation" is interpreted as the degree to which a variable differs from its mean value. Don't confuse variation with variance -- they are related but are not the same.
- The dependent variable is the variable whose variation is explained by the independent variable. The dependent variable is also referred to as the explained variable, the endogenous(内生的) variable or the predicated variable.
- The independent variable is the variable used to explain the variation of the dependent variable. The independent variable is also referred to as the explanatory variable, the exogenous(外生的) variable, or the predicating variable.
Assumption of Linear Regression
Linear regression requires a number of assumptions. As indicated in the following list, most of the major assumptions pertain to the regression model's residual term ε.
A linear relation exists between the dependent and the independent variable.
The independent variable is uncorrelated with the residuals.
The expected value of the residual term is zero. [E(ε)=0]
The variance of the residual term is constant for all observations.
The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation.
The residual term is normally distributed.
Simple Linear Regression Model
The following linear regression model is used to describe the relationship between two variables, X and Y:
Based on the regression model stated previously, the regression process estimates an equation for a line through a scatter plot of the data that "best" explains the observed values for Y in terms of the observed values for X.
The linear equation, often called the line of the best fit, or regression line, takes the following form:
The regression line is just one of the many possible lines that can be drawn through the scatter plot of X and Y. In fact, the criteria used to estimate this line forms the very essence of linear regression. The regression line is the line for which the estimates of b0 and b1 are such that the sum of the squared differences (vertical distance) between the Y-values predicated by the regression equation and the actual Y-values is minimized. The sum of the squared vertical distances between the estimated and the actual Y-values is referred to as the sum of squared errors (SSE).
Thus, the regression line is the line that minimizes the SSE. This explains why simple linear regression is frequently referred to as ordinary least squares (OLS) regression, and the values estimated by the estimated regression equation are called least squares estimates.
The estimated slop coefficient for the regression line describes the change in Y for one unit change in X. It can be positive, negative, or zero, depending on the relationship between the regression variables. The slope term is calculated as:
The intercept term is the line's intersection with the Y-axis at X=0. It can be positive, negative, or zero. A property of the least squares method is that the intercept term may be expressed as:
The intercept equation highlights the fact that the regression line passes through a point with coordinates equal to the mean of the independent and dependent variables.
Keep in mind that any conclusion regarding the importance of an independent variable in explaining a dependent variable require determining the statistical significance of the slope coefficient. Simply looking at the magnitude of the slope coefficient does not address the issue of the importance of the variable. A hypothesis test must be conducted, or a confidence interval must be formed, to assess the importance of the variable.
SEE(Standard Error of Estimate)
The standard error of estimate(SEE) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SEE gauges the "fit" of the regression line. The smaller the standard error; the better the fit.
The SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.
Coefficient of Determination (R^2) 决定系数
The coefficient of determination(R^2) is defined as the percentage of the total variation in the dependent variable explained by the independent variable. For example, an R^2 of 0.63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable.
For simple linear regression(i.e. one independent variable), the coefficient of determination, R^2, may be computed by simply squaring the correlation coefficient, r. In other words, R2=r2 for a regression with one independent variable. This approach is not appropriate when more then one independent variable is used in the regression.
Regression Coefficient Confidence Interval
Hypothesis testing for a regression coefficient may use the confidence interval for the coefficient being tested.
The confidence interval for the regression coefficient, b1, is calculated as:
In this expression, tc is the critical two-tailed t-value for the selected confidence level with the appropriate number of degrees of freedom, which is equal to the number of observations minus 2. (i.e. n-2)
The standard error of the regression coefficient is denoted as Sb1. It is a function of the SEE: as SEE raises, Sb1 also increases, and the confidence interval widens. This makes sense because SEE measures the variability of the data about the regression line, and the more variable the data, the less confidence there is in the regression model to estimate coefficient.
Hypothesis about a population value of a regression coefficient
A t-test may also be used to test the hypothesis that the true slope coefficient, b1, is equal to some hypothesized value. Letting b1^ be the point estimate for b1, the appropriate test statistic with n-2 degrees of freedom is:
The decision rule for tests of significance for regression coefficient is:
Rejection of the null means that the slope coefficient is different from the hypothesized value of b1.
To test whether an independent variable explains the variation in the dependent variable (i.e. it is statistically significant), the hypothesis that is tested is whether the true slope is zero (b1=0). The appropriate test structure for the null and alternative hypothesis is:
Confidence Interval for Predicated Values of the Dependent Variable
Confidence intervals for the predicated value of a dependent variable are calculated in a manner similar to the confidence interval for the regression coefficients.
The challenge with computing a confidence interval for a predicated value is calculating sf.
Analysis of Variance(ANOVA)
Analysis of variance(ANOVA) is a statistical procedure for analyzing the total variability of the dependent variable.
- Total sum of squares (SST) measures the total variation in the dependent variable. SST is equal to the sum of squared differences between the actual Y-values and the mean of Y:
Note: this is not the same as variance. Variance = SST/(n-1)
- Regression sum of square(RSS) measures the variation in the dependent variable that is explained by the independent variable. RSS is the sum of the squared distances between the predicated Y-values and the mean of Y.
- Sum of squared errors(SSE) measures the unexplained variation in the dependent variable. It's also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual Y-values and the predicated Y-values on the regression line.
Thus, total variation = explained variation + unexplained variation, or:
SST = RSS + SSE
The output of the ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. ANOVA tables are included in the regression of output of many statistical software packages.
A generic ANOVA table for a simple linear regression(one independent variable) is presented in the following figure,
The mean regression sum of squares(MSR) and mean squared error(MSE) are simply calculated as the appropriate sum of squares divided by its degree of freedom.
Calculating R^2 and SEE
The R^2 and the standard error of estimate(SEE) can also be calculated directly from the ANOVA table. The R^2 is the percentage of the total variation in the dependent variable explained by the independent variable:
The SEE is the standard deviation of the regression error terms and is equal to the square root of the mean squared error (MSE):
Note: SSE is the sum of the squared residuals,, while SEE is the standard deviation of the residuals.
The F-Statistic
An F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable. In multiple regression, the F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variable.
The F-statistic is calculated as:
In multiple regression, the F-statistic tests all independent variables as a group.
The F-Statistic With One Independent Variable
For simple linear regression, there is only one independent variable, so the F-statistic tests the same hypothesis as the t-test for statistical significant of the slope coefficient:
To determine whether b1 is statistically significant using the F-test, the calculated F-statistic is compared with the critical F-value, Fc, at the appropriate level of significance. The degrees of freedom for the numerator and the denominator with one independent variable are:
The decision rule for the F-test is:
Rejection of the null hypothesis as a stated level of significance indicates that the independent variable is significantly different than zero, which is interpreted to mean that it makes a significant contribution to the explanation of the dependent variable. In simple linear regression, it tells us the same thing as the t-test of the slope coefficient.In fact, in simple linear regression with one independent variable, F=tb1^2.
Limitations of Regression Analysis
Linear relationships can change over time. This means that the estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period. This is referred to as parameter instability.
Even if the regression model actually reflects the historical relationship between the two variables, its usefulness in investment analysis will be limited if other market participants are also aware of and act on this evidence.
If the assumptions underlying regression analysis do not hold, the interpretation and tests of hypotheses may not be valid.
Correlation and Regression的更多相关文章
- Multiple Regression
Multiple Regression What is multiple regression? Multiple regression is regression analysis with mor ...
- R TUTORIAL: VISUALIZING MULTIVARIATE RELATIONSHIPS IN LARGE DATASETS
In two previous blog posts I discussed some techniques for visualizing relationships involving two o ...
- LD SCore计算基因多效性、遗传度、遗传相关性(the LD Score regression intercept, heritability and genetic correlation)
这篇文章是对之前啊啊救救我,为何我的QQ图那么飘(全基因组关联分析)这篇文章的一个补坑. LD SCore除了查看显著SNP位点对表型是否为基因多效性外,还额外补充了怎么计算表型的遗传度和遗传相关性. ...
- 线性回归 Linear Regression
成本函数(cost function)也叫损失函数(loss function),用来定义模型与观测值的误差.模型预测的价格与训练集数据的差异称为残差(residuals)或训练误差(test err ...
- Regression analysis
Source: http://wenku.baidu.com/link?url=9KrZhWmkIDHrqNHiXCGfkJVQWGFKOzaeiB7SslSdW_JnXCkVHsHsXJyvGbDv ...
- Logistic Regression vs Decision Trees vs SVM: Part II
This is the 2nd part of the series. Read the first part here: Logistic Regression Vs Decision Trees ...
- KCF:High-Speed Tracking with Kernelized Correlation Filters 的翻译与分析(一)。分享与转发请注明出处-作者:行于此路
High-Speed Tracking with Kernelized Correlation Filters 的翻译与分析 基于核相关滤波器的高速目标跟踪方法,简称KCF 写在前面,之所以对这篇文章 ...
- [EMSE'17] A Correlation Study between Automated Program Repair and Test-Suite Metrics
Basic Information Authors: Jooyong Yi, Shin Hwei Tan, Sergey Mechtaev, Marcel Böhme, Abhik Roychoudh ...
- 相关系数(CORRELATION COEFFICIENTS)会骗人?
CORRELATION COEFFICIENTS We've discussed how to summarize a single variable. The next question is ho ...
随机推荐
- Python访问MySQL数据库
#encoding: utf-8 import mysql.connector __author__ = 'Administrator' config={'host':'127.0.0.1',#默认1 ...
- 【Flash 插件】时钟类插件
1.honehone_clock人体时钟实现 原理:就是在网页上播放已写好的.SWF文件. 效果如下: 效果一:背景透明,推荐为白色或浅背景 效果二:背景白色,推荐黑色或深色背景 实现步骤: 先引用 ...
- Android 之 AndroidManifest.xml 详解(二)
[10]<activity> Activity活动组件(即界面控制器组件)的声明标签,Android应用中的每一个Activity都必须在AndroidManifest.xml配置文件中声 ...
- uni-app - Class 与 Style 绑定
参考uni文档:https://uniapp.dcloud.io/use?id=class-%E4%B8%8E-style-%E7%BB%91%E5%AE%9A 参考vue文档:https://cn. ...
- cocos2d 重写顶点着色语言
bool CCShaderSprite::initWithFile( const char *pszFilename ) { bool ret=false; do { ret=CCSpri ...
- [Python]网络爬虫(八):糗事百科的网络爬虫(v0.2)源码及解析
转自:http://blog.csdn.net/pleasecallmewhy/article/details/8932310 项目内容: 用Python写的糗事百科的网络爬虫. 使用方法: 新建一个 ...
- Android中onTouch与onClick事件的关系
这几天遇到点关于Android的触摸事件相关的,还跟onClick有关.暂且记下: LinearLayout分别设置了onTouchListener,onClickListener,onLongCli ...
- 〖Windows〗Linux的Qt程序源码转换至Windows平台运行,编码的解决
在中国大陆,Windows默认的编码是gb2312,而Linux是UTF8: 多数情况下,把Linux上的程序转换至Windows上运行需要进行编码转换才能正常显示: 而其实大可以不必的,同样,文件使 ...
- Jmeter:Java request
http://blog.csdn.net/xiazdong/article/details/7873767
- cygwin完全安装步骤方法(组图)
我们可以到Cygwin的官方网站下载Cygwin的安装程序,地址是: http://www.cygwin.com/ 或者直接使用下载连接来下载安装程序,下载连接是: http://www.cygwin ...