R与数据分析旧笔记（六）多元线性分析上

> x=iris[which(iris$Species=="setosa"),1:4]
> plot(x)

首先是简单的肉眼观察数据之间相关性

多元回归相较于一元回归的最主要困难可能就是变量的选择，如下面的例子

使用Swiss数据集（R内置）

Swiss Fertility and Socioeconomic Indicators(1888) Data

建立多元线性回归

> s=lm(Fertility~.,data=swiss)
> print(s)
 
Call:
lm(formula = Fertility ~ ., data = swiss)
 
Coefficients:
     (Intercept)       Agriculture       Examination         Education
         66.9152           -0.1721           -0.2580           -0.8709
        Catholic  Infant.Mortality
          0.1041            1.0770  
 
> summary(s)#模型汇总信息
 
Call:
lm(formula = Fertility ~ ., data = swiss)
 
Residuals:
     Min       1Q   Median       3Q      Max
-15.2743  -5.2617   0.5032   4.1198  15.3213 
 
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *
Examination      -0.25801    0.25388  -1.016  0.31546
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 **
Infant.Mortality  1.07705    0.38172   2.822  0.00734 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared:  0.7067,	Adjusted R-squared:  0.671
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

可从summary可看出Examination的Pr=0.31546，明显大于0.05，显著性水平有点问题。

将Examination剔除后，看看

> k=lm(Fertility~Agriculture+Education+Catholic+Infant.Mortality,data=swiss)
> print(k)
 
Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic +
    Infant.Mortality, data = swiss)
 
Coefficients:
     (Intercept)       Agriculture         Education          Catholic
         62.1013           -0.1546           -0.9803            0.1247
Infant.Mortality
          1.0784  
 
> summary(k)
 
Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic +
    Infant.Mortality, data = swiss)
 
Residuals:
     Min       1Q   Median       3Q      Max
-14.6765  -6.0522   0.7514   3.1664  16.1422 
 
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
Agriculture      -0.15462    0.06819  -2.267  0.02857 *
Education        -0.98026    0.14814  -6.617 5.14e-08 ***
Catholic          0.12467    0.02889   4.315 9.50e-05 ***
Infant.Mortality  1.07844    0.38187   2.824  0.00722 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared:  0.6993,	Adjusted R-squared:  0.6707
F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

Pr均小于0.05，表明相关性显著，但R²=0.6993略小，置信度不高。

例题：某大型牙膏制造企业为了更好地扩展产品市场，有效地管理库存，公司董事会要求销售部门根据市场调查，找出公司生产的牙膏销售量与销售价格、广告投入等之间的关系，从而预测出在不同的价格和广告费用下的销售量。为此，销售部的研究人员收集了过去30个销售周期（每个销售周期为4周）公司生产的牙膏的销售量、销售价格、投入的广告费用，以及周期其它厂家生产同类牙膏的市场平均销售价格，如表所示。试根据这些数据建立一个数学模型，分析牙膏销售量与其他因素的关系，为订制价格策略与广告投入策略提供数据依据

销售周期	公司销售价格（元）	其他厂家平均价格（元）	价格差（元）	广告费用（百万元）	销售量（百万支）
1	3.85	3.80	-0.05	5.50	7.38
2	3.75	4.00	0.25	6.75	8.51
3	3.70	4.30	0.60	7.25	9.52
4	3.70	3.70	0.00	5.50	7.50
5	3.60	3.85	0.25	7.00	9.33
6	3.60	3.80	0.20	6.50	8.28
7	3.60	3.75	0.15	6.75	8.75
8	3.80	3.85	0.05	5.25	7.87
9	3.80	3.65	-0.15	5.25	7.10
10	3.85	4.00	0.15	6.00	8.00
11	3.90	4.10	0.20	6.50	7.89
12	3.90	4.00	0.10	6.25	8.15
13	3.70	4.10	0.40	7.00	9.10
14	3.75	4.20	0.45	6.90	8.86
15	3.75	4.10	0.35	6.80	8.90
16	3.80	4.10	0.30	6.80	8.87
17	3.70	4.20	0.50	7.10	9.26
18	3.80	4,30	0.50	7.00	9.00
19	3.70	4.10	0.40	6.80	8.75
20	3.80	3.75	-0.05	6.50	7.95
21	3.80	3.75	-0.05	6.25	7.65
22	3.75	3.65	-0.10	6.00	7.27
23	3.70	3.90	0.20	6.50	8.00
24	3.55	3.65	0.10	7.00	8.50
25	3.60	4.10	0.50	6.80	8.75
26	3.65	4.25	0.60	6.80	9.21
27	3.70	3.65	-0.05	6.50	8.27
28	3.75	3.75	0.00	5.75	7.67
29	3.80	3.85	0.05	5.80	7.93
30	3.70	4.25	0.55	6.80	9.26

分析：由于牙膏是生活必需品，对于大多数顾客来说，在购买同类产品的牙膏时，更多地会关心不同品牌之间的价格差，而不是它们的价格本身。因此，在研究各个因素对销售量的影响时，用价格差代替公司销售价格和其他厂家平均价格更为合适。

模型建立与求解

记牙膏销售量为Y，价格差为X₁，公司的广告费为X₂，假设基本模型为线性模型

Y=β₀+β₁X₁+β₂X₂+ε

输入数据，调用R中的lm()函数求解，并用summary()显示计算结果（程序名：exam0609.R）

> toothpaste<-data.frame(
+ X1=c(-0.05, 0.25,0.60,0,0.25,0.20, 0.15,0.05,-0.15, 0.15,
+ 0.20, 0.10,0.40,0.45,0.35,0.30, 0.50,0.50, 0.40,-0.05,
+ -0.05,-0.10,0.20,0.10,0.50,0.60,-0.05,0,0.05, 0.55),
+ X2=c( 5.50,6.75,7.25,5.50,7.00,6.50,6.75,5.25,5.25,6.00,
+ 6.50,6.25,7.00,6.90,6.80,6.80,7.10,7.00,6.80,6.50,
+ 6.25,6.00,6.50,7.00,6.80,6.80,6.50,5.75,5.80,6.80),
+ Y =c( 7.38,8.51,9.52,7.50,9.33,8.28,8.75,7.87,7.10,8.00,
+ 7.89,8.15,9.10,8.86,8.90,8.87,9.26,9.00,8.75,7.95,
+ 7.65,7.27,8.00,8.50,8.75,9.21,8.27,7.67,7.93,9.26)
+ )
> lm.sol<-lm(Y~X1+X2,data=toothpaste)
> summary(lm.sol)
 
Call:
lm(formula = Y ~ X1 + X2, data = toothpaste)
 
Residuals:
     Min       1Q   Median       3Q      Max
-0.49779 -0.12031 -0.00867  0.11084  0.58106 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   4.4075     0.7223   6.102 1.62e-06 ***
X1            1.5883     0.2994   5.304 1.35e-05 ***
X2            0.5635     0.1191   4.733 6.25e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2383 on 27 degrees of freedom
Multiple R-squared:  0.886,	Adjusted R-squared:  0.8776
F-statistic:   105 on 2 and 27 DF,  p-value: 1.845e-13

计算结果通过回归系数检验和回归方程检验，由此得到销售量与价格差与广告费之间的关系为：

Y=4.4075+1.5883X₁+0.5635X₂

模型的进一步分析

为了进一步分析回归模型，我们画出y与x₁和y与x₂散点图，如下图。从散点图上可以看出，对于y与x₁，用直线拟合较好。而对于y与x₂，则用二次曲线拟合较好

> ####绘制x1与y的散点图和回归直线
> attach(toothpaste)
> plot(Y~X1);abline(lm(Y~X1))

> ####绘制x2与y的散点图和回归曲线
> lm2.sol<-lm(Y~X2+I(X2^2))
> x<-seq(min(X2),max(X2),len=200)
> y<-predict(lm2.sol,data.frame(X2=x))
> plot(Y~X2);lines(x,y)

其中I(X2^2)表示模型中X₂的平方项，即X₂²

从两图可看出，将销售量模型改为

Y=β₀+β₁X₁+β₂X₂+β₃X₂²+ε

似乎更加合理。我们作相应的回归分析，

> lm.new<-update(lm.sol, .~.+I(X2^2))
> summary(lm.new)
 
Call:
lm(formula = Y ~ X1 + X2 + I(X2^2), data = toothpaste)
 
Residuals:
     Min       1Q   Median       3Q      Max
-0.40330 -0.14509 -0.03035  0.15488  0.46602 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  17.3244     5.6415   3.071  0.00495 **
X1            1.3070     0.3036   4.305  0.00021 ***
X2           -3.6956     1.8503  -1.997  0.05635 .
I(X2^2)       0.3486     0.1512   2.306  0.02934 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2213 on 26 degrees of freedom
Multiple R-squared:  0.9054,	Adjusted R-squared:  0.8945
F-statistic: 82.94 on 3 and 26 DF,  p-value: 1.944e-13

此时发现，模型残差的标准差有所下降，相关系数的平方R²有所上升，这说明模型修正是合理的。但这也出现一个问题，就是对应于β₂的P-值>0.05.为进一步分析作β的区间估计。

> source("beta.int.R")
> beta.int(lm.new)
              Estimate        Left      Right
(Intercept) 17.3243685  5.72818421 28.9205529
X1           1.3069887  0.68290927  1.9310682
X2          -3.6955867 -7.49886317  0.1076898
I(X2^2)      0.3486117  0.03786354  0.6593598

β₂的区间估计是[-7.49886317,0.1076898]，包含了0，也就是说，β₂的值可能会是0

去掉X₂的一次项，再进行分析

> lm2.new<-update(lm.new,.~.-X2)
> summary(lm2.new)
 
Call:
lm(formula = Y ~ X1 + I(X2^2), data = toothpaste)
 
Residuals:
    Min      1Q  Median      3Q     Max
-0.4859 -0.1141 -0.0046  0.1053  0.5592 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.07667    0.35531  17.102 5.17e-16 ***
X1           1.52498    0.29859   5.107 2.28e-05 ***
I(X2^2)      0.04720    0.00952   4.958 3.41e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2332 on 27 degrees of freedom
Multiple R-squared:  0.8909,	Adjusted R-squared:  0.8828
F-statistic: 110.2 on 2 and 27 DF,  p-value: 1.028e-13

此模型虽然通过了F检验和T检验，但与上一模型对比来看，模型残差的标准差上升，R²下降。这又是此模型的不足之处。

再作进一步的修正，考虑x₁与x₂交互作用，即模型为

Y=β₀+β₁X₁+β₂X₂+β₃X₂²+β₄X₁X₂+ε

> lm3.new<-update(lm.new,.~.+X1*X2)
> summary(lm3.new)
 
Call:
lm(formula = Y ~ X1 + X2 + I(X2^2) + X1:X2, data = toothpaste)
 
Residuals:
     Min       1Q   Median       3Q      Max
-0.43725 -0.11754  0.00489  0.12263  0.38410 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  29.1133     7.4832   3.890 0.000656 ***
X1           11.1342     4.4459   2.504 0.019153 *
X2           -7.6080     2.4691  -3.081 0.004963 **
I(X2^2)       0.6712     0.2027   3.312 0.002824 **
X1:X2        -1.4777     0.6672  -2.215 0.036105 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2063 on 25 degrees of freedom
Multiple R-squared:  0.9209,	Adjusted R-squared:  0.9083
F-statistic: 72.78 on 4 and 25 DF,  p-value: 2.107e-13

模型通过T检验和F检验，并且模型残差的标准差减少，R²增加。因此，最终模型选为

Y=29.1133+11.1342X₁-7.6080X₂+0.6712X₂²-1.4777X₁X₂+ε