加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 4 Dependent Samples

Stat2.3x Inference（统计推断）课程由加州大学伯克利分校（University of California, Berkeley）于2014年在edX平台讲授。

Summary

Dependent Variables (paired samples)

SD of the difference is $$\sqrt{\sigma_x^2+\sigma_y^2-2\cdot r\cdot\sigma_x\cdot\sigma_y}$$ where $r$ is the correlation between the two variables X and Y.
Correlation $$r=\frac{1}{n}\cdot\sum_{i=1}^{n}(\frac{x_i-\bar{x}}{\sigma_x}\cdot\frac{y_i-\bar{y}}{\sigma_y})$$ where $$\sigma_x=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2}$$ $$\sigma_y=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\bar{y})^2}$$ R function:
```
cor(x, y)
```

ADDITIONAL PRACTICE PROBLEMS FOR EXERCISE SET 4

PROBLEM 1

To see whether caffeine affects the speed at which mice can run to a "reward", a simple random sample of 50 mice was taken from a large population of mice. Each mouse ran twice; once before the caffeine, and once after. The "before" run times had a mean of 32 seconds and an SD of 3 seconds. The "after" run times had a mean of 30 seconds and an SD of 3.5 seconds. The correlation between the "before" and "after" run times was 0.7. For 32 of the 50 mice, the "after" run time was shorter than the "before". Which of the following is a correct $z$ statistic to test whether mice in this population run faster after caffeine? More than one answer might be correct.

a) $(2 - 0) / \sqrt{0.424^2 + 0.495^2}$

b) $(2 - 0) / 0.362$

c) $(31.5 - 25) / 3.535$

Solution

Dependent paired variables, so (a) is incorrect. Based on the sample mean, we have $$H_0: \mu_1 = \mu_2$$ $$H_A: \mu_1 > \mu_2$$ and $$n=50, \mu_1=32, \mu_2=30, \sigma_1=3, \sigma_2=3.5$$ Therefore $$\sigma=\sqrt{\sigma_1^2+\sigma_2^2-2\cdot r\cdot\sigma_1\cdot\sigma_2}=\sqrt{3^2+{3.5}^2-2\times0.7\times3\times3.5}$$ $$ SE=\frac{\sigma}{\sqrt{n}}\Rightarrow z=\frac{\mu_1-\mu_2}{SE}=\frac{2-0}{0.362}$$ Thus, (b) is correct. We can conclude that the P-value is too small and reject $H_0$, that is, $\mu_1 > \mu _2$. R code:

sd = sqrt(3^2 + 3.5^2 - 2 * 0.7 * 3 * 3.5)

se = sd / sqrt(50); z = 2 / se

se

[1] 0.3619392

1 - pnorm(z)

[1] 1.640035e-08

(c) is correct, too. This is like a coin toss. $$H_0: p=0.5$$ $$H_A: p > 0.5 $$ where $p$ is the percent of "faster" mice of the population. The observed number of heads is 32. If the null were true, we would expect it to be 25 give or take: $$SE=\sqrt{\frac{p\cdot(1-p)}{n}}\cdot n=3.535$$ Thus $$z=\frac{31.5-25}{3.535}$$ Since the P-value is small so reject $H_0$, that is, $p > 0.5$. R code:

p = 0.5; n = 50

se = sqrt(p * (1 - p) / n)

z = (32 / 50 - p) / se

1 - pnorm(z)

[1] 0.02385744

se * 50

[1] 3.535534

PROBLEM 2

In a study on weight loss, a simple random sample of 500 of the 750 participants was placed in the "Diet 1" group and the remaining 250 in the "Diet 2" group. After the treatment, the average weight loss in the "Diet 1" group was 4.3 pounds with an SD of 1.2 pounds; the average weight lost in the "Diet 2" group was 3.9 pounds with an SD of 1.7 pounds. In the "Diet 1" group, 57% of the participants lost weight, compared to 54% in the "Diet 2" group.

a) To test whether the diet affected the mean amount of weight lost, the $z$ statistic is (fill in the blank): $(0.4 - 0)/( )$

b) To test whether the diet affects the percent of people who lose weight, the $z$ statistic is $(3 - 0)/( )$

Solution

Independent variables.

a. $$H_0: \mu_1=\mu_2$$ $$H_A: \mu_1\neq\mu_2$$ And $$n_1=500, n_2=250, \mu_1=4.3, \mu_2=3.9, \sigma_1=1.2, \sigma_2=1.7$$ $$\Rightarrow SE=\sqrt{SE_1^2+SE_2^2}=\sqrt{(\frac{\sigma_1}{\sqrt{n_1}})^2+(\frac{\sigma_2}{\sqrt{n_2}})^2}=0.1201666$$ Therefore, the P-value is 0.0008724816 which is smaller than 0.05. We reject $H_0$, that is, $\mu_1\neq\mu_2$. R code:

mu1 = 4.3; mu2 = 3.9; sd1 = 1.2; sd2 = 1.7; n1 = 500; n2 = 250

se1 = sd1 / sqrt(n1); se2 = sd2 / sqrt(n2); se = sqrt(se1^2 + se2^2)

se

[1] 0.1201666

z = (mu1 - mu2) / se

(1 - pnorm(z)) * 2

[1] 0.0008724816

b. $$H_0: p_1=p_2$$ $$H_A: p_1\neq p_2$$ And $$p_1=0.57, p_2=0.54, n_1=500, n_2=250, \hat{p}=\frac{n_1\cdot p_1+n_2\cdot p_2}{n_1+n_2}$$ $$\Rightarrow SE=\sqrt{SE_1^2+SE_2^2}=\sqrt{\frac{\hat{p}\cdot(1-\hat{p})}{n_1}+\frac{\hat{p}\cdot(1-\hat{p})}{n_2}}=0.03844997$$ R code:

p1 = 0.57; p2 = 0.54; n1 = 500; n2 = 250

p = (n1 * p1 + n2 * p2) / (n1 + n2)

se1 = sqrt(p * (1 - p) / n1); se2 = sqrt(p * (1 - p) / n2)

se = sqrt(se1^2 + se2^2)

se

[1] 0.03844997

z = (p1 - p2) / se

(1 - pnorm(z)) * 2

[1] 0.4352527

Because the P-value is larger than 0.05 so we reject $H_A$, that is, $p_1=p_2$.

EXERCISE SET 4

If a problem asks for an approximation, please use the methods described in the video lecture segments. Unless the problem says otherwise, please give answers correct to one decimal place according to those methods. Some of the problems below are about simple random samples. If the population size is not given, you can assume that the correction factor for standard errors is close enough to 1 that it does not need to be computed. Please use the 5% cutoff for P-values unless otherwise instructed in the problem.

PROBLEM 1

In a study of the effect of a medical treatment, a simple random sample of 300 of the 500 participating patients was assigned to the treatment group; the remaining patients formed the control group. When the patients were assessed at the end of the study, favorable outcomes were observed in 162 patients in the treatment group and 97 patients in the control group. Did the treatment have an effect, or is this just chance variation? Perform a statistical test, following the steps in Problems 1A-1D.

1A The null hypothesis is (pick the best among the options):

a. The treatment has an effect which could be good or bad.

b. The treatment has a good effect.

c. The treatment has no effect.

d. The treatment has a bad effect.

1B Under the null hypothesis, the SE of the difference between the percents of favorable outcomes in the two groups is about( )%.

1C The $z$ statistic is closest to?

1D The conclusion of the test is (pick the better of the two options): The observed difference is due to chance. The treatment has an effect.

Solution

1A) $$H_0: p_1=p_2$$ $$H_A: p_1 > p_2$$ where $p_1=\frac{162}{300}, p_2=\frac{97}{200}$.

1B) The samples are from the same population, so we don't use pooled estimate. $$SE=\sqrt{SE_1^2+SE_2^2}=\sqrt{\frac{p_1\cdot(1-p_1)}{n_1}+\frac{p_2\cdot(1-p_2)}{n_2}}=0.04557274$$

1C) $$z=\frac{p_1-p_2}{SE}=1.205771$$

1D) P-value is $0.1137427 > 0.05$, which concludes rejecting $H_A$. Therefore, the conclusion is $p_1=p_2$. R code:

p1 = 162 / 300; p2 = 97 / 200; n1 = 300; n2 = 200

se = sqrt(p1 * (1 - p1) / n1 + p2 * (1 - p2) / n2)

z = (p1 - p2) / se; z

[1] 1.206862

1 - pnorm(z)

[1] 0.1137427

PROBLEM 2

In a simple random sample of 250 father-son pairs taken from a large population of such pairs, the mean height of the fathers is 68.5 inches and the SD is 2.5 inches; the mean height of the sons is 69 inches and the SD is 3 inches; the correlation between the heights of the fathers and sons is 0.5. In the population, are the sons taller than their fathers, on average? Or is this just chance variation? Follow the steps in Problems 2A-2B.

2A The SE of the mean difference between heights of fathers and sons in the sample is closest to?

2B Which of the following most closely represents the result of the test?

a. The result is not statistically significant, so we conclude that it is due to chance variation.

b. The result is not statistically significant, so we conclude that the sons are taller than their fathers, on average.

c. The result is highly statistically significant, so we conclude that the sons are taller than their fathers, on average.

d. The result is highly statistically significant, so we conclude that it is due to chance variation.

Solution

2A) Dependent variables. $$H_0: \mu_1=\mu_2$$ $$H_A: \mu_1 < \mu_2$$ where $\mu_1, \mu_2$ represents the height of fathers and sons on average, respectively. We have $$n=250, \sigma_1=2.5, \sigma_2=3, \mu_1=68.5, \mu_2=69, r=0.5$$ and $$SE_1=\frac{\sigma_1}{\sqrt{n}}, SE_2=\frac{\sigma_2}{\sqrt{n}}$$ Thus $$SE=\sqrt{SE_1^2+SE_2^2-2\cdot r\cdot SE_1\cdot SE_2}=0.1760682$$

2B) $$z = \frac{\mu_1-\mu_2}{SE}=-2.839809$$ And the P-value is $0.002257026 < 0.05$ which is statistically significant. Therefore, we reject $H_0$ and the conclusion is $\mu_1 < \mu_2$. R code:

n = 250; mu1 = 68.5; mu2 = 69; sigma1 = 2.5; sigma2=3; r = 0.5

se1 = sigma1 / sqrt(n); se2 = sigma2 / sqrt(n)

se = sqrt(se1^2 + se2^2 - 2 * r * se1 * se2)

se

[1] 0.1760682

z = (mu1 - mu2) / se; z

[1] -2.839809

pnorm(z)

[1] 0.002257026

PROBLEM 3

A group of scientists is studying whether a new medical treatment has an adverse (bad) effect on lung function. Here are data on a simple random sample of 10 patients taken from a large population of patients in the study. Both variables are measurements, in liters, of the amount of air that the patient can blow out (this is a very rough description of a well-defined measure). The bigger a measurement is, the better the lung function. The "baseline" measurement was taken before the treatment, and the "final" measurement was taken after the treatment.

Baseline Final

4.19 4.17

4.52 4.20

4.50 4.53

3.90 3.95

4.33 4.15

4.30 4.19

3.94 3.96

4.35 4.26

4.21 4.07

4.17 3.93

In case you need summary statistics, here are some that are commonly used; the SDs have $n-1 = 9$ in the denominator. Baseline: mean 4.241, SD 0.2065 Final: mean 4.141, SD 0.1798 Correlation between baseline and final: 0.8055 Perform a one-sided test at the 5% level, following the steps in Problems 3A-3C.

3A Based on the information given, which test should you perform?

a. binomial test for the fairness of a coin

b. one-sample $z$ test for a population mean (quantitative variable; not proportions of zeros and ones)

c. one-sample $t$ test for a population mean

d. two-sample $z$ test for the difference between population means, based on independent samples

e. two-sample $z$ test for the effect of a treatment, applied to the results of a randomized controlled experiment

3B The P-value of the test is:

less than 1%

between 1% and 5%

between 5% and 10%

between 10% and 15%

between 15% and 20%

3C The conclusion of the test is: The treatment had a bad effect. The results are due to chance variation.

Solution

3A) (a) is correct. The data are paired, so this will be a one-sample test; this rules out (d) and (e). There are only 10 observations, so the probabilities for sample means need not be normal; this rules out (b). It cannot be $t$ test since there's no assumption about the underlying normality of the variables; this rules out (c). ($t$ test: population roughly normal, unknown mean and SD). The only thing left is to compare the results to tosses of a coin. Define a "head" to be a patient whose score goes down after treatment. Then we will test whether the number of heads is like the result of tossing a coin 10 times, or whether there are too many heads for "coin tossing" to be a reasonable conclusion. $$H_0: p=0.5$$ $$H_A: p>0.5$$ where $p=0.7$ is this sample. For the given mean, SD and $r$ in the problem, its calculation in R could be:

base = c(4.19, 4.52, 4.5, 3.9, 4.33, 4.3, 3.94, 4.35, 4.21, 4.17)

final = c(4.17, 4.2, 4.53, 3.95, 4.15, 4.19, 3.96, 4.26, 4.07, 3.93)

mean(base); sd(base); mean(final); sd(final); cor(base, final)

[1] 4.241

[1] 0.2064757

[1] 4.141

[1] 0.1798425

[1] 0.8054805

3B) In 7 of the 10 pairs, the patient's score went down. So we want the chance of 7 or more heads in 10 tosses of a coin. Binomial distribution, under the null $n=10, k=7:10, p=0.5$, so $$\sum_{k=7}^{10}C_{10}^{k}\cdot0.5^k\cdot0.5^{10-k}=0.171875$$ R code:

sum(dbinom(7:10, 10, 0.5))

[1] 0.171875

3C) P-value is 0.171875 which is larger than 0.05, so we reject $H_A$. That is, the conclusion is the result is due to chance variation ($p=0.5$).

加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 4 Dependent Samples的更多相关文章

加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 5 Window to a Wider World
Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 3 One-sample and two-sample tests
Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 2 Testing Statistical Hypotheses
Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 1 Estimating unknown parameters
Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: FINAL
Stat2.3x Inference(统计推断)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Final
Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Section 5 The accuracy of simple random samples
Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Section 4 The Central Limit Theorem
Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...
加州大学伯克利分校Stat2.2x Probability 概率初步学习笔记: Section 3 The law of averages, and expected values
Stat2.2x Probability(概率)课程由加州大学伯克利分校(University of California, Berkeley)于2014年在edX平台讲授. PDF笔记下载(Acad ...

随机推荐

前端见微知著JavaScript基础篇：你所不知道的apply, call 和 bind
在我的职业生涯中,很早就已经开始使用JavaScript进行项目开发了.但是一直都是把重心放在了后端开发方面,前端方面鲜有涉及.所以造成的一个现象就是:目前的前端知识水平,应付一般的项目已然是足够的, ...
RSA签名验签学习笔记
RSA私钥签名时要基于某个HASH算法,比如MD5或者SHA1等.之前我一直认为签名的过程是:先对明文做HASH计算,然后用私钥直接对HASH值加密.最近才发现不是那么简单,需要对HASH后的数据进行 ...
C#发展历程以及C#6.0新特性
一.C#发展历程下图是自己整理列出了C#每次重要更新的时间及增加的新特性,对于了解C#这些年的发展历程,对C#的认识更加全面,是有帮助的. 二.C#6.0新特性 1.字符串插值 (String In ...
setTimeout和setinterval的区别
setTimeout("alert('久等了')",2000)是等待多长时间开始执行函数 setinterval(fn,1000)是每隔多长时间执行一次函数 setTimeout和 ...
自然数从1到n之间,有多少个数字含有1
问题明确而简单.for循环肯定是不好的. 用递推方法: 定义h(n)=从1到9999.....9999 ( n 个 9)之间含有1的数字的个数.定义f(n)为n位数中 ...
Map集合的应用及其遍历方式
---> HashMap :底层基于哈希表存储原理也使用哈希表来存放的: 往HashMap添加了元素 ,首先会调用键的hashCode方法获得一个哈希值,然后 ...
Android AlertDialog
在Android 4.2 中不推荐使用showDialog弹窗,这里简单总结一下AlertDialog的使用方法,以后复习的时候看着方便,详细使用方法需要的时候再研究. setTitle :为对话框设 ...
Beta版本冲刺———第五天
会议照片: 项目燃尽图: 1.项目进展: 困难:基本计划中增加的功能已经完成,但是在"如何保存每次游戏的分数,并将其排序列在排行榜中"遇到麻烦,现在小组都在一起协商攻克中.
Hibernate用注解实现实体类和表的映射
数据库mysql: 1.一对一 person50表password50表是一对一的关系: password50表中有外键 person_id person实体类: package com.c50.en ...
__getattribute__
class Foo: def __init__(self,x): self.x = x def __getattribute__(self, item): print('不管是否纯在,我都会执行') ...

加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 4 Dependent Samples

加州大学伯克利分校Stat2.3x Inference 统计推断学习笔记: Section 4 Dependent Samples的更多相关文章

随机推荐

热门专题