2017-05-18 21:03:11
统计学中,研究如何根据样本数据去推断总体数量特征的方法。 它是在对样本数据进行描述的基础上,对统计总体的未知数量特征做出以概率形式表述的推断。更概括地说,是在一段有限的时间内,通过对一个随机过程的观察来进行推断的。
– Wikipedia
用样本统计量的来估计相应总体参数,称为参数估计。
回答前面的问题,需要转化推断指标。如:假定收入服从正态分布,可以均数作为收入的估计。
假设检验是一种统计检验,用于确定样本数据中是否有足够证据来推断特定条件对总体是否为真。
基于“无罪推定”的原则,步骤:
研究结论 \ 实际情况 | H0正确 | H0错误 |
---|---|---|
拒绝H0 | I类错误(概率=\(\alpha\)) | 正确(概率=1-\(\alpha\)) |
接受H0 | 正确(概率=1-\(\beta\)) | II类错误(概率=\(\beta\)) |
以t检验为例,假设样本量=100,效应大小为0.3
library(pwr) alpha <- seq(0, 1, 0.01) beta <- 1 - sapply(alpha, function(a) { pwr.t.test(100, d=0.3, sig.level=a)$power}) df <- rbind( data.frame(alpha, class="beta", Value=beta), data.frame(alpha, class="alpha+beta", Value=alpha+beta)) dfmin <- c(alpha[which.min(alpha+beta)], min(alpha+beta)) library(ggplot2) ggplot(df, aes(alpha, Value, color=class)) + geom_point() + geom_line() + geom_vline(xintercept=dfmin[1]) + geom_hline(yintercept=dfmin[2]) + annotate("text", label=dfmin[1], x=dfmin[1]+0.05, y=dfmin[2]-0.02) + ggtitle(expression(paste( "Prob. of Type I Error (", alpha, ") And Type II Error (", beta, ")"))) + xlab(expression(alpha)) + theme_minimal() + theme(legend.position='bottom')
总体均值\(\mu\)的最小方差无偏估计为样本的均值\(\bar{x}\)
Edgar Anderson iris数据Sepal.Width的总体均数?
> x <- iris$Sepal.Width > sum(x)/length(x) # 或 > mean(x) 3.057333
\[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\]
Sepal.Width的总体方差?
> sum((x-3.057333)^2)/(length(x)-1) # 或 > var(x) 0.1899794
样本均数服从正态分布
\[z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}\]
假设总体方差是0.1899794,Sepal.Width均数区间?
> mean(x) + qnorm(c(0.025, 0.975))*0.1899794/ + sqrt(length(x)) [1] 3.026931 3.087736
样本均数服从t分布
\[\bar{x} \pm t_{n - 1, 1 - \alpha / 2}{\frac{s}{\sqrt{n}}}\]
> mean(x) + qt(p=c(0.025, 0.975), df=length(x))* + sd(x)/sqrt(length(x)) [1] 2.987014 3.127653
\[[\frac{(n - 1) s^2}{\chi_{n - 1, \alpha/2}^{2}}, \frac{(n - 1) s^2}{\chi_{n - 1, 1 - \alpha / 2}^{2}}]\]
> (length(x)-1) * var(x) / + qchisq(c(0.025, 0.975), length(x)-1) [1] 0.2417372 0.1532698
样本量必须在3-5000之间
> shapiro.test(rnorm(100)) Shapiro-Wilk normality test data: rnorm(100) W = 0.98802, p-value = 0.5099
> shapiro.test(runif(100, -2, 2)) Shapiro-Wilk normality test data: runif(100, -2, 2) W = 0.94369, p-value = 0.0003262
> ks.test(rnorm(100), "pnorm") One-sample Kolmogorov-Smirnov test data: rnorm(100) D = 0.079472, p-value = 0.5528 alternative hypothesis: two-sided
> ks.test(rnorm(100), "pnorm", 1, 2) One-sample Kolmogorov-Smirnov test data: rnorm(100) D = 0.32728, p-value = 9.941e-10 alternative hypothesis: two-sided
> x <- rnorm(50) > y <- runif(30, -2, 2) > ks.test(x, y) Two-sample Kolmogorov-Smirnov test data: x and y D = 0.23333, p-value = 0.2264 alternative hypothesis: two-sided
t.test
)> t.test(rnorm(100), mu=0) One Sample t-test data: rnorm(100) t = -1.3261, df = 99, p-value = 0.1879 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.33597711 0.06679886 sample estimates: mean of x -0.1345891
t.test
(一般不这样做)> wilcox.test(rpois(1:100, 1), mu=0) Wilcoxon signed rank test with continuity correction data: rpois(1:100, 1) V = 2016, p-value = 1.282e-12 alternative hypothesis: true location is not equal to 0
> x <- rnorm(100) > y <- rnorm(100, 1) > var.test(x, y) F test to compare two variances data: x and y F = 0.79822, num df = 99, denom df = 99, p-value = 0.2639 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.5370739 1.1863377 sample estimates: ratio of variances 0.7982174
t.test
)> t.test(x, y) Welch Two Sample t-test data: x and y t = -5.7058, df = 195.54, p-value = 4.236e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.0780640 -0.5242426 sample estimates: mean of x mean of y 0.2038993 1.0050526
> x <- rexp(1:200, 1) > y <- rexp(1:200, 2) > wilcox.test(x, y) Wilcoxon rank sum test with continuity correction data: x and y W = 26324, p-value = 4.514e-08 alternative hypothesis: true location shift is not equal to 0
t.test
(不要过度变换)> library(forecast) > x1 <- BoxCox(x, lambda=0.3) > y1 <- BoxCox(y, lambda=0.3) > shapiro.test(x1) # p-value=0.1488 > shapiro.test(y1) # p-value=0.1542 > t.test(x1, y1) Welch Two Sample t-test data: x1 and y1 t = 5.9563, df = 380.44, p-value = 5.878e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3402813 0.6756506 sample estimates: mean of x mean of y -0.4456814 -0.9536474
chickwts数据集,不同饲料(feed)喂养后体重(weight)有无差别?
统计假设
前置条件: 两饲料组样本方差齐不齐?
> bartlett.test(weight~feed, chickwts) Bartlett test of homogeneity of variances data: weight by feed Bartlett's K-squared = 3.2597, df = 5, p-value = 0.66
各组体重方差齐
aov
> summary(aov(weight~feed, chickwts)) Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.37 5.94e-10 *** Residuals 65 195556 3009 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova
(必须配合lm
)> anova(lm(weight~feed, chickwts)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.365 5.936e-10 *** Residuals 65 195556 3009 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
哪个饲料组体重更大?哪个更小?
多重比较必须修正\(\alpha\),否则会导致I类错误增高
TukeyHSD
TukeyHSD(x, which, ordered = FALSE, conf.level = 0.95, ...)
pairwise.t.test
pairwise.t.test(x, g, p.adjust.method = p.adjust.methods, pool.sd = !paired, paired = FALSE, alternative = c("two.sided", "less", "greater"), ...)
> TukeyHSD(aov(weight~feed, chickwts)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = weight ~ feed, data = chickwts) $feed diff lwr upr p adj horsebean-casein -163.383 -232.347 -94.42 0.00000 linseed-casein -104.833 -170.587 -39.08 0.00021 meatmeal-casein -46.674 -113.906 20.56 0.33246 soybean-casein -77.155 -140.517 -13.79 0.00836 sunflower-casein 5.333 -60.421 71.09 0.99989 linseed-horsebean 58.550 -10.413 127.51 0.14133 meatmeal-horsebean 116.709 46.335 187.08 0.00011 soybean-horsebean 86.228 19.542 152.91 0.00427 sunflower-horsebean 168.717 99.753 237.68 0.00000 meatmeal-linseed 58.159 -9.073 125.39 0.12770 soybean-linseed 27.678 -35.684 91.04 0.79328 sunflower-linseed 110.167 44.412 175.92 0.00009 soybean-meatmeal -30.480 -95.375 34.41 0.73913 sunflower-meatmeal 52.007 -15.224 119.24 0.22070 sunflower-soybean 82.488 19.123 145.85 0.00388
> with(chickwts, pairwise.t.test(weight, feed)) Pairwise comparisons using t tests with pooled SD data: weight and feed casein horsebean linseed meatmeal soybean horsebean 2.9e-08 - - - - linseed 0.00016 0.09435 - - - meatmeal 0.18227 9.0e-05 0.09435 - - soybean 0.00532 0.00298 0.51766 0.51766 - sunflower 0.81249 1.2e-08 8.1e-05 0.13218 0.00298 P value adjustment method: holm
> summary(aov(yield~N*P*K, npk)) Df Sum Sq Mean Sq F value Pr(>F) N 1 189.3 189.28 6.161 0.0245 * P 1 8.4 8.40 0.273 0.6082 K 1 95.2 95.20 3.099 0.0975 . N:P 1 21.3 21.28 0.693 0.4175 N:K 1 33.1 33.14 1.078 0.3145 P:K 1 0.5 0.48 0.016 0.9019 N:P:K 1 37.0 37.00 1.204 0.2887 Residuals 16 491.6 30.72 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
将那些很难人为控制的控制因素作为协变量,在排除协变量对观测变量影响的条件下,分析控制变量对观测变量的作用, 从而更准确地评价控制因素。
> summary(aov(yield~N*P*K+block, npk)) Df Sum Sq Mean Sq F value Pr(>F) N 1 189.3 189.28 12.26 0.0044 ** P 1 8.4 8.40 0.54 0.4749 K 1 95.2 95.20 6.17 0.0289 * block 5 343.3 68.66 4.45 0.0159 * N:P 1 21.3 21.28 1.38 0.2632 N:K 1 33.1 33.14 2.15 0.1687 P:K 1 0.5 0.48 0.03 0.8628 Residuals 12 185.3 15.44 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> kruskal.test(Ozone ~ Month, data = airquality) Kruskal-Wallis rank sum test data: Ozone by Month Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06
Thank you!