2017-05-20 16:14:53
目的
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. — Wikipedia
码本是数据集的基础说明文档。获得数据后,首先要阅读码本)。
码本描述数据集的内容、结构和版式。
撰写得当的码本"包含了数据文件中每个变量的完整的、自解释(self-explanatory)的信息"1
LEXINGTON-FAYETTE POLICE DATA Police Records for Auto Larceny and Commercial Burglary Codebook (ROARK SYSTEM) January 01, 1999 – Nov. 12, 2000 EVENTNO Indicates the dispatch number for the event, which is unique for each call for service. CASENO Indicates the dispatch number for the event, which is unique for each call for service. DATE Indicates the string form of the date that the call for service took place. For example an event occurring on Jan 2, 1999 would appear as 19990102. year Indicates the year in 4 digits that the call for service took place. For this database, 1999, 2000, or 2001. mon Indicates the month in two digits that the call for service took place. “01” = January “07” = July ... “06” = June “12” = December day Indicates the day in two digits that the call for service took place. txtdate Indicates the abbreviated form of the date that the call for service took place. (mm/dd/yy)
> Hmisc::describe(cars)
cars 2 Variables 50 Observations --------------------------------------- speed n missing distinct Info ... 50 0 19 0.996 ... Value 4 7 8 9 10 ... Frequency 2 2 1 1 3 ... Proportion 0.04 0.04 0.02 0.02 0.06 ... --------------------------------------- dist n missing distinct Info ... 50 0 35 0.999 ... lowest : 2 4 10 , highest: ... ---------------------------------------
> memisc::codebook(cars)
======================== speed ------------------------ Storage mode: double Min.: 4.000 1st Qu.: 12.000 Median: 15.000 Mean: 15.400 3rd Qu.: 19.000 Max.: 25.000 ======================== dist ------------------------ Storage mode: double Min.: 2.000 1st Qu.: 26.000 Median: 36.000 Mean: 42.980 3rd Qu.: 56.000 Max.: 120.000
> library(ggplot2) > ?diamonds
A data frame with 53940 rows and 10 variables: price price in US dollars (\$326–\$18,823) carat weight of the diamond (0.2–5.01) cut quality of the cut (Fair, Good, Very Good, Premium, Ideal) color diamond colour, from J (worst) to D (best) clarity a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)) x length in mm (0–10.74) y width in mm (0–58.9) z depth in mm (0–31.8) depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) table width of top of diamond relative to widest point (43–95)
> str(diamonds) >
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables: $ price : int 326 326 327 334 335 336 336 337 337 ... $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 ... $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 ... $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 ... $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 ... $ table : num 55 61 65 58 58 57 57 55 61 61 ...
> mean(c(1:10, NA)) [1] NA > lm(Solar.R~Ozone, airquality, na.action=NULL) Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'x'
optionas("na.action")
):
"na.omit"
, "na.fail", "na.exclude", "na.pass"
library(mice) md.pattern(airquality)
## Wind Temp Month Day Solar.R Ozone ## 111 1 1 1 1 1 1 0 ## 35 1 1 1 1 1 0 1 ## 5 1 1 1 1 0 1 1 ## 2 1 1 1 1 0 0 2 ## 0 0 0 0 7 37 44
> sum(complete.cases(airquality))/ + nrow(airquality) [1] 0.7254902
fluxplot(airquality)
从月份分布看,缺失不随机
> library(reshape2) > o <- dcast(airquality, Month ~ is.na(Ozone), + length, value.var="Month") > chisq.test(o[,2:3]) Pearson's Chi-squared test data: o[, 2:3] X-squared = 44.751, df = 4, p-value = 4.48e-09
但从气温和风速分布上看,缺失是随机的
> m1 <- t.test(Temp~is.na(Ozone), airquality) > m2 <- t.test(Wind~is.na(Ozone), airquality) > t(rbind(unlist(m1[1:4]), unlist(m2[1:4]))) [,1] [,2] statistic.t -0.02683064 -0.6091059 parameter.df 60.44704944 63.6459520 p.value 0.97868320 0.5446224 conf.int1 -3.64330595 -1.6893132 conf.int2 3.54684742 0.8999377
imp <- lapply(list(list("pmm", 5), list("pmm", 10), list("rf", 10)), function(lst){ mice(airquality, method=lst[[1]], maxit=lst[[2]], print=FALSE)}) summary(pool(with(imp[[1]], lm(Ozone~Temp+Wind))))
## est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda ## (Intercept) -67.409 20.7643 -3.246 110.04 1.549e-03 -108.558 -26.259 NA 0.09920 0.08297 ## Temp 1.781 0.2202 8.087 118.13 6.026e-13 1.345 2.217 0 0.08482 0.06946 ## Wind -2.967 0.6716 -4.418 34.94 9.192e-05 -4.330 -1.604 0 0.31569 0.27761
summary(pool(with(imp[[2]], lm(Ozone~Temp+Wind))))
## est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda ## (Intercept) -55.651 19.7734 -2.814 123.15 5.690e-03 -94.791 -16.511 NA 0.07578 0.06090 ## Temp 1.666 0.2085 7.991 136.58 5.005e-13 1.254 2.078 0 0.04943 0.03562 ## Wind -3.227 0.5825 -5.540 95.14 2.679e-07 -4.384 -2.071 0 0.12652 0.10835
summary(pool(with(imp[[3]], lm(Ozone~Temp+Wind))))
## est se t df Pr(>|t|) lo 95 hi 95 nmis fmi lambda ## (Intercept) -68.649 27.6978 -2.478 15.56 2.508e-02 -127.501 -9.796 NA 0.5142 0.4555 ## Temp 1.718 0.2949 5.826 15.80 2.712e-05 1.092 2.344 0 0.5100 0.4516 ## Wind -2.452 0.6780 -3.617 40.33 8.201e-04 -3.822 -1.082 0 0.2858 0.2512
boxplot(airquality$Wind)
> boxplot(airquality$Wind, plot=FALSE)$out [1] 20.1 18.4 20.7
> ub <- quantile(airquality$Wind, 3/4) + + 1.5 * IQR(airquality$Wind) > lb <- quantile(airquality$Wind, 1/4) - + 1.5 * IQR(airquality$Wind) > with(airquality, Wind[Wind>ub | Wind<lb]) [1] 20.1 18.4 20.7
> library(outliers) > grubbs.test(airquality$Wind) Grubbs test for one outlier data: airquality$Wind G = 3.04920, U = 0.93843, p-value = 0.1518 alternative hypothesis: highest value 20.7 is an outlier
最大值20.7不是异常值
处理: 删除、插值、变量变换、更改模型
"RR.engineer"、"minister"两个点对模型有较大影响。
car::influencePlot(lm(prestige ~ income + education, data=car::Duncan))
Anscombe, Francis J. (1973)
# 调整版面参数 op <- par(mfrow=c(2,2), mar=c(4, 4, 2, 2)) # 循环: 绘制密度直方图,并上覆核密度曲线 for (i in 1:4){ hist(airquality[,i], col="lightskyblue", main=paste("Histogram of", names(airquality)[i]), xlab=names(airquality)[i], border="white", prob=TRUE) lines(density(airquality[,i], na.rm=TRUE), lwd=2) } # 重置版面参数 par(op)
# 调整版面参数 op <- par(mfrow=c(1,2), mar=c(4, 4, 2, 2)) # 循环: 绘制条图 for (i in 5:6){ barplot(table(airquality[,i]), col="lightskyblue", main=paste("Count of", names(airquality)[i]), xlab=names(airquality)[i], border="deepskyblue", horiz=TRUE) } # 重置版面参数 par(op)
Thank you!