Data Mining with R
Ubuntu下安装R
sudo apt-get install r-base r-base-dev
CRAN (Comprehensive R Archive Network), R的插件管理和release
Install add-on packages available for R at CRAN
比如安装一个RMySQL
> options(CRAN=’http://cran.r-project.org’)
> install.package(“RMySQL”)
Managing your sessions
执行存放R语句的文件
> source(’mycode.R’)
In Unix versions you may use the functions getwd() and setwd(<directory path>) to, respectively, check and change the current working directory.
还可以将当前的workspace, 运行环境包含已创建的对象存入文件, 在需要是load进来, 很方便.
> save(f,my.dataset,file=’mysession.R’)
> load(’mysession.R’)
R中的数据结构
下面开始介绍R的数据结构, 刚开始看觉得和python一样方便
慢慢觉得, 真的很强大, 使用太灵活了, 非常适用于数据分析
R objects
> z <- 5
> w <- z^2
> w
[1] 25
> i <- (z*2 + 45)/2
> i
[1] 27.5
> ls()
[1] "i" "w" "y" "z"
> rm(y)
> rm(z,w,i)
Note
that names in R are case sensitive, meaning that Color and color are two distinct objects
In R you are not allowed to use the underscore character ‘_’ in object names
Vectors
The most basic data object in R is a vector. Even when you assign a single number to an object (like in x <- 45.3) you are creating a vector containing a single element.
这个很有意思, R的最基本的结构是vector, 不会影响效率吗, 还是考虑到类型使用的一致性...
It can take the values character, logical, numeric or complex.
> v <- c(4,7,23.5,76.2,80)
> v
[1] 4.0 7.0 23.5 76.2 80.0
> length(v)
[1] 5
> mode(v)
[1] "numeric"
All vectors may contain a special value named NA. This represents a missing value, Missing values
> v <- c(NA,"rrr")
> v
[1] NA "rrr"
> u <- c(4,6,NA,2)
> u
[1] 4 6 NA 2
还有一点是, vector的编号是从1开始的, 而不是象一般语言从0开始
> v[1] <- ’hello’
> v
[1] "hello" "rrr"
对vector的直接操作, 等同于对vector中的每个元素进行操作
> v <- c(4,7,23.5,76.2,80)
> x <- sqrt(v)
> x
[1] 2.000000 2.645751 4.847680 8.729261 8.944272
对两个vector进行操作时, length不一样, 会循环补齐
> v1 <- c(4,6,8,24)
> v2 <- c(10,2,4)
> v1+v2
[1] 14 8 12 34
Factors
这个是R里面的一个新的概念”因子”, 这个名字听着就很晦涩...
这个结构适用于有限集合组成的序列, 而且这个有限集合一般都比较小, 如性别F, M, 年龄1到100
把这样的序列转化抽象为factor, 个人觉得两个好处
1. 便于存储, 节省空间, 因为factor内部都是用自然数计数去存储的
2. 便于统计, 可以直接用table函数对factor进行统计
> g <- c(’f’,’m’,’m’,’m’,’f’,’m’,’f’,’m’,’f’,’f’)
> g
[1] "f" "m" "m" "m" "f" "m" "f" "m" "f" "f"
You can transform this vector into a factor by entering, Creating a factor
> g <- factor(g)
> g
[1] f m m m f m f m f f
Levels: f m
使用table来统计...
> table(g)
g
f m
5 5
比较强大的是, 可以对多个vector进行统计, g表示性别, a表示是否成年
> g <- factor(c(’f’,’m’,’m’,’m’,’f’,’m’,’f’,’m’,’f’,’f’))
> a <- factor(c(’adult’,’adult’,’juvenile’,’juvenile’,’adult’,’adult’, 'adult’,’juvenile’,’adult’,’juvenile’))
> table(a,g)
g f ma
adult 4 2
juvenile 1 3
table统计还有其他的一些方便的函数
单维统计
> t <- table(a,g)
> margin.table(t,1)
a
adult juvenile
6 4
> margin.table(t,2)
g
f m
5 5
百分比形式的统计
以第一个属性算百分比
> prop.table(t,1)
g f m
a
adult 0.6666667 0.3333333
juvenile 0.2500000 0.7500000
以第二个属性算百分比> prop.table(t,2)
g f m
a
adult 0.8 0.4
juvenile 0.2 0.6
全局百分比> prop.table(t)
g f m
a
adult 0.4 0.2
juvenile 0.1 0.3
Generating sequences
这个结构和python的很像, 不过比python更灵活
记住一点, :具有最高优先级
You should be careful with the precedence of the operator “:”. The following examples illustrate this danger,
> 10:15-1
[1] 9 10 11 12 13 14
> 10:(15-1)
[1] 10 11 12 13 14
Seq的用法, To generate sequences of real numbers you can use the function seq()
> seq(from=1,to=5,length=4)
[1] 1.000000 2.333333 3.666667 5.000000
> seq(from=1,to=5,length=2)
[1] 1 5
> seq(length=10,from=-2,by=.2)
[1] -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2
Rep的用法,
> rep(5,10)
[1] 5 5 5 5 5 5 5 5 5 5
> rep(’hi’,3)
[1] "hi" "hi" "hi"
> rep(1:3,2)
[1] 1 2 3 1 2 3
gl的用法,
The function gl() can be used to generate sequences involving factors. The Factor sequences syntax of this function is gl(k,n), where k is the number of levels of the factor, and n the number of repetitions of each level.
> gl(3,5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
> gl(2,5,labels=c(’female’,’male’))
[1] female female female female female male male male male male
Levels: female male
generate random sequences
you want 10 randomly generated numbers from a normal distribution with zero mean and unit standard deviation,type
> rnorm(10)
[1] -0.306202028 0.335295844 1.199523068 2.034668704 0.273439339
[6] -0.001529852 1.351941008 1.643033230 -0.927847816 -0.163297158
while if you prefer a mean of 10 and a standard deviation of 3, you should use
> rnorm(10,mean=10,sd=3)
[1] 7.491544 12.360160 12.879259 5.307659 11.103252 18.431678 9.554603
[8] 9.590276 7.133595 5.498858
Indexing
这边会描述一些, 非常灵活的index使用的方法,
> x <- c(0,-3,4,-1,45,90,-5)
> x
[1] 0 -3 4 -1 45 90 -5
这儿可以在index的地方用表达式, 太方便了
> x[x>0]
[1] 4 45 90> x[x <= -2 | x > 5]
[1] -3 45 90 -5
> x[x > 40 & x < 100]
[1] 45 90> x[c(4,6)]
[1] -1 90
> x[1:3]
[1] 0 -3 4
这儿index中用-, 和python不一样, 这儿-是排除的意思, -1就是把第一个排除
> x[-1]
[1] -3 4 -1 45 90 -5
> x[-c(4,6)]
[1] 0 -3 4 45 -5
> x[-(1:3)]
[1] -1 45 90 –5
给元素起名字, 来替代抽象的index
> pH <- c(4.5,7,7.3,8.2,6.3)
> names(pH) <- c(’area1’,’area2’,’mud’,’dam’,’middle’)
> pH
area1 area2 mud dam middle
4.5 7.0 7.3 8.2 6.3
Matrices and arrays
Data elements can be stored in an object with more than one dimension.
两维的叫matrices
多维的叫arrays
> m <- c(45,23,66,77,33,44,56,12,78,23)
> m
[1] 45 23 66 77 33 44 56 12 78 23
> dim(m) <- c(2,5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 45 66 33 56 78
[2,] 23 77 44 12 23
> m <- matrix(c(45,23,66,77,33,44,56,12,78,23),2,5)
Lists
R lists consist of an ordered collection of other objects known as its components.
Lists可以存放不同类型的元素, 而且可以分别命名
> my.lst <- list(stud.id=34453, stud.name="John", stud.marks=c(14.3,12,15,19))
> my.lst[[1]]
[1] 34453
> my.lst[[3]]
[1] 14.3 12.0 15.0 19.0> my.lst[1]
$stud.id
[1] 34453
Data frames
A data frame is similar to a matrix but with named columns. However, contrary to matrices data frames may include data of different type on each column.
Data frame就是一种可以对columns命名的, 并且不同column可以不同类型数据的Matrixs
> my.dataset <- data.frame(site=c(’A’,’B’,’A’,’A’,’B’), season=c(’Winter’,’Summer’,’Summer’,’Spring’,’Fall’), pH = c(7.4,6.3,8.6,7.2,8.9))
> my.dataset
site season pH
1 A Winter 7.4
2 B Summer 6.3
3 A Summer 8.6
4 A Spring 7.2
5 B Fall 8.9
可以灵活的进行查询
> my.dataset[my.dataset$pH > 7,]
site season pH
1 A Winter 7.4
3 A Summer 8.6
4 A Spring 7.2
5 B Fall 8.9
> my.dataset[my.dataset$site == ’A’,’pH’]
[1] 7.4 8.6 7.2
> my.dataset[my.dataset$season == ’Summer’,c(’site’,’pH’)]
site pH
2 B 6.3
3 A 8.6
Case Study 1: Predicting Algae Blooms
该case就是根据水中的化学元素的情况来预测发生水藻的可能性.
Loading the data into R
> algae <- read.table(’Analysis.txt’,
+ header=F,
+ dec=’.’,
+ col.names=c(’season’,’size’,’speed’,’mxPH’,’mnO2’,’Cl’,
+ ’NO3’,’NH4’,’oPO4’,’PO4’,’Chla’,’a1’,’a2’,’a3’,’a4’,
+ ’a5’,’a6’,’a7’),
+ na.strings=c(’XXXXXXX’))
Data Visualization and Summarization
> summary(algae)
season size speed mxPH mnO2
autumn:40 large :45 high :84 Min. :5.600 Min. : 1.500
spring:53 medium:84 low :33 1st Qu.:7.700 1st Qu.: 7.725
summer:45 small :71 medium:83 Median :8.060 Median : 9.800
winter:62 Mean :8.012 Mean : 9.118
3rd Qu.:8.400 3rd Qu.:10.800
Max. :9.700 Max. :13.400
NA’s :1.000 NA’s : 2.000
Summary很强大...
对于summary, 要注意就是定量数据和定性数据的统计形式是不一样的
对于多种图形表现形式, 其他都比较简单, 就只介绍一个box plot
Let r be the interquartile range.
四分位数间距:是上四分位数与下四分位数之差,用四分位数间距可反映变异程度的大小. 即:Q3 –Q1
box plot除了box的上下限分别是加上1.5r, 超出的都是outlier
The circles below or above these small dashes represent observations that are extremely low (high) compared to all others, and are usually considered outliers. This means that box plots give us plenty of information regarding not only the central value and
spread of the variable but also on eventual outliers.
Unknown values
There are several water samples with unknown variable values. This situation, rather common in real problems, may preclude the use of certain techniques that are not able to handle missing values.
Whenever we are handling a data set with missing values we can follow several strategies. The most common are:
• Remove the cases with unknowns
• Fill in the unknown values by exploring the correlations between variables
• Fill in the unknown values by exploring the similarity between cases
• Use tools that are able to handle these values.
Obtaining prediction models
In this section we explore two different predictive models that could be applied to the algae domain: linear regression and regression trees.
讲到核心问题了,
回归问题的解释, http://blog.csdn.net/vshuang/article/details/5512853
简单的说就是, 给定多个自变量、一个因变量以及代表它们之间关系的一些训练样本,如何来确定它们的关系?
其实就是解方程式 y = ax + bz, 然后有一堆样本, 解一下算出a, b, ok, 下次就可以用x,z直接算出y, 这就是预测predict
但是事情没有那么简单, 是吧, 因为方程式是不确定的, 你要去猜, 有可能是上面那种最简单的, 叫线性回归, 算你运气好
也有可能是下面这样复杂的, 非线性回归...
y =a*Sqrt(x1)+b*Sqrt(x2); y = a*exp(x1)+b*exp(-x2); y=x1*x2
对于线性回归, 我们往往指的是多元线性回归(Multiple linear regression), 即有多个自变量.
Multiple linear regression
前面把数据load进来了, 也做完了数据clean, 下面就建回归模型
Let us start by learning how to obtain a linear regression model for predicting the frequency of one of the algae.
> lm.a1 <- lm(a1 ~ .,data=clean.algae[,1:12]) #Obtaining a linear regression model
The function lm() obtains a linear regression model.
The first argument of this function indicates the functional form of the model.
In this example, it states that we want a model that predicts the variable a1 using all other variables present in the data, which is the meaning of the dot character. For instance, if we wanted a model to predict a1 as a function of the variables mxPH and NH4, we should have indicated the model as “a1 ~ mxPH + NH4”.
The data parameter sets the data sample to be used to obtain the model.
这样模型就建好了, 很简单, 怎么用?
We may obtain more information about the linear model with the following instruction,
> summary(lm.a1)
Call:
lm(formula = a1 ~ ., data = clean.algae[, 1:12])
Residuals:
Min 1Q Median 3Q Max
-37.582 -11.882 -2.741 7.090 62.143
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.210622 24.042849 1.797 0.07396 .
seasonspring 3.575474 4.135308 0.865 0.38838
seasonsummer 0.645459 4.020423 0.161 0.87263
seasonwinter 3.572084 3.863941 0.924 0.35647
sizemedium 3.321935 3.797755 0.875 0.38288
sizesmall 9.732162 4.175616 2.331 0.02086 *
speedlow 3.965153 4.709314 0.842 0.40090
speedmedium 0.304232 3.243204 0.094 0.92537
mxPH -3.570995 2.706612 -1.319 0.18871
mnO2 1.018514 0.704875 1.445 0.15019
Cl -0.042551 0.033646 -1.265 0.20761
NO3 -1.494145 0.551200 -2.711 0.00736 **
NH4 0.001608 0.001003 1.603 0.11072
oPO4 -0.005235 0.039864 -0.131 0.89566
PO4 -0.052247 0.030737 -1.700 0.09087 .
Chla -0.090800 0.080015 -1.135 0.25796
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 182 degrees of freedom
Multiple R-Squared: 0.3737, Adjusted R-squared: 0.3221
F-statistic: 7.24 on 15 and 182 DF, p-value: 2.273e-12
首先我们可以用summary来看看模型的基本情况, 这一堆数据是什么意思, 需要一个个来解释一下.
首先要解决的问题, 是命名型变量, 做数值统计必须要把这些变量进行量化才好统计.
How R handled the three nominal variables?
Namely, for each factor variable with k levels, R will create k −1 auxiliary variables.
Looking at the summary presented above, we can see that R has created three auxiliary variables for the factor season (seasonspring, seasonsummer and seasonwinter). This means that if we have a water sample with the value “autumn” in the variable season, all these three auxiliary variables will be set to zero.
所以season就分成(seasonspring, seasonsummer and seasonwinter), 为什么不分成4个, 为了省空间, 前3个为0, 最后一个肯定为1, 前3个有1, 最后一个肯定为0.
The application of the function summary() to a linear model gives some diagnostic information concerning the obtained model.
Residuals (i.e. the errors) of the fit of the linear model to the used data. These residuals should have a mean zero and should
have a normal distribution (and obviously be as small as possible!).
意思是, 线性模型和用户数据的贴合度(fit)的误差, 说是满足正态分布, 想想也是, 应该大部分数据都符合模型, 异常数据只是极小部分.
从这个值可以看出线性模型和真实数据是否符合, 差的多不多.
Coefficient
相关系数, 表示这个参数和预测结果的相关度.
Estimate, estimate coefficient value, 正值是正相关, 负值就是负相关.
Std. Error, standard error (an estimate of the variation of these coefficients)
In order to check the importance of each coefficient, we may test the hypothesis that each of them is null.
在模型中, 假设某一个参数可以为0, 如果假设成立, 就说明这个参数对预测没有任何帮助.
t value, to test this hypothesis the t test is normally used. R calculates a t value, which is defined as the ratio between the coefficient value and its standard error, coefficient value/standard error. 这个t值被用于底下的参数中...
(Pr(>|t|)), associated with each coefficient with the level at which the hypothesis that the coefficient is null is rejected. Thus a value of 0.0001, has the meaning that we are 99.99% confident that the coefficient is not null.
视乎有些明白, 是否estimate的绝对值越大, 并且Pr(>|t|)的值越小就说明, 这个参数相关系数越高.
R-Squared coefficients (Multiple and Adjusted)
The degree of fit of the model to the data, that is the proportion of variance in the data that is explained by the model.
Values near 1 are better (almost 100% explained variance), while the smaller the values the larger the lack of fit.
The adjusted coefficient is more demanding as it takes into account the number of parameters of the regression model.
也是反映fit程度的, 越接近1说明模型越是精确
p-value
we can also test the null hypothesis that there is no dependence of the target variable on any of the explanatory variables.
Usually, if the model fails this test it makes no sense to look at the t-tests on the individual coefficients.
这个值就是验证预测结果和所有变量没有关系的假设成立的可能性, 如果假设成立, 说明这个模型没有意义
所以当这个假设成立的时候, 去看individual coefficients是没有意义的.
同样是值越小, 越好, 说明假设成立可能性越小.
好, 分析问数据, 这个说明什么
The proportion of variance explained by this model is not very impressive (around 32.0%).
Still, we can reject the hypothesis that the target variable does not depend on the predictors (the p value of the F test is very small).
结论两条,
模型贴合度不够高, 因为Adjusted R-squared只有32.0%左右
模型还算有用, 因为p-value很小
于是自然的想法就是, 怎么提高?
从上面的结论可以看出, 应该是模型中包含了些相关度比较低的参数, 所以通过简化模型, 反向消除不相干参数的方法来提高贴合度.
Looking at the significance of some of the coefficients we may question the inclusion of some of them in the model. There are several methods for simplifying regression models. In this section we will explore a method usually known as backward elimination(反向消除或反向淘汰).
We will start our study of simplifying the linear model using the anova() function. When applied to a single linear model this function will give us a sequential analysis of variance of the model fit.
> anova(lm.a1)
Analysis of Variance Table
Response: a1
Df Sum Sq Mean Sq F value Pr(>F)
season 3 85 28 0.0906 0.9651499
size 2 11401 5701 18.3253 5.613e-08 ***
speed 2 3934 1967 6.3236 0.0022126 **
mxPH 1 1322 1322 4.2499 0.0406740 *
mnO2 1 2218 2218 7.1312 0.0082614 **
Cl 1 4451 4451 14.3073 0.0002105 ***
NO3 1 3399 3399 10.9263 0.0011420 **
NH4 1 385 385 1.2376 0.2674000
oPO4 1 4765 4765 15.3168 0.0001283 ***
PO4 1 1423 1423 4.5738 0.0337981 *
Chla 1 401 401 1.2877 0.2579558
Residuals 182 56617 311
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These results indicate that the variable season is the variable that least contributes for the reduction of the fitting error of the model.
从这个可以看出season对模型的贡献度最小, 因为sum和mean都最小, 而Pr最大, 说明season对模型几乎没啥影响
Let us remove it from the model,
> lm2.a1 <- update(lm.a1, . ~ . - season)
The summary information for this new model is given below,
> summary(lm2.a1)
Call:
lm(formula = a1 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 +
oPO4 + PO4 + Chla, data = clean.algae[, 1:12])
Residuals:
Min 1Q Median 3Q Max
-36.386 -11.899 -2.941 7.338 63.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.9587170 23.2659336 1.932 0.05484 .
sizemedium 3.3636189 3.7773655 0.890 0.37437
sizesmall 10.3092317 4.1173665 2.504 0.01315 *
speedlow 3.1460847 4.6155216 0.682 0.49632
speedmedium -0.2146428 3.1839011 -0.067 0.94632
mxPH -3.2377235 2.6587542 -1.218 0.22487
mnO2 0.7741679 0.6578931 1.177 0.24081
Cl -0.0409303 0.0333812 -1.226 0.22170
NO3 -1.5126458 0.5475832 -2.762 0.00632 **
NH4 0.0015525 0.0009946 1.561 0.12027
oPO4 -0.0061577 0.0394710 -0.156 0.87620
PO4 -0.0508845 0.0304911 -1.669 0.09684 .
Chla -0.0879751 0.0794655 -1.107 0.26969
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.56 on 185 degrees of freedom
Multiple R-Squared: 0.369, Adjusted R-squared: 0.3281
F-statistic: 9.016 on 12 and 185 DF, p-value: 1.581e-13
于是把season从模型中删掉, 再看一下模型的summary, Adjusted R-squared提高到了32.8%
The fit has improved a bit (32.8%) but it is still not too impressive. We may carried out a more formal comparison between the two models by using again the anova() function, but this time with both models as arguments.
> anova(lm.a1,lm2.a1)
Analysis of Variance Table
Model 1: a1 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla
Model 2: a1 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla
Res.Df RSS Df Sum of Sq F Pr(>F)
1 182 56617
2 185 57043 –3 –425 0.4559 0.7134
通过anova函数还能比较两个模型...
In this case, although the sum of the squared errors has decreased (-425), the comparison shows that the differences are not significant (a value of 0.7134 tells us that with only around 29% confidence we can say they are different).
可见两个模型差别不明显, 还需要继续优化模型, 做完就是再通过anova找出最不相关的参数, 删除, 测试, 这个过程其实R有函数可以直接实现
The following code creates a linear model that results from applying the backward elimination method to the initial model we have obtained (lm.a1)
> final.lm <- step(lm.a1)
Start: AIC= 1151.85
a1 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla
Step: AIC= 1147.33
a1 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla
...
...
Step: AIC= 1140.09
a1 ~ size + mxPH + Cl + NO3 + PO4
经过若干步, 终于得到了最终简化的模型, 但通过summary可以发现, Adjusted R-squared: 0.3333
提升很有限, 这说明什么?
the linearity assumptions of this model are inadequate for the domain
说明线性假设是不够的, 用线性模型无法真正描述这个问题.
Regression trees
Let us now look at a different kind of regression model available in R.
决策树有几种产生方法, 回归树是决策树的一种形式
- 分类树 分析是当预计结果可能为两种类型(例如男女,输赢等)使用的概念。
- 回归树 分析是当局域结果可能为实数(例如房价,患者住院时间等)使用的概念。
> rt.a1 <- rpart(a1 ~ .,data=algae[,1:12])
我们可以用这么简单的命令rpart来生成回归树模型, 和决策树一样, 我们可以通过pruning(剪枝)来优化回归树.
Model evaluation and selection
The most popular are criteria that calculate the predictive performance(预测性能) of the models.
Still, other criteria exist like for instance the model interpretability, or even the model computational efficiency that can be important for very large data mining problems.
我们怎么挑选model, 最主要就是要看这个predictive performance, 而这个性能又通过MAE来反映...
The predictive performance of regression models is obtained by comparing the predictions of the models with the real values of the target variables, and calculating some average error measure from this comparison. One of such measures is the mean absolute error (MAE).
Let us see how to obtain this measure for our two models (linear regression and regression trees).
The first step is to obtain the model predictions for the set of cases where we want to evaluate it. To obtain the predictions of any model in R, one uses the function predictions predict().
This general function peeks a model and a set of data and retrieves the model predictions,
> lm.predictions.a1 <- predict(final.lm,clean.algae)
> rt.predictions.a1 <- predict(rt.a1,algae)
Having the predictions of the models we can calculate their mean absolute error as follows,
> (mae.a1.lm <- mean(abs(lm.predictions.a1-algae[,’a1’])))
[1] 13.10279
> (mae.a1.rt <- mean(abs(rt.predictions.a1-algae[,’a1’])))
[1] 11.61717
Another popular error measure is the mean squared error (MSE) . This Mean squared error measure can be obtained as follows,
> (mse.a1.lm <- mean((lm.predictions.a1-algae[,’a1’])^2))
[1] 295.1097
> (mse.a1.rt <- mean((rt.predictions.a1-algae[,’a1’])^2))
[1] 271.3226
An alternative statistic that provides a reasonable answer to this question is the normalized mean squared error (NMSE). This statistic calculates a ratio between the performance of our models and that of a baseline predictor, usually taken as the
mean value of the target variable,
> (nmse.a1.lm <- mean((lm.predictions.a1-algae[,’a1’])^2)/mean((mean(algae[,’a1’])-algae[,’a1’])^2))
[1] 0.6463594
> (nmse.a1.rt <- mean((rt.predictions.a1-algae[,’a1’])^2)/mean((mean(algae[,’a1’])-algae[,’a1’])^2))
[1] 0.5942601
If your model is performing better than this very simple baseline predictor then the NMSE should be clearly below 1. The smaller the NMSE, the better. Values above 1 mean that your model is performing worse than simply predicting always the average for all cases
所以可以看出对于这个数据集, 回归树是要比多项线性回归模型好些的.
通过这些measures, 就可以简单的判断出模型的好坏, 完成模型的选择.
但这里的measure都是通过training data来计算的, 所以就有个过匹配问题(overfitting the training data)
解决这个问题的方法就是, K-fold Cross Validation
Obtain K equally sized and random sub-sets of the training data. For each of these K sub-sets, build a model using the remaining K-1 sets and evaluate this model on the Kth sub-set. Store the performance of the model and repeat this process for all remaining sub-sets. In the end we have K performance measures, all obtained by testing a model on data not used for its construction. The K-fold Cross Validation estimate is the average of these K measures.
Predictions for the 7 algae
In this section we will see how to obtain the predictions for the 7 algae on the 140 test samples.
这个就不具体说了
Case Study 2: Predicting Stock Market Returns
以后用到, 有空再继续吧