相对于Python、Orange Canvas、Weka、Kinme这些免费的数据挖掘软件来说,更容易上手,统计图形也更加美观。
下载kernlab包里的spam数据集,spam是一个邮件数据集,共有4601个观测值,58个变量,最后一个变量是一个二值变量,“spam”和“no spam”,我们要做的工作就是通过建立模型了预测观测值是否为“spam”。首先加载软件包和数据集:
> library(caret)
1: 程辑包‘caret’是用R版本3.1.1 来建造的
2: 程辑包‘ggplot2’是用R版本3.1.1 来建造的
> library(kernlab)
程辑包‘kernlab’是用R版本3.1.3 来建造的
> data(spam)
> head(spam)
make address all num3d our over remove internet order mail
1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00
2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94
3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25
4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63
5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63
6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00
receive will people report addresses free business email you
1 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93
2 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47
3 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36
4 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18
5 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
credit your font num000 money hp hpl george num650 lab labs telnet
1 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 0
2 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 0
3 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 0
4 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0
5 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0
6 0.00 0.00 0 0.00 0.00 0 0 0 0 0 0 0
num857 data num415 num85 technology num1999 parts pm direct cs
1 0 0 0 0 0 0.00 0 0 0.00 0
2 0 0 0 0 0 0.07 0 0 0.00 0
3 0 0 0 0 0 0.00 0 0 0.06 0
4 0 0 0 0 0 0.00 0 0 0.00 0
5 0 0 0 0 0 0.00 0 0 0.00 0
6 0 0 0 0 0 0.00 0 0 0.00 0
meeting original project re edu table conference charSemicolon
1 0 0.00 0 0.00 0.00 0 0 0.00
2 0 0.00 0 0.00 0.00 0 0 0.00
3 0 0.12 0 0.06 0.06 0 0 0.01
4 0 0.00 0 0.00 0.00 0 0 0.00
5 0 0.00 0 0.00 0.00 0 0 0.00
6 0 0.00 0 0.00 0.00 0 0 0.00
charRoundbracket charSquarebracket charExclamation charDollar
1 0.000 0 0.778 0.000
2 0.132 0 0.372 0.180
3 0.143 0 0.276 0.184
4 0.137 0 0.137 0.000
5 0.135 0 0.135 0.000
6 0.223 0 0.000 0.000
charHash capitalAve capitalLong capitalTotal type
1 0.000 3.756 61 278 spam
2 0.048 5.114 101 1028 spam
3 0.010 9.821 485 2259 spam
4 0.000 3.537 40 191 spam
5 0.000 3.537 40 191 spam
6 0.000 3.000 15 54 spam
inTrain <- createDataPartition(y=spam$type,p=0.75,list=FALSE)
training <- spam[inTrain, ]
testing <- spam[-inTrain, ]
[1] 3451
[1] 1150
以上命令中createDataPartition( )就是数据划分函数,对象是spam$typ,p=0.75表示训练数据所占的比例为75%,list是输出结果的格式,默认list=FALSE。 training <- spam[inTrain, ],testing <- spam[-inTrain, ]分别制定具体的训练数据和测试数据。
modelFit <- train(type~.,data=training,method="glm") train( )函数就是我们的训练器,type~是回归方程,data=training指定数据集,method="glm"指定具体的模型形式,这里我们用的是glm估计,当然读者也可以用SVM(支持向量机),nnet神经网络等其他模型形式,以下是模型的具体内容:
(Intercept) make address all num3d
-1.989e+00 -5.022e-01 -1.702e-01 1.553e-01 3.368e+00
our over remove internet order
7.554e-01 6.682e-01 2.220e+00 5.586e-01 1.144e+00
mail receive will people report
Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
Null Deviance: 4628
Residual Deviance: 1335 AIC: 1451(篇幅有限,中间有删减)
predictions <- predict(modelFit,newdata=testing)
[1] spam spam spam spam spam spam spam spam spam spam spam
[12] spam spam spam spam spam spam spam spam spam spam spam
[23] nonspam spam spam spam spam spam spam nonspam spam spam spam
[34] spam spam spam spam spam spam spam spam spam spam spam
[45] spam spam spam spam spam spam spam spam spam spam spam
Confusion Matrix and Statistics
Prediction nonspam spam
nonspam 658 47
spam 39 406
Accuracy : 0.9252
95% CI : (0.9085, 0.9398)
No Information Rate : 0.6061
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8429
Mcnemar's Test P-Value : 0.4504
Sensitivity : 0.9440
Specificity : 0.8962
Pos Pred Value : 0.9333
Neg Pred Value : 0.9124
Prevalence : 0.6061
Detection Rate : 0.5722
Detection Prevalence : 0.6130
Balanced Accuracy : 0.9201
inTrain<-createDataPartition(y = Sonar$Class,##the outcome data are needed
p=.75,##The percentage of data in the training set
list = FALSE##the format of the results
#The output is a set of integers for the rows of Sonar
#that belong in the training set.
> str(inTrain)
int [1:157, 1] 98 100 101 102 103 105 107 109 110 111 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "Resample1"
> training <- Sonar[inTrain,]
> testing <- Sonar[-inTrain,]
> nrow(training)
[1] 157
> nrow(testing)
[1] 51
plsFit <- train(Class~.,data = training,
method = 'pls',#Center and scale the predictors for the training set and all future samples,
preProc = c("center","scale"))
plsFit <- train(Class~.,data = training,
method = 'pls',
tuneLength = 15,
preProc = c("center","scale"))
ctrl <-trainControl(method = "repeatedcv",repeats=3)
plsFit <- train(Class~.,data = training,
method = 'pls',
tuneLength = 15,
trControl = ctrl,
preProc = c("center","scale"))
ctrl <- trainControl(method = "repeatedcv",repeats=3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
plsFit <-train(Class~.,
data = training,
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = C("center","scale"))
> plsFit
Partial Least Squares
157 samples
60 predictor
2 classes: 'M', 'R'
Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 141, 141, 142, 141, 140, 142, ...
Resampling results across tuning parameters:
ncomp Accuracy Kappa Accuracy SD Kappa SD
1 0.729 0.460 0.1291 0.254
2 0.807 0.614 0.0896 0.176
3 0.788 0.577 0.0880 0.176
4 0.780 0.558 0.0783 0.158
5 0.757 0.512 0.0953 0.193
6 0.762 0.524 0.0925 0.185
7 0.752 0.504 0.0943 0.188
8 0.739 0.477 0.0743 0.148
9 0.745 0.491 0.0861 0.170
10 0.747 0.493 0.0791 0.156
11 0.736 0.472 0.0845 0.167
12 0.758 0.514 0.0887 0.177
13 0.730 0.458 0.0883 0.176
14 0.734 0.466 0.0916 0.182
15 0.743 0.483 0.0964 0.193
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was ncomp = 2.
> plsClasses <- predict(plsFit,newdata = testing)
> str(plsClasses)
Factor w/ 2 levels "M","R": 2 1 1 2 1 2 2 2 2 2 ...
> plsProbs <- predict(plsFit,newdata = testing,type = "prob")
> head(plsProbs)
4 0.3762529 0.6237471
5 0.5229047 0.4770953
8 0.5839468 0.4160532
16 0.3660142 0.6339858
20 0.7351013 0.2648987
25 0.2135788 0.7864212
> confusionMatrix(data = plsClasses,testing$Class)
Confusion Matrix and Statistics
Prediction M R
M 20 7
R 7 17
Accuracy : 0.7255
95% CI : (0.5826, 0.8411)
No Information Rate : 0.5294
P-Value [Acc > NIR] : 0.003347
Kappa : 0.4491
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.7407
Specificity : 0.7083
Pos Pred Value : 0.7407
Neg Pred Value : 0.7083
Prevalence : 0.5294
Detection Rate : 0.3922
Detection Prevalence : 0.5294
Balanced Accuracy : 0.7245
'Positive' Class : M
