Title: | Cross-validation tools for regression models |
---|---|
Description: | Tools that allow developers to write functions for cross-validation with minimal programming effort and assist users with model selection. |
Authors: | Andreas Alfons |
Maintainer: | Andreas Alfons <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.3.3 |
Built: | 2024-11-09 02:35:47 UTC |
Source: | https://github.com/aalfons/cvtools |
Tools that allow developers to write functions for cross-validation with minimal programming effort and assist users with model selection.
Package: | cvTools |
Type: | Package |
Version: | 0.3.3 |
Date: | 2012-05-25 |
Depends: | R (>= 2.11.0), lattice, robustbase |
Imports: | lattice, robustbase, stats |
License: | GPL (>= 2) |
LazyLoad: | yes |
Index:
accessors Access or set information on cross-validation results aggregate.cv Aggregate cross-validation results bwplot.cv Box-and-whisker plots of cross-validation results cost Prediction loss cvFit Cross-validation for model evaluation cvFolds Cross-validation folds cvReshape Reshape cross-validation results cvSelect Model selection based on cross-validation cvTool Low-level function for cross-validation cvTools-package Cross-validation tools for regression models cvTuning Cross-validation for tuning parameter selection densityplot.cv Kernel density plots of cross-validation results dotplot.cv Dot plots of cross-validation results plot.cv Plot cross-validation results repCV Cross-validation for linear models subset.cv Subsetting cross-validation results summary.cv Summarize cross-validation results xyplot.cv X-Y plots of cross-validation results
Andreas Alfons
Maintainer: Andreas Alfons <[email protected]>
Retrieve or set the names of cross-validation results, retrieve or set the identifiers of the models, or retrieve the number of cross-validation results or included models.
cvNames(x) cvNames(x) <- value fits(x) fits(x) <- value ncv(x) nfits(x)
cvNames(x) cvNames(x) <- value fits(x) fits(x) <- value ncv(x) nfits(x)
x |
an object inheriting from class |
value |
a vector of replacement values. |
cvNames
returns the names of the cross-validation
results. The replacement function thereby returns them
invisibly.
fits
returns the identifiers of the models for
objects inheriting from class "cvSelect"
and
NULL
for objects inheriting from class
"cv"
. The replacement function thereby returns
those values invisibly.
ncv
returns the number of cross-validation
results.
nfits
returns the number of models included in
objects inheriting from class "cvSelect"
and
NULL
for objects inheriting from class
"cv"
.
Andreas Alfons
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # "cv" object ncv(cvFitLts50) nfits(cvFitLts50) cvNames(cvFitLts50) cvNames(cvFitLts50) <- c("improved", "initial") fits(cvFitLts50) cvFitLts50 # "cvSelect" object ncv(cvFitsLts) nfits(cvFitsLts) cvNames(cvFitsLts) cvNames(cvFitsLts) <- c("improved", "initial") fits(cvFitsLts) fits(cvFitsLts) <- 1:2 cvFitsLts
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # "cv" object ncv(cvFitLts50) nfits(cvFitLts50) cvNames(cvFitLts50) cvNames(cvFitLts50) <- c("improved", "initial") fits(cvFitLts50) cvFitLts50 # "cvSelect" object ncv(cvFitsLts) nfits(cvFitsLts) cvNames(cvFitsLts) cvNames(cvFitsLts) <- c("improved", "initial") fits(cvFitsLts) fits(cvFitsLts) <- 1:2 cvFitsLts
Compute summary statistics of results from repeated
-fold cross-validation.
## S3 method for class 'cv' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'cvSelect' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'cvTuning' aggregate(x, ...)
## S3 method for class 'cv' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'cvSelect' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'cvTuning' aggregate(x, ...)
x |
an object inheriting from class |
FUN |
a function to compute the summary statistics. |
select |
a character, integer or logical vector indicating the columns of cross-validation results for which to compute the summary statistics. |
... |
for the |
The "cv"
method returns a vector or matrix of
aggregated cross-validation results, depending on whether
FUN
returns a single value or a vector.
For the other methods, a data frame containing the
aggregated cross-validation results for each model is
returned. In the case of the "cvTuning"
method,
the data frame contains the combinations of tuning
parameters rather than a column describing the models.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, aggregate
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # summary of the results with the 50% subsets aggregate(cvFitLts50, summary) # summary of the combined results aggregate(cvFitsLts, summary)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # summary of the results with the 50% subsets aggregate(cvFitLts50, summary) # summary of the combined results aggregate(cvFitsLts, summary)
Produce box-and-whisker plots of results from repeated
-fold cross-validation.
## S3 method for class 'cv' bwplot(x, data, select = NULL, ...) ## S3 method for class 'cvSelect' bwplot(x, data, subset = NULL, select = NULL, ...)
## S3 method for class 'cv' bwplot(x, data, select = NULL, ...) ## S3 method for class 'cvSelect' bwplot(x, data, subset = NULL, select = NULL, ...)
x |
an object inheriting from class |
data |
currently ignored. |
subset |
a character, integer or logical vector indicating the subset of models for which to plot the cross-validation results. |
select |
a character, integer or logical vector indicating the columns of cross-validation results to be plotted. |
... |
additional arguments to be passed to the
|
For objects with multiple columns of repeated cross-validation results, conditional box-and-whisker plots are produced.
An object of class "trellis"
is returned
invisibly. The
update
method can
be used to update components of the object and the
print
method
(usually called by default) will plot it on an
appropriate plotting device.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, plot
,
densityplot
,
xyplot
,
dotplot
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model bwplot(cvFitLmrob) # plot combined results bwplot(cvFits)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model bwplot(cvFitLmrob) # plot combined results bwplot(cvFits)
Compute the prediction loss of a model.
mspe(y, yHat, includeSE = FALSE) rmspe(y, yHat, includeSE = FALSE) mape(y, yHat, includeSE = FALSE) tmspe(y, yHat, trim = 0.25, includeSE = FALSE) rtmspe(y, yHat, trim = 0.25, includeSE = FALSE)
mspe(y, yHat, includeSE = FALSE) rmspe(y, yHat, includeSE = FALSE) mape(y, yHat, includeSE = FALSE) tmspe(y, yHat, trim = 0.25, includeSE = FALSE) rtmspe(y, yHat, trim = 0.25, includeSE = FALSE)
y |
a numeric vector or matrix giving the observed values. |
yHat |
a numeric vector or matrix of the same
dimensions as |
trim |
a numeric value giving the trimming proportion (the default is 0.25). |
includeSE |
a logical indicating whether standard errors should be computed as well. |
mspe
and rmspe
compute the mean squared
prediction error and the root mean squared prediction
error, respectively. In addition, mape
returns
the mean absolute prediction error, which is somewhat
more robust.
Robust prediction loss based on trimming is implemented
in tmspe
and rtmspe
. To be more precise,
tmspe
computes the trimmed mean squared prediction
error and rtmspe
computes the root trimmed mean
squared prediction error. A proportion of the largest
squared differences of the observed and fitted values are
thereby trimmed.
Standard errors can be requested via the includeSE
argument. Note that standard errors for tmspe
are
based on a winsorized standard deviation. Furthermore,
standard errors for rmspe
and rtmspe
are
computed from the respective standard errors of
mspe
and tmspe
via the delta method.
If standard errors are not requested, a numeric value giving the prediction loss is returned.
Otherwise a list is returned, with the first component containing the prediction loss and the second component the corresponding standard error.
Andreas Alfons
Tukey, J.W. and McLaughlin, D.H. (1963) Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/winsorization. Sankhya: The Indian Journal of Statistics, Series A, 25(3), 331–352
Oehlert, G.W. (1992) A note on the delta method. The American Statistician, 46(1), 27–29.
# fit an MM-regression model data("coleman") fit <- lmrob(Y~., data=coleman) # compute the prediction loss from the fitted values # (hence the prediction loss is underestimated in this simple # example since all observations are used to fit the model) mspe(coleman$Y, predict(fit)) rmspe(coleman$Y, predict(fit)) mape(coleman$Y, predict(fit)) tmspe(coleman$Y, predict(fit), trim = 0.1) rtmspe(coleman$Y, predict(fit), trim = 0.1) # include standard error mspe(coleman$Y, predict(fit), includeSE = TRUE) rmspe(coleman$Y, predict(fit), includeSE = TRUE) mape(coleman$Y, predict(fit), includeSE = TRUE) tmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE) rtmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE)
# fit an MM-regression model data("coleman") fit <- lmrob(Y~., data=coleman) # compute the prediction loss from the fitted values # (hence the prediction loss is underestimated in this simple # example since all observations are used to fit the model) mspe(coleman$Y, predict(fit)) rmspe(coleman$Y, predict(fit)) mape(coleman$Y, predict(fit)) tmspe(coleman$Y, predict(fit), trim = 0.1) rtmspe(coleman$Y, predict(fit), trim = 0.1) # include standard error mspe(coleman$Y, predict(fit), includeSE = TRUE) rmspe(coleman$Y, predict(fit), includeSE = TRUE) mape(coleman$Y, predict(fit), includeSE = TRUE) tmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE) rtmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE)
Estimate the prediction error of a model via (repeated)
-fold cross-validation. It is thereby possible to
supply an object returned by a model fitting function, a
model fitting function itself, or an unevaluated function
call to a model fitting function.
cvFit(object, ...) ## Default S3 method: cvFit(object, data = NULL, x = NULL, y, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'function' cvFit(object, formula, data = NULL, x = NULL, y, args = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'call' cvFit(object, data = NULL, x = NULL, y, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...)
cvFit(object, ...) ## Default S3 method: cvFit(object, data = NULL, x = NULL, y, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'function' cvFit(object, formula, data = NULL, x = NULL, y, args = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'call' cvFit(object, data = NULL, x = NULL, y, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame(), seed = NULL, ...)
object |
the fitted model for which to estimate the
prediction error, a function for fitting a model, or an
unevaluated function call for fitting a model (see
|
formula |
a |
data |
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a
|
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
args |
a list of additional arguments to be passed to the model fitting function. |
cost |
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see
|
K |
an integer giving the number of folds into which
the data should be split (the default is five). Keep in
mind that this should be chosen such that all folds are
of approximately equal size. Setting |
R |
an integer giving the number of replications for
repeated |
foldType |
a character string specifying the type of
folds to be generated. Possible values are
|
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
folds |
an object of class |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
predictArgs |
a list of additional arguments to be
passed to the |
costArgs |
a list of additional arguments to be
passed to the prediction loss function |
envir |
the |
seed |
optional initial seed for the random number
generator (see |
... |
additional arguments to be passed down. |
(Repeated) -fold cross-validation is performed in
the following way. The data are first split into
previously obtained blocks of approximately equal size.
Each of the
data blocks is left out once to fit
the model, and predictions are computed for the
observations in the left-out block with the
predict
method of the fitted model.
Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost
to estimate the prediction error.
For repeated cross-validation, this process is replicated
and the estimated prediction errors from all replications
as well as their average are included in the returned
object.
Furthermore, if the response is a vector but the
predict
method of the fitted models
returns a matrix, the prediction error is computed for
each column. A typical use case for this behavior would
be if the predict
method returns
predictions from an initial model fit and stepwise
improvements thereof.
If formula
or data
are supplied, all
variables required for fitting the models are added as
one argument to the function call, which is the typical
behavior of model fitting functions with a
formula
interface. In this case,
the accepted values for names
depend on the
method. For the function
method, a character
vector of length two should supplied, with the first
element specifying the argument name for the formula and
the second element specifying the argument name for the
data (the default is to use c("formula", "data")
).
Note that names for both arguments should be supplied
even if only one is actually used. For the other
methods, which do not have a formula
argument, a
character string specifying the argument name for the
data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names
should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
An object of class "cv"
with the following
components:
n |
an integer giving the number of observations or groups. |
K |
an integer giving the number of folds. |
R |
an integer giving the number of replications. |
cv |
a numeric vector containing the respective estimated prediction errors. For repeated cross-validation, those are average values over all replications. |
se |
a numeric vector containing the respective estimated standard errors of the prediction loss. |
reps |
a numeric matrix in which each column contains the respective estimated prediction errors from all replications. This is only returned for repeated cross-validation. |
seed |
the seed of the random number generator before cross-validation was performed. |
call |
the matched function call. |
Andreas Alfons
cvTool
, cvSelect
,
cvTuning
, cvFolds
,
cost
library("robustbase") data("coleman") ## via model fit # fit an MM regression model fit <- lmrob(Y ~ ., data=coleman) # perform cross-validation cvFit(fit, data = coleman, y = coleman$Y, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied cvFit(lmrob, formula = Y ~ ., data = coleman, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation cvFit(call, data = coleman, y = coleman$Y, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
library("robustbase") data("coleman") ## via model fit # fit an MM regression model fit <- lmrob(Y ~ ., data=coleman) # perform cross-validation cvFit(fit, data = coleman, y = coleman$Y, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied cvFit(lmrob, formula = Y ~ ., data = coleman, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation cvFit(call, data = coleman, y = coleman$Y, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
Split observations or groups of observations into
folds to be used for (repeated)
-fold
cross-validation.
should thereby be chosen such
that all folds are of approximately equal size.
cvFolds(n, K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL)
cvFolds(n, K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL)
n |
an integer giving the number of observations to
be split into folds. This is ignored if |
K |
an integer giving the number of folds into which
the observations should be split (the default is five).
Setting |
R |
an integer giving the number of replications for
repeated |
type |
a character string specifying the type of
folds to be generated. Possible values are
|
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
An object of class "cvFolds"
with the following
components:
n |
an integer giving the number of observations or groups. |
K |
an integer giving the number of folds. |
R |
an integer giving the number of replications. |
subsets |
an integer matrix in which each column contains a permutation of the indices of the observations or groups. |
which |
an integer vector giving the fold for each permuted observation or group. |
grouping |
a list giving the indices of the observations belonging to each group. This is only returned if a grouping factor has been supplied. |
Andreas Alfons
set.seed(1234) # set seed for reproducibility cvFolds(20, K = 5, type = "random") cvFolds(20, K = 5, type = "consecutive") cvFolds(20, K = 5, type = "interleaved") cvFolds(20, K = 5, R = 10)
set.seed(1234) # set seed for reproducibility cvFolds(20, K = 5, type = "random") cvFolds(20, K = 5, type = "consecutive") cvFolds(20, K = 5, type = "interleaved") cvFolds(20, K = 5, R = 10)
Reshape cross-validation results into an object of class
"cvSelect"
with only one column of results.
cvReshape(x, ...) ## S3 method for class 'cv' cvReshape(x, selectBest = c("min", "hastie"), seFactor = 1, ...) ## S3 method for class 'cvSelect' cvReshape(x, selectBest = c("min", "hastie"), seFactor = 1, ...)
cvReshape(x, ...) ## S3 method for class 'cv' cvReshape(x, selectBest = c("min", "hastie"), seFactor = 1, ...) ## S3 method for class 'cvSelect' cvReshape(x, selectBest = c("min", "hastie"), seFactor = 1, ...)
x |
an object inheriting from class |
selectBest |
a character string specifying a
criterion for selecting the best model. Possible values
are |
seFactor |
a numeric value giving a multiplication
factor of the standard error for the selection of the
best model. This is ignored if |
... |
additional arguments to be passed down. |
An object of class "cvSelect"
with the following
components:
n |
an integer giving the number of observations. |
K |
an integer giving the number of folds used in cross-validation. |
R |
an integer giving the number of replications used in cross-validation. |
best |
an integer giving the index of the model with the best prediction performance. |
cv |
a data frame containing the estimated prediction errors for the models. For repeated cross-validation, those are average values over all replications. |
se |
a data frame containing the estimated standard errors of the prediction loss for the models. |
selectBest |
a character string specifying the criterion used for selecting the best model. |
seFactor |
a numeric value giving the multiplication factor of the standard error used for the selection of the best model. |
reps |
a data frame containing the estimated prediction errors for the models from all replications. This is only returned if repeated cross-validation was performed. |
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
library("robustbase") data("coleman") # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, K = 5, R = 10, fit = "both", trim = 0.1, seed = 1234) # compare original and reshaped object cvFitLts cvReshape(cvFitLts)
library("robustbase") data("coleman") # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, K = 5, R = 10, fit = "both", trim = 0.1, seed = 1234) # compare original and reshaped object cvFitLts cvReshape(cvFitLts)
Combine cross-validation results for various models into one object and select the model with the best prediction performance.
cvSelect(..., .reshape = FALSE, .selectBest = c("min", "hastie"), .seFactor = 1)
cvSelect(..., .reshape = FALSE, .selectBest = c("min", "hastie"), .seFactor = 1)
... |
objects inheriting from class |
.reshape |
a logical indicating whether objects with more than one column of cross-validation results should be reshaped to have only one column (see “Details”). |
.selectBest |
a character string specifying a
criterion for selecting the best model. Possible values
are |
.seFactor |
a numeric value giving a multiplication
factor of the standard error for the selection of the
best model. This is ignored if |
Keep in mind that objects inheriting from class
"cv"
or "cvSelect"
may contain multiple
columns of cross-validation results. This is the case if
the response is univariate but the
predict
method of the fitted model
returns a matrix.
The .reshape
argument determines how to handle
such objects. If .reshape
is FALSE
, all
objects are required to have the same number of columns
and the best model for each column is selected. A
typical use case for this behavior would be if the
investigated models contain cross-validation results for
a raw and a reweighted fit. It might then be of interest
to researchers to compare the best model for the raw
estimators with the best model for the reweighted
estimators.
If .reshape
is TRUE
, objects with more than
one column of results are first transformed with
cvReshape
to have only one column. Then
the best overall model is selected.
It should also be noted that the argument names of
.reshape
, .selectBest
and .seFacor
start with a dot to avoid conflicts with the argument
names used for the objects containing cross-validation
results.
An object of class "cvSelect"
with the following
components:
n |
an integer giving the number of observations. |
K |
an integer vector giving the number of folds used in cross-validation for the respective model. |
R |
an integer vector giving the number of replications used in cross-validation for the respective model. |
best |
an integer vector giving the indices of the models with the best prediction performance. |
cv |
a data frame containing the estimated prediction errors for the models. For models for which repeated cross-validation was performed, those are average values over all replications. |
se |
a data frame containing the estimated standard errors of the prediction loss for the models. |
selectBest |
a character string specifying the criterion used for selecting the best model. |
seFactor |
a numeric value giving the multiplication factor of the standard error used for the selection of the best model. |
reps |
a data frame containing the estimated prediction errors from all replications for those models for which repeated cross-validation was performed. This is only returned if repeated cross-validation was performed for at least one of the models. |
Even though the function allows to compare cross-validation results obtained with a different number of folds or a different number of replications, such comparisons should be made with care. Hence warnings are issued in those cases. For maximum comparability, the same data folds should be used in cross-validation for all models to be compared.
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # compare cross-validation results cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # compare cross-validation results cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts)
Basic function to estimate the prediction error of a
model via (repeated) -fold cross-validation. The
model is thereby specified by an unevaluated function
call to a model fitting function.
cvTool(call, data = NULL, x = NULL, y, cost = rmspe, folds, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame())
cvTool(call, data = NULL, x = NULL, y, cost = rmspe, folds, names = NULL, predictArgs = list(), costArgs = list(), envir = parent.frame())
call |
an unevaluated function call for fitting a
model (see |
data |
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a
|
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
cost |
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see
|
folds |
an object of class |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
predictArgs |
a list of additional arguments to be
passed to the |
costArgs |
a list of additional arguments to be
passed to the prediction loss function |
envir |
the |
(Repeated) -fold cross-validation is performed in
the following way. The data are first split into
previously obtained blocks of approximately equal size
(given by
folds
). Each of the data blocks
is left out once to fit the model, and predictions are
computed for the observations in the left-out block with
the
predict
method of the fitted
model. Thus a prediction is obtained for each
observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost
to estimate the prediction error.
For repeated cross-validation (as indicated by
folds
), this process is replicated and the
estimated prediction errors from all replications are
returned.
Furthermore, if the response is a vector but the
predict
method of the fitted models
returns a matrix, the prediction error is computed for
each column. A typical use case for this behavior would
be if the predict
method returns
predictions from an initial model fit and stepwise
improvements thereof.
If data
is supplied, all variables required for
fitting the models are added as one argument to the
function call, which is the typical behavior of model
fitting functions with a formula
interface. In this case, a character string specifying
the argument name can be passed via names
(the
default is to use "data"
).
If x
is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names
should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")
). It should be noted that data
takes
precedence over x
if both are supplied.
If only one replication is requested and the prediction
loss function cost
also returns the standard
error, a list is returned, with the first component
containing the estimated prediction errors and the second
component the corresponding estimated standard errors.
Otherwise the return value is a numeric matrix in which each column contains the respective estimated prediction errors from all replications.
Andreas Alfons
cvFit
, cvTuning
,
cvFolds
, cost
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up function call for an MM regression model call <- call("lmrob", formula = Y ~ .) # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) # perform cross-validation cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1))
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up function call for an MM regression model call <- call("lmrob", formula = Y ~ .) # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) # perform cross-validation cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1))
Select tuning parameters of a model by estimating the
respective prediction errors via (repeated) -fold
cross-validation. It is thereby possible to supply a
model fitting function or an unevaluated function call to
a model fitting function.
cvTuning(object, ...) ## S3 method for class 'function' cvTuning(object, formula, data = NULL, x = NULL, y, tuning = list(), args = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'call' cvTuning(object, data = NULL, x = NULL, y, tuning = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, envir = parent.frame(), seed = NULL, ...)
cvTuning(object, ...) ## S3 method for class 'function' cvTuning(object, formula, data = NULL, x = NULL, y, tuning = list(), args = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, envir = parent.frame(), seed = NULL, ...) ## S3 method for class 'call' cvTuning(object, data = NULL, x = NULL, y, tuning = list(), cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, names = NULL, predictArgs = list(), costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, envir = parent.frame(), seed = NULL, ...)
object |
a function or an unevaluated function call
for fitting a model (see |
formula |
a |
data |
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a
|
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
tuning |
a list of arguments giving the tuning parameter values to be evaluated. The names of the list components should thereby correspond to the argument names of the tuning parameters. For each tuning parameter, a vector of values can be supplied. Cross-validation is then applied over the grid of all possible combinations of tuning parameter values. |
args |
a list of additional arguments to be passed to the model fitting function. |
cost |
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see
|
K |
an integer giving the number of folds into which
the data should be split (the default is five). Keep in
mind that this should be chosen such that all folds are
of approximately equal size. Setting |
R |
an integer giving the number of replications for
repeated |
foldType |
a character string specifying the type of
folds to be generated. Possible values are
|
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
folds |
an object of class |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
predictArgs |
a list of additional arguments to be
passed to the |
costArgs |
a list of additional arguments to be
passed to the prediction loss function |
selectBest |
a character string specifying a
criterion for selecting the best model. Possible values
are |
seFactor |
a numeric value giving a multiplication
factor of the standard error for the selection of the
best model. This is ignored if |
envir |
the |
seed |
optional initial seed for the random number
generator (see |
... |
additional arguments to be passed down. |
(Repeated) -fold cross-validation is performed in
the following way. The data are first split into
previously obtained blocks of approximately equal size.
Each of the
data blocks is left out once to fit
the model, and predictions are computed for the
observations in the left-out block with the
predict
method of the fitted model.
Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost
to estimate the prediction error.
For repeated cross-validation, this process is replicated
and the estimated prediction errors from all replications
as well as their average are included in the returned
object.
Furthermore, if the response is a vector but the
predict
method of the fitted models
returns a matrix, the prediction error is computed for
each column. A typical use case for this behavior would
be if the predict
method returns
predictions from an initial model fit and stepwise
improvements thereof.
If formula
or data
are supplied, all
variables required for fitting the models are added as
one argument to the function call, which is the typical
behavior of model fitting functions with a
formula
interface. In this case,
the accepted values for names
depend on the
method. For the function
method, a character
vector of length two should supplied, with the first
element specifying the argument name for the formula and
the second element specifying the argument name for the
data (the default is to use c("formula", "data")
).
Note that names for both arguments should be supplied
even if only one is actually used. For the call
method, which does not have a formula
argument, a
character string specifying the argument name for the
data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names
should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
If tuning
is an empty list, cvFit
is
called to return an object of class "cv"
.
Otherwise an object of class "cvTuning"
(which
inherits from class "cvSelect"
) with the following
components is returned:
n |
an integer giving the number of observations or groups. |
K |
an integer giving the number of folds. |
R |
an integer giving the number of replications. |
tuning |
a data frame containing the grid of tuning parameter values for which the prediction error was estimated. |
best |
an integer vector giving the indices of the optimal combinations of tuning parameters. |
cv |
a data frame containing the estimated prediction errors for all combinations of tuning parameter values. For repeated cross-validation, those are average values over all replications. |
se |
a data frame containing the estimated standard errors of the prediction loss for all combinations of tuning parameter values. |
selectBest |
a character string specifying the criterion used for selecting the best model. |
seFactor |
a numeric value giving the multiplication factor of the standard error used for the selection of the best model. |
reps |
a data frame containing the estimated prediction errors from all replications for all combinations of tuning parameter values. This is only returned for repeated cross-validation. |
seed |
the seed of the random number generator before cross-validation was performed. |
call |
the matched function call. |
The same cross-validation folds are used for all combinations of tuning parameter values for maximum comparability.
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
cvTool
, cvFit
,
cvSelect
, cvFolds
,
cost
library("robustbase") data("coleman") ## evaluate MM regression models tuned for 85% and 95% efficiency tuning <- list(tuning.psi = c(3.443689, 4.685061)) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied cvTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation cvTuning(call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
library("robustbase") data("coleman") ## evaluate MM regression models tuned for 85% and 95% efficiency tuning <- list(tuning.psi = c(3.443689, 4.685061)) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied cvTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation cvTuning(call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
Produce kernel density plots of results from repeated
-fold cross-validation.
## S3 method for class 'cv' densityplot(x, data, select = NULL, ...) ## S3 method for class 'cvSelect' densityplot(x, data, subset = NULL, select = NULL, ...)
## S3 method for class 'cv' densityplot(x, data, select = NULL, ...) ## S3 method for class 'cvSelect' densityplot(x, data, subset = NULL, select = NULL, ...)
x |
an object inheriting from class |
data |
currently ignored. |
subset |
a character, integer or logical vector indicating the subset of models for which to plot the cross-validation results. |
select |
a character, integer or logical vector indicating the columns of cross-validation results to be plotted. |
... |
additional arguments to be passed to the
|
For objects with multiple columns of repeated cross-validation results, conditional kernel density plots are produced.
An object of class "trellis"
is returned
invisibly. The
update
method can
be used to update components of the object and the
print
method
(usually called by default) will plot it on an
appropriate plotting device.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, plot
,
bwplot
,
xyplot
,
dotplot
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model densityplot(cvFitLmrob) # plot combined results densityplot(cvFits)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model densityplot(cvFitLmrob) # plot combined results densityplot(cvFits)
Produce dot plots of (average) results from (repeated)
-fold cross-validation.
## S3 method for class 'cv' dotplot(x, data, select = NULL, seFactor = NA, ...) ## S3 method for class 'cvSelect' dotplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...)
## S3 method for class 'cv' dotplot(x, data, select = NULL, seFactor = NA, ...) ## S3 method for class 'cvSelect' dotplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...)
x |
an object inheriting from class
|
data |
currently ignored. |
subset |
a character, integer or logical vector indicating the subset of models for which to plot the cross-validation results. |
select |
a character, integer or logical vector indicating the columns of cross-validation results to be plotted. |
seFactor |
a numeric value giving the multiplication
factor of the standard error for displaying error bars.
Error bars can be suppressed by setting this to
|
... |
additional arguments to be passed to the
|
For objects with multiple columns of repeated cross-validation results, conditional dot plots are produced.
An object of class "trellis"
is returned
invisibly. The
update
method can
be used to update components of the object and the
print
method
(usually called by default) will plot it on an
appropriate plotting device.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, plot
,
xyplot
,
bwplot
,
densityplot
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine and plot results cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits dotplot(cvFits)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine and plot results cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits dotplot(cvFits)
Plot results from (repeated) -fold
cross-validation.
## S3 method for class 'cv' plot(x, method = c("bwplot", "densityplot", "xyplot", "dotplot"), ...) ## S3 method for class 'cvSelect' plot(x, method = c("bwplot", "densityplot", "xyplot", "dotplot"), ...)
## S3 method for class 'cv' plot(x, method = c("bwplot", "densityplot", "xyplot", "dotplot"), ...) ## S3 method for class 'cvSelect' plot(x, method = c("bwplot", "densityplot", "xyplot", "dotplot"), ...)
x |
an object inheriting from class |
method |
a character string specifying the type of
plot. For the |
... |
additional arguments to be passed down. |
For objects with multiple columns of cross-validation results, conditional plots are produced.
An object of class "trellis"
is returned
invisibly. The
update
method can
be used to update components of the object and the
print
method
(usually called by default) will plot it on an
appropriate plotting device.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, bwplot
,
densityplot
,
xyplot
,
dotplot
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model plot(cvFitLmrob, method = "bw") plot(cvFitLmrob, method = "density") # plot combined results plot(cvFits, method = "bw") plot(cvFits, method = "density") plot(cvFits, method = "xy") plot(cvFits, method = "dot")
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine results into one object cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits # plot results for the MM regression model plot(cvFitLmrob, method = "bw") plot(cvFitLmrob, method = "density") # plot combined results plot(cvFits, method = "bw") plot(cvFits, method = "density") plot(cvFits, method = "xy") plot(cvFits, method = "dot")
Estimate the prediction error of a linear model via
(repeated) -fold cross-validation.
Cross-validation functions are available for least
squares fits computed with
lm
as
well as for the following robust alternatives: MM-type
models computed with lmrob
and
least trimmed squares fits computed with
ltsReg
.
repCV(object, ...) ## S3 method for class 'lm' repCV(object, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) ## S3 method for class 'lmrob' repCV(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) ## S3 method for class 'lts' repCV(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, fit = c("reweighted", "raw", "both"), seed = NULL, ...) cvLm(object, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) cvLmrob(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) cvLts(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, fit = c("reweighted", "raw", "both"), seed = NULL, ...)
repCV(object, ...) ## S3 method for class 'lm' repCV(object, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) ## S3 method for class 'lmrob' repCV(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) ## S3 method for class 'lts' repCV(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, fit = c("reweighted", "raw", "both"), seed = NULL, ...) cvLm(object, cost = rmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) cvLmrob(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, seed = NULL, ...) cvLts(object, cost = rtmspe, K = 5, R = 1, foldType = c("random", "consecutive", "interleaved"), grouping = NULL, folds = NULL, fit = c("reweighted", "raw", "both"), seed = NULL, ...)
object |
an object returned from a model fitting
function. Methods are implemented for objects of class
|
cost |
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error for the
|
K |
an integer giving the number of folds into which
the data should be split (the default is five). Keep in
mind that this should be chosen such that all folds are
of approximately equal size. Setting |
R |
an integer giving the number of replications for
repeated |
foldType |
a character string specifying the type of
folds to be generated. Possible values are
|
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
folds |
an object of class |
fit |
a character string specifying for which fit to
estimate the prediction error. Possible values are
|
seed |
optional initial seed for the random number
generator (see |
... |
additional arguments to be passed to the
prediction loss function |
(Repeated) -fold cross-validation is performed in
the following way. The data are first split into
previously obtained blocks of approximately equal size.
Each of the
data blocks is left out once to fit
the model, and predictions are computed for the
observations in the left-out block with the
predict
method of the fitted model.
Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost
to estimate the prediction error.
For repeated cross-validation, this process is replicated
and the estimated prediction errors from all replications
as well as their average are included in the returned
object.
An object of class "cv"
with the following
components:
n |
an integer giving the number of observations or groups. |
K |
an integer giving the number of folds. |
R |
an integer giving the number of replications. |
cv |
a numeric vector containing the estimated
prediction errors. For the |
se |
a numeric vector containing the estimated
standard errors of the prediction loss. For the
|
reps |
a numeric matrix containing the estimated
prediction errors from all replications. For the
|
seed |
the seed of the random number generator before cross-validation was performed. |
call |
the matched function call. |
The repCV
methods are simple wrapper functions
that extract the data from the fitted model and call
cvFit
to perform cross-validation. In
addition, cvLm
, cvLmrob
and cvLts
are aliases for the respective methods.
Andreas Alfons
cvFit
, cvFolds
,
cost
, lm
,
lmrob
,
ltsReg
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) repCV(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) repCV(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) repCV(fitLts, cost = rtmspe, folds = folds, trim = 0.1) repCV(fitLts, cost = rtmspe, folds = folds, fit = "both", trim = 0.1)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility # set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) repCV(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) repCV(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) repCV(fitLts, cost = rtmspe, folds = folds, trim = 0.1) repCV(fitLts, cost = rtmspe, folds = folds, fit = "both", trim = 0.1)
Extract subsets of results from (repeated) -fold
cross-validation.
## S3 method for class 'cv' subset(x, select = NULL, ...) ## S3 method for class 'cvSelect' subset(x, subset = NULL, select = NULL, ...)
## S3 method for class 'cv' subset(x, select = NULL, ...) ## S3 method for class 'cvSelect' subset(x, subset = NULL, select = NULL, ...)
x |
an object inheriting from class |
subset |
a character, integer or logical vector indicating the subset of models for which to keep the cross-validation results. |
select |
a character, integer or logical vector indicating the columns of cross-validation results to be extracted. |
... |
currently ignored. |
An object similar to x
containing just the
selected results.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, subset
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # extract reweighted LTS results with 50% subsets subset(cvFitLts50, select = "reweighted") subset(cvFitsLts, subset = c(TRUE, FALSE), select = "reweighted")
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # extract reweighted LTS results with 50% subsets subset(cvFitLts50, select = "reweighted") subset(cvFitsLts, subset = c(TRUE, FALSE), select = "reweighted")
Produce a summary of results from (repeated) -fold
cross-validation.
## S3 method for class 'cv' summary(object, ...) ## S3 method for class 'cvSelect' summary(object, ...) ## S3 method for class 'cvTuning' summary(object, ...)
## S3 method for class 'cv' summary(object, ...) ## S3 method for class 'cvSelect' summary(object, ...) ## S3 method for class 'cvTuning' summary(object, ...)
object |
an object inheriting from class |
... |
currently ignored. |
An object of class "summary.cv"
,
"summary.cvSelect"
or "summary.cvTuning"
,
depending on the class of object
.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, summary
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # summary of the results with the 50% subsets summary(cvFitLts50) # summary of the combined results summary(cvFitsLts)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine results into one object cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) cvFitsLts # summary of the results with the 50% subsets summary(cvFitLts50) # summary of the combined results summary(cvFitsLts)
Plot the (average) results from (repeated) -fold
cross-validation on the
-axis against the
respective models on the
-axis.
## S3 method for class 'cv' xyplot(x, data, select = NULL, seFactor = NA, ...) ## S3 method for class 'cvSelect' xyplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...) ## S3 method for class 'cvTuning' xyplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...)
## S3 method for class 'cv' xyplot(x, data, select = NULL, seFactor = NA, ...) ## S3 method for class 'cvSelect' xyplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...) ## S3 method for class 'cvTuning' xyplot(x, data, subset = NULL, select = NULL, seFactor = x$seFactor, ...)
x |
an object inheriting from class
|
data |
currently ignored. |
subset |
a character, integer or logical vector indicating the subset of models for which to plot the cross-validation results. |
select |
a character, integer or logical vector indicating the columns of cross-validation results to be plotted. |
seFactor |
a numeric value giving the multiplication
factor of the standard error for displaying error bars.
Error bars can be suppressed by setting this to
|
... |
additional arguments to be passed to the
|
For objects with multiple columns of repeated cross-validation results, conditional plots are produced.
In most situations, the default behavior is to represent
the cross-validation results for each model by a vertical
line segment (i.e., to call the default method of
xyplot
with type =
"h"
). However, the behavior is different for objects of
class "cvTuning"
with only one numeric tuning
parameter. In that situation, the cross-validation
results are plotted against the values of the tuning
parameter as a connected line (i.e., by using type
= "b"
).
The default behavior can of course be overridden by
supplying the type
argument (a full list of
accepted values can be found in the help file of
panel.xyplot
).
An object of class "trellis"
is returned
invisibly. The
update
method can
be used to update components of the object and the
print
method
(usually called by default) will plot it on an
appropriate plotting device.
Andreas Alfons
cvFit
, cvSelect
,
cvTuning
, plot
,
dotplot
,
bwplot
,
densityplot
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine and plot results cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits xyplot(cvFits)
library("robustbase") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvFitLm <- cvLm(fitLm, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, k.max = 500) cvFitLmrob <- cvLmrob(fitLmrob, cost = rtmspe, folds = folds, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvFitLts <- cvLts(fitLts, cost = rtmspe, folds = folds, trim = 0.1) # combine and plot results cvFits <- cvSelect(LS = cvFitLm, MM = cvFitLmrob, LTS = cvFitLts) cvFits xyplot(cvFits)