Title: | Resampling-Based Prediction Error Estimation for Regression Models |
---|---|
Description: | Tools that allow developers to write functions for prediction error estimation with minimal programming effort and assist users with model selection in regression problems. |
Authors: | Andreas Alfons [aut, cre] |
Maintainer: | Andreas Alfons <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.3.1 |
Built: | 2024-11-21 03:04:08 UTC |
Source: | https://github.com/aalfons/perry |
Tools that allow developers to write functions for prediction error estimation with minimal programming effort and assist users with model selection in regression problems.
The DESCRIPTION file:
Package: | perry |
Type: | Package |
Title: | Resampling-Based Prediction Error Estimation for Regression Models |
Version: | 0.3.1 |
Date: | 2021-11-03 |
Depends: | R (>= 2.14.1), ggplot2 (>= 0.9.2), parallel |
Imports: | stats, utils |
Suggests: | perryExamples, robustbase |
Description: | Tools that allow developers to write functions for prediction error estimation with minimal programming effort and assist users with model selection in regression problems. |
License: | GPL (>= 2) |
LazyLoad: | yes |
Authors@R: | person("Andreas", "Alfons", email = "[email protected]", role = c("aut", "cre")) |
Author: | Andreas Alfons [aut, cre] |
Maintainer: | Andreas Alfons <[email protected]> |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.2 |
Repository: | https://aalfons.r-universe.dev |
RemoteUrl: | https://github.com/aalfons/perry |
RemoteRef: | HEAD |
RemoteSha: | e368947972f6b8a0a848ddfa27aa2a46f22665b9 |
Index of help topics:
accessors Access or set information on resampling-based prediction error results aggregate.perry Aggregate resampling-based prediction error results bootControl Control object for bootstrap samples bootSamples Bootstrap samples cost Prediction loss cvFolds Cross-validation folds foldControl Control object for cross-validation folds perry Resampling-based prediction error for fitted models perry-deprecated Deprecated functions in package 'perry' perry-package Resampling-Based Prediction Error Estimation for Regression Models perryFit Resampling-based prediction error for model evaluation perryPlot Plot resampling-based prediction error results perryReshape Reshape resampling-based prediction error results perrySelect Model selection via resampling-based prediction error measures perrySplits Data splits for resampling-based prediction error measures perryTuning Resampling-based prediction error for tuning parameter selection randomSplits Random data splits reperry Recompute resampling-based prediction error measures setupPerryPlot Set up a plot of resampling-based prediction error results splitControl Control object for random data splits subset.perry Subsetting resampling-based prediction error results summary.perry Summarize resampling-based prediction error results
Andreas Alfons [aut, cre]
Maintainer: Andreas Alfons <[email protected]>
Retrieve or set the names of resampling-based prediction error results, retrieve or set the identifiers of the models, or retrieve the number of prediction error results or included models.
peNames(x) peNames(x) <- value fits(x) fits(x) <- value npe(x) nfits(x)
peNames(x) peNames(x) <- value fits(x) fits(x) <- value npe(x) nfits(x)
x |
an object inheriting from class |
value |
a vector of replacement values. |
peNames
returns the names of the prediction error results. The
replacement function thereby returns them invisibly.
fits
returns the identifiers of the models for objects inheriting
from class "perrySelect"
and NULL
for objects inheriting from
class "perry"
. The replacement function thereby returns those values
invisibly.
npe
returns the number of prediction error results.
nfits
returns the number of models included in objects inheriting
from class "perrySelect"
and NULL
for objects inheriting from
class "perry"
.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # "perry" object npe(cv50) nfits(cv50) peNames(cv50) peNames(cv50) <- c("improved", "initial") fits(cv50) cv50 # "perrySelect" object npe(cv) nfits(cv) peNames(cv) peNames(cv) <- c("improved", "initial") fits(cv) fits(cv) <- 1:2 cv
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # "perry" object npe(cv50) nfits(cv50) peNames(cv50) peNames(cv50) <- c("improved", "initial") fits(cv50) cv50 # "perrySelect" object npe(cv) nfits(cv) peNames(cv) peNames(cv) <- c("improved", "initial") fits(cv) fits(cv) <- 1:2 cv
Compute summary statistics of resampling-based prediction error results.
## S3 method for class 'perry' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'perrySelect' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'perryTuning' aggregate(x, ...)
## S3 method for class 'perry' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'perrySelect' aggregate(x, FUN = mean, select = NULL, ...) ## S3 method for class 'perryTuning' aggregate(x, ...)
x |
an object inheriting from class |
FUN |
a function to compute the summary statistics. |
select |
a character, integer or logical vector indicating the columns of prediction error results for which to compute the summary statistics. |
... |
for the |
The "perry"
method returns a vector or matrix of aggregated
prediction error results, depending on whether FUN
returns a single
value or a vector.
For the other methods, a data frame containing the aggregated
prediction error results for each model is returned. In the case of the
"perryTuning"
method, the data frame contains the combinations of
tuning parameters rather than a column describing the models.
Duplicate indices in subset
or select
are removed such
that all models and prediction error results are unique.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
, aggregate
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # summary of the results with the 50% subsets aggregate(cv50, summary) # summary of the combined results aggregate(cv, summary)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # summary of the results with the 50% subsets aggregate(cv50, summary) # summary of the combined results aggregate(cv, summary)
Generate an object that controls how to draw bootstrap samples and which bootstrap estimator of prediction error to compute.
bootControl(R = 1, type = c("0.632", "out-of-bag"), grouping = NULL)
bootControl(R = 1, type = c("0.632", "out-of-bag"), grouping = NULL)
R |
an integer giving the number of bootstrap samples. |
type |
a character string specifying a bootstrap estimator. Possible
values are |
grouping |
a factor specifying groups of observations. |
An object of class "bootControl"
with the following
components:
R
an integer giving the number of bootstrap samples.
type
a character string specifying the type of bootstrap estimator.
grouping
if supplied, a factor specifying groups of observations. The groups will then be resampled rather than individual observations such that all observations within a group belong either to the bootstrap sample or the test data.
Andreas Alfons
Efron, B. (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78(382), 316–331.
perrySplits
, bootSamples
,
foldControl
, splitControl
set.seed(1234) # set seed for reproducibility perrySplits(20, bootControl()) perrySplits(20, bootControl(R = 10))
set.seed(1234) # set seed for reproducibility perrySplits(20, bootControl()) perrySplits(20, bootControl(R = 10))
Draw bootstrap samples of observations or groups of observations and specify which bootstrap estimator of prediction error to compute.
bootSamples(n, R = 1, type = c("0.632", "out-of-bag"), grouping = NULL)
bootSamples(n, R = 1, type = c("0.632", "out-of-bag"), grouping = NULL)
n |
an integer giving the number of observations for which to draw
bootstrap samples. This is ignored if |
R |
an integer giving the number of bootstrap samples. |
type |
a character string specifying a bootstrap estimator. Possible
values are |
grouping |
a factor specifying groups of observations. If supplied, the groups are resampled rather than individual observations such that all observations within a group belong either to the bootstrap sample or the test data. |
An object of class "bootSamples"
with the following
components:
n
an integer giving the number of observations or groups.
R
an integer giving the number of bootstrap samples.
subsets
an integer matrix in which each column contains the indices of the observations or groups in the corresponding bootstrap sample.
grouping
a list giving the indices of the observations belonging to each group. This is only returned if a grouping factor has been supplied.
This is a simple wrapper function for perrySplits
with a
control object generated by bootControl
.
Andreas Alfons
Efron, B. (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78(382), 316–331.
perrySplits
, bootControl
,
cvFolds
, randomSplits
set.seed(1234) # set seed for reproducibility bootSamples(20) bootSamples(20, R = 10)
set.seed(1234) # set seed for reproducibility bootSamples(20) bootSamples(20, R = 10)
Compute the prediction loss of a model.
mspe(y, yHat, includeSE = FALSE) rmspe(y, yHat, includeSE = FALSE) mape(y, yHat, includeSE = FALSE) tmspe(y, yHat, trim = 0.25, includeSE = FALSE) rtmspe(y, yHat, trim = 0.25, includeSE = FALSE)
mspe(y, yHat, includeSE = FALSE) rmspe(y, yHat, includeSE = FALSE) mape(y, yHat, includeSE = FALSE) tmspe(y, yHat, trim = 0.25, includeSE = FALSE) rtmspe(y, yHat, trim = 0.25, includeSE = FALSE)
y |
a numeric vector or matrix giving the observed values. |
yHat |
a numeric vector or matrix of the same dimensions as |
includeSE |
a logical indicating whether standard errors should be computed as well. |
trim |
a numeric value giving the trimming proportion (the default is 0.25). |
mspe
and rmspe
compute the mean squared prediction error and
the root mean squared prediction error, respectively. In addition,
mape
returns the mean absolute prediction error, which is somewhat
more robust.
Robust prediction loss based on trimming is implemented in tmspe
and
rtmspe
. To be more precise, tmspe
computes the trimmed mean
squared prediction error and rtmspe
computes the root trimmed mean
squared prediction error. A proportion of the largest squared differences
of the observed and fitted values are thereby trimmed.
Standard errors can be requested via the includeSE
argument. Note that
standard errors for tmspe
are based on a winsorized standard
deviation. Furthermore, standard errors for rmspe
and rtmspe
are computed from the respective standard errors of mspe
and
tmspe
via the delta method.
If standard errors are not requested, a numeric value giving the prediction loss is returned.
Otherwise a list is returned, with the first component containing the prediction loss and the second component the corresponding standard error.
Andreas Alfons
Tukey, J.W. and McLaughlin, D.H. (1963) Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/winsorization. Sankhya: The Indian Journal of Statistics, Series A, 25(3), 331–352
Oehlert, G.W. (1992) A note on the delta method. The American Statistician, 46(1), 27–29.
# fit an MM-regression model library("robustbase") data("coleman") fit <- lmrob(Y~., data=coleman) # compute the prediction loss from the fitted values # (hence the prediction loss is underestimated in this simple # example since all observations are used to fit the model) mspe(coleman$Y, predict(fit)) rmspe(coleman$Y, predict(fit)) mape(coleman$Y, predict(fit)) tmspe(coleman$Y, predict(fit), trim = 0.1) rtmspe(coleman$Y, predict(fit), trim = 0.1) # include standard error mspe(coleman$Y, predict(fit), includeSE = TRUE) rmspe(coleman$Y, predict(fit), includeSE = TRUE) mape(coleman$Y, predict(fit), includeSE = TRUE) tmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE) rtmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE)
# fit an MM-regression model library("robustbase") data("coleman") fit <- lmrob(Y~., data=coleman) # compute the prediction loss from the fitted values # (hence the prediction loss is underestimated in this simple # example since all observations are used to fit the model) mspe(coleman$Y, predict(fit)) rmspe(coleman$Y, predict(fit)) mape(coleman$Y, predict(fit)) tmspe(coleman$Y, predict(fit), trim = 0.1) rtmspe(coleman$Y, predict(fit), trim = 0.1) # include standard error mspe(coleman$Y, predict(fit), includeSE = TRUE) rmspe(coleman$Y, predict(fit), includeSE = TRUE) mape(coleman$Y, predict(fit), includeSE = TRUE) tmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE) rtmspe(coleman$Y, predict(fit), trim = 0.1, includeSE = TRUE)
Split observations or groups of observations into folds to be used
for (repeated)
-fold cross-validation.
should thereby be
chosen such that all folds are of approximately equal size.
cvFolds( n, K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL )
cvFolds( n, K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL )
n |
an integer giving the number of observations to be split into
folds. This is ignored if |
K |
an integer giving the number of folds into which the observations
should be split (the default is five). Setting |
R |
an integer giving the number of replications for repeated
|
type |
a character string specifying the type of folds to be
generated. Possible values are |
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
An object of class "cvFolds"
with the following components:
n
an integer giving the number of observations or groups.
K
an integer giving the number of folds.
R
an integer giving the number of replications.
subsets
an integer matrix in which each column contains a permutation of the indices of the observations or groups.
which
an integer vector giving the fold for each permuted observation or group.
grouping
a list giving the indices of the observations belonging to each group. This is only returned if a grouping factor has been supplied.
This is a simple wrapper function for perrySplits
with a
control object generated by foldControl
.
Andreas Alfons
perrySplits
, foldControl
,
randomSplits
, bootSamples
set.seed(1234) # set seed for reproducibility cvFolds(20, K = 5, type = "random") cvFolds(20, K = 5, type = "consecutive") cvFolds(20, K = 5, type = "interleaved") cvFolds(20, K = 5, R = 10)
set.seed(1234) # set seed for reproducibility cvFolds(20, K = 5, type = "random") cvFolds(20, K = 5, type = "consecutive") cvFolds(20, K = 5, type = "interleaved") cvFolds(20, K = 5, R = 10)
Generate an object that controls how to split observations or
groups of observations into
folds to be used for (repeated)
-fold cross-validation.
should thereby be chosen such that
all folds are of approximately equal size.
foldControl( K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL )
foldControl( K = 5, R = 1, type = c("random", "consecutive", "interleaved"), grouping = NULL )
K |
an integer giving the number of folds into which the observations should be split (the default is five). |
R |
an integer giving the number of replications for repeated
|
type |
a character string specifying the type of folds to be
generated. Possible values are |
grouping |
a factor specifying groups of observations. |
An object of class "foldControl"
with the following
components:
K
an integer giving the number of folds. A value of
K
equal to the number of observations or groups yields
eave-one-out cross-validation.
R
an integer giving the number of replications. This will be ignored for for leave-one-out cross-validation and other non-random splits of the data.
type
a character string specifying the type of folds.
grouping
if supplied, a factor specifying groups of observations. The data will then be split according to the groups rather than individual observations such that all observations within a group belong to the same fold.
Andreas Alfons
perrySplits
, cvFolds
,
splitControl
, bootControl
set.seed(1234) # set seed for reproducibility perrySplits(20, foldControl(K = 5)) perrySplits(20, foldControl(K = 5, R = 10))
set.seed(1234) # set seed for reproducibility perrySplits(20, foldControl(K = 5)) perrySplits(20, foldControl(K = 5, R = 10))
Generic function to estimate the prediction error of a fitted model via
(repeated) -fold cross-validation, (repeated) random splitting (also
known as random subsampling or Monte Carlo cross-validation), or the
bootstrap.
perry(object, ...)
perry(object, ...)
object |
the fitted model for which to estimate the prediction error. |
... |
additional arguments to be passed down to methods. |
The idea is that developers write easy-to-use methods for end users to
leverage the prediction error estimation framework for their models. A
typical perry
method consists of the following two parts: first the
data are extracted from the model, then function perryFit
is
called to perform prediction error estimation. The programming effort of
implementing prediction error estimation for a certain model is thus greatly
reduced.
Examples for methods are available in package perryExamples (see
perry-methods
).
Andreas Alfons
These functions are provided for compatibility with older versions only, and may be defunct as soon as the next release.
## S3 method for class 'perry' fortify( model, data, select = NULL, reps = model$splits$R > 1, seFactor = NA, ... ) ## S3 method for class 'perrySelect' fortify( model, data, subset = NULL, select = NULL, reps = model$splits$R > 1, seFactor = model$seFactor, ... ) ## S3 method for class 'perryTuning' fortify(model, data, ...) ## Default S3 method: perryPlot( object, method = c("box", "density", "dot", "line"), mapping, facets = attr(object, "facets"), ... )
## S3 method for class 'perry' fortify( model, data, select = NULL, reps = model$splits$R > 1, seFactor = NA, ... ) ## S3 method for class 'perrySelect' fortify( model, data, subset = NULL, select = NULL, reps = model$splits$R > 1, seFactor = model$seFactor, ... ) ## S3 method for class 'perryTuning' fortify(model, data, ...) ## Default S3 method: perryPlot( object, method = c("box", "density", "dot", "line"), mapping, facets = attr(object, "facets"), ... )
model |
an object inheriting from class |
data |
currently ignored. |
select |
a character, integer or logical vector indicating the columns of prediction error results to be converted. |
reps |
a logical indicating whether to convert the results from all
replications ( |
seFactor |
a numeric value giving the multiplication factor of the
standard error for displaying error bars in dot plots or line plots. Error
bars in those plots can be suppressed by setting this to |
... |
for the |
subset |
a character, integer or logical vector indicating the subset of models to be converted. |
object |
an object inheriting from class |
method |
a character string specifying the type of plot. Possible
values are |
mapping |
an aesthetic mapping to override the default behavior (see
|
facets |
a faceting formula to override the default behavior. If
supplied, |
The fortify
methods extract all necessary information for plotting
from resampling-based prediction error results and store it in a data frame.
The default method of perryPlot
creates the corresponding plot from
the data frame returned by fortify
.
The fortify
methods return a data frame containing the
columns listed below, as well as additional information stored in the
attribute "facets"
(default faceting formula for the plots).
Fit
a vector or factor containing the identifiers of the models.
Name
a factor containing the names of the predictor error results (not returned in case of only one column of prediction error results with the default name).
PE
the estimated prediction errors.
Lower
the lower end points of the error bars (only returned
if reps
is FALSE
).
Upper
the upper end points of the error bars (only returned
if reps
is FALSE
).
Duplicate indices in subset
or select
are removed such
that all models and prediction error results are unique.
Andreas Alfons
Estimate the prediction error of a model via (repeated) -fold
cross-validation, (repeated) random splitting (also known as random
subsampling or Monte Carlo cross-validation), or the bootstrap. It is
thereby possible to supply an object returned by a model fitting function,
a model fitting function itself, or an unevaluated function call to a model
fitting function.
perryFit(object, ...) ## Default S3 method: perryFit( object, data = NULL, x = NULL, y, splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class ''function'' perryFit( object, formula, data = NULL, x = NULL, y, args = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'call' perryFit( object, data = NULL, x = NULL, y, splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... )
perryFit(object, ...) ## Default S3 method: perryFit( object, data = NULL, x = NULL, y, splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class ''function'' perryFit( object, formula, data = NULL, x = NULL, y, args = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'call' perryFit( object, data = NULL, x = NULL, y, splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... )
object |
the fitted model for which to estimate the prediction error,
a function for fitting a model, or an unevaluated function call for fitting
a model (see |
... |
additional arguments to be passed down. |
data |
a data frame containing the variables required for fitting the
models. This is typically used if the model in the function call is
described by a |
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
splits |
an object of class |
predictFun |
a function to compute predictions for the test data. It
should expect the fitted model to be passed as the first argument and the test
data as the second argument, and must return either a vector or a matrix
containing the predicted values. The default is to use the
|
predictArgs |
a list of additional arguments to be passed to
|
cost |
a cost function measuring prediction loss. It should expect
the observed values of the response to be passed as the first argument and
the predicted values as the second argument, and must return either a
non-negative scalar value, or a list with the first component containing
the prediction error and the second component containing the standard
error. The default is to use the root mean squared prediction error
(see |
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
envir |
the |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
formula |
a |
args |
a list of additional arguments to be passed to the model fitting function. |
(Repeated) -fold cross-validation is performed in the following
way. The data are first split into
previously obtained blocks of
approximately equal size (given by
folds
). Each of the data
blocks is left out once to fit the model, and predictions are computed for
the observations in the left-out block with
predictFun
. Thus a
prediction is obtained for each observation. The response and the obtained
predictions for all observations are then passed to the prediction loss
function cost
to estimate the prediction error. For repeated
-fold cross-validation (as indicated by
splits
), this process
is replicated and the estimated prediction errors from all replications are
returned.
(Repeated) random splitting is performed similarly. In each replication,
the data are split into a training set and a test set at random. Then the
training data are used to fit the model, and predictions are computed for
the test data. Hence only the response values from the test data and the
corresponding predictions are passed to the prediction loss function
cost
.
For the bootstrap estimator, each bootstrap sample is used as training data
to fit the model. The out-of-bag estimator uses the observations that do
not enter the bootstrap sample as test data and computes the prediction loss
function cost
for those out-of-bag observations. The 0.632 estimator
is computed as a linear combination of the out-of-bag estimator and the
prediction loss of the fitted values of the model computed from the full
sample.
In any case, if the response is a vector but predictFun
returns a
matrix, the prediction error is computed for each column. A typical use
case for this behavior would be if predictFun
returns predictions
from an initial model fit and stepwise improvements thereof.
If formula
or data
are supplied, all variables required for
fitting the models are added as one argument to the function call, which is
the typical behavior of model fitting functions with a
formula
interface. In this case, the accepted values
for names
depend on the method. For the function
method, a
character vector of length two should supplied, with the first element
specifying the argument name for the formula and the second element
specifying the argument name for the data (the default is to use
c("formula", "data")
). Note that names for both arguments should be
supplied even if only one is actually used. For the other methods, which do
not have a formula
argument, a character string specifying the
argument name for the data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor matrix and the
response are added as separate arguments to the function call. In this
case, names
should be a character vector of length two, with the
first element specifying the argument name for the predictor matrix and the
second element specifying the argument name for the response (the default is
to use c("x", "y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
An object of class "perry"
with the following components:
pe
a numeric vector containing the respective estimated prediction errors. In case of more than one replication, those are average values over all replications.
se
a numeric vector containing the respective estimated standard errors of the prediction loss.
reps
a numeric matrix in which each column contains the respective estimated prediction errors from all replications. This is only returned in case of more than one replication.
splits
an object giving the data splits used to estimate the prediction error.
y
the response.
yHat
a list containing the predicted values from all replications.
call
the matched function call.
Andreas Alfons
perrySelect
, perryTuning
,
cvFolds
, randomSplits
,
bootSamples
, cost
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## via model fit # fit an MM regression model fit <- lmrob(Y ~ ., data=coleman) # perform cross-validation perryFit(fit, data = coleman, y = coleman$Y, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied perryFit(lmrob, formula = Y ~ ., data = coleman, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation perryFit(call, data = coleman, y = coleman$Y, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## via model fit # fit an MM regression model fit <- lmrob(Y ~ ., data=coleman) # perform cross-validation perryFit(fit, data = coleman, y = coleman$Y, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied perryFit(lmrob, formula = Y ~ ., data = coleman, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation perryFit(call, data = coleman, y = coleman$Y, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234)
Plot results of resampling-based prediction error measures.
perryPlot(object, ...) ## S3 method for class 'perry' perryPlot( object, which = c("box", "density", "dot"), select = NULL, seFactor = NA, ... ) ## S3 method for class 'perrySelect' perryPlot( object, which = c("box", "density", "dot", "line"), subset = NULL, select = NULL, seFactor = object$seFactor, ... ) ## S3 method for class 'setupPerryPlot' perryPlot(object, mapping = object$mapping, facets = object$facets, ...) ## S3 method for class 'perry' autoplot(object, ...) ## S3 method for class 'perrySelect' autoplot(object, ...) ## S3 method for class 'perry' plot(x, ...) ## S3 method for class 'perrySelect' plot(x, ...)
perryPlot(object, ...) ## S3 method for class 'perry' perryPlot( object, which = c("box", "density", "dot"), select = NULL, seFactor = NA, ... ) ## S3 method for class 'perrySelect' perryPlot( object, which = c("box", "density", "dot", "line"), subset = NULL, select = NULL, seFactor = object$seFactor, ... ) ## S3 method for class 'setupPerryPlot' perryPlot(object, mapping = object$mapping, facets = object$facets, ...) ## S3 method for class 'perry' autoplot(object, ...) ## S3 method for class 'perrySelect' autoplot(object, ...) ## S3 method for class 'perry' plot(x, ...) ## S3 method for class 'perrySelect' plot(x, ...)
object , x
|
an object inheriting from class |
... |
additional arguments to be passed down, eventually to
|
which |
a character string specifying the type of plot. Possible
values are |
select |
a character, integer or logical vector indicating the columns of prediction error results to be plotted. |
seFactor |
a numeric value giving the multiplication factor of the
standard error for displaying error bars in dot plots or line plots. Error
bars in those plots can be suppressed by setting this to |
subset |
a character, integer or logical vector indicating the subset of models for which to plot the prediction error results. |
mapping |
an aesthetic mapping to override the default behavior (see
|
facets |
a faceting formula to override the default behavior. If
supplied, |
For objects with multiple columns of prediction error results, conditional plots are produced.
An object of class "ggplot"
(see ggplot
).
Duplicate indices in subset
or select
are removed such
that all models and prediction error results are unique.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
,
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, maxit.scale = 500) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts, .selectBest = "min") cv # plot results for the MM regression model plot(cvLmrob, which = "box") plot(cvLmrob, which = "density") plot(cvLmrob, which = "dot", seFactor = 2) # plot combined results plot(cv, which = "box") plot(cv, which = "density") plot(cv, which = "dot", seFactor = 2)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, maxit.scale = 500) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts, .selectBest = "min") cv # plot results for the MM regression model plot(cvLmrob, which = "box") plot(cvLmrob, which = "density") plot(cvLmrob, which = "dot", seFactor = 2) # plot combined results plot(cv, which = "box") plot(cv, which = "density") plot(cv, which = "dot", seFactor = 2)
Reshape resampling-based prediction error results into an object of class
"perrySelect"
with only one column of results.
perryReshape( x, selectBest = c("min", "hastie"), seFactor = 1, tuning = list(), ... )
perryReshape( x, selectBest = c("min", "hastie"), seFactor = 1, tuning = list(), ... )
x |
an object inheriting from class |
selectBest |
a character string specifying a criterion for selecting
the best model. Possible values are |
seFactor |
a numeric value giving a multiplication factor of the
standard error for the selection of the best model. This is ignored if
|
tuning |
a list of tuning parameter values that correspond to the different prediction error results. The names of the list components should thereby correspond to the argument names of the tuning parameters. For each tuning parameter, a vector of values can be supplied. A data frame containing all possible combinations of tuning parameter values is then added to the reshaped prediction error results. |
... |
additional arguments to be passed down. |
An object of class "perrySelect"
(subclass
"perryTuning"
if a list of tuning parameters is supplied) with the
following components:
pe
a data frame containing the estimated prediction errors for the models. In case of more than one resampling replication, those are average values over all replications.
se
a data frame containing the estimated standard errors of the prediction loss for the models.
reps
a data frame containing the estimated prediction errors for the models from all replications. This is only returned in case of more than one resampling replication.
splits
an object giving the data splits used to estimate the prediction error.
y
the response.
yHat
a list containing the predicted values for the models. Each list component is again a list containing the corresponding predicted values from all replications.
best
an integer giving the index of the model with the best prediction performance.
selectBest
a character string specifying the criterion used for selecting the best model.
seFactor
a numeric value giving the multiplication factor of the standard error used for the selection of the best model.
tuning
a data frame containing the grid of tuning parameter
values that correspond to the different prediction error results (only
subclass "perryTuning"
).
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
perryFit
, perrySelect
,
perryTuning
library("perryExamples") data("coleman") # perform cross-validation for an LTS regression model fit <- ltsReg(Y ~ ., data = coleman) folds <- foldControl(K = 5, R = 10) cv <- perry(fit, splits = folds, fit = "both", cost = rtmspe, trim = 0.1, seed = 1234) # compare original and reshaped object cv perryReshape(cv)
library("perryExamples") data("coleman") # perform cross-validation for an LTS regression model fit <- ltsReg(Y ~ ., data = coleman) folds <- foldControl(K = 5, R = 10) cv <- perry(fit, splits = folds, fit = "both", cost = rtmspe, trim = 0.1, seed = 1234) # compare original and reshaped object cv perryReshape(cv)
Combine resampling-based prediction error results for various models into one object and select the model with the best prediction performance.
perrySelect( ..., .list = list(...), .reshape = FALSE, .selectBest = c("min", "hastie"), .seFactor = 1 )
perrySelect( ..., .list = list(...), .reshape = FALSE, .selectBest = c("min", "hastie"), .seFactor = 1 )
... |
objects inheriting from class |
.list |
a list of objects inheriting from class |
.reshape |
a logical indicating whether objects with more than one column of prediction error results should be reshaped to have only one column (see “Details”). |
.selectBest |
a character string specifying a criterion for selecting
the best model. Possible values are |
.seFactor |
a numeric value giving a multiplication factor of the
standard error for the selection of the best model. This is ignored if
|
Keep in mind that objects inheriting from class "perry"
or
"perrySelect"
may contain multiple columns of prediction error
results. This is the case if the response is univariate but the
function to compute predictions (usually the predict
method of the fitted model) returns a matrix.
The .reshape
argument determines how to handle such objects. If
.reshape
is FALSE
, all objects are required to have the same
number of columns and the best model for each column is selected. A typical
use case for this behavior would be if the investigated models contain
prediction error results for a raw and a reweighted fit. It might then be
of interest to researchers to compare the best model for the raw estimators
with the best model for the reweighted estimators.
If .reshape
is TRUE
, objects with more than one column of
results are first transformed with perryReshape
to have only
one column. Then the best overall model is selected.
It should also be noted that the argument names of .list
,
.reshape
, .selectBest
and .seFacor
start with a dot to
avoid conflicts with the argument names used for the objects containing
prediction error results.
An object of class "perrySelect"
with the following
components:
pe
a data frame containing the estimated prediction errors for the models. In case of more than one resampling replication, those are average values over all replications.
se
a data frame containing the estimated standard errors of the prediction loss for the models.
reps
a data frame containing the estimated prediction errors for the models from all replications. This is only returned in case of more than one resampling replication.
splits
an object giving the data splits used to estimate the prediction error of the models.
y
the response.
yHat
a list containing the predicted values for the models. Each list component is again a list containing the corresponding predicted values from all replications.
best
an integer vector giving the indices of the models with the best prediction performance.
selectBest
a character string specifying the criterion used for selecting the best model.
seFactor
a numeric value giving the multiplication factor of the standard error used for the selection of the best model.
To ensure comparability, the prediction errors for all models are required to be computed from the same data splits.
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # compare cross-validation results perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # compare cross-validation results perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts)
Split observations or groups of observations into segments to be used
for (repeated) -fold cross-validation, (repeated) random splitting
(also known as random subsampling or Monte Carlo cross-validation), or the
bootstrap.
perrySplits(n, control)
perrySplits(n, control)
n |
an integer giving the number of observations to be split. |
control |
a control object of class |
For the foldControl
method, an object of class "cvFolds"
giving folds for (repeated) -fold cross-validation (see
cvFolds
).
For the splitControl
method, an object of class "randomSplits"
giving random data splits (see randomSplits
).
For the bootControl
method, an object of class "bootSamples"
giving bootstrap samples (see bootSamples
).
Users may prefer the wrapper functions cvFolds
,
randomSplits
and bootSamples
.
Andreas Alfons
foldControl
, splitControl
,
bootControl
, cvFolds
,
randomSplits
, bootSamples
set.seed(1234) # set seed for reproducibility ## data folds for K-fold cross-validation perrySplits(20, foldControl(K = 5)) perrySplits(20, foldControl(K = 5, R = 10)) ## random data splits perrySplits(20, splitControl(m = 5)) perrySplits(20, splitControl(m = 5, R = 10)) ## bootstrap samples perrySplits(20, bootControl()) perrySplits(20, bootControl(R = 10))
set.seed(1234) # set seed for reproducibility ## data folds for K-fold cross-validation perrySplits(20, foldControl(K = 5)) perrySplits(20, foldControl(K = 5, R = 10)) ## random data splits perrySplits(20, splitControl(m = 5)) perrySplits(20, splitControl(m = 5, R = 10)) ## bootstrap samples perrySplits(20, bootControl()) perrySplits(20, bootControl(R = 10))
Select tuning parameters of a model by estimating the respective prediction
errors via (repeated) -fold cross-validation, (repeated) random
splitting (also known as random subsampling or Monte Carlo
cross-validation), or the bootstrap. It is thereby possible to supply a
model fitting function or an unevaluated function call to a model fitting
function.
perryTuning(object, ...) ## S3 method for class ''function'' perryTuning( object, formula, data = NULL, x = NULL, y, tuning = list(), args = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, final = FALSE, names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'call' perryTuning( object, data = NULL, x = NULL, y, tuning = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, final = FALSE, names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... )
perryTuning(object, ...) ## S3 method for class ''function'' perryTuning( object, formula, data = NULL, x = NULL, y, tuning = list(), args = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, final = FALSE, names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... ) ## S3 method for class 'call' perryTuning( object, data = NULL, x = NULL, y, tuning = list(), splits = foldControl(), predictFun = predict, predictArgs = list(), cost = rmspe, costArgs = list(), selectBest = c("min", "hastie"), seFactor = 1, final = FALSE, names = NULL, envir = parent.frame(), ncores = 1, cl = NULL, seed = NULL, ... )
object |
a function or an unevaluated function call for fitting
a model (see |
... |
additional arguments to be passed down. |
formula |
a |
data |
a data frame containing the variables required for fitting the
models. This is typically used if the model in the function call is
described by a |
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
tuning |
a list of arguments giving the tuning parameter values to be evaluated. The names of the list components should thereby correspond to the argument names of the tuning parameters. For each tuning parameter, a vector of values can be supplied. The prediction error is then estimated for all possible combinations of tuning parameter values. |
args |
a list of additional arguments to be passed to the model fitting function. |
splits |
an object of class |
predictFun |
a function to compute predictions for the test data. It
should expect the fitted model to be passed as the first argument and the test
data as the second argument, and must return either a vector or a matrix
containing the predicted values. The default is to use the
|
predictArgs |
a list of additional arguments to be passed to
|
cost |
a cost function measuring prediction loss. It should expect
the observed values of the response to be passed as the first argument and
the predicted values as the second argument, and must return either a
non-negative scalar value, or a list with the first component containing
the prediction error and the second component containing the standard
error. The default is to use the root mean squared prediction error
(see |
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
selectBest |
a character string specifying a criterion for selecting
the best model. Possible values are |
seFactor |
a numeric value giving a multiplication factor of the
standard error for the selection of the best model. This is ignored if
|
final |
a logical indicating whether to fit the final model with the optimal combination of tuning parameters. |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
envir |
the |
ncores |
a positive integer giving the number of processor cores to be
used for parallel computing (the default is 1 for no parallelization). If
this is set to |
cl |
a parallel cluster for parallel computing as generated by
|
seed |
optional initial seed for the random number generator (see
|
(Repeated) -fold cross-validation is performed in the following
way. The data are first split into
previously obtained blocks of
approximately equal size (given by
folds
). Each of the data
blocks is left out once to fit the model, and predictions are computed for
the observations in the left-out block with
predictFun
. Thus a
prediction is obtained for each observation. The response and the obtained
predictions for all observations are then passed to the prediction loss
function cost
to estimate the prediction error. For repeated
-fold cross-validation (as indicated by
splits
), this process
is replicated and the estimated prediction errors from all replications are
returned.
(Repeated) random splitting is performed similarly. In each replication,
the data are split into a training set and a test set at random. Then the
training data are used to fit the model, and predictions are computed for
the test data. Hence only the response values from the test data and the
corresponding predictions are passed to the prediction loss function
cost
.
For the bootstrap estimator, each bootstrap sample is used as training data
to fit the model. The out-of-bag estimator uses the observations that do
not enter the bootstrap sample as test data and computes the prediction loss
function cost
for those out-of-bag observations. The 0.632 estimator
is computed as a linear combination of the out-of-bag estimator and the
prediction loss of the fitted values of the model computed from the full
sample.
In any case, if the response is a vector but predictFun
returns a
matrix, the prediction error is computed for each column. A typical use
case for this behavior would be if predictFun
returns predictions
from an initial model fit and stepwise improvements thereof.
If formula
or data
are supplied, all variables required for
fitting the models are added as one argument to the function call, which is
the typical behavior of model fitting functions with a
formula
interface. In this case, the accepted values
for names
depend on the method. For the function
method, a
character vector of length two should supplied, with the first element
specifying the argument name for the formula and the second element
specifying the argument name for the data (the default is to use
c("formula", "data")
). Note that names for both arguments should be
supplied even if only one is actually used. For the call
method,
which does not have a formula
argument, a character string specifying
the argument name for the data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor matrix and the
response are added as separate arguments to the function call. In this
case, names
should be a character vector of length two, with the
first element specifying the argument name for the predictor matrix and the
second element specifying the argument name for the response (the default is
to use c("x", "y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
If tuning
is an empty list, perryFit
is called to
return an object of class "perry"
.
Otherwise an object of class "perryTuning"
(which inherits from class
"perrySelect"
) with the following components is returned:
pe
a data frame containing the estimated prediction errors for all combinations of tuning parameter values. In case of more than one replication, those are average values over all replications.
se
a data frame containing the estimated standard errors of the prediction loss for all combinations of tuning parameter values.
reps
a data frame containing the estimated prediction errors from all replications for all combinations of tuning parameter values. This is only returned in case of more than one replication.
splits
an object giving the data splits used to estimate the prediction error.
y
the response.
yHat
a list containing the predicted values for all combinations of tuning parameter values. Each list component is again a list containing the corresponding predicted values from all replications.
best
an integer vector giving the indices of the optimal combinations of tuning parameters.
selectBest
a character string specifying the criterion used for selecting the best model.
seFactor
a numeric value giving the multiplication factor of the standard error used for the selection of the best model.
tuning
a data frame containing the grid of tuning parameter values for which the prediction error was estimated.
finalModel
the final model fit with the optimal combination
of tuning parameters. This is only returned if argument final
is
TRUE
.
call
the matched function call.
The same data splits are used for all combinations of tuning parameter values for maximum comparability.
If a final model with the optimal combination of tuning parameters is
computed, class "perryTuning"
inherits the coef()
,
fitted()
, predict()
and residuals()
methods from
its component finalModel
.
Andreas Alfons
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
perryFit
, perrySelect
,
cvFolds
, randomSplits
,
bootSamples
, cost
library("perryExamples") data("coleman") ## evaluate MM regression models tuned for 85% and 95% efficiency tuning <- list(tuning.psi = c(3.443689, 4.685061)) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied perryTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation perryTuning(call, data = coleman, y = coleman$Y, tuning = tuning, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234)
library("perryExamples") data("coleman") ## evaluate MM regression models tuned for 85% and 95% efficiency tuning <- list(tuning.psi = c(3.443689, 4.685061)) ## via model fitting function # perform cross-validation # note that the response is extracted from 'data' in # this example and does not have to be supplied perryTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234) ## via function call # set up function call call <- call("lmrob", formula = Y ~ .) # perform cross-validation perryTuning(call, data = coleman, y = coleman$Y, tuning = tuning, splits = foldControl(K = 5, R = 10), cost = rtmspe, costArgs = list(trim = 0.1), seed = 1234)
Split observations or groups of observations into training and test data to be used for (repeated) random splitting (also known as random subsampling or Monte Carlo cross-validation).
randomSplits(n, m, R = 1, grouping = NULL)
randomSplits(n, m, R = 1, grouping = NULL)
n |
an integer giving the number of observations to be split into
training and test data. This is ignored if |
m |
an integer giving the number of observations or groups of observations to be used as test data. |
R |
an integer giving the number of random data splits. |
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong either to the training or test data. |
An object of class "randomSplits"
with the following
components:
n
an integer giving the number of observations or groups.
m
an integer giving the number of observations or groups in the test data.
R
an integer giving the number of random data splits.
subsets
an integer matrix in which each column contains the indices of the observations or groups in the test data of the corresponding random data split.
grouping
a list giving the indices of the observations belonging to each group. This is only returned if a grouping factor has been supplied.
This is a simple wrapper function for perrySplits
with a
control object generated by splitControl
.
Andreas Alfons
perrySplits
, splitControl
,
cvFolds
, bootSamples
set.seed(1234) # set seed for reproducibility randomSplits(20, m = 5) randomSplits(20, m = 5, R = 10)
set.seed(1234) # set seed for reproducibility randomSplits(20, m = 5) randomSplits(20, m = 5, R = 10)
Recompute prediction error measures for previously obtained objects that contain resampling-based prediction error results. This is useful for computing a different measure of prediction loss.
reperry(object, ...) ## S3 method for class 'perry' reperry(object, cost = rmspe, ...) ## S3 method for class 'perrySelect' reperry(object, cost = rmspe, ...)
reperry(object, ...) ## S3 method for class 'perry' reperry(object, cost = rmspe, ...) ## S3 method for class 'perrySelect' reperry(object, cost = rmspe, ...)
object |
an object inheriting from class |
... |
for the generic function, additional arguments to be passed
down to methods. For the methods, additional arguments to be passed to the
prediction loss function |
cost |
a cost function measuring prediction loss. It should expect
the observed values of the response to be passed as the first argument and
the predicted values as the second argument, and must return either a
non-negative scalar value, or a list with the first component containing
the prediction error and the second component containing the standard
error. The default is to use the root mean squared prediction error
(see |
An object similar to object
containing the results for the
new measure of prediction loss.
Andreas Alfons
perryFit
, perryTuning
,
perrySelect
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for 50% and 75% ## subsets based on their RTMSPE with 25% trimming # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.25) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.25) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv ## recompute the RTMSPE with 10% trimming reperry(cv50, cost = rtmspe, trim = 0.1) reperry(cv, cost = rtmspe, trim = 0.1)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for 50% and 75% ## subsets based on their RTMSPE with 25% trimming # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.25) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.25) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv ## recompute the RTMSPE with 10% trimming reperry(cv50, cost = rtmspe, trim = 0.1) reperry(cv, cost = rtmspe, trim = 0.1)
Extract and prepare the relevant information for a plot of results of resampling-based prediction error measures.
setupPerryPlot(object, ...) ## S3 method for class 'perry' setupPerryPlot( object, which = c("box", "density", "dot"), select = NULL, seFactor = NA, ... ) ## S3 method for class 'perrySelect' setupPerryPlot( object, which = c("box", "density", "dot", "line"), subset = NULL, select = NULL, seFactor = object$seFactor, ... ) ## S3 method for class 'perryTuning' setupPerryPlot(object, ...)
setupPerryPlot(object, ...) ## S3 method for class 'perry' setupPerryPlot( object, which = c("box", "density", "dot"), select = NULL, seFactor = NA, ... ) ## S3 method for class 'perrySelect' setupPerryPlot( object, which = c("box", "density", "dot", "line"), subset = NULL, select = NULL, seFactor = object$seFactor, ... ) ## S3 method for class 'perryTuning' setupPerryPlot(object, ...)
object |
an object inheriting from class |
... |
for the |
which |
a character string specifying the type of plot to
prepare. Possible values are |
select |
a character, integer or logical vector indicating the columns of prediction error results to be prepared for plotting. |
seFactor |
a numeric value giving the multiplication factor of the
standard error for displaying error bars in dot plots or line plots. Error
bars in those plots can be suppressed by setting this to |
subset |
a character, integer or logical vector indicating the subset of models to be prepared for plotting. |
An object of class "setupPerryPlot"
with the following
components:
data
a data frame containing the following columns:
Fit
a vector or factor containing the identifiers of the models.
Name
a factor containing the names of the predictor error results (not returned in case of only one column of prediction error results with the default name).
PE
the estimated prediction errors.
Lower
the lower end points of the error bars (only returned if possible to compute).
Upper
the upper end points of the error bars (only returned if possible to compute).
which
a character string specifying the type of plot.
grouped
a logical indicating whether density plots should
be grouped due to multiple model fits (only returned in case of density
plots for the "perrySelect"
and "perryTuning"
methods).
includeSE
a logical indicating whether error bars based on standard errors are available (only returned in case of dot plots or line plots).
mapping
default aesthetic mapping for the plots.
facets
default faceting formula for the plots (not returned in case of only one column of prediction error results with the default name).
tuning
a data frame containing the grid of tuning parameter
values for which the prediction error was estimated (only returned for the
"perryTuning"
method).
Duplicate indices in subset
or select
are removed such
that all models and prediction error results are unique.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
,
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, maxit.scale = 500) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts, .selectBest = "min") cv ## convert MM regression results to data frame for plotting # all replications for box plot cvLmrobBox <- setupPerryPlot(cvLmrob, which = "box") perryPlot(cvLmrobBox) # aggregated results for dot plot cvLmrobDot <- setupPerryPlot(cvLmrob, which = "dot", seFactor = 2) perryPlot(cvLmrobDot) ## convert combined results to data frame for plotting # all replications for box plot cvBox <- setupPerryPlot(cv, which = "box") perryPlot(cvBox) # aggregated results for dot plot cvDot <- setupPerryPlot(cv, which = "dot", seFactor = 2) perryPlot(cvDot)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare LS, MM and LTS regression # perform cross-validation for an LS regression model fitLm <- lm(Y ~ ., data = coleman) cvLm <- perry(fitLm, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an MM regression model fitLmrob <- lmrob(Y ~ ., data = coleman, maxit.scale = 500) cvLmrob <- perry(fitLmrob, splits = folds, cost = rtmspe, trim = 0.1) # perform cross-validation for an LTS regression model fitLts <- ltsReg(Y ~ ., data = coleman) cvLts <- perry(fitLts, splits = folds, cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect(LS = cvLm, MM = cvLmrob, LTS = cvLts, .selectBest = "min") cv ## convert MM regression results to data frame for plotting # all replications for box plot cvLmrobBox <- setupPerryPlot(cvLmrob, which = "box") perryPlot(cvLmrobBox) # aggregated results for dot plot cvLmrobDot <- setupPerryPlot(cvLmrob, which = "dot", seFactor = 2) perryPlot(cvLmrobDot) ## convert combined results to data frame for plotting # all replications for box plot cvBox <- setupPerryPlot(cv, which = "box") perryPlot(cvBox) # aggregated results for dot plot cvDot <- setupPerryPlot(cv, which = "dot", seFactor = 2) perryPlot(cvDot)
Generate an object that controls how to split observations or
groups of observations into training and test data to be used for (repeated)
random splitting (also known as random subsampling or Monte Carlo
cross-validation).
splitControl(m, R = 1, grouping = NULL)
splitControl(m, R = 1, grouping = NULL)
m |
an integer giving the number of observations or groups of observations to be used as test data. |
R |
an integer giving the number of random data splits. |
grouping |
a factor specifying groups of observations. |
An object of class "splitControl"
with the following
components:
m
an integer giving the number of observations or groups of observations to be used as test data.
R
an integer giving the number of random data splits.
grouping
if supplied, a factor specifying groups of observations. The data will then be split according to the groups rather than individual observations such that all observations within a group belong either to the training or test data.
Andreas Alfons
perrySplits
, randomSplits
,
foldControl
, bootControl
set.seed(1234) # set seed for reproducibility perrySplits(20, splitControl(m = 5)) perrySplits(20, splitControl(m = 5, R = 10))
set.seed(1234) # set seed for reproducibility perrySplits(20, splitControl(m = 5)) perrySplits(20, splitControl(m = 5, R = 10))
Extract subsets of resampling-based prediction error results.
## S3 method for class 'perry' subset(x, select = NULL, ...) ## S3 method for class 'perrySelect' subset(x, subset = NULL, select = NULL, ...)
## S3 method for class 'perry' subset(x, select = NULL, ...) ## S3 method for class 'perrySelect' subset(x, subset = NULL, select = NULL, ...)
x |
an object inheriting from class |
select |
a character, integer or logical vector indicating the prediction error results to be extracted. |
... |
currently ignored. |
subset |
a character, integer or logical vector indicating the subset of models for which to keep the prediction error results. |
An object similar to x
containing just the selected results.
Duplicate indices in subset
or select
are removed such
that all models and prediction error results are unique.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
, subset
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # extract reweighted LTS results with 50% subsets subset(cv50, select = "reweighted") subset(cv, subset = c(TRUE, FALSE), select = "reweighted")
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # extract reweighted LTS results with 50% subsets subset(cv50, select = "reweighted") subset(cv, subset = c(TRUE, FALSE), select = "reweighted")
Produce a summary of resampling-based prediction error results.
## S3 method for class 'perry' summary(object, ...) ## S3 method for class 'perrySelect' summary(object, ...) ## S3 method for class 'perryTuning' summary(object, ...)
## S3 method for class 'perry' summary(object, ...) ## S3 method for class 'perrySelect' summary(object, ...) ## S3 method for class 'perryTuning' summary(object, ...)
object |
an object inheriting from class |
... |
currently ignored. |
An object of class "summary.perry"
, "summary.perrySelect"
or
"summary.perryTuning"
, depending on the class of object
.
Andreas Alfons
perryFit
, perrySelect
,
perryTuning
, summary
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # summary of the results with the 50% subsets summary(cv50) # summary of the combined results summary(cv)
library("perryExamples") data("coleman") set.seed(1234) # set seed for reproducibility ## set up folds for cross-validation folds <- cvFolds(nrow(coleman), K = 5, R = 10) ## compare raw and reweighted LTS estimators for ## 50% and 75% subsets # 50% subsets fit50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cv50 <- perry(fit50, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # 75% subsets fit75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cv75 <- perry(fit75, splits = folds, fit = "both", cost = rtmspe, trim = 0.1) # combine results into one object cv <- perrySelect("0.5" = cv50, "0.75" = cv75) cv # summary of the results with the 50% subsets summary(cv50) # summary of the combined results summary(cv)