Execute Ensemble Random Forests model on a given dataset
ens_random_forests(
df,
var,
covariates,
header = NULL,
out.folder = NULL,
duplicate = TRUE,
n.forests = 10L,
importance = TRUE,
cores = parallel::detectCores() - 2,
save = TRUE,
ntree = 1000,
mtry = 5,
var.q = c(0.1, 0.5, 0.9),
mode = "bin",
weights = NULL
)
A data.frame object
A character string indicating the column name of the data frame that contains the number of interactions for the ERF to model; column should be a numeric column
A character vector indicating the column name(s) of the data frame that contain the covariates
A character vector indicating the column name(s) of the data frame that contain the additional columns you wish appended to the output
A path to the folder to write out too. If NULL then a folder is generated in the working directory
A logical flag that indicates whether to duplicate observations with more than one interaction. Default is TRUE to duplicate all records that interacted with more than one individual (i.e. a fishing set that caught two of the same species)
An integer value indicating how many Random Forests to generate in the ensemble, default is 100
A logical flag for the randomForest model to calculate the variable importance
A integer value that either indicates the number of cores to use for parallel processing or a negative value to indicate the number of cores to leave free. Default is to leave two cores free.
A logical flag to save the output as an RData object, default is TRUE.
The number of decision trees to use in each RF, default is 1000
The number of covariates to try at each node split, default is 5
The quantiles for the distribution of the variable importance; only exectuted if importance=TRUE
is either 'bin' for binary or if not, then assumed to be multivariate factor, 'bin' is set by default
a vector equal in length to nrow(df) of weights, NULL by default
A list containing the fitted ERF model and associated output.
data: the exact dataset used to fit each Random Forests within the ensemble. At the minimum, it will contain var
, covariates
, and header
if provided. If duplicate
==TRUE
, then it will also contain the duplicated presence records.
model: the returned fitted individual Random Forests in the ensemble
ens.pred: the ensemble model predictions. This is generally the prediction set to use.
ens.perf: the ensemble model performance metrics including ROC curve.
mu.tr.perf: the mean training set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These are generally useless.
mu.te.perf: the mean test set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These can be informative.
roc_train: the training performance metrics for each Random Forests in the ensemble
roc_test: the test performance metrics for each Random Forests in the ensemble
pred: a list with two objects:
p: predictions from each Random Forests in the ensemble to the dataset
resid: residuals between the observed presence/absences and the predictions from each Random Forests in the ensemble to the dataset
#run an ERF with 10 RFs and
ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",covariates=grep("cov", colnames(simData$samples),value=T), save=FALSE, cores=1)
#> rounding n.forests to the nearest one
# view the dataset used in the model
head(ens_rf_ex$data)
#> obs cov1 cov2 cov3 cov4 cov5 random
#> 1 0 -0.07158000 -0.27811766 0.5741324 0.01366734 -0.15537696 -0.05694179
#> 2 0 -0.12028137 0.26103341 -0.2298234 0.14750795 0.32482174 -1.42940088
#> 3 0 0.12422049 0.09052045 -0.3070963 0.14493653 0.02303010 -0.73738517
#> 4 0 -0.12481261 -0.29583389 0.6282519 0.06067430 -0.25240678 0.74319261
#> 5 0 0.01361460 -0.04989738 0.1600218 0.42189437 0.13166581 0.38999584
#> 6 0 0.01842766 0.18300638 -0.3421439 -0.33128857 0.07998399 -0.46390889
#view the model predictions
head(ens_rf_ex$ens.pred)
#> P.0 P.1 PRES resid
#> 1 0.7255 0.2745 0 -0.2745
#> 2 0.3777 0.6223 0 -0.6223
#> 3 0.5666 0.4334 0 -0.4334
#> 4 0.7278 0.2722 0 -0.2722
#> 5 0.3636 0.6364 0 -0.6364
#> 6 0.9906 0.0094 0 -0.0094
#view the mean test threshold-free performance metrics
ens_rf_ex$mu.te.perf
#> teAUC teRMSE teTSS
#> 0.7846983 0.3885199 0.5415211
#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')])
#> auc rmse tss
#> 0.9757934 0.3777554 0.8292362