Ensemble Random Forests — ens_random_forests • EnsembleRandomForests

Execute Ensemble Random Forests model on a given dataset

ens_random_forests(
  df,
  var,
  covariates,
  header = NULL,
  out.folder = NULL,
  duplicate = TRUE,
  n.forests = 10L,
  importance = TRUE,
  cores = parallel::detectCores() - 2,
  save = TRUE,
  ntree = 1000,
  mtry = 5,
  var.q = c(0.1, 0.5, 0.9),
  mode = "bin",
  weights = NULL
)

Arguments

df: A data.frame object
var: A character string indicating the column name of the data frame that contains the number of interactions for the ERF to model; column should be a numeric column
covariates: A character vector indicating the column name(s) of the data frame that contain the covariates
header: A character vector indicating the column name(s) of the data frame that contain the additional columns you wish appended to the output
out.folder: A path to the folder to write out too. If NULL then a folder is generated in the working directory
duplicate: A logical flag that indicates whether to duplicate observations with more than one interaction. Default is TRUE to duplicate all records that interacted with more than one individual (i.e. a fishing set that caught two of the same species)
n.forests: An integer value indicating how many Random Forests to generate in the ensemble, default is 100
importance: A logical flag for the randomForest model to calculate the variable importance
cores: A integer value that either indicates the number of cores to use for parallel processing or a negative value to indicate the number of cores to leave free. Default is to leave two cores free.
save: A logical flag to save the output as an RData object, default is TRUE.
ntree: The number of decision trees to use in each RF, default is 1000
mtry: The number of covariates to try at each node split, default is 5
var.q: The quantiles for the distribution of the variable importance; only exectuted if importance=TRUE
mode: is either 'bin' for binary or if not, then assumed to be multivariate factor, 'bin' is set by default
weights: a vector equal in length to nrow(df) of weights, NULL by default

Value

A list containing the fitted ERF model and associated output.

data: the exact dataset used to fit each Random Forests within the ensemble. At the minimum, it will contain var, covariates, and header if provided. If duplicate==TRUE, then it will also contain the duplicated presence records.
model: the returned fitted individual Random Forests in the ensemble
ens.pred: the ensemble model predictions. This is generally the prediction set to use.
ens.perf: the ensemble model performance metrics including ROC curve.
mu.tr.perf: the mean training set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These are generally useless.
mu.te.perf: the mean test set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These can be informative.
roc_train: the training performance metrics for each Random Forests in the ensemble
roc_test: the test performance metrics for each Random Forests in the ensemble
pred: a list with two objects:
- p: predictions from each Random Forests in the ensemble to the dataset
- resid: residuals between the observed presence/absences and the predictions from each Random Forests in the ensemble to the dataset

Examples

#run an ERF with 10 RFs and 
ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",covariates=grep("cov", colnames(simData$samples),value=T), save=FALSE, cores=1)
#> rounding n.forests to the nearest one

# view the dataset used in the model
head(ens_rf_ex$data) 
#>   obs        cov1        cov2       cov3        cov4        cov5      random
#> 1   0 -0.07158000 -0.27811766  0.5741324  0.01366734 -0.15537696 -0.05694179
#> 2   0 -0.12028137  0.26103341 -0.2298234  0.14750795  0.32482174 -1.42940088
#> 3   0  0.12422049  0.09052045 -0.3070963  0.14493653  0.02303010 -0.73738517
#> 4   0 -0.12481261 -0.29583389  0.6282519  0.06067430 -0.25240678  0.74319261
#> 5   0  0.01361460 -0.04989738  0.1600218  0.42189437  0.13166581  0.38999584
#> 6   0  0.01842766  0.18300638 -0.3421439 -0.33128857  0.07998399 -0.46390889

#view the model predictions
head(ens_rf_ex$ens.pred) 
#>      P.0    P.1 PRES   resid
#> 1 0.7255 0.2745    0 -0.2745
#> 2 0.3777 0.6223    0 -0.6223
#> 3 0.5666 0.4334    0 -0.4334
#> 4 0.7278 0.2722    0 -0.2722
#> 5 0.3636 0.6364    0 -0.6364
#> 6 0.9906 0.0094    0 -0.0094

#view the mean test threshold-free performance metrics
ens_rf_ex$mu.te.perf 
#>     teAUC    teRMSE     teTSS 
#> 0.7846983 0.3885199 0.5415211 

#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')]) 
#>       auc      rmse       tss 
#> 0.9757934 0.3777554 0.8292362