Execute Ensemble Random Forests model on a given dataset

ens_random_forests(
  df,
  var,
  covariates,
  header = NULL,
  out.folder = NULL,
  duplicate = TRUE,
  n.forests = 10L,
  importance = TRUE,
  cores = parallel::detectCores() - 2,
  save = TRUE,
  ntree = 1000,
  mtry = 5,
  var.q = c(0.1, 0.5, 0.9),
  mode = "bin",
  weights = NULL
)

Arguments

df

A data.frame object

var

A character string indicating the column name of the data frame that contains the number of interactions for the ERF to model; column should be a numeric column

covariates

A character vector indicating the column name(s) of the data frame that contain the covariates

header

A character vector indicating the column name(s) of the data frame that contain the additional columns you wish appended to the output

out.folder

A path to the folder to write out too. If NULL then a folder is generated in the working directory

duplicate

A logical flag that indicates whether to duplicate observations with more than one interaction. Default is TRUE to duplicate all records that interacted with more than one individual (i.e. a fishing set that caught two of the same species)

n.forests

An integer value indicating how many Random Forests to generate in the ensemble, default is 100

importance

A logical flag for the randomForest model to calculate the variable importance

cores

A integer value that either indicates the number of cores to use for parallel processing or a negative value to indicate the number of cores to leave free. Default is to leave two cores free.

save

A logical flag to save the output as an RData object, default is TRUE.

ntree

The number of decision trees to use in each RF, default is 1000

mtry

The number of covariates to try at each node split, default is 5

var.q

The quantiles for the distribution of the variable importance; only exectuted if importance=TRUE

mode

is either 'bin' for binary or if not, then assumed to be multivariate factor, 'bin' is set by default

weights

a vector equal in length to nrow(df) of weights, NULL by default

Value

A list containing the fitted ERF model and associated output.

  • data: the exact dataset used to fit each Random Forests within the ensemble. At the minimum, it will contain var, covariates, and header if provided. If duplicate==TRUE, then it will also contain the duplicated presence records.

  • model: the returned fitted individual Random Forests in the ensemble

  • ens.pred: the ensemble model predictions. This is generally the prediction set to use.

  • ens.perf: the ensemble model performance metrics including ROC curve.

  • mu.tr.perf: the mean training set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These are generally useless.

  • mu.te.perf: the mean test set performance across all Random Forests for the AUC, TSS, and RMSE metrics. These can be informative.

  • roc_train: the training performance metrics for each Random Forests in the ensemble

  • roc_test: the test performance metrics for each Random Forests in the ensemble

  • pred: a list with two objects:

    • p: predictions from each Random Forests in the ensemble to the dataset

    • resid: residuals between the observed presence/absences and the predictions from each Random Forests in the ensemble to the dataset

Examples

#run an ERF with 10 RFs and 
ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",covariates=grep("cov", colnames(simData$samples),value=T), save=FALSE, cores=1)
#> rounding n.forests to the nearest one

# view the dataset used in the model
head(ens_rf_ex$data) 
#>   obs        cov1        cov2       cov3        cov4        cov5      random
#> 1   0 -0.07158000 -0.27811766  0.5741324  0.01366734 -0.15537696 -0.05694179
#> 2   0 -0.12028137  0.26103341 -0.2298234  0.14750795  0.32482174 -1.42940088
#> 3   0  0.12422049  0.09052045 -0.3070963  0.14493653  0.02303010 -0.73738517
#> 4   0 -0.12481261 -0.29583389  0.6282519  0.06067430 -0.25240678  0.74319261
#> 5   0  0.01361460 -0.04989738  0.1600218  0.42189437  0.13166581  0.38999584
#> 6   0  0.01842766  0.18300638 -0.3421439 -0.33128857  0.07998399 -0.46390889

#view the model predictions
head(ens_rf_ex$ens.pred) 
#>      P.0    P.1 PRES   resid
#> 1 0.7255 0.2745    0 -0.2745
#> 2 0.3777 0.6223    0 -0.6223
#> 3 0.5666 0.4334    0 -0.4334
#> 4 0.7278 0.2722    0 -0.2722
#> 5 0.3636 0.6364    0 -0.6364
#> 6 0.9906 0.0094    0 -0.0094

#view the mean test threshold-free performance metrics
ens_rf_ex$mu.te.perf 
#>     teAUC    teRMSE     teTSS 
#> 0.7846983 0.3885199 0.5415211 

#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')]) 
#>       auc      rmse       tss 
#> 0.9757934 0.3777554 0.8292362