library(EnsembleRandomForests)
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
#> Loading required package: randomForest
#> randomForest 4.7-1.1
#> Type rfNews() to see new features/changes/bug fixes.

Internal helper functions within Ensemble Random Forests

Reviewing the data preparation

We will use our example dataset simData to run the ERF model on. Internally, ens_random_forests uses the functions erf_data_prep and erf_formula_prep to convert the data.frame of observations, the dependent variable of interest, and the covariates of interest into the correct format for the ERF model. We will see what these functions do first.

erf_data <- erf_data_prep(df = simData$samples, 
                          var = "obs", #dependent variable
                          covariates = grep('cov',
                                            colnames(simData$samples), 
                                            value=T),
                          header = c('prob.raw','prob'), #add'n columns to include
                          duplicate = TRUE) #flag for duplicate multiple presences
head(erf_data,4)
#>   obs   prob.raw      prob       cov1        cov2       cov3       cov4
#> 1   0 -0.2525018 0.4372078 -0.0715800 -0.27811766  0.5741324 0.01366734
#> 2   0  2.7156166 0.9379419 -0.1202814  0.26103341 -0.2298234 0.14750795
#> 3   0  0.5609232 0.6366661  0.1242205  0.09052045 -0.3070963 0.14493653
#> 4   0  0.2723824 0.5676777 -0.1248126 -0.29583389  0.6282519 0.06067430
#>         cov5      random
#> 1 -0.1553770 -1.54050531
#> 2  0.3248217 -0.15696616
#> 3  0.0230301  0.07094116
#> 4 -0.2524068  1.15358310

As we can see above, the main purpose of erf_data_prep is to reorganize the data.frame so that the variable of interest is in the first column. Additionally, a random variable is included to provide a cutoff for variable importance downstream.

The header argument allows you to tack on additional columns that you want to include in the model prediction data.frame. This is a useful way to add a column of unique IDs for matching predictions and observations should they get scrambled. It is unwise to include all columns in the header argument; tacking any extraneous columns to the ERF output is less wasteful of computational resources.

The most impactful argument in erf_data_prep is the duplicate argument. This logical flag, when TRUE, copies the covariates for any row that has more than one presence in it. For example, say we have a fishery set a longline with each set equaling an observation (or row) in our data.frame. We are likely to expect that on the 1000’s of hooks on this longline that we could catch more than two of the species of interest. The duplicate argument handles this instance by copying the multiple presences on an observation. This effectively upweights covariate space where the species occurs multiple times.

Alternatively, setting duplicate to FALSE can allow a user to model the probability of observing at least one of the dependent variable of interest. Note, this is different than the probability of presence when the observation unit can generate multiple presences per unit effort. Back in our longline example, if we used each individual hook as an observation (or row) then we could expect to only have one presence per hook and setting duplicate to TRUE or FALSE would have no impact on our resulting ERF data.frame.

Reviewing the formula preparation

Next, let’s see what erf_formula_prep does.

erf_form <- erf_formula_prep(var = 'obs',
                             covariates = grep('cov',
                                               colnames(simData$samples), 
                                               value=T))
print(erf_form)
#> obs ~ cov1 + cov2 + cov3 + cov4 + cov5 + random
#> <environment: 0x7fbe0a9f1188>

This function is incredibly simple and mostly called internally to ens_random_forests. It simple sets the dependent variable and the covariates up for use in the resulting randomForests call.

Balancing class imbalance within Ensemble Random Forests

The main wrapper function users should be interacting with is ens_random_forests. Internally, this function firsts prepares the data using erf_data_prep, creates a formula using erf_formula_prep, and then calls rf_ens_fn. This last function is the workhorse that implements each Random Forests in the ensemble. It also implements two bagging steps. The first bagging step divides the dataset into a training and test set (90:10% is default). The proportion of zeroes and ones in the presence observations is retained in this bagging step. The next bagging step implements downsampling. This balances the proportions of zeroes and ones provided to each decision tree in a given Random Forests. For example, let’s look at the frequency and proportion of zeroes and ones in our simulated dataset simData.

Obs Freq
0 9819
1 181
Obs Prop.
0 0.9819
1 0.0181

As we can see our zeroes and ones are unbalanced, we have far more zeroes than ones. In the first bagging step the proportion of zeroes and ones is retained in creating the training/test sets. Then prior to calling the next bagging step, we determine the maximum number of samples in each bag passed to each decision tree in a given Random Forests in the ensemble. We do this with max_splitter.

(max_split <- max_splitter(erf_data))
#> [1] 143

Running a single RF

Now that we have prepared our data, formula, and maximum split we can run a single Random Forest. This will be done internally in ens_random_forests multiple times to create the ensemble of Random Forests. Here, a single RF is run but with the double bagging procedure as discussed above. The resulting list has components mod, preds, roc_train, and roc_test. Internally, in ens_random_forests these will be used to calculate the ensemble predictions and performance. On our single RF this will return the randomForest class model, a data.frame of the predictions and observations, the ROC measured performance for the training set, and for the test set. The test set metrics are of more interest than the training set. It is still useful to check that \(AUC_{training} > AUC_{test}\) as by random chance the test AUC can be greater even though the overall RF fit poorly. The test was not seen by the RF model so it is a better measure of how the model fits the data. In ERF, we do not typically concern ourselves with out-of-bag performance calculated internally in the randomForest call.

rf_ex <- rf_ens_fn(erf_data, erf_form, max_split, ntree=50)
head(rf_ex$preds)
#>    P.0  P.1 PRES  type
#> 1 0.68 0.32    0 train
#> 2 0.38 0.62    0 train
#> 3 0.62 0.38    0 train
#> 4 0.80 0.20    0 train
#> 5 0.22 0.78    0 train
#> 6 1.00 0.00    0 train
rf_ex$roc_train$auc #training AUC
#> [1] 0.9704141
rf_ex$roc_test$auc #test AUC
#> [1] 0.788375
(rf_ex$roc_train$auc > rf_ex$roc_test$auc)
#> [1] TRUE