Chapter 06: Random Forests

This chapter introduces bagging as method to increase the performance of trees. A modification of bagging leads to random forests. We explain the main idea of random forests, benchmark their performance with the methods seen so far and show how to quantify the impact of a single feature on the performance of the random forest as well as how to compute proximities between observations based on random forests.

Chapter 6.1: Bagging Ensembles

Bagging (bootstrap aggregation) is a method for combining many models into a meta-model, which often works much better than its individual components. In this section, we present the basic idea of bagging and explain why and when bagging works.

Chapter 6.2: Introduction

In this section we investigate random forests, a modification of bagging for trees. We illustrate the effect of the ensemble size and show how to compute out-of-bag error estimates.

Chapter 6.3: Benchmarking Trees, Forests, and Bagging K-NN

In this section we compare the performance of random forests vs. (bagged) CART and (bagged) k-NN.

Chapter 6.4: Feature Importance

In a complex machine learning model, the contributions of the different features to the model performance are difficult to evaluate. The concept of feature importance allows to quantify this for random forests.

Chapter 6.5: Proximities

The term "proximity" means the "closeness" between pairs of cases. Proximities are calculated for each pair of observations and can be derived directly from random forests.

Chapter 6.6: Discussion

In this section we discuss the advantages and disadvantages of random forests and explain that all advantages of trees also apply here.