Python vs R: Random Forests

I heard recently that statisticians still favor R over Python because they’re suspicious of Python’s accuracy… Well, there’s an easy way to check. I’m going to answer three questions:

  1. Are the predictions similar between the languages?
  2. For comparable predictive algorithms, which of the two languages runs faster?
  3. Does variable importance and partial dependence change?

Models

I’m going to compare random forest methods.

  1. R’s random forest with default parameters
  2. Python’s random forest with R’s default parameters
  3. Python’s random forest with default parameters
  4. R’s random forest with Python’s default parameters

    As a throw in I’m interested in a couple of additional Python models out of curiousity.

  5. Python’s gradient boosted random forest (gbm)
  6. Python’s XGBoost random forest

Data

I use two datasets which I’m not sharing. However, one is a large zero inflated dataset with over 110,000 rows and 55 columns. The other is data from my land surface temperature (LST) project and has 1,200 rows and 75 columns. These are regression models (rather than classification) in both cases.

Code

The code is on GitHub here so feel free to check it out. I generate 10 holdouts and then pass to both Python and R so the models are being trained and tested on the same observations. I parallelize the code with 10 cores. The test size is 20%.

Random forests are ensemble models (i.e. they average the result of many predictors). As their name suggests, they are the average of many tree models. One of the parameters for RF’s is the number of trees. R defaults to 500 whereas Python defaults to 10. So R defaults to using many more trees, which will both take longer, but likely be more accurate. The other parameter that I ensure is consistent between languages is the mtry. This is the number of variables randomly sampled for each split (when the value is a decimal, it’s the fraction of the total number of variables).

Results

Accuracy

The boxplots show the difference in the MAE between the models for both sets of data. Python’s random forest using R’s default parameters is the best for the zeroinflated dataset, it also slightly outperforms R’s in the LST dataset. The best model for the LST dataset is the GBM and R’s RF (with Python’s parameters) is off-the-charts bad.

In addition to the comparison between aggregate predictive accuracy, I wondered how the direct observation-level predictions compared. That is, for a given data point do the predictions differ? These figures show that the models do no differ much at all.

Time

The speed at which models take to process can dictate what other interesting approaches can be tested. For example, parameter selection (another post coming soon). Parameter selection can require running the models repeatedly and that’s unpalatable if they take days to run a single instance.

I suspected that Python would be faster, but honestly I was blown away. On the large dataset Python was 84 times faster than R (and had better predictive accuracy). That is, it took R ~13 hours and 43 minutes to complete what Python’s did in 10 minutes.

Inference

Partial dependence and variable importance with these nonlinear models can be super useful for understanding the relationship between the explanatory variables and the target variable. It also provides an approach to capture the independent effects of different variables and identifying relative importance. This can be useful when prioritizing action or further study. For this section I’ve used the concrete dataset so the data is a little cleaner than mine, and I can share.

Variable importance is defined as

  • R: the mean decrease in node impurity. As these models are regression, the node impurity is measure as the residual sum of squares (classification used the Gini index). The mean decrease in node impurity is the total decrease in the RSS from splitting on the variable, averaged over all the trees.
  • Python: is poorly documented, but the developer answered a StackOverflow question saying that it is the same as above (based on Breiman’s 1984 paper). The difference is that the importances in python are normalized so they sum to one.

Partial dependence is defined as

  • R: “Partial dependence plot gives a graphical depiction of the marginal effect of a variable on the class probability (classification) or response (regression).”
  • Python: “[PDPs] show the dependence between the target function and a set of features, marginalizing over the values of all other features” (documentation). Note here that the y axis is showing how the prediction will change as we change the x axis variable. R’s partial dependence show the prediction itself. Also note that Python defaults to showing the 5th-95th percentile range of the x variable.

Summary

  1. The accuracy is relatively similar, although Python does outperform R
  2. Python is faster than R
  3. The variable importance and partial dependence are comparable. So, while R is nice, stop procrastinating and learn Python.

Stay tuned for our forthcoming blog comparing parameter selection approaches in Python.


You May Also Enjoy

Better Coding Practice

Have you ever had to revisit a coding project? If you're like me, there's a sense of trepidation. Will I remember what I was doing? Will it still work? Here are some techniques that may make you a happier and more efficient researcher.

Watercolor (aka Visually Weighted) regressions in Python

Scatter plots with coloured bands for the distribution.

How to setup an OSRM server

If you want to query the walking, driving, or cycling time between multiple points, this may help.

R Parallelisation with 'Big Data'

A comparison of some techniques for dealing with large data and parallel computation in R.