I heard recently that statisticians still favor R over Python because they’re suspicious of Python’s accuracy… Well, there’s an easy way to check. I’m going to answer three questions:
- Are the predictions similar between the languages?
- For comparable predictive algorithms, which of the two languages runs faster?
- Does variable importance and partial dependence change?
I’m going to compare random forest methods.
- R’s random forest with default parameters
- Python’s random forest with R’s default parameters
- Python’s random forest with default parameters
R’s random forest with Python’s default parameters
As a throw in I’m interested in a couple of additional Python models out of curiousity.
- Python’s gradient boosted random forest (gbm)
- Python’s XGBoost random forest
I use two datasets which I’m not sharing. However, one is a large zero inflated dataset with over 110,000 rows and 55 columns. The other is data from my land surface temperature (LST) project and has 1,200 rows and 75 columns. These are regression models (rather than classification) in both cases.
The code is on GitHub here so feel free to check it out. I generate 10 holdouts and then pass to both Python and R so the models are being trained and tested on the same observations. I parallelize the code with 10 cores. The test size is 20%.
Random forests are ensemble models (i.e. they average the result of many predictors). As their name suggests, they are the average of many tree models. One of the parameters for RF’s is the
number of trees. R defaults to 500 whereas Python defaults to 10. So R defaults to using many more trees, which will both take longer, but likely be more accurate.
The other parameter that I ensure is consistent between languages is the
This is the number of variables randomly sampled for each split (when the value is a decimal, it’s the fraction of the total number of variables).
The boxplots show the difference in the MAE between the models for both sets of data. Python’s random forest using R’s default parameters is the best for the zeroinflated dataset, it also slightly outperforms R’s in the LST dataset. The best model for the LST dataset is the GBM and R’s RF (with Python’s parameters) is off-the-charts bad.
In addition to the comparison between aggregate predictive accuracy, I wondered how the direct observation-level predictions compared. That is, for a given data point do the predictions differ? These figures show that the models do no differ much at all.
The speed at which models take to process can dictate what other interesting approaches can be tested. For example, parameter selection (another post coming soon). Parameter selection can require running the models repeatedly and that’s unpalatable if they take days to run a single instance.
I suspected that Python would be faster, but honestly I was blown away. On the large dataset Python was 84 times faster than R (and had better predictive accuracy). That is, it took R ~13 hours and 43 minutes to complete what Python’s did in 10 minutes.
Partial dependence and variable importance with these nonlinear models can be super useful for understanding the relationship between the explanatory variables and the target variable. It also provides an approach to capture the independent effects of different variables and identifying relative importance. This can be useful when prioritizing action or further study. For this section I’ve used the concrete dataset so the data is a little cleaner than mine, and I can share.
Variable importance is defined as
- R: the mean decrease in node impurity. As these models are regression, the node impurity is measure as the residual sum of squares (classification used the Gini index). The mean decrease in node impurity is the total decrease in the RSS from splitting on the variable, averaged over all the trees.
- Python: is poorly documented, but the developer answered a StackOverflow question saying that it is the same as above (based on Breiman’s 1984 paper). The difference is that the importances in python are normalized so they sum to one.
Partial dependence is defined as
- R: “Partial dependence plot gives a graphical depiction of the marginal effect of a variable on the class probability (classification) or response (regression).”
- Python: “[PDPs] show the dependence between the target function and a set of features, marginalizing over the values of all other features” (documentation). Note here that the y axis is showing how the prediction will change as we change the x axis variable. R’s partial dependence show the prediction itself. Also note that Python defaults to showing the 5th-95th percentile range of the x variable.
- The accuracy is relatively similar, although Python does outperform R
- Python is faster than R
- The variable importance and partial dependence are comparable. So, while R is nice, stop procrastinating and learn Python.
Stay tuned for our forthcoming blog comparing parameter selection approaches in Python.
I've just finished participating in the Intergovernmental Panel on Climate Change's (IPCC) Cities conference. The climate forecasts in the forthcoming assessment report on the state of the climate are worse. Everything has gotten worse. Let's pull out the stops and act.
Last week Hurricane Florence caused, and continues to cause, major damage along the East Coast of the USA. Following Florence and a year of record breaking heat waves and fires now is a good time to consider adapting to climate change and mitigating natural hazards. Are strategies we’re taking really going to work? Or is it possible that the engineering approaches taken to protect us are actually just making us more vulnerable?
You May Also Enjoy
There's so much data being produced and presented online, but often it vanishes as quickly as it arrives. Here's a quick guide and example to how to record it.
Have you ever had to revisit a coding project? If you're like me, there's a sense of trepidation. Will I remember what I was doing? Will it still work? Here are some techniques that may make you a happier and more efficient researcher.
Scatter plots with coloured bands for the distribution.
If you want to query the walking, driving, or cycling time between multiple points, this may help.