Title: | Predictive Inference for Random Forests |
---|---|
Description: | An integrated package for constructing random forest prediction intervals using a fast implementation package 'ranger'. This package can apply the following three methods described in Haozhe Zhang, Joshua Zimmerman, Dan Nettleton, and Daniel J. Nordman (2019) <doi:10.1080/00031305.2019.1585288>: the out-of-bag prediction interval, the split conformal method, and the quantile regression forest. |
Authors: | Haozhe Zhang [aut, cre] |
Maintainer: | Haozhe Zhang <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-10-31 22:10:15 UTC |
Source: | https://github.com/haozhestat/rfinterval |
This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.
BeijingPM25
BeijingPM25
A data frame with 8661 rows and 11 variables:
PM2.5 concentration (ug/m^3)
month of observation
day of observation
hour of observation
dew point
temperature
air pressure
combined wind direction
cumulated wind speed
cumulated hours of snow
cumulated hours of rain
Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H. and Chen, S. X. (2015). Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A, 471, 20150257.
The rfinterval
constructs prediction intervals for random forest predictions using a fast implementation package 'ranger'.
rfinterval(formula = NULL, train_data = NULL, test_data = NULL, method = c("oob", "split-conformal", "quantreg"), alpha = 0.1, symmetry = TRUE, seed = NULL, params_ranger = NULL)
rfinterval(formula = NULL, train_data = NULL, test_data = NULL, method = c("oob", "split-conformal", "quantreg"), alpha = 0.1, symmetry = TRUE, seed = NULL, params_ranger = NULL)
formula |
Object of class |
train_data |
Training data of class data.frame, matrix, or dgCMatrix (Matrix). |
test_data |
Test data of class data.frame, matrix, or dgCMatrix (Matrix). |
method |
Method for constructing prediction interval. If method = "oob", compute the out-of-bag prediction intervals; if method = "split-conformal", compute the split conformal prediction interval; if method = "quantreg", use quantile regression forest to compute prediction intervals. |
alpha |
Confidence level. alpha = 0.05 for the 95% prediction interval. |
symmetry |
True if constructing symmetric out-of-bag prediction intervals, False otherwise. Only for method = "oob" |
seed |
Seed (only for method = "split-conformal") |
params_ranger |
List of further parameters that should be passed to ranger. See |
oob_interval |
Out-of-bag prediction intervals |
sc_interval |
Split-conformal prediction intervals |
quantreg_interval |
Quantile regression forest prediction intervals |
alpha |
Confidence level for prediction intervals |
testPred |
Random forest prediction for test set |
train_data |
Training data |
test_data |
Test data |
Haozhe Zhang, Joshua Zimmerman, Dan Nettleton, and Dan Nordman. (2019). "Random Forest Prediction Intervals." The American Statistician. Doi: 10.1080/00031305.2019.1585288.
Haozhe Zhang. (2019). "Topics in Functional Data Analysis and Machine Learning Predictive Inference." Ph.D. Dissertations. Iowa State University Digital Repository. 17929.
Lei, J., Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. "Distribution-free predictive inference for regression." Journal of the American Statistical Association 113, no. 523 (2018): 1094-1111.
Meinshausen, Nicolai. "Quantile regression forests." Journal of Machine Learning Research 7 (2006): 983-999.
Leo Breiman. (2001). Random Forests. Machine Learning 45(1), 5-32.
train_data <- sim_data(n = 500, p = 8) test_data <- sim_data(n = 500, p = 8) output <- rfinterval(y~., train_data = train_data, test_data = test_data, method = c("oob", "split-conformal", "quantreg"), symmetry = TRUE,alpha = 0.1) y <- test_data$y mean(output$oob_interval$lo < y & output$oob_interval$up > y) mean(output$sc_interval$lo < y & output$sc_interval$up > y) mean(output$quantreg_interval$lo < y & output$quantreg_interval$up > y)
train_data <- sim_data(n = 500, p = 8) test_data <- sim_data(n = 500, p = 8) output <- rfinterval(y~., train_data = train_data, test_data = test_data, method = c("oob", "split-conformal", "quantreg"), symmetry = TRUE,alpha = 0.1) y <- test_data$y mean(output$oob_interval$lo < y & output$oob_interval$up > y) mean(output$sc_interval$lo < y & output$sc_interval$up > y) mean(output$quantreg_interval$lo < y & output$quantreg_interval$up > y)
Simulate data for illustrate the performance of prediction intervals for random forests
sim_data(n = 500, p = 10, rho = 0.6, predictor_dist = "correlated", mean_function = "nonlinear-interaction", error_dist = "homoscedastic")
sim_data(n = 500, p = 10, rho = 0.6, predictor_dist = "correlated", mean_function = "nonlinear-interaction", error_dist = "homoscedastic")
n |
Sample size |
p |
Number of features |
rho |
Correlation between predictors |
predictor_dist |
Distribution of predictor: "uncorrelated", and "correlated" |
mean_function |
Mean function: "linear", "nonlinear", and "nonlinear-interaction" |
error_dist |
Distribution of error: "homoscedastic", "heteroscedastic", and "heavy-tailed" |
a data.frame of simulated data
train_data <- sim_data(n = 500, p = 10) test_data <- sim_data(n = 500, p = 10)
train_data <- sim_data(n = 500, p = 10) test_data <- sim_data(n = 500, p = 10)