Variable selection and importance in presence of high collinearity: an application to the prediction of lean body mass from multi-frequency bioelectrical impedance
In prediction problems both response and covariates may have high
correlation with a second group of influential regressors, that can be
considered as background variables. An important challenge is to
perform variable selection and importance assessment among the
covariates in the presence of these variables. A clinical example is
the prediction of the lean body mass (response) from bioimpedance
(covariates), where anthropometric measures play the role of background
variables. We introduce a reduced dataset in which the variables
are defined as the residuals with respect to the background,
and perform variable selection and importance assessment both in
linear and random forest models. Using a clinical dataset of multifrequency
bioimpedance, we show the effectiveness of this method
to select the most relevant predictors of the lean body mass beyond
anthropometry.