On the use of Random Forest to impute categorical variables beyond the sample

Authors

  • Ilaria Bombelli Istat
  • Romina Filippini Istat
  • Simona Toti Istat

DOI:

https://doi.org/10.71014/sieds.v80i2.454

Abstract

The new paradigm of the Official Statistical production is based on a system of registers, resulting from the integration of administrative and survey data. Administrative data provide a complete enumeration of the units they cover: however, this population usually represents only a specific subset of the statistical target population, and such a subset is typically not obtained through a probabilistic sampling design. Similarly, survey data also cover only a subset of the population of interest, but they are made representative through the use of weights. Another common limitation of administrative sources is the delay in data availability.

In this context, generating a complete and consistent dataset is a critical task, which requires the implementation of specific procedures to account for delayed data and to impute missing values. One possible strategy is to use survey data as the target variable. However, this raises the issue of how to properly incorporate weights into the estimation model.

An example of a mass imputation approach is provided by the official estimates of the Attained Level of Education (ALE) adopted by the Italian National Institute of Statistics (Istat) for all the resident population in Italy. The official procedure is based on the estimation of different log-linear models.

In this application, we focus on the use of Random Forest (RF) to leverage the opportunities presented by Machine Learning (ML) technique on using all available information, including longitudinal data and variables with many categories. This approach allows for capturing complex relationships between variables, which is often challenging to incorporate comprehensively using standard methods.

The results are evaluated by focusing on different scenarios, each related to a level of available information, and focusing on three population subsets, each characterized by a different pattern of available information

References

BREIMAN L. 2001. Random forests, Machine learning, Vol. 45, pp. 5–32.

BREIMAN L. 2002. Manual On Setting Up, Using, And Understanding Random Forests V3.1.

CHEN T., GUESTRIN, C. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), pp. 785–794.

DE FAUSTI F., DI ZIO M., FILIPPINI R., TOTI, S., ZARDETTO D. 2022. Multilayer perceptron models for the estimation of the attained level of education in the Italian Permanent Census, Statistical Journal of the IAOS, Vol. 38, No. 2, pp. 637-646.

DIETTERICH T.G. 2000. Ensemble Methods in Machine Learning. In Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, Vol. 1857. Springer, Berlin, Heidelberg.

DI ZIO M., FILIPPINI R., ROCCHETTI G. 2019. An imputation procedure for the Italian attained level of education in the register of individuals based on administrative and survey data. Rivista di Statistica Ufficiale, Vol. 2, No. 3, pp. 143-174.

LITTLE R.J.A, RUBIN D.B. 2002. Statistical Analysis with Missing Data. Wiley, New York.

PROBST P., WRIGHT M., BOULESTEIX A. 2018. Hyperparameters and Tuning Strategies for Random Forest, Data Mining and Knowledge Discovery.

PROBST P., BOULESTEIX A., BISCHL B. 2019 Tunability: Importance of Hyperparameters of Machine Learning Algorithms, Journal of Machine Learning Research, Vol. 20, pp. 1-32.

RUMELHART D. E., HINTON G. E., WILLIAMS R. J. 1986. Learning representations by back-propagating errors. Nature, Vol. 323, No. 6088, pp. 533–536.

WRIGHT M.N., ZIEGLER A. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, Journal of Statistical Software, Vol. 77, No. 1, pp. 1–17.

Downloads

Published

2026-02-19

Issue

Section

Articles