Optimizing stratification with data-driven approach: a case study

Authors

  • Ilaria Bombelli Istat
  • Giorgia Sacco Istat

DOI:

https://doi.org/10.71014/sieds.v80i3.431

Abstract

The stratified sampling design is helpful in official statistics to guarantee the accuracy and efficiency of survey estimates. The choice of suitable stratification variables and the subsequent determination of the strata are two of the most important steps in this procedure. When the stratification variables are continuous, to create the strata, these variables must be converted into categorical variables that identify classes.

The quality of the stratification can be greatly impacted by the division of these continuous variables, specifically the choice of intervals or class boundaries. Class intervals are sometimes predetermined based on established procedures or past knowledge. In some cases, however, the researcher must determine the best partitions into classes based on the population's characteristics and the survey's goals. Choosing such partitions can be a hard task.

The R package SamplingStrata provides useful assistance in situations when the researcher has the flexibility to define the stratum. This tool is based on a genetic algorithm and provides a data-driven approach for determining the best stratification boundaries, thereby maximizing sampling design efficiency while satisfying precision requirements.

In this study, we present an application of the aforementioned package on a household survey dataset. We illustrate the advantages of utilizing the package to guide the stratification process, especially when working with continuous auxiliary variables. The findings demonstrate how, compared to conventional techniques based on arbitrary or fixed class definitions, data-driven stratification can result in more efficient sample allocations.

References

BARCAROLI G., BALLIN M., ODENDAAL H., PAGLIUCA D., WILLIGHAGEN E., ZARDETTO D. 2020. Sampling Strata: optimal stratification of sampling frames for multipurpose sampling surveys, R package.

BARCAROLI G. 2014. SamplingStrata: An R package for the optimization of stratified sampling. Journal of Statistical Software, Vol. 61, pp. 1–24.

BARCAROLI G., FASULO A., GUANDALINI A., TERRIBILI M. D. 2023. Two Stage Sampling Design and Sample Selection with the R Package R2BEAT. The R Journal, Vol. 15, No. 3, pp. 191–213.

BETHEL J. 1989. Sample allocation in multivariate surveys. Survey methodology Vol. 15, No. 1, pp. 47–57.

COCHRAN W. G. 1977. Sampling techniques. John Wiley & Sons.

FALORSI P. D., BALLIN M., DE VITIIS C., SCEPI G. 1998. Principi e metodi del software generalizzato per la definizione del disegno di campionamento nelle indagini sulle imprese condotte dall’ISTAT. Statistica Applicata, Vol. 10, No. 2, pp. 235–257.

KHAN M.G.M. NIRAJ N. NURAIN A. 2008. Determining the optimum strata boundary points using dynamic programming. Survey methodology. Vol. 34, pp. 205-214.

KHAN M. G., SHARMA S. 2015. Determining optimum strata boundaries and optimum allocation in stratified sampling. Aligarh Journal of Statistics, Vol. 35, pp. 23-40.

LAVALLÉE P. 1988. Two-way optimal stratification using dynamic programming.Proc. Sect. Surv. Res. Methods, pp. 646-651.

NEYMAN J. 1934. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, Vol. 97, No. 4, pp. 558–625.

SÄRNDAL C. E., SWENSSON B., WRETMAN, J. 2003. Model assisted survey sampling. Springer Science & Business Media.

TSCHUPROW A. A. 1923. On the mathematical expectation of the moments of frequency distributions in the case of correlated observations. Metron, Vol. 2, pp. 646–683.

WRIGHT T. 2014. A simple method of exact optimal sample allocation under stratification with any mixed constraint patterns. Statistics, Vol. 7.

Downloads

Published

2026-02-26

Issue

Section

Articles