Machine Learning, Regularization, and Heckman model

Supervisor: Emmanuel Ogundimu

Project’s research area: Statistics / Data Science / Statistical Learning

General description

Sample selection occurs when the outcome of interest is partially observed in a study. Typically, this involves a two-stage process. In the first stage, a binary mechanism determines whether an observation will be available for the second stage. In the second stage, the original analysis is adjusted for potential selection bias. For example, when someone applies for a loan from a bank, the bank uses the applicant’s attributes to decide whether to grant or reject the request. If the request is accepted, the bank can then observe the loan performance over time. This process involves two stages: the credit-granting (accept or reject) and the loan performance stage (default or non-default). If a model is developed based only on the applicants who were accepted in the credit-granting stage, the resulting sample may not be a random representation of the target population, and this can lead to selection bias. Heckman selection model is often used to alleviate selection bias in this framework.

Objectives

This project aims to develop and evaluate Regularization and Machine Learning techniques, such as LASSO (least absolute shrinkage and selection operator) and Random Forest, for variable selection and optimization of predictive accuracy in the context of predictive modelling. Some potential directions for this project include:

Development and studying a new estimator for LASSO based on the approximation of the L1-norm and post selection inference.
Exploring variable selection through group‐wise penalties for flexible covariate effect structures.
Extending the use of various penalty functions, such as MCP (minimax concave penalty) and SCAD (smoothly clipped absolute deviation) to Heckman model.
Investigating the use of copulas for non-Gaussian marginal distributions including dealing with binary outcomes with few events.
Application of the models to missing data imputation and interesting data in credit scoring and health research.
Refining and adapting existing Machine Learning approaches for Heckman model.

Mode of Operation and Evidence of Learning

The project will operate through a combination of reading, mathematical derivation, and programming in R. The student will begin by engaging with the foundational literature on sample selection models and penalized regression, building a working understanding of the Heckman model and how regularization methods can be adapted to this setting. The project will then move into implementation, with the student developing and testing methods through simulation studies and, where appropriate, application to real datasets in credit scoring or health research.

Evidence of learning will include: the ability to explain the statistical basis of sample selection models and the rationale for regularization; implementation of penalized estimation or machine learning methods in R; the design and interpretation of simulation studies to assess method performance; and a clear, well-structured written dissertation supported by figures, tables, and code.

Prerequisites/skills:

MATH1061: Calculus I, MATH1071: Linear Algebra I, Programming in R.

References

Cook, JA., Siddiqui, S. (2020). Random forests and selected samples. Bull Econ Res. 72: 272–287.
Ogundimu, EO. (2022). Regularization and variable selection in Heckman selection model. Statistical Papers. 63(2): 421-439.
Ogundimu, EO. (2022). On Lasso and adaptive Lasso for non-random sample in credit scoring. Statistical Modelling.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B. 58(1), 267–288.