Why it is important to revisit the fundamentals of any field and ask what those pioneers would be thinking today?
A central challenge in supervised machine learning is how to generalize the model performance (a.k.a the "bias variance tradeoff").
L_2/L_1 regularization is often used to achieve this goal. However, L_2/L_1 regularization has its origins in Tikhonov regularization rather than information theory.
With Tyler Ward (NYU), we published a new, information theoretic, approach to regularization which appears in the special issue on Machine Learning and information theory in Entropy.
The key idea is to build on earlier pioneering work of Japanese mathematicians, well known to econometricians and statisticians, namely that of Akaike(below) and Takeuchi and their information criteria (AIC and TIC).
We present a new parameter estimation approach to use these directly in supervised learning with proof of their error generalizing properties and computational modifications to make these practical.
There is also R code to reproduce most of the numerical examples in the paper.
Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered.
On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance.
Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations.
We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning.
ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions.
Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization.
Matthew Francis Dixon is a researcher and innovator in the area of mathematical algorithms for prediction, outlier detection and risk.
Author of a Machine Learning in Finance textbook and several Journal papers on algorithms and models for machine learning, blockchain based technologies with applications in fintech. Member of the CFA NY Quant Investing Committee.
Furthermore, Matthew is Editorial Associate for the AIMS Journal of Dynamics & Games. In addition, Deputy Editor of the Journal of Machine Learning in Finance. Co-Founder of Digital Bank Technologies. Chair of the IEEE/ACM Workshop on High Performance Computational Finance (2010-2015). In addition, the Google Summer of Code Mentor for the R Statistical Computing Project, 2017. College of Computing Dean’s Excellence in Research Award (Junior level), 2021.