**Machine Learning Challenges**

Machine learning is a combination of computer science, mathematics and statistics that could use systematic programming to automatically learn from data and conclude relationships between data. Although machine learning is very popular these days in the financial market, it also meets many challenges when we apply machine learning techniques to financial data.

From my knowledge, I think the most challenging part is that the financial data is very hard to handle.

First, many financial data might have missing values, extra data points or outliers. For example, some data might have values on holidays, which is not possible. So it is necessary to conduct data cleaning before we continue processing our analysis. Different data sets have different data cleaning issues which increase the time cost for analysis.

Second, the scale of data might be very different which will also affect the result of the algorithm, so sometimes normalizing data is also necessary.

Third, financial data is non stationary, which means that the data itself will contain many noises and the data will be less informative.

Based on my own experience, the WTI Crude oil futures price is quite noisy, and the noises will definitely increase the model variance. Even if we are in a data rich situation or we have some techniques to reduce the noises, the noises inside the data will still affect the machine learning algorithm performance.

Besides the above three challenges with data, there are other issues with data I have met before. When dealing with linear regression, the explanatory variable might have collinearity, although we could apply PCA to reduce dimension, the resulting variables lose intuition.

When dealing with time series data, it is very likely that the data we have is lagged, thus making it impossible to generate one step prediction that could be used in a trading signal. It is also possible that we meet the situation of lack of data, or the data we use is too bad to do analysis.

The previously mentioned problems with data also increase the challenge for applying machine learning algorithms to financial data.

Machine learning algorithms are designed to achieve specific goals.

Each algorithm will have their own advantage and disadvantage. But the key is always the bias-variance tradeoff. If we have a simple model, we might have high bias and low variance, which is likely to be an underfitting and we might miss some relevant relationship between predictors and explained variables.

If we have an overly complex model, such as a 1000 degree polynomial regression, we will have low bias and high variance, which will be an overfitting, and we might have almost zero training error but extremely high testing error. Thus, it is important to achieve a balance between bias and variance.

Different models might have different bias and variance on the same data set.

One challenge raised is which model might be better to use?

There are many techniques to evaluate models or models checking, cross-validation might be a good way to estimate prediction error, thus helping us decide which model is better to use.

If we are dealing with a time series model and the parameters numbers are different, we could also check log likelihood and AIC, or apply some statistical test to check which model might be better.

For example, I did a project that asked me to estimate parameters of several GARCH type models to predict the volatility, and I used Schwartz Bayesian Criteria to compare likelihood to choose for the best models.

In practice, many people tend to use simple models, such as linear models, in order to make sure we do not overfit and we do not lose intuition when dealing analysis.

Linear models are also the key step in other fields. For example, when I did the active portfolio management project and I wanted to compute a signal called residual reversion signal, I applied linear regression and obtained the residuals from the regression to design such signals.

However, the challenges for linear models are that linear models are generally very weak. In order to achieve a more precise prediction using linear models or other simple models, we might need other techniques.

There are two techniques I know, one is bagging, and the other one is boosting. For bagging, basically we use bootstrap to get a lot of training sets, say we get k training sets, and we build k models then we average the result of the k models to get our final result.

Bagging can reduce variance and avoid overfitting.

Random forest is an advanced version of bagging but it might lead to high bias since some predictors might be too strong.

Boosting is like a weighted version of bagging, but the difference is that we use only one training set instead of k bootstrap sets, and the regressor/classifier will be sequentially generated. Boosting can reduce bias but it might lead to overfit.

In conclusion, I think the challenges for using machine learning for investment management could be classified into two classes, one is the problem with financial data, another is the challenge with algorithms. They will also influence each other.