Machine learning is not a buzzword anymore, at least not in the case of financial modelling. Fraud prevention, algorithmic trading, digital assistants and risk management are some of the areas where machine learning has found its niche. The application of ML models in finance is unlike any other industries because the decisions taken either pay the dividends immediately or plummet catastrophically.
Many financial companies need data engineering, statistics and visualization tools to meet their ends. With machine learning, the models can be retrained repeatedly until the best solution emerges. That said, there is no universal machine learning solution for all the business problems.
This graphic illustrates the confidence of financial institutions in imbibing machine learning methodologies. Talking about methodologies there are many statistical methods and frameworks that help in building models. For instance, validation techniques are widely popular to assess the accuracy of the models; cross-validation being the popular choice.
Cross-validation is used for determining the generalization error in a machine learning algorithm to prevent overfitting. But, in the case of financial models, overfitting does take place and can go undetected by CV. Moreover, hyper-parameter tuning can contribute to overfitting. Such fundamental errors in a model can still get passed through owing to its overfitting. While the forecasting power is reduced to null.
A Quick Recap Of CV
CV splits the dataset into two sets: the training set and the testing set. Each observation in the complete dataset belongs to one, and only one, set. This is done as to prevent leakage from one set into the other since that would defeat the purpose of testing on unseen data.
There are many alternative CV schemes, of which one of the most popular is k-fold CV. It works as follows:
- The dataset is partitioned into k subsets.
- For i = 1,…,k
- The ML algorithm is trained on all subsets excluding i.
- The fitted ML algorithm is tested on i.
The outcome from k-fold CV is a k x 1 array of cross-validated performance metrics. For example, in a binary classifier, the model is deemed to have learned something if the cross-validated accuracy is over 1/2, more than what we would achieve by tossing a fair coin.
How Well Does CV Fair With Finance
One reason k-fold CV fails in finance is that observations cannot be assumed to be drawn from an IID (Independent and Identically Distributed)processes.
A second reason for CV’s failure is that the testing set is used multiple times in the process of developing a model, leading to multiple testing and selection bias.
So, when there is an overlap between training and testing datasets, there will be some leakage.
The problem is leakage in the presence of irrelevant features, as this leads to false discoveries.
Problems With Sklearn’s Cross-Validation
Sci-kit learn is the most popular ML library for implementing cross-validation. One of the many upsides of open-source code is that you can verify everything and adjust it to your needs. In Advances of financial machine learning, Marcos Lopez de Prado lists the following two problems with sklearn:
- Scoring functions do not know classes_, as a consequence of sklearn’s reliance on numpy arrays rather than pandas series: https://github.com/scikit-learn/scikit-learn/issues/6231
- cross_val_score will give different results because it passes weights to the fit method, but not to the log_loss method: https://github.com/scikit-learn/scikit-learn/issues/9144
How To Reduce The Leakage
- Drop from the training set any observation i where Yi is a function of information used to determine Yj, and j belongs to the testing set.
- Avoiding overfitting of the classifier.
- Early stopping of the base estimators.
- Bagging of classifiers, while controlling for oversampling on redundant examples, so that the individual classifiers are as diverse as possible.
- Set average uniqueness.
- Apply sequential bootstrap
An Alternative In The Form Of Purged K-Fold CV
One way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set; called “purging.”
If no training observations occur between the first and last testing observation, then purging can be accelerated with a pandas series with a single item, spanning the entire testing set.
The larger the number of testing splits, the greater the number of overlapping observations in the training set. In many cases, purging is enough to prevent leakage. The performance will improve when the model is allowed to recalibrate more often.
The number of open-source machine learning algorithms and tools to curate financial data are increasing fast. And, with increasing interests of the financial institutions in AI, the funds allocated will increase, which in turn will enable more methodologies to developed. The advantage with this industry is its quantitative nature and its large repository of historical data which is exactly what a machine learning model needs. Neglecting the advancements and latching on to conventional methods will prove costly in the future.