The availability of data and increased computational power have been the biggest drivers of artificial intelligence. Google’s TensorFlow played a huge role in revolutionising machine learning as it allows developers to build neural networks without knowing all the functionality. It supports multiple languages, so developers can create the ML models in Python and use them easily in other languages as well.
This article is based on Andrew Ng’s free ebook Machine Learning Yearning where he gives technical direction for machine learning projects. One of the key aspects he discusses is about setting up the development and test sets.
In the book, Ng discusses what happens when a team decides to deploy a classifier in the app and tests the performance based on the data collected. For example, you download a large training set by downloading pictures of cats (positive examples) and non-cats (negative examples) from different websites. The dataset is further split into 70 percent to 30 percent – training and test sets. Using this data, one builds a cat detector which works well on the training and test sets. But when this classifier is deployed into a mobile app, the performance doesn’t fare well.
Setting Up Development And Test Sets
Ng emphasises that working on machine learning applications is hard enough but having mismatched development and test sets add to the uncertainty about whether improving on the development set distribution also improves test set performance. As a lesson for beginners, he states that having mismatched development and test sets can make it harder to figure out what is and isn’t working.
Ng affirms that it is an important research problem to develop learning algorithms that are trained on one distribution and generalise well to another. But if your goal is to make progress on a specific machine learning application rather than make research progress, he recommends choosing development and test sets that are drawn from the same distribution.
How Large Should The Development/Tests Sets Be?
The development set should be large enough to detect differences between algorithms that one is working on, states Ng. He cites an example – if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a development set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems, a 100-example development set is small. Development sets with sizes from 1,000 to 10,000 examples are common. You stand a good chance of detecting an improvement of up to 0.1% when the set features 10,000 examples.
For mature and important applications like , advertising, web search and product recommendations the former Baidu and Google chief talks about teams that are highly motivated to eke out even a 0.01% improvement, since it has a direct impact on the company’s profits. In this case, the development set could be much larger than 10,000, in order to pick up even the smallest of improvements.
What should be the size of the test set? It should be large enough to give high confidence in the overall performance of the system. One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples — say 100 to 10,000 examples. But now in the age of big data, where we now have machine learning problems with sometimes more than a billion examples, the fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of examples in the development or test sets has been growing. Ng emphasises that this eliminates the need to have excessively large development or test sets beyond what is needed to evaluate the performance of your algorithms.
Ng Recommends That Teams Should:
- Choose development and test sets to reflect data that approximates your expectation
- The test set should not simply be 30% of the available data, especially if one wants the future data (mobile phone images) to be different in nature from the training set (website images)
- The development and test sets should ideally be large enough to represent accurately the performance of the model
When discussing best practices on splitting test and development datasets, Stanford tutorial discusses that academic datasets often come with a train/test split (to be able to compare different models on a common test set). You will therefore have to build yourself the train/development split before beginning your project.
Another key tip is that as part of the machine learning strategy, teams should define the data collection process. If teams know what they want to predict, it will help them outline what data needs to be mined. By and large, the general recommendation for beginners is to reduce the complexity of data by understanding exactly what type of data needs to be harnessed. For example, most business problems can be solved with a simple segmentation, so it is important to know tasks/business problem and understand the right algorithm for it. For example, ML algorithms fall into five major categories: cluster analysis, classification, ranking, regression and generation. So, segmenting audience falls under cluster analysis.