MITB Banner

Best Practices On Setting Up Development And Test Sets For ML, According To Andrew Ng

Share

The availability of data and increased computational power have been the biggest drivers of artificial intelligence. Google’s TensorFlow played a huge role in revolutionising machine learning as it allows developers to build neural networks without knowing all the functionality. It supports multiple languages, so developers can create the ML models in Python and use them easily in other languages as well.

This article is based on Andrew Ng’s free ebook Machine Learning Yearning where he gives technical direction for machine learning projects. One of the key aspects he discusses is about setting up the development and test sets.

In the book, Ng discusses what happens when a team decides to deploy a classifier in the app and tests the performance based on the data collected. For example, you download a large training set by downloading pictures of cats (positive examples) and non-cats (negative examples) from different websites. The dataset is further split into 70 percent to 30 percent – training and test sets. Using this data, one builds a cat detector which works well on the training and test sets. But when this classifier is deployed into a mobile app, the performance doesn’t fare well.

Setting Up Development And Test Sets

Ng emphasises that working on machine learning applications is hard enough but having mismatched development and test sets add to the uncertainty about whether improving on the development set distribution also improves test set performance. As a lesson for beginners, he states that having mismatched development and test sets can make it harder to figure out what is and isn’t working.

Ng affirms that it is an important research problem to develop learning algorithms that are trained on one distribution and generalise well to another. But if your goal is to make progress on a specific machine learning application rather than make research progress, he recommends choosing development and test sets that are drawn from the same distribution.

How Large Should The Development/Tests Sets Be?

The development set should be large enough to detect differences between algorithms that one is working on, states Ng. He cites an example – if classifier A has an accuracy of 90.0% and classifier B has an accuracy of 90.1%, then a development set of 100 examples would not be able to detect this 0.1% difference. Compared to other machine learning problems, a 100-example development set is small. Development sets with sizes from 1,000 to 10,000 examples are common. You stand a good chance of detecting an improvement of up to 0.1% when the set features 10,000 examples.

For mature and important applications like , advertising, web search and product recommendations the former Baidu and Google chief talks about teams that are highly motivated to eke out even a 0.01% improvement, since it has a direct impact on the company’s profits. In this case, the development set could be much larger than 10,000, in order to pick up even the smallest of improvements.

What should be the size of the test set? It should be large enough to give high confidence in the overall performance of the system. One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples — say 100 to 10,000 examples. But now in the age of big data, where we now have machine learning problems with sometimes more than a billion examples, the fraction of data allocated to dev/test sets has been shrinking, even as the absolute number of examples in the development or test sets has been growing. Ng emphasises that this eliminates the need to have excessively large development or test sets beyond what is needed to evaluate the performance of your algorithms.

Ng Recommends That Teams Should:

  • Choose development and test sets to reflect data that approximates your expectation
  • The test set should not simply be 30% of the available data, especially if one wants the future data (mobile phone images) to be different in nature from the training set (website images)
  • The development and test sets should ideally be large enough to represent accurately the performance of the model

When discussing best practices on splitting test and development datasets, Stanford tutorial discusses that academic datasets often come with a train/test split (to be able to compare different models on a common test set). You will therefore have to build yourself the train/development split before beginning your project.

Data Collection

Another key tip is that as part of the machine learning strategy, teams should define the data collection process. If teams know what they want to predict, it will help them outline what data needs to be mined. By and large, the general recommendation for beginners is to reduce the complexity of data by understanding exactly what type of data needs to be harnessed. For example, most business problems can be solved with a simple segmentation, so it is important to know tasks/business problem and understand the right algorithm for it. For example, ML algorithms fall into five major categories: cluster analysis, classification, ranking, regression and generation. So, segmenting audience falls under cluster analysis.

 

PS: The story was written using a keyboard.
Share
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India