MITB Banner

Can Synthetic Data Solve The Bulk Data Problem In Deep Learning?

Share

Synthetic data generation has become a surrogate technique for tackling the problem of bulk data needed in training deep learning algorithms. Areas such as computer vision have greatly benefited from advances in deep learning and now generating synthetic data is serving as a good starting point for researchers who are trying to bridge the data gap. A recent research from University of Barcelona talks about Synthetic Data Generation model which introduced a synthetic image generation algorithm to tackle the lack of availability of training data in a fully-supervised learning problem. Synthetic data is defined as anonymised data, generated to mimic real world data.

According to Sergey Nikolenko, chief research officer at Neuromation, synthetic data is a more efficient way of getting perfectly labeled data for recognition. In a post, he shared that the synthetic data approach has proven to be very successful, and now the models trained by Neuromation are already being implemented in the retail sector.

Nikolenko Outlined The Following Major Benefits Of Using Synthetic Data:

  • Reduces the manual work required to label data
  • The replicated data is labeled perfectly without any errors
  • Synthetic data is pegged as a useful tool for testing the scalability of algorithms and the performance of new software

Despite the upside, it also comes with its own disadvantages. According to Gautier Krings, chief scientist at Real Impact Analytics, synthetic data can’t be used to for research purposes, as it only reproduces specific properties of the data. He further emphasised that producing quality synthetic data is complicated, since it can be difficult to keep a track of all the features which are required to replicate the real data. Other researchers have voiced similar concerns about bias in synthetic data and have said that while it is good for training models, it cannot be used for research as it cannot serve as a base for understanding real world problems. Another big problem in using synthetic images is to comprehend the extent to which this data can be applied to solve real world problems, and whether the data introduces any bias in the model.

MIT Research Demonstrates How To Generate Synthetic Data

In a paper titled The Synthetic Data Vault, MIT researchers Kalyan Veeramachaneni, principal research scientist and co-authors Neha Patki and Roy Wedge, talked about a system that automatically created synthetic data. SDV automatically builds machine learning models out of real databases and cranks out synthetic data. The algorithm — recursive conditional parameter aggregation synthesises artificial data for any relational dataset. Their findings indicated that the SDV successfully modeled relational datasets and used the generative models to synthesise data which the data scientists could use effectively.  

According to Veeramachaneni, once the database was modeled, the researchers recreated a synthetic version of the data that looked like the original database and even if the original database featured missing values and noise, the noise was embedded to produce the right results.

One of the key advantages of this model, as outlined by Veeramachaneni is that it tackles the data crunch problem the companies face. The SDV model also affects the data privacy problem wherein companies can continue designing and testing models without causing a breach of data. Also, another upside is that machine learning models can be easily scaled to create either small or large synthetic data sets, thereby helping in stress tests for big data systems.

Another approach outlined by Salesforce’s Andrey Karapetov is to use historical data, sample the probability distribution and generate as many data points as needed for our use. He mentioned that with Maximum Likelihood Estimation researchers can use samples from historical data to create a model that can be further queries for more data points when required.

Outlook

With GDPR and stricter privacy laws kicking in the other parts of the world, companies are grappling with tighter regulations, data governance and data collection issues. With restricted access to data, big tech companies would make more investment in simulating real data to rapidly test data science models and algorithms. While synthetic data can’t be used for research, it will help companies get rid of the privacy bottleneck, it will allow researchers and scientists to continue their work without using any sensitive data, says Veeramachaneni. Over a period of time, synthetic data will play a huge part in scaling business applications and will give data scientists more flexibility as compared to real data.

Share
Picture of Richa Bhatia

Richa Bhatia

Richa Bhatia is a seasoned journalist with six-years experience in reportage and news coverage and has had stints at Times of India and The Indian Express. She is an avid reader, mum to a feisty two-year-old and loves writing about the next-gen technology that is shaping our world.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.