Synthetic data generation has become a surrogate technique for tackling the problem of bulk data needed in training deep learning algorithms. Areas such as computer vision have greatly benefited from advances in deep learning and now generating synthetic data is serving as a good starting point for researchers who are trying to bridge the data gap. A recent research from University of Barcelona talks about Synthetic Data Generation model which introduced a synthetic image generation algorithm to tackle the lack of availability of training data in a fully-supervised learning problem. Synthetic data is defined as anonymised data, generated to mimic real world data.
According to Sergey Nikolenko, chief research officer at Neuromation, synthetic data is a more efficient way of getting perfectly labeled data for recognition. In a post, he shared that the synthetic data approach has proven to be very successful, and now the models trained by Neuromation are already being implemented in the retail sector.
Nikolenko Outlined The Following Major Benefits Of Using Synthetic Data:
- Reduces the manual work required to label data
- The replicated data is labeled perfectly without any errors
- Synthetic data is pegged as a useful tool for testing the scalability of algorithms and the performance of new software
Despite the upside, it also comes with its own disadvantages. According to Gautier Krings, chief scientist at Real Impact Analytics, synthetic data can’t be used to for research purposes, as it only reproduces specific properties of the data. He further emphasised that producing quality synthetic data is complicated, since it can be difficult to keep a track of all the features which are required to replicate the real data. Other researchers have voiced similar concerns about bias in synthetic data and have said that while it is good for training models, it cannot be used for research as it cannot serve as a base for understanding real world problems. Another big problem in using synthetic images is to comprehend the extent to which this data can be applied to solve real world problems, and whether the data introduces any bias in the model.
MIT Research Demonstrates How To Generate Synthetic Data
In a paper titled The Synthetic Data Vault, MIT researchers Kalyan Veeramachaneni, principal research scientist and co-authors Neha Patki and Roy Wedge, talked about a system that automatically created synthetic data. SDV automatically builds machine learning models out of real databases and cranks out synthetic data. The algorithm — recursive conditional parameter aggregation synthesises artificial data for any relational dataset. Their findings indicated that the SDV successfully modeled relational datasets and used the generative models to synthesise data which the data scientists could use effectively.
According to Veeramachaneni, once the database was modeled, the researchers recreated a synthetic version of the data that looked like the original database and even if the original database featured missing values and noise, the noise was embedded to produce the right results.
One of the key advantages of this model, as outlined by Veeramachaneni is that it tackles the data crunch problem the companies face. The SDV model also affects the data privacy problem wherein companies can continue designing and testing models without causing a breach of data. Also, another upside is that machine learning models can be easily scaled to create either small or large synthetic data sets, thereby helping in stress tests for big data systems.
Another approach outlined by Salesforce’s Andrey Karapetov is to use historical data, sample the probability distribution and generate as many data points as needed for our use. He mentioned that with Maximum Likelihood Estimation researchers can use samples from historical data to create a model that can be further queries for more data points when required.
With GDPR and stricter privacy laws kicking in the other parts of the world, companies are grappling with tighter regulations, data governance and data collection issues. With restricted access to data, big tech companies would make more investment in simulating real data to rapidly test data science models and algorithms. While synthetic data can’t be used for research, it will help companies get rid of the privacy bottleneck, it will allow researchers and scientists to continue their work without using any sensitive data, says Veeramachaneni. Over a period of time, synthetic data will play a huge part in scaling business applications and will give data scientists more flexibility as compared to real data.