Data visualisation is increasingly being used across many businesses as well as in research communities, and the key reason is better insights. However, when working with data visualisation tools, not everyone is comfortable with the level of abstraction associated with it. For example, in the field of data science, basic data visualisation and data exploration are sometimes more than sufficient. Here, complex manual visualisation models are not used at all.
This is where visualisation recommender systems come into play. They provide a variety of visualisations so that the user does not need to specify each and every time he/she is working on a similar project. But this does not hold true for all applications — for example, the rule-based approaches. With improvements in ML, these limitations can be dealt with easily. This article lays out an interesting research study by scholars at MIT, where they chart out an ML-based approach for visualisation recommendation.
ML-Based Visualisation Recommendations
In contrast to rule-based recommender systems, which are usually sophisticated and incorporate insights from experts for visualisations, ML-based recommender systems deal directly with data to train models and embed them into their systems.
For example, Data2Vis, an automatic visualisation generating model developed by IBM researchers, uses a recurrent neural network (RNN) built on sequence-to-sequence modelling. With significant accuracy in visualisations, Data2Vis generates them within seconds and can aid in complex visualisation without referring other manually-created visualisations.
The researchers’ ultimate goal was to provide a more accessible platform, speed up visualisations and make it easier for anyone to work with data visualisation regardless of his or her programming skills. In addition, data exploration is improved further with initial visualisations available for complex tasks.
Likewise, there are other tools based on ML to generate visualisations. But, all of these, including Data2Vis, face the limitations in areas such as training models, validation method or even compatible integration with other visualisation systems.
A Sui Generis System Called VizML
Kevin Hu and team from Massachusetts Institute of Technology have recently worked on a new ML-based visualisation method called VizML. This technique learns visualisation designs choices from a collection of different datasets and their associated visualisations. The design choices, which are selected by the researchers, are trained with a neural network using Pytorch and the models are developed with scikit-learn. Here, visualisation is specifically defined on the basis of design choices.
“We describe visualisation as a process of making design choices that maximise effectiveness, which depends on dataset, task, and context. Then, we formulate visualisation recommendation as a problem of developing models that learn make design choices, train and test these models using one million unique dataset visualisation pairs from the Plotly Community Feed.”
VizML entails five steps in its ML-based approach:
- Problem Formulation: This step involves the process of making visualisation design choices as mentioned earlier.
- Data Processing: It involves data collection, cleaning and feature extraction.
- Predicting Design Choices: Now, the neural network is checked to see if it predicts the design choices.
- Feature Importance: It involves interpreting features ideal for prediction.
- Crowdsourced Benchmark: Comparison of this model with visualisation models evaluated by human experts and see if it is better and effective.
Visualisation is done through encodings (specifying parameters) on tools such as Vega-lite and Tableau. Design choices are based on these visualisations and are evaluated by an objective function equation (mentioned as visualisation effectiveness in the study). If all the design choices fall under the required criteria, preferred design choices, then it is considered. Otherwise, it will be ignored. Mathematical analysis of this can be found here.
For data collection and cleaning, Hu and the team use Plotly. A total of 2.3 million dataset-visualisation pairs describing each dataset and column with features was taken.
“Using the Plotly API, we collected approximately 2.5 years of public visualisations from the feed, starting from 2015-07-17 and ending at 2018-01-06. We gathered 2,359,175 visualisations in total, 2,102,121 of which contained all three configuration objects, and 1,989,068 of which were parsed without error.”
The ‘Plotly corpus’, as described in the study, is now subjected to feature extraction. With 81 single-column features and 30 pairwise-column features, they are converted into scalar values using 16 aggregating functions. Consequently, through feature processing, only 119,815 datasets and 287,416 columns are taken for prediction tasks.
Once set, a fully-connected feedforward neural network is built and implemented through PyTorch. This is again optimised with respect to learning rate and weight ratio. Finally, it is trained and tested for the obtained data.
VizML predicted visualisations with an accuracy of 70 to 95 percent. Furthermore, the crowdsourced benchmarking done by the researchers also aligns with similar accuracy. This high success by VizML tells us ML can vastly come up with better visualisation than us.
Ultimately, it all depends on the parameters such as design choices if there is a need for data visualisation. On the other hand, it also depends on applications that rely on data visualisation.