The term MNIST is something that no machine learning enthusiast can avoid. MNIST is special in many ways, it is highly popular and hence widely explored and studied, it is open and easily available and it is not at all complicated. MNIST is one of the very first datasets any ML person would turn to when they are beginning.
A recent research paper talks about a very new addition to the family of MNISTs. This new family with origin in India is a dataset of handwritten digits from Kannada, one of the 22 scheduled languages in India spoken by almost 57 million people.
What Is Kannada-MNIST
The dataset consists of images of handwritten digits in Kannada with 60,000 images in training set and 10,000 images in the test set.
In addition to the training and test set, there is another set which consists of 10,240 images called the Dig-MNIST dataset. Unlike the Kannada MNIST which were handwritten by people who used Kannada as a means of communication, the Dig-MNIST is handwritten by non-Kannadigas, thus acting as a more challenging test set. The images in Dig-MNIST are noisier with smudges and grid borders.
Data set dimensions:
- Training set: 60,000 x 28 × 28
- Test set: 10,000 x 28 x 28
- Dig-MNIST: 10,240 28 × 28
The Kannada-MNIST is to act as a complete replacement to the original MNIST dataset. Although there have been numerous works around Kannada digits in ML, Kannada-MNIST purely addresses the scarcity in data with a count that is up to the original MNISTdataset Standard along with an additional Dig-MNIST dataset.
Kannada-MNIST vs MNIST
The paper also compares the Kannada-MNIST with the MNIST dataset. The paper describes how the two datasets differ in both Morphological and Dimensionality reduction comparisons. The Morphological comparison compares the pixel densities of the images in both the datasets. It was observed that the Kannada-MNIST dataset has a maximal mean pixel-intensity of ∼ 0.3 as compared to the ∼ 0.6 of the MNIST dataset. The statistics of morphological traits were obtained using the Morpho-MNIST framework.
Principle Component Analysis was used to understand the explained variance across the PCA components in which it was found that the top-50 PCA components explain 83% of the total variance for the MNIST dataset while it only explained 63% for Kannada-MNIST.
The research also studies the performance behaviour of the Kannada-MNIST dataset with a standard Convolutional Neural Network.
With an out-of-the-shelf Keras CNN using Adadelta optimizer with a learning-rate=1.0 and ρ = 0.95 the model was able to attain a 97.3% accuracy on the test set. The same model returned an accuracy of 76.2% on the Dig-MNIST dataset.
It Is Open
The work by Vinay Uday Prabhu has been open-sourced to promote future studies on both Kannada-MNIST as well as other languages. The paper also puts down some interesting problem statements or challenges to the large ML community to use the Kannada-MNIST dataset for various studies and researches.
Click here to read the full paper
Click here to go to the official Git.
What To Expect
MNIST has already become a standard turn-to dataset for beginners in machine learning, especially in Computer Vision. With more studies being done and more papers being published on MNIST, we can expect scripts of many more languages to enter into the MNIST family which will induce more challenges as well as new discoveries in the ML spectrum, setting up a new standard.