Deep learning being the game changer at the present day scenario, the datasets play a dominant role in shaping the future of the technology. Learning starts with getting the right data and the best way to mastering in this field is to get your hands dirty by practicing with the high-quality datasets.
Here we list 15 open high-quality datasets for practicing in deep learning space that includes image processing, speech processing, etc.
1| ImageNet
This dataset is inspired by the growing sentiment in the image and vision research field and can be said as the de facto dataset for the classification algorithms in computer vision. This is a dataset of images that are organized according to the WordNet hierarchy.
Category: Image Processing
Dataset Info: The total number of images includes 14,197,122. The concept in WordNet is described by “synonym set” or “synset” and there are more than 100,000 synsets present in WordNet for which ImageNet provides an average of 1000 images to each synset.
Here is a link to a paper that implemented this dataset.
2| MNIST
This is one of the important databases for deep learning. Microsoft and Google lab researchers have reportedly contributed to this dataset of handwritten digits. It is basically constructed from NIST that contains binary images of handwritten digits.
Category: Image Processing
Dataset Info: The dataset contains 60,000 examples of the training set and 10,000 examples of the test set. There are four files in this dataset. Here is the link to a paper that implemented this dataset.
3| LSUN
LSUN or Large-Scale Scene Understanding is a dataset that is used to detect and speed up the progress for scene understanding that includes scene classification, saliency prediction, room layout estimation, etc.
Category: Image Processing
Dataset Info: This dataset by Princeton University consists of around one million labeled images for each scene and object categories where the test set contains 10,000 images.
4| MS COCO- Common Objects In Context
A large-scale object detection, segmentation and captioning dataset that contains features such as recognition in context, super-pixel stuff segmentation, object segmentation, etc.
Category: Image Processing
Dataset Info: This dataset contains 1.5 million object instances with 80 object categories, 91 stuff categories and has been annotated with 5 captions per images.
5| Youtube-8M
Youtube-8M is a large-scale video dataset that was announced in Sept 2016 by Google group. The labels per video are organized into 24 top-level verticals.
Category: Image Processing
Dataset Info: This dataset contains 6.1 million YouTube video IDs, 2.6 billion of audio/visual features with high-quality annotations and 3800+ visual entities.
6| Yelp Reviews
This is an open dataset for academic learning purposes from Yelp that is a subset of user data, reviews, and businesses.
Category: Natural language processing
Dataset Info: This dataset contains over 5 million reviews from 10 metropolitan areas. It also contains 1.4 million business attributes like hours, parking, ambiance, etc.
7| LibriSpeech
This is a large-scale dataset of English speech that is derived from reading audiobooks from the LibriVox project. It contains prepared language-model training data and pre-built language models.
Category: Speech Recognition
Data Info: The dataset contains 1000 hours of speech sampled at 16 kHz that includes recordings of yes-no, Danish pronunciation dictionary, the large-scale corpus of English speech, list of words in Spanish, recordings of African Accented French speech, etc.
8| Open Source Biometric Recognition Data
This dataset provides tools to design and evaluate new biometric algorithms and an interface to incorporate biometric technology into end-user applications.
Category: Biometric Recognition
Dataset Info: This dataset contains open source code for facial recognition, age estimation, and gender estimation.
9| Google AudioSet
This dataset is drawn from YouTube videos and consists of an expanding ontology, the ontology is specified as the hierarchical graph of event categories that covers human and animal sounds, sounds of musical instruments, genres, everyday environmental sounds, etc.
Category: Sound
Dataset info: The dataset consists of 2.1 million annotated videos that include 527 classes and 5.8 thousand hours of audio.
10| Blogger Corpus
This dataset consists of blogs that are gathered from blogger.com in 2004. Here, each blog is represented as a separate file and each group consists of an equal number of male and female bloggers.
Category: Natural Language Processing
Dataset Info: The dataset consists of collected posts of 19,230 bloggers that incorporates a total of 681,288 posts and over 140 million words.
11| CIFAR-10
This dataset is a labeled subset of 80 million tiny images dataset that was collected by Alex Krizhevsky, Vinod Nair and Geoffrey Hinton.
Category: Image Classification
Dataset Info: This dataset consists of 60,000 32X32 color images in 10 classes with 6000 images per class. The training images and test data consist of 50,000 and 10,000 images respectively.
12| Baidu Apolloscapes
This is a large scale open data set that is designed to promote the development of self-driving technologies. It contains high-resolution RGB videos with hundreds of thousands of frames and its corresponding pixel by pixel semantic annotations, dense point cloud, stereo image, etc.
Category: Self-driving
Dataset Info: The dataset contains 25 different semantic items like cars, bicycles, pedestrian, street lights, etc. covered by 5 groups.
13| WordNet
This is a large lexical dataset of English synsets by Princeton University that consists of any opinions, findings, conclusions, recommendations, etc. It serves as a useful tool for computational linguistics and natural language processing.
Category: Natural language processing
Dataset Info: It contains 11,7000 synsets and each of which is linked to other synsets by means of a small number of conceptual relations.
14| Open Images Dataset
This dataset contains open images that have been annotated with image-level labels and object bounding boxes.
Category: Image Classification
Dataset Info: The dataset is split into a training set of more than 9 million images, a validation set of more than 40k images and a test set of 125,436 images.
15| IMDB Reviews
This dataset is for binary sentiment classification that includes unlabeled data apart from training and test reviews.
Category: Natural Language Processing
Dataset Info: This dataset consists of 25,000 highly polar movie reviews for training as well as 25,000 reviews for testing.