MITB Banner

15 Open Datasets For Deep Learning Enthusiasts

Share

Deep learning being the game changer at the present day scenario, the datasets play a dominant role in shaping the future of the technology. Learning starts with getting the right data and the best way to mastering in this field is to get your hands dirty by practicing with the high-quality datasets.

Here we list 15 open high-quality datasets for practicing in deep learning space that includes image processing, speech processing, etc.

1| ImageNet

This dataset is inspired by the growing sentiment in the image and vision research field and can be said as the de facto dataset for the classification algorithms in computer vision. This is a dataset of images that are organized according to the WordNet hierarchy.

Category: Image Processing

Dataset Info: The total number of images includes 14,197,122. The concept in WordNet is described by “synonym set” or “synset” and there are more than 100,000 synsets present in WordNet for which ImageNet provides an average of 1000 images to each synset.

Here is a link to a paper that implemented this dataset.

2| MNIST

This is one of the important databases for deep learning. Microsoft and Google lab researchers have reportedly contributed to this dataset of handwritten digits. It is basically constructed from NIST that contains binary images of handwritten digits.

Category: Image Processing

Dataset Info: The dataset contains 60,000 examples of the training set and 10,000 examples of the test set. There are four files in this dataset. Here is the link to a paper that implemented this dataset.

3| LSUN

LSUN or Large-Scale Scene Understanding is a dataset that is used to detect and speed up the progress for scene understanding that includes scene classification, saliency prediction, room layout estimation, etc.

Category: Image Processing

Dataset Info: This dataset by Princeton University consists of around one million labeled images for each scene and object categories where the test set contains 10,000 images.

4| MS COCO- Common Objects In Context

A large-scale object detection, segmentation and captioning dataset that contains features such as recognition in context, super-pixel stuff segmentation, object segmentation, etc.

Category: Image Processing

Dataset Info: This dataset contains 1.5 million object instances with 80 object categories, 91 stuff categories and has been annotated with 5 captions per images.

5| Youtube-8M

Youtube-8M is a large-scale video dataset that was announced in Sept 2016 by Google group. The labels per video are organized into 24 top-level verticals.

Category: Image Processing

Dataset Info: This dataset contains 6.1 million YouTube video IDs, 2.6 billion of audio/visual features with high-quality annotations and 3800+ visual entities.

6| Yelp Reviews

This is an open dataset for academic learning purposes from Yelp that is a subset of user data, reviews, and businesses.

Category: Natural language processing

Dataset Info: This dataset contains over 5 million reviews from 10 metropolitan areas. It also contains 1.4 million business attributes like hours, parking, ambiance, etc.

7| LibriSpeech

This is a large-scale dataset of English speech that is derived from reading audiobooks from the LibriVox project. It contains prepared language-model training data and pre-built language models.

Category: Speech Recognition

Data Info: The dataset contains 1000 hours of speech sampled at 16 kHz that includes recordings of yes-no, Danish pronunciation dictionary, the large-scale corpus of English speech, list of words in Spanish, recordings of African Accented French speech, etc.

8| Open Source Biometric Recognition Data

This dataset provides tools to design and evaluate new biometric algorithms and an interface to incorporate biometric technology into end-user applications.

Category: Biometric Recognition

Dataset Info: This dataset contains open source code for facial recognition, age estimation, and gender estimation.

9| Google AudioSet

This dataset is drawn from YouTube videos and consists of an expanding ontology, the ontology is specified as the hierarchical graph of event categories that covers human and animal sounds, sounds of musical instruments, genres, everyday environmental sounds, etc.

Category: Sound

Dataset info: The dataset consists of 2.1 million annotated videos that include 527 classes and 5.8 thousand hours of audio.

10| Blogger Corpus

This dataset consists of blogs that are gathered from blogger.com in 2004. Here, each blog is represented as a separate file and each group consists of an equal number of male and female bloggers.

Category: Natural Language Processing

Dataset Info: The dataset consists of collected posts of 19,230 bloggers that incorporates a total of 681,288 posts and over 140 million words.

11| CIFAR-10

This dataset is a labeled subset of 80 million tiny images dataset that was collected by Alex Krizhevsky, Vinod Nair and Geoffrey Hinton.

Category: Image Classification

Dataset Info: This dataset consists of 60,000 32X32 color images in 10 classes with 6000 images per class. The training images and test data consist of 50,000 and 10,000 images respectively.

12| Baidu Apolloscapes

This is a large scale open data set that is designed to promote the development of self-driving technologies. It contains high-resolution RGB videos with hundreds of thousands of frames and its corresponding pixel by pixel semantic annotations, dense point cloud, stereo image, etc.

Category: Self-driving

Dataset Info: The dataset contains 25 different semantic items like cars, bicycles, pedestrian, street lights, etc. covered by 5 groups.

13| WordNet

This is a large lexical dataset of English synsets by Princeton University that consists of any opinions, findings, conclusions, recommendations, etc. It serves as a useful tool for computational linguistics and natural language processing.

Category: Natural language processing

Dataset Info: It contains 11,7000 synsets and each of which is linked to other synsets by means of a small number of conceptual relations.

14| Open Images Dataset

This dataset contains open images that have been annotated with image-level labels and object bounding boxes.

Category: Image Classification

Dataset Info: The dataset is split into a training set of more than 9 million images, a validation set of more than 40k images and a test set of 125,436 images.

15| IMDB Reviews

This dataset is for binary sentiment classification that includes unlabeled data apart from training and test reviews.

Category: Natural Language Processing

Dataset Info: This dataset consists of 25,000 highly polar movie reviews for training as well as 25,000 reviews for testing.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India