The trick to successfully reach out to a potential employer is to make sure that one’s resume stands out from the rest.
For an aspiring data scientist, it is imperative that he/she does more than just acquiring a specialisation in data science. Creating projects and providing innovative solutions, arms an aspiring data scientist with the much needed edge to propel his/her career in data science.
One of the best ways to build a strong portfolio in data science is to participate in popular data science challenges, and using the wide variety of data sets provided, produce projects offering solutions for the problems posed.
AIM brings you 11 popular data science projects for aspiring data scientists.
[divider top=”no” size=”1″]
Beginner[divider top=”no” size=”1″]
As a data scientist taking baby steps towards a career in data science, it is important to start with data sets with small amounts of data. These data sets provide the scope for training and gradually developing proficiency.
As the name suggests (no points for guessing), this data set provides the data on all the passengers who were aboard the RMS Titanic when it sank on 15 April 1912 after colliding with an iceberg in the North Atlantic ocean. It is the most commonly used and referred to data set for beginners in data science. With 891 rows and 12 columns, this data set provides a combination of variables based on personal characteristics such as age, class of ticket and sex, and tests one’s classification skills.
Objective: Predict the survival of the passengers aboard RMS Titanic.
Published originally in 1978, in a paper titled `Hedonic prices and the demand for clean air’, this data set contains the data collected by the U.S Census Service for housing in Boston, Massachusetts. It was collected for a study that aimed at ascertaining if the availability of clean air influenced the value of houses in Boston.
With only 506 rows and 14 columns, this is a small data set that seeks the discovery of ideal explanatory variables. It is very popular in pattern recognition literature and serves as a regression analysis problem.
Objective: Predict the median value of occupied homes.
Retail industry is a front-runner in the large scale employment of data science. Areas such as product placement, inventory management and customization of offers, are sought to improve constantly through the application of data science. Walmart is one such retailer.
This data set provides information on the historical sales data of 45 stores of Walmart, each of which having various departments. The goal is to predict the department-wise sales of each store using the historical data spanning across 143 weeks.
Walmart is also known for conducting promotional markdown events before major holidays such as Christmas, Thanksgiving, and Super Bowl among others. The difference between the weightage given to the data of regular weeks and the weeks including holiday seasons, coupled with unavailability of complete historical data, adds another level of difficulty of factoring the effects of the markdowns on the sales during the holiday weeks. This is a regression analysis problem.
- Predict the sales across various departments in each store.
- Predict the effect of markdowns on the sales during the holiday seasons.
Intermediate[divider top=”no” size=”1″]
This is where the training wheels come off and it is time to face the open road. These data sets provide a higher level of complexity and difficulty, and help in building upon the solid basics acquired by working with simpler data sets.
A well known example of a trip history project is the Hubway Data Visualization Challenge. This data set comes from the Boston-based bicycle sharing service, Hubway. Originally launched in 2013, the competition sought a visualization of the company’s trip history from the date of its official launch on 28 July 2011 till the end of September 2012. Variables within the data include duration, membership type, gender, and destinations among others.
The data provides an engaging exercise in data wrangling and serves as a classification problem
Objective: Provide a visualization of the data (answer questions on user patterns).
In simple words, text mining means analysing data within text. Large amounts of unstructured data is found within natural language. Mining this unstructured data from sources such as e-mails, text messages and other platforms like Facebook and Twitter, can help companies gain business insights about customers, and their patterns and topics of interest.
Data sets from the famous competition, What’s Cooking?, can help you get started in the area of text mining. The goal is to use recipe ingredients to categorize cuisines.
Text mining data sets test skills on classification and clustering. Occasionally, regression analysis may be required.
Objective: Classification and categorisation based on tags or labels.
It contains the extracted weighted census data, and has 41 employment and demographic related variables.
While the the original table contained 199,523 rows and 42 columns, the newer refined versions of the data set contain anywhere between 14-16 columns and above 30,000 rows. It is a commonly cited data set of KNN(know nearest neighbors) and is a classification problem.
Objective: Predict whether income exceeds $50,000 per year.
Similar to the above mentioned Titanic Data Set, it is one of the most popular and commonly quoted data sets in data science. This data set provides the exciting opportunity of building one’s own movie recommendation engine and is available in many sizes.
The smallest set meant for the purpose of education and development contains 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. While the largest set meant for the same purpose contains 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users.
It also contains stable benchmark data set of 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.
Objective: Make movie suggestions for users.
[divider top=”no” size=”1″]
Advanced[divider top=”no” size=”1″]
This is where an aspiring data scientist makes the final push into the big leagues. After acquiring the necessary basics and honing them in the first two levels, it is time to confidently play the big game. These data sets provide a platform for putting to use all the learnings and take on new, and more complex challenges.
This data set is a part of the Yelp Dataset Challenge conducted by crowd-sourced review platform, Yelp. It is a subset of the data of Yelp’s businesses, reviews, and users, provided by the platform for educational and academic purposes.
In 2017, the tenth round of the Yelp Dataset Challenge was held and the data set contained information about local businesses in 12 metropolitan areas across 4 countries.
Rich data comprising 4,700,000 reviews, 156,000 businesses and 200,000 pictures provides an ideal source of data for multi-faceted data projects. Projects such as natural language processing and sentiment analysis,photo classification, and graph mining among others, are some of the projects that can be carried out using this data set containing diverse data.The data set is available in JSON and SQL formats.
Objective: Provide insights for operational improvements using the data available.
With the increasing demand to analyse large amounts of data within small time frames, organisations prefer working with the data directly over samples. This presents a herculean task for a data scientist with limitation of time.
Extracted from Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system, this data set contains information on reported incidents of crime in the city of Chicago from 2001 to present, with the absence of data from the most recent seven days. Not included in the data set, is data on murder, where data is recorded for each victim.
It contains 6.51 million rows and 22 columns, and is a multi-classification problem. In order to achieve mastery over working with abundant data, this data set can serve as the ideal stepping stone in the pursuit of tackling mountainous data.
Objective: Explore the data, and provide insights and forecasts about crimes in Chicago.
10) KDD Cup
Organized by the ACM SIGKDD group on Knowledge Discovery and Data Mining, KKD cup is a popular data mining and knowledge discovery competition held annually. It is considered to be the first-ever data science competition held and dates back to 1997.
With a different problem every year, the KDD cup provides data scientists an opportunity to work with data sets across different disciplines. Some of the problems tackled in the past include problems such as identifying which authors correspond to the same person, predicting the click-through rate of ads using the given query and user information, and development of algorithms for Computer Aided Detection (CAD) of early stage breast cancer among others.
The latest edition of the challenge was held in 2017 and required participants to predict the traffic flow through highway tollgates.
Objective: Solve or make predictions for the problem presented every year.
ILSVRC makes for a compelling challenge of creating the best algorithm for object detection and image classification at large scale. Held annually, the primary aim of the competition is the comparison of progress in the area of image detection and classification, and merging good research with more data. It also aims to measure the progress made in indexing for retrieval and annotation by computer vision.
This challenge assess algorithms for object detection and localization, from videos and images, and scene parsing and classification on a large scale. Every year, the challenge sees modifications such as addition of new images and categories. The visual resource available consists of over 475,000 objects for classification from over 450,00 images that have been gathered from Flickr and other search engines.
From its inception in 2010, the competition was held by ImageNet. However, the latest edition in 2017 was held by Kaggle.
- Object localization
- Object detection from videos
Try deep learning using MATLAB