While many experts are saying that Data is as precious as oil in this century, the need for free, simple datasets for analytics projects are important as well. As any beginner would reveal, their first projects have helped them immensely in kick-starting their careers into the world of analytics.
We have listed the following 16 free datasets from where any beginner can pick out relevant data for his or her projects. And the best part is, it’s all free.
- Government of India: Data.gov.in is a portal for encouraging ‘Open Data Initiative’ undertaken by the Government of India. This joint initiative of the Governments of India and the US enables the Ministries to publish datasets collected by them for public use.
- Twitter: Twitter dataset is a reliable platform which contains all the tweets and user details which enables one to perform interesting analysis on. This can be deployed to create user communities and suggest suitable followers in accordance with the genre of tweets.
- Google N-Grams: If you’re interested in truly massive data, the Google n-grams dataset counts the frequency of words and phrases by year across a huge number of text sources. The resulting file is 2.2 TB.
- Amazon: Amazon Web Services datasets can be analyzed in the cloud using EC2 and Hadoop via EMR. Amazon Web Services renders an entire toolkit for analyzing data at any scale.
- YouTube: This is a video dataset consisting of millions of YouTube video IDs and associated labels from a diverse vocabulary of over 4700 visual entities. Their main motive is to accelerate research work on video understanding.
- Buzzfeed News: Surprisingly, the website famous for its extensive reportage on celebrities and pop culture makes the data sets used in its articles available on Github.
- Kaggle: Kaggle has created an array of high-quality public datasets known as Kaggle Datasets for hassle-free access and analysing the data without downloading it. Work done in Kaggle is saved and published publicly by default which enables newcomers to modify the work done by other data scientists.
- Socrata OpenData: This is a platform consisting of multiple clean data sets that can be explored in the browser or downloaded to work on. Registration, however, is not required. It allows you to use visualization and exploration tools to explore the data in the browser and choose from hundreds of open data catalogs.
- Bitbucket: This web-based hosting service owned by Atlassian, is written in Python and uses the Django web framework. This web portal allows unlimited public repositories for all and private repositories free for up to five users.
- GitHub: GitHub, one of the largest web-based hosting services for developmental projects renders its services free of cost for public repositories. In addition to an integrated issue tracker right within your project, GitHub also supports over 200 programming languages.
- UCI: The UCI Machine Learning Repository is an amalgamation of databases, domain theories, and data generators that are utilized by the machine learning community for the critical analysis of machine learning algorithms.
- Amazon Reviews: This dataset consists of product reviews and metadata from Amazon that can be used by researchers for analytics projects. It consists of 142.8 million reviews spanning from May 1996- July 2014.
- Google BigQuery Public Datasets: The public datasets listed in the BigQuery documentation are datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these datasets and provides public access to the data via a project. You pay only for the queries that you perform on the data (the first 1 TB per month is free, subject to query pricing details).
- World Bank: The World Bank is a global development organization that offers loans and advice to developing countries. The World Bank regularly funds programs in developing countries, then gathers data to monitor the success of these programs. You can browse World Bank data sets directly, without registering. The data sets have many missing values, and sometimes take several clicks to actually get to data.
- Reserve Bank Of India: The data available from the Reserve Bank of India includes several metrics on money market operations, balance of payments, use of banking and several products. A must go to site, if you come from BFSI domain in India.
- Ministry of Statistics and Programme Implementation: The MOSPI has a collection of varied datasets, ranging from the Statistical Yearbook, ASI summaries, to National Accounts Data, for data analysts to pore over.
Try deep learning using MATLAB