Last updated October 7, 2021
In AI Mysteries

11 Open Source Datasets That Can Be Used For Health Science Projects

Published on March 28, 2019

by Ram Sagar

Machine learning is now widely deployed across various health sectors because of its ability to make real-time predictions and draw insights which usually go unnoticed given the voluminous and unstructured nature of the datasets. Here are few repositories that have culminated over the years thanks to the never-ending efforts of the researchers to make crucial metadata available to the common public so that they can try them out on their own models:

WHO (World Health Organisation)

WHO’s is authentic as it can it get when it comes to keeping track of the health of all the nations. Its open data source contains categories which include child nutrition, neglected diseases, risk factors pertaining to certain diseases among others.

The data is available in Excel format.

OGD Platform India

This website consists of all the data collected from Indian health agencies and other entities. The categories in the catalogue range from primary health in tribal regions to state wise health reports.

There is an option to search the keyword to avail numerous well-curated resources.

Kaggle- Health Analytics

The dataset consists of 26 indicators like acute illness, chronic illness, immunisation, mortality and others. These indicators, in turn, have sub-categories which cover all the attributes.

The survey was conducted in Empowered Action Group (EAG) states Uttarakhand, Rajasthan, Uttar Pradesh, Bihar, Jharkhand, Odisha, Chhattisgarh and Madhya Pradesh and Assam.

This dataset covers 21 million population and 4.32 million households spread across the rural and urban area of these 9 states.

These benchmarks would help in better and holistic understanding and timely monitoring of various determinants on well-being and health of population particularly Reproductive and Child Health.

Heart Disease Data Set

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

Brainomics

Project Brainomics provides the technical foundation for this database, based on a semantic web framework, bringing together imaging, genetics and questionnaire data.

OpenfMRI

OpenfMRI.org is a project dedicated to the free and open sharing of raw magnetic resonance imaging (MRI) datasets.

Number of currently available datasets: 95

Number of subjects across all datasets: 3,372

Mental Disorders

This data was collected via Collaborative Psychiatric Epidemiology Surveys (CPES) which were initiated in recognition of the need for contemporary, comprehensive epidemiological data regarding the distributions, correlates and risk factors of mental disorders.

The objective of the CPES was to collect data about the prevalence of mental disorders, impairments associated with these disorders, and their treatment patterns from representative samples of majority and minority adult populations in the United States.

Pima-indians-diabetes

This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes

Fields description follow:

preg = Number of times pregnant

plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test

pres = Diastolic blood pressure (mm Hg)

skin = Triceps skin fold thickness (mm)

test = 2-Hour serum insulin (mu U/ml)

mass = Body mass index (weight in kg/(height in m)^2)

pedi = Diabetes pedigree function

age = Age (years)

class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.