MITB Banner

11 Open Source Datasets That Can Be Used For Health Science Projects

Share

Machine learning is now widely deployed across various health sectors because of its ability to make real-time predictions and draw insights which usually go unnoticed given the voluminous and unstructured nature of the datasets. Here are few repositories that have culminated over the years thanks to the never-ending efforts of the researchers to make crucial metadata available to the common public so that they can try them out on their own models:

WHO (World Health Organisation)

WHO’s is authentic as it can it get when it comes to keeping track of the health of all the nations. Its open data source contains categories which include child nutrition, neglected diseases, risk factors pertaining to certain diseases among others.

The data is available in Excel format.

OGD Platform India

This website consists of all the data collected from Indian health agencies and other entities. The categories in the catalogue range from primary health in tribal regions to state wise health reports.

There is an option to search the keyword to avail numerous well-curated resources.

Kaggle- Health Analytics

The dataset consists of 26 indicators like acute illness, chronic illness, immunisation, mortality and others. These indicators, in turn, have sub-categories which cover all the attributes.

The survey was conducted in Empowered Action Group (EAG) states Uttarakhand, Rajasthan, Uttar Pradesh, Bihar, Jharkhand, Odisha, Chhattisgarh and Madhya Pradesh and Assam.

This dataset covers  21 million population and 4.32 million households spread across the rural and urban area of these 9 states.

These benchmarks would help in better and holistic understanding and timely monitoring of various determinants on well-being and health of population particularly Reproductive and Child Health.

Heart Disease Data Set

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

Brainomics

Project Brainomics provides the technical foundation for this database, based on a semantic web framework, bringing together imaging, genetics and questionnaire data.

OpenfMRI

OpenfMRI.org is a project dedicated to the free and open sharing of raw magnetic resonance imaging (MRI) datasets.

Number of currently available datasets: 95

Number of subjects across all datasets: 3,372

Mental Disorders

This data was collected via Collaborative Psychiatric Epidemiology Surveys (CPES) which were initiated in recognition of the need for contemporary, comprehensive epidemiological data regarding the distributions, correlates and risk factors of mental disorders.

The objective of the CPES was to collect data about the prevalence of mental disorders, impairments associated with these disorders, and their treatment patterns from representative samples of majority and minority adult populations in the United States.

Pima-indians-diabetes

This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes

Fields description follow:

preg = Number of times pregnant

plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test

pres = Diastolic blood pressure (mm Hg)

skin = Triceps skin fold thickness (mm)

test = 2-Hour serum insulin (mu U/ml)

mass = Body mass index (weight in kg/(height in m)^2)

pedi = Diabetes pedigree function

age = Age (years)

class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)

CT Medical Images

The dataset is designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive.

Malaria Datasets

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.

The dataset contains a total of 27,558 cell images with equal instances of parasitised and uninfected cells.

Mental Health in Tech Survey

This data was collected with an aim to measure mental health in the tech workplace and examine the frequency of mental health disorders among tech workers.

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India