Machine learning is now widely deployed across various health sectors because of its ability to make real-time predictions and draw insights which usually go unnoticed given the voluminous and unstructured nature of the datasets. Here are few repositories that have culminated over the years thanks to the never-ending efforts of the researchers to make crucial metadata available to the common public so that they can try them out on their own models:
WHO (World Health Organisation)
WHO’s is authentic as it can it get when it comes to keeping track of the health of all the nations. Its open data source contains categories which include child nutrition, neglected diseases, risk factors pertaining to certain diseases among others.
The data is available in Excel format.
OGD Platform India
This website consists of all the data collected from Indian health agencies and other entities. The categories in the catalogue range from primary health in tribal regions to state wise health reports.
There is an option to search the keyword to avail numerous well-curated resources.
Kaggle- Health Analytics
The dataset consists of 26 indicators like acute illness, chronic illness, immunisation, mortality and others. These indicators, in turn, have sub-categories which cover all the attributes.
The survey was conducted in Empowered Action Group (EAG) states Uttarakhand, Rajasthan, Uttar Pradesh, Bihar, Jharkhand, Odisha, Chhattisgarh and Madhya Pradesh and Assam.
This dataset covers 21 million population and 4.32 million households spread across the rural and urban area of these 9 states.
These benchmarks would help in better and holistic understanding and timely monitoring of various determinants on well-being and health of population particularly Reproductive and Child Health.
Heart Disease Data Set
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Brainomics
Project Brainomics provides the technical foundation for this database, based on a semantic web framework, bringing together imaging, genetics and questionnaire data.
OpenfMRI
OpenfMRI.org is a project dedicated to the free and open sharing of raw magnetic resonance imaging (MRI) datasets.
Number of currently available datasets: 95
Number of subjects across all datasets: 3,372
Mental Disorders
This data was collected via Collaborative Psychiatric Epidemiology Surveys (CPES) which were initiated in recognition of the need for contemporary, comprehensive epidemiological data regarding the distributions, correlates and risk factors of mental disorders.
The objective of the CPES was to collect data about the prevalence of mental disorders, impairments associated with these disorders, and their treatment patterns from representative samples of majority and minority adult populations in the United States.
Pima-indians-diabetes
This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes
Fields description follow:
preg = Number of times pregnant
plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
pres = Diastolic blood pressure (mm Hg)
skin = Triceps skin fold thickness (mm)
test = 2-Hour serum insulin (mu U/ml)
mass = Body mass index (weight in kg/(height in m)^2)
pedi = Diabetes pedigree function
age = Age (years)
class = Class variable (1:tested positive for diabetes, 0: tested negative for diabetes)
CT Medical Images
The dataset is designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The data are a tiny subset of images from the cancer imaging archive.
Malaria Datasets
A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.
The dataset contains a total of 27,558 cell images with equal instances of parasitised and uninfected cells.
Mental Health in Tech Survey
This data was collected with an aim to measure mental health in the tech workplace and examine the frequency of mental health disorders among tech workers.