MITB Banner

How To Avoid Bias In Data Collection

Share

Data collection is the most crucial part of machine learning models as the working of the model will completely depend on the data which we push as training as well as testing data. This follows the “Garbage In Garbage Out” method because if we push wrong data we will get the wrong output.

For instance, in 2016 Tay.ai bot released by Microsoft via Twitter was embroiled in a controversy when it began to post offensive tweets through its Twitter account. As a result, the tech giant had to shut down the chatter-bot only after 16 hours from its launch. In this article, we list you a few points which you can look at in order to avoid bias.

Understand The Purpose

Knowing what you really want to do with your data and more basically its purpose to serve your specific project is a very crucial part. You should develop a clear understanding of the data requirements before you take any further step of collecting data.

Collect Data Objectively

The objective data collection is the process in which data relating to the client’s problem are obtained and mainly depends on the purposes and stage of the characterisation. It should be focused, clear, and project specific as it will help you to determine which type of data you will be using, quantitative or qualitative data.  

Design An Easy To Use Interface

Easy-To-Use interface design is needed while collecting data because the user interface design focuses on envisioning what the users might need to do as well as ensuring that the interface contains all the elements which are easily understandable and accessible. One should keep the interface simple, purposeful and consistent.

Avoid Missing Values

It is very crucial to focus on issues like missing values of the data while collecting it. The reason behind missing data can be such as Missing at Random (MAR), Missing completely at Random (MCAR) and Missing not at Random (MNAR). As we know, it is best to avoid missing data values during data collection but this must not always be an option. For instance, If you remove observations from MNAR, it can produce bias, in that case, you have to be very careful before removing any observations. Record sampling is the technique where you can remove data with missing values or the data which have less representative values in order to make prediction more accurate.

Data Imputation

Data imputation means replacing the missing values with an estimated value and then analyse the total data set along with the imputed values to check as if the imputed values are actual observed values. There are different kinds of imputation such as mean imputation (calculate the mean of the observed values for that variable for all individuals who are non-missing), substitution (impute the value from a new individual which is not selected to be in the sample), hot-deck imputation (a randomly chosen value from an individual in the sample which has similar values on other variables), cold deck imputation (systematically chosen value from an individual which has similar values on other variables), regression imputation (predicted value obtained by regressing the missing variable), stochastic regression imputation (predicted value from a regression along with random residual value), etc.

Feature Scaling

The feature scaling is applied to independent variables or features of data in order to normalise the data within a particular range. It is used for adjusting the data which have different scales in order to avoid biases. The common techniques are standardisation and normalisation where the first one transforms data in order to give 0 mean and variance 1 and the latter is used when it is needed to bind the values between two numbers.

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.