Data collection is the most crucial part of machine learning models as the working of the model will completely depend on the data which we push as training as well as testing data. This follows the “Garbage In Garbage Out” method because if we push wrong data we will get the wrong output.
For instance, in 2016 Tay.ai bot released by Microsoft via Twitter was embroiled in a controversy when it began to post offensive tweets through its Twitter account. As a result, the tech giant had to shut down the chatter-bot only after 16 hours from its launch. In this article, we list you a few points which you can look at in order to avoid bias.
Understand The Purpose
Knowing what you really want to do with your data and more basically its purpose to serve your specific project is a very crucial part. You should develop a clear understanding of the data requirements before you take any further step of collecting data.
Collect Data Objectively
The objective data collection is the process in which data relating to the client’s problem are obtained and mainly depends on the purposes and stage of the characterisation. It should be focused, clear, and project specific as it will help you to determine which type of data you will be using, quantitative or qualitative data.
Design An Easy To Use Interface
Easy-To-Use interface design is needed while collecting data because the user interface design focuses on envisioning what the users might need to do as well as ensuring that the interface contains all the elements which are easily understandable and accessible. One should keep the interface simple, purposeful and consistent.
Avoid Missing Values
It is very crucial to focus on issues like missing values of the data while collecting it. The reason behind missing data can be such as Missing at Random (MAR), Missing completely at Random (MCAR) and Missing not at Random (MNAR). As we know, it is best to avoid missing data values during data collection but this must not always be an option. For instance, If you remove observations from MNAR, it can produce bias, in that case, you have to be very careful before removing any observations. Record sampling is the technique where you can remove data with missing values or the data which have less representative values in order to make prediction more accurate.
Data Imputation
Data imputation means replacing the missing values with an estimated value and then analyse the total data set along with the imputed values to check as if the imputed values are actual observed values. There are different kinds of imputation such as mean imputation (calculate the mean of the observed values for that variable for all individuals who are non-missing), substitution (impute the value from a new individual which is not selected to be in the sample), hot-deck imputation (a randomly chosen value from an individual in the sample which has similar values on other variables), cold deck imputation (systematically chosen value from an individual which has similar values on other variables), regression imputation (predicted value obtained by regressing the missing variable), stochastic regression imputation (predicted value from a regression along with random residual value), etc.
Feature Scaling
The feature scaling is applied to independent variables or features of data in order to normalise the data within a particular range. It is used for adjusting the data which have different scales in order to avoid biases. The common techniques are standardisation and normalisation where the first one transforms data in order to give 0 mean and variance 1 and the latter is used when it is needed to bind the values between two numbers.