Data scientists are the rare breed of professionals who can solve the world’s thorniest problems. The data savvy professionals are believed to be a rare combination of statistical and computational ingenuity, however, these data pros are also prone to mistakes. While we have dived into the makings of a data scientists and covered the topic extensively, it is time to train the gaze on the six most common statistical mistakes data scientists make. Some of the most common errors are the types of measurements, variability of data and the sample size. Statistics provides the answers but in some cases it confuses too.
Correlation is not causation
According to leading data science veteran and co-author Data Science for Business Tom Fawcett, the underlying principle in statistics and data science is the correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is apparently the most common mistake in Time Series. Fawcett cites an example of a stock market index and the unrelated time series Number of times Jennifer Lawrence was mentioned in the media. The lines look amusingly similar. There is usually a statement like: “Correlation = 0.86”. Recall that a correlation coefficient is between +1 (a perfect linear relationship) and -1 (perfectly inversely related), with zero meaning no linear relationship at all. 0.86 is a high value, demonstrating that the statistical relationship of the two time series is strong.Fawcett goes on to add that when exploring relationships between two time series, all one wants to know is whether the variations in one series are correlated with variations in another.
We have heard of biased algorithms, but there is bias data as well. We are talking about biased sampling that can lead to measurement errors because of unrepresentative samples. In most cases, data scientists can arrive at results that are close but not accurate due to biased estimators. An estimator is the rule for calculating an estimate of a given quantity based on the observed data. In fact, non-random samples are believed to be biased, and their data cannot be used to represent any other population beyond themselves.
In basic linear or logistic regression, mistakes arise from not knowing what should be tested on the regression table. In regression analysis, one identifies the dependent variable that varies based on the value of the independent variable. The first step here is to specify the model by defining the response and predictor variables. And most data scientists trip up here by mispecifying the model. In order to avoid the model misspecification, one must find out if there is any functional relationship between the variables that are being considered.
Misunderstanding P Value
Long pegged as the ‘gold standard’ of statistical validity, P values are a nebulous concept and scientists believes that aren’t as reliable as many researchers assume. P value are used to determine statistical significance in a hypothesis test. According to the American Statistical Association, P value do not measure the probability that the studied hypothesis is true, or the probability that the data was produced by random chance alone. Hence, business and organizational decisions should not be based only on whether a p-value passes a specific threshold. Many believe that data manipulation and significance chasing can make it impossible to come to the right conclusions from findings.
Inadequate Handling of Outliers and Influential Data Points
Outliers can affect any statistical analysis, thereby outlier should be investigated and deleted, corrected, or explained as appropriate. For auditable work, the decision on how to treat any outliers should be documented. Sometimes loss of information may be a valid tradeoff in return for enhanced comprehension.
Loss of information
The main object of statistical data analysis is to provide the best business outcome, with minimal modeling or human bias. Sometime, a loss of information in individual data points can impact the result and its relationship with data set.