A typical data mining project starts off with visual exploration of each variable. The objective is to identify the distribution of the variable. Once, a distribution is identified, the next step is to identify the outliers. These would be those observations that do not plot in the acceptance range of the observation. This could be the range of one (or more) standard deviation from the expected mean value. The aim is to find a population that would behave uniformly and as such would be easier to herd together in a given function or formulae.
Let me give a quote from Sherlock Holmes “What is out of the common is usually a guide rather than a hindrance.” However, in the standard accepted practice of data mining one is trained to ignore the outlier before proceeding on finding the best fit equation. This is also the arena that provides material for Nassim Taleb’s masterpieces such as The Black Swan or Fooled by Randomness.
I am not advocating any error in the current data mining projects. Business needs answers quickly. Building a model on the 80%, say, of the common population is good enough for most business applications. For example, it is common practice to execute models with accuracy as low as 65% in marketing campaigns. The cost of misclassification is not very high and business can always pull back campaigns that do not work well. Very few and rare models are mission critical for the business. I would not advise this approach for the actuarial models in say a life insurance business.
Once the model for the common population is delivered, the data scientist moves on to the next business problem. However, it is important that the data scientist revisit the outliers and try to understand any emerging pattern that they may denote. If there is a repeatable pattern identified within even a few observations, the data scientist should raise an alert and define a tracking mechanism for similar occurrences in the future. When the occurrences show an increasing pattern or frequency, we may well have an emerging issue or behavior on our hands.
Let us look at the heavy metal scenario over the past few decades. A typical appearance of a heavy metal fan was denim trousers, black tees, long hair, and lots of facial hair. Concert stadiums were full of similarly clad individuals. Somewhere in the eighties, one could see a formally dressed person in a blazer at these concerts. Slowly, the occurrences of such formal dressers increased. From the rebel culture of the 60s and 70s, the youth of 80s and 90s were aligned to mainstream life and were also leading the business world. It is no surprise today to see groups of blue suits with loose ties and unbuttoned collar playing air guitar during after office hours. Bands that identified this shift in the fan base and adopted their music from the rebel kind to the achiever kind survived and others got lost by the way side.
So go back to the models that are deployed in production and identify the outliers. There is a high chance of uncovering an emerging trend by analyzing these outliers.
Try deep learning using MATLAB