Last updated February 5, 2015
In AI Origins & Evolution

Fishing where the fish are

Share

Published on February 5, 2015

by William Inmon

There is an old saying – “90% of the fishermen fish where 10% of the fish are”. The saying explains why many fisherman have rotten luck and why a few fishermen have extraordinary luck. The message is simple – if you want to catch a lot of fish, then put your hook and line in the water where there are a lot of fish.

The saying illustrates a phenomenon that is occurring today in the world of Big Data analytics.

In the world of technology today there is no mistaking the momentum of Big Data. It is the rage and vendors and customers everywhere are fascinated with Big Data. The focus on Big Data rivals the passion and fervor that once was found for the Dot.com craze.

So what are the issues of Big Data? The vendors, the writers, the speakers of Big Data all focus on the technology of Big Data and the potential worth of Big Data. But no one is focusing on getting business value out of Big Data. And the failure to focus on business value and Big Data is likely to prove to be the Achilles heel of Big Data. If the fad of Big Data fails, as did the Dot.com craze, then it is likely because of the failure of the vendor to understand the issues of Big Data that relate to business value, as was the case with Dot.com.

The understanding of business value in Big Data begins in a strange place – in a place where no vendor and few people have stepped. The understanding of business value in Big Data begins with the observation that Big Data can be divided into two distinct worlds – a world of repetitive data and a world of non repetitive data.

Repetitive data is data that often times is repetitive in terms of both structure and content. There are lots of examples of repetitive data. There are log tapes. There are telephone calls (call detail data). There is metering data. There is click stream data. There are analog records, and so forth. In repetitive data the same structure of data is repeated over and over again. And in many cases the same content of data is repeated as well. With repetitive data there is record after record after record of the same sort of data.

Non repetitive data is data that does not have repetition. There are many types of non repetitive records of data, such as emails, warranty claim data, call center conversations, customer surveys, medical records and so forth. In non repetitive data the structure and content of any one record is different from any other record. In non repetitive data, each record is its own unique thing.

Now what does repetitive data and non repetitive data have to do with business value in Big Data? The answer is simple – there is very little business value to be found in repetitive records of data whereas there is a wealth of business value to be found in non repetitive records of data.

Consider the call level detail records found in the world of telephony. There are many, many records that are made in a day’s time. Each time a telephone call is made a new call record detail record is created. And out of millions of records, how many of those records have business value? The answer is that only a tiny fraction of repetitive records have business value. Furthermore finding those records is a very difficult thing to do. It is truly like trying to find the needle in the haystack. The same is true for click stream data. Or metering data. Or log tape data, and so forth.

Now consider non repetitive data, such as email or call center conversations. (Note call center conversations are not the same thing as call record detail data.) There is business value in every call center record. Some call center records have great business value and some call center records have only a little business value. But there is business value in every non repetitive call center record. And the same is true of email, warranty claims, medical records, and the like.

Having established that the business value in Big Data lies in non repetitive data, where are the vast majority of the new startups in Silicon Valley dedicating their energies? The answer is – the new startups in Silicon Valley are firmly dedicated to getting business value out of repetitive data. The venture capitalists and the startups in Silicon Valley don’t even know the difference between repetitive data and non repetitive data. Instead the startups in Silicon Valley are blindly focused on building the technology that will manage massive amounts of repetitive data, even though that data holds very little business value. It is as if there were a collective “blind eye” when it comes to repetitive data.

Stated differently, the venture capitalists and startups in Silicon Valley are dedicated to finding 10% of the fish or less. The majority of the fishermen are truly going after 10% of the fish.

And why is it that this gross strategic mistake is being made? There are lots of reasons. But the primary reason why not many resources are being dedicated to non repetitive data is that trying to do analytical processing on non repetitive data – primarily textual data – is difficult. There have been many previous efforts trying to create the foundation for doing analytical processing from non repetitive data. People have tried blobs. People have tried text tagging. People have tried NLP. People have tried identifying key words.

Each of these approaches have achieved some small degree of success. But until the context of text can be derived, all of these approaches have serious limitations.

But the world of Big Data is too concerned with looking at the volumes of data and the velocity of data to notice that as long as the focus is on something other than business value, Big Data is destined to be another Dot.com disappointment.

Access all our open Survey & Awards Nomination forms in one place

William Inmon

William H. Inmon (born 1945) is an American computer scientist, recognized by many as the father of the data warehouse. Bill Inmon wrote the first book, held the first conference (with Arnie Barnett), wrote the first column in a magazine and was the first to offer classes in data warehousing. Bill Inmon created the accepted definition of what a data warehouse is - a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.