For those who hit the “snooze” button on earlier Big Data wake-up calls, consider this your espresso shot of information: Our society creates as much data in two days—approximately five exabytes of data—as all of civilization did prior to 2003. In light of that statistic, maybe they should change Moore’s Law to A Little Moore’s Law out of respect.
The Current Challenge: Too Much, Too Late
A few examples illustrate the magnitude of today’s data challenge. If an average-sized healthcare insurer (with approximately 30 million customers) wishes to improve the outcomes for diabetic patients, they may need to analyze more than 60,000 medical codes across 10 billion claims and factor a separate silo of pharmacy data into the equation. The challenge is no less daunting in other industries. A national retail chain that wants to improve its product replenishment could be looking at sales data from thousands of separate stock keeping units (SKUs) across hundreds or thousands of stores over the last several years—that’s more than 100 billion rows of data.
For years, data analysis has been, in a sense, a “moving” experience. Enterprises moved the data that they wanted to analyze from their database onto analytic servers in order to break the analytic work into smaller pieces. Many enterprises, in fact, still do this today. The problem with this approach is several-fold. As data gets bigger—a foregone conclusion today—it can take hours (or even longer) to transfer and stage the data on multiple servers, then return it to the database and re-assemble it. For time-sensitive analyses, that’s a deal breaker. As a workaround, enterprises often choose to analyze only a subset of their data, but this sort of data sampling leads to less than ideal analytic models and can ultimately create more data confusion by generating multiple versions of the same information.
Move the Analytics, Not the Data
To solve the current challenges of Big Data, enterprises are turning to a new strategy: in-database analytics. The idea behind this approach can be summed up in one simple concept: Move the analytics, not the data. By bringing the analytics engine into the database and leveraging massively parallel map-reduce technology (popularized by tools like Hadoop), enterprises can perform highly complex analyses directly in the database environment without spreading the problem out across a team of servers. In-database analytics yield a host of benefits over traditional analytics including:
- Faster analytics, on a scale of 10-100X faster than traditional analytics
- Better analytic models as data scientists can now use full datasets versus sampling
- No data duplication errors caused by moving data between servers
- Stricter security policy enforcement, particularly for industries that regulate the movement of sensitive business data
- Near real-time insights, as opposed to analytic insights that may be days or weeks old
- Capex reduction by eliminating the need for additional hardware servers to process the analytics
- Pervasive analytics that can flow freely to reporting tools and applications throughout the enterprise
It’s Data Science, Not Rocket Science
As data has grown, so has the role of the “data scientist” to that of an atlas of analytics. The data scientist, according to legend, is intimate with machine learning and statistics, can program in low-level languages (often with one hand), is a data domain expert and is an artist where data visualization is concerned. He or she can tease insights from otherwise unrecognizable patterns and, in some cases literally, can predict the future. Not surprisingly, this mythical unicorn comes with a commensurate price tag.
The problem with this model, beyond cost and scarcity, is a tendency to isolate innovation. In-database analytics solves this problem by reducing the complexity of analytics so that teams of “regular” data analysts can access and analyze data using familiar SQL queries. This essentially democratizes data-led discoveries in the enterprise and, to the relief of HR departments everywhere, eliminate the need to slather themselves in unicorn perfume just to attract the right talent.
That’s Nice, But Who Cares?
In-database analytics isn’t ideal for everyone. For example, some of the newer business cases for Big Data that require analysis of large sets of unstructured data are better served by tools designed specifically for those scenarios. But any business where large amounts of structured data need to be analyzed quickly can benefit from in-database analytics. Good candidates for in-database analytics include:
- Healthcare organizations that need to analyze large amounts of patient data securely
- Financial services companies that can benefit from real-time decision making in their investment strategies
- Retail corporations that need to improve supply chain logistics or analyze product performance in a dynamic environment
Over the next 18 months, we expect to see more enterprise analytics forego the data server farms of the past and move into the data warehouse. Higher performance and lower cost are the most important drivers for the move home, but there are other benefits to consider. As already cited, in-database analytics allows enterprises to use what they have today in terms of IT skills and infrastructure while dramatically improving their analytic capabilities. Also, the relative simplicity of in-database analytics allows enterprises to experiment more with their data analysis and test hypotheses that would have been impractical or excessively expensive in a traditional setting.
At least, from my perspective, all signs point to in-database analytics becoming the next “in” thing for Big Data.
Try deep learning using MATLAB