Last updated September 9, 2020
In AI Origins & Evolution

How Data Mining Works

Share

Published on December 4, 2018

by Bharat Adibhatla

There is an abundance of data across various industries, but it only becomes useful when it is transformed into information. The method of extracting information from enormous data is known as data mining. Data mining find its application across various industries such as market analysis, business management, fraud inspection, corporate analysis and risk management, among others. This article takes a short tour of the steps involved in data mining.

Various Aspects Of Data Mining

1. Data Cleaning

Classification of data is essential in obtaining the final analysis. Any data which tends to be incomplete, noisy and uncertain can affect the result. Data cleaning is the procedure of identifying and removing tricky or inaccurate data from a recordset, table or database.

Here are some data cleaning techniques:

Ignore the tuple: This is done when the class label is not found. This method is not very productive unless the tuple contains several allocations with missing values
Fill in the missing values manually: This technique is effective on limited data set with some missing values
Replace missing distinct values with global constants
Replace missing values with the attribute mean or predictable values

2. Data Integration

Data integration is a technique when we merge new information with the existing information. The sources may involve multiple databases, data cubes, or flat files. One of the most customary implementations of data integration is building an enterprise data warehouse.

Any field that systematically collects information is concerned with data integration, which has two main approaches.

Tight Coupling: In this approach data from different sources are integrated into a single physical location by the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling: In this approach, data remains in the original source databases. A combination which provides scope to take queries from the user and transforms them in a format the source database can understand and then sends the query directly to the source databases to obtain the result.

3. Data Transformation

The procedure of transforming data or information from one format to another is known as data transformation. It is usually done from the composition of a source system into the required composition of a new destination system. The process fundamentally involves converting documents, but data conversions sometimes involve the transformation of a program from one computer language to another to authorise the program to run on a different platform. The purpose of this data passage is the adoption of a new system that’s totally different from the previous one.

There are various strategies to achieve data transformation, such as:

Smoothing: The noise is removed from the data
Aggregation: Summary or aggregate values are applied to the data
Generalisation: Low-level data is replaced with high-level data using a notion known as hierarchies climbing.
Normalisation: Attributes are scaled to make sure that they come under a small specified range, such as 0.0 to 1.0
Attribute Construction: New attributes are created from the given set of attributes.

4. Data Discretisation

The techniques which are used to split the domain of continuous attribute into intervals is known as data discretization. The various study attribute values are restored by small interval labels. This helps us to use the knowledge level representation of mining results in an easy and compact way. The data discretization involves two processes:

Top-down discretisation: In the top-down discretisation process, one or a few points found first and are used (called split points or cut points) to split the entire attribute range and then repeats this loop on the resulting intervals.

Bottom-up discretisation: In the bottom-up discretisation, the process starts by acknowledging all of the continuous values as possible split-points, removes some by merging neighbourhood values to form intervals.

5. Concept Hierarchies

Being an updated module of discretisation concept hierarchies are used to minimise the data by collecting and replacing low-level concepts with higher-level concepts. In a multidimensional model, data is systematically arranged into multiple dimensions, and each dimension has multiple levels of abstraction defined by concept hierarchies. This provides users with the adaptability to observe data from different perspectives. The typical methods for concept hierarchy generation for numerical data are:

Binning: It is a top-down unsupervised discretization splitting technique based on a specified number of bins.
Histogram Analysis: it is an unsupervised discretization technique which separates the values for an attribute into disjoint ranges called buckets.
Cluster Analysis: It is a well-known data discretization method which uses an algorithm to separate a numerical attribute of data set by partitioning the values of data set into clusters or groups.

6. Pattern Evaluation And Data Presentation

Representation of data plays a crucial part in any business. The clients or the customers can make the best of the data if they are presented in an efficient manner. Once the data has gone through the above procedures and evaluated to be flawless it is presented through diagrams and graphs so that it can be understood by the people with minimal statistical knowledge The are various methods in which economic data can be presented can be: