Cyber attacks are increasing at a rapid pace and are now systematically targeted towards vulnerable countries. The growing number of online users and their data is further worsening the situation. In addition to this, there is an interconnection between digital usage and the users in critical business sectors such as banking, which is giving rise to cyber crimes more than ever before.
Big data technologies have emerged extensively as a result of this data burst. However, it has given rise to concerns on the security aspects such as methods of data storage, their systems and real-time monitoring, amongst other challenges. Nonetheless, big data solutions are still preferred due to their advantage of handling voluminous amounts of data in short span of time. In this article, we will discuss a study conducted by academics at Gazi University, Turkey, where a novel method called ‘unsupervised anomaly detection approach’ in networks is designed to address vulnerabilities in a big data network using data from NetFlow.
Why Use NetFlow?
NetFlow is a network protocol which was first introduced by Cisco in order to monitor and collect network traffic data among devices as well as computer software and applications. The protocol is primarily used to detect anomalies and patterns where dangerous information can be circulated to devices and compromise security. More importantly, it offers computer security experts a view to understand the behavior of the traffic flow. In addition, network anomaly detection can be done using many methods with most of them following machine learning techniques.
The Approach To Anomaly Detection
In the study, clustering aspects of machine learning is considered to detect network anomalies. Academics, say that the approach follows the following six steps:
- Firstly, NetFlows are divided into intervals. Most actions show similar behavior in several minutes
- Netflows are then aggregated according to source IPs.
- The data size is reduced for processing
- The aggregated data may show new patterns to detect behavior.
- The obtained data is standardized by zero score as in the equation z=(x – μ)/σ where μ is the mean and σ is the standard deviation.
- This procedure equalises the data variability.
- Standardised data is less affected by outlier.
- The aggregated NetFlows are clustered based on the k-means algorithm as distributed.
- The unsupervised techniques trained with unlabeled data has the ability to detect unfamiliar attacks.
- It is predicted that clusters will occur according to normal or abnormal traffic behavior.
- The Euclidean distance of the cluster elements to the cluster center is calculated.
- The elements in the cluster should be close to the center for a good clustering.
- The elements may be abnormally distant from the center because of any reason and the centroids can be used for outlier detection.
- The histogram is used to understand the distribution of distance of the elements from the center
- The elements stay distant from the concentrated region on the histogram are considered as anomalous.
- The actual normal and abnormal flow numbers are determined from time intervals in steps 4 and 5. Finally, the success criterion is evaluated.
Cloud Environment And The Datasets For Implementation
For the study, Netflow data was implemented on Apache Spark big data framework with Azure HDInsight cloud service for processing data. Python was used as the main programming language. In order to detect network attacks, CTU-13 dataset was investigated since it provided sample attack scenarios to ascertain network behavior. Specifically, the 10th Scenario in the dataset (UDP DDos attacks) was the focus of the study due to the fact that it covered botnet attacks in addition to being large in size (It had 13,09,792 netflows with 1,06,352 UDP DDos flows).
The implementation follows the approach mentioned earlier. The Netflow data was split into one minute time intervals to capture anomalies so that the data is not crowded with anomalies for experimentation. With this, the unsupervised anomaly detection was developed. The detailed information can be found here. The accuracy of the detection was found to be 96 percent correct. In order to visualise the accuracy, the six features in the dataset for the study was reduced to three-dimensions using dimensionality reduction with principal component analysis(PCA).
As technology is rising in parallel, cyber crimes are committed with more ease and deception. It is sometimes harder to detect censure, owing to anonymity and other tricky methods harbored by cyber-criminals. This study will definitely serve beneficial for future avenues to counter attacks on computer networks using big data and machine learning.