Anomalies are data points, items, observations or events that occur or appear as an exception to an expected pattern for data being analyzed. As result, this subset of exceptions to the typical or expected data pattern within a given group and are termed as outliers. While we see outliers, in general, data points in any given pool are also correlated with each other in certain fashion known as data patterns. Data Scientists use Anomaly detection to find deviations from such data patterns using various statistical models and purpose-built Machine Learning algorithms. From an enterprise perspective, building Anomaly Detection systems or Technology which can detect such anomalies is a powerful capability. While the basis for Anomaly Detection is broadly based on a data mining process, there are many specific applications where AD is increasingly being adopted across industry verticals.
Applications range from detection of hackers in security system to fraud detection in credit card systems, from monitoring of health to monitoring of any operations. Thus, the technology has wide range of application across industries including Banking, Healthcare, Insurance, Retail and CPG domains.
Before we look at Anomaly Detection techniques, it is important for Data Scientists to understand the broad categories in which anomalies can be classified:
- Point Anomaly: A Point anomaly usually occurs when the sample size of data that can be considered as outlier, is one. For example, during credit card fraud monitoring a single large transaction, which is a point anomaly, is likely to be identified as a fraudulent transaction.
- Contextual Anomaly: In this case the exception to a given data pattern is context specific. For example, in U.K, increased consumption of electricity during the summer months is expected; however, a spike during winter season is considered abnormal. The Scotland Yard recently used such anomaly detection technique in electricity consumption and built a predictive system to detect drug dealers based on the pattern of electricity usage. The underlying usage patterns were identifiable with the increased electricity usage throughout the year as the drugs had to be stored in a temperature-controlled environment.
- Collective Anomaly: In Collective Anomaly situations, we see multiple data set as outliers, rather than a single data set as in case of a Point Anomaly. For instance, Collective Anomaly usually applicable to a scenario where a hacker tries to hack into the website of a High Security National Defense Agency or a Global Bank by unlawfully replicating data from the website onto local hosts.
Commonly Used Anomaly Detection Techniques
- Simple Statistical Method: In this method, descriptive statistics is used to calculate the mean, mode, median and standard deviation. Thus, from the obtained normal distribution of data, an approximate confidence level is set (usually 95%) to detect outliers. Confidence level indicates the region within which the desired percentage of data lies. In this case it is 95%. So, the remaining data will be outliers. However, this technique is only applicable in cases where a normal distribution of data is available. If the data is unevenly distributed, then this technique becomes invalid.
- Density based anomaly detection: This technique is based on a simple assumption that normal data points are obtained in closed clusters. Those far from the clusters are anomalies. Hence, Euclidean distance between different data points are measured to classify data.
- Clustering Based anomaly detection: In this technique, local centroid for different clusters of data are determined at first. Outliers are detected by calculating the distance of the data points from the local centroid. Data points that are far away from the center are termed as outliers.
- Vector machine Based Anomaly Detection: This technique is easy and can be used only when a training data set is available to be fed to machine. By analyzing the training data set, machine can be made to understand the difference between normal and abnormal data, when random data sets are given as input. Thus, the training data set, works as a reference to improve the accuracy of machine.
- Bayesian Network: Bayesian Network is a technique which represents a set of variables and their conditional dependencies amongst one another. Thus, using this technique data points which are unrelated to other data points are detected easily. Such, unrelated data points are termed as anomaly.
- Hidden Markov Model: This model is also used to detect hidden relationships between different random data points. Thus, it works in a similar pattern as Bayesian Network and is an effective tool for anomaly detection.
- Replicator Neural Network (RNN): Replicator Neural Network is slightly different than usual regression models where output variables are mapped with input variables. Here, input variables are also used as output variables to replicate the similar patterns in output as in input. A training data set is needed initially to tune in the device to replicate exact patterns for training data. Patterns representing outliers will be less well reproduced by a trained RNN in the output and hence will be detected easily.
Anomaly detection also helps in finding patterns and insights to contexts which otherwise were never thought of, helping business to take decisions which otherwise were never taken care of earlier and thus improve the system.