Before the word anomaly became so popular, outlier was the more in use term. I really like Hawkins’s definition of outlier – It’s an observation, which is so different from other observations that it seems it is generated by some other mechanism or process. Some other names of the outlier are anomaly, novelty, surprise etc. There is a more dramatic name like ‘Black Swan’ (There is a very famous Hollywood movie by the same name) which is a metaphor of a rare event. The motivation for finding anomalies is it gives some very interesting insights depending on the application domain. To keep parity, often the normal observations are called inliers.
Let us borrow an example from Malcolm Gladwell’s bestselling book Outliers, where he talks about a place named Bangor, Pennsylvania where there was no death by heart attack below the age of 65. This was from 50 years medical record of the place. Though, this story is from 1950s, there is no confusion that it was indeed a medical aberration, a mystery, an outlier. How the investigators eliminated one by one the diet and exercise regime, climatic factor, heredity of the people and chased the truth is a different story and is a highly recommended read. Nevertheless, this establishes that outliers actually can provide very interesting insights.
Enough of Movies and Books. Now, Let’s take few examples from business applications:
Intrusion Detection: –
If the system is intruded, there will be some abnormal behavior in terms of system calls, network traffic or some other system characteristics. This detection needs to be almost real time, as a result necessities simple and effective algorithm.
Fraud Detection: –
Another important use case of anomaly detection is financial fraud. If card information is compromised, the purchasing behavior pattern in terms of time of the purchase, location of the purchase, amount of purchase, merchandise purchased can be completely different.
Some other important applications are identification of suspicious activity from surveillance cameras, identification of critical medical events from electrocardiography (ECG) signals or other body sensors, identification of abnormal cells or tumors from CT images.
Score – Outyingless
Typically, an algorithm will give some kind of score to each of the observations. This is often called as the outlying score. This score can be used to rank the observations based on their outlying tendency.
Labels – Normal and Outlier
Here, the algorithm outputs a binary label as a ‘normal’ obervation or an outlier. It is quite apparent, that the score coupled with a threshold score can be easily be used as a label based classifier.
Sometimes there are outlying observations, which are plane and simple noise. So, one of the questions that often bothers practitioners is how to differentiate the noise from the news. A broad guideline is illustrated in the below figure.
The noise elements generally have a lower value in outlying score and are called as weak outliers.
The anomalous elements have a higher score and are called strong outliers or Novelty.
There are several terminologies for the anomaly detection methods, but it’s most standard to classify them into the following three categories:
Supervised Techniques – Normal and Outliers in Training Set
The dataset has both the normal and the outlier classes. Classifiers are trained like any regular Machine Learning problem. Only thing to consider is the classes are highly imbalanced and false positive (Identifying a normal class as an outlier) and false negative (Identifying an outlier as an inlier) has different associated cost.
Semi-supervised Techniques – Only Normal class in Training Set
Here, the normal classes are available in the training set. So, the algorithm tries to learn the parameters of the normal class distribution and any observation which has a low probability to come from this normal class, can be identified as an outlier. One of the popular techniques in this category is One class SVM. Techniques like Gaussian Mixture modeling can be applied to identify the parameters of the normal class.
Unsupervised Technique – No Labels
Here no labels are available and distance or density-based techniques are applied to identify the outliers. Some of the popular unsupervised techniques are IsolationForest and Local Outlying Factor(LOF).
The above figure indicates the scenarios. Colored circles indicate labeled observations. Same colored circles belong from same class. The blue circle has been used to indicate normal class observations.
Supervised Technique: – You have already the observations in the training set labeled as normal (inlier) observations and Anomalous observations. You train any binary classification model and when an unlabeled input is given to the model, it can label whether it is normal or an outlier.
Semi-supervised Technique: – It is more of a one-class learning. We have enough normal data available, but not many outliers. The model learns the parameters of the normal class. When an unlabeled input is given to the model, it can enumerate the probability of the observation belonging to the normal class. If the probability is low, then the observation can be labeled as an outlier.
Unsupervised Technique: – Here we do not have any labels available. Techniques like clustering can be applied. The observations which are in a very low-density region or far away from other clusters can be identified as outliers.
Few popular algorithms are discussed as following: –
One Class SVM:
This algorithm tries to find a class boundary around the normal observations. Points further from the boundary points are classified as normal points. RBF kernel is a popular SVM Kernel which is used.
This is an ensemble based on decision trees. The normal observations can be fitted with regular number of splits; however, the abnormal observations can not be fit into regular splits. As a result, the outlying observations are found near the root with much lesser splits.
Local Outlier Factor:
Here a score is assigned to every observation. If a point is surrounded by neighbors who are quite near, then it’s Local Outlier Factor (LOF) is low. An outlier is supposed to come from less densely populated region.
For a particular ‘k’, LOF(K) will mean the following things:
- – LOF(k) ~ 1 means Similar density as neighbours
- – LOF(k) < 1 means Higher density than neighbours (Inlier)
- – LOF(k) > 1 means Lower density than neighbours (Outlier)
All these methods are available in sklearn, here is result of an experiment using one class SVM, Isolation Forest and LOF. The below diagram shows result of these three algorithms on Make Blobs and Make Moon Dataset.
Anomaly detection has tremendous application specially in Fraud Detection and Intrusion Detection. The first step is to understand availability of labeled outlier data. Accordingly, the problem needs to be set up in a Supervised, Semi-Supervised and Unsupervised scheme. Some of the popular algorithms are One Class SVM, Isolation Forest, Local Outlying Factor (LOF) etc. One class SVM is good for finding strong outliers or Novelty. If there are outliers in training data, the performance of One Class SVM takes a toll. Isolation Forest and Local Outlying Factors are more general purpose at the cost of being more computationally expensive. A piece of advice is taking the domain understanding into consideration, restrict number of features, and try visualizations. You may also combine their prediction using some ensemble scheme.