Outlier Detection using Apache Spark Solution
Sometimes an outlier is defined with respect to a context. Whether a data point should be labeled as an outlier depends on the associated context. For a bank ATM, transactions that are considered normal between 6 AM and 10 PM, may be considered anomalous between 10 PM and 6 AM. In this case, the context is the hour of the day.</span
In this post, we will go through some contextual outlier detection techniques based on statistical modeling of the data. The Spark based implementation is available in the open source projects in github : https://github.com/ThirdEyeData/Outlier-Detections-Apache-Spark
Outlier Detection
Outliers detection techniques can be categorized in different ways, depending on how the data is treated and how the outliers are predicted. They are as follows. Our use case falls under the first category
- Detecting point data outlier, treating the underlying data independent point data
- Detecting point data outlier, treating the underlying data as a sequence
- Detecting sub sequence outlier, treating the underlying data as a sequence
Sequence data in a temporal context is called time series data. Another facet of the data is that it can be either uni variate i.e scalar or multivariate i.e vector.
There are various solution techniques for outlier detection. Some of them are applicable for sequence data . Here are some of the popular techniques.
- Classification
- Statistical Modeling
- Clustering
- SVM
- PCA
- Nearest Neighbor
- Neural Network
- Finite State Machine
- Markov Model
- Forecasting
Statistical Modeling Based Outlier Detection
The underlying assumption in these techniques is that the outliers are not generated from the generative process that generates the normal data. We use the training data to build a statistical model. If the probability of a data point being generated from the model is very low, then it’s marked as an outlier.
The models generated may be parametric or non parametric. With parametric modeling, we assume certain known probability distribution and find the parameters of the model with maximum likelihood techniques. With non parametric approach, no such assumption is made about the underlying probability distribution. These are the techniques supported in the Spark implementation
- Zscore .. details
- Extreme value statistics ..details
- Robust zscore .. details
- Non parametric uni variate distribution ..details
- Non parametric multi variate distribution ..details
We will be using the Zscore based technique. Any of the other techniques may be used by setting a configuration parameter appropriately. With Zscore, a normal distribution is assumed for the data. We learn the parameters of the distribution (mean and standard deviation) from the training data.
Depending on the technique selected, different Spark jobs need to be used for building the required statistical model. Here is the list.
Algorithm | Modeling Spark job | Comment |
zscore | NumericalAttrStats | parametric uni variate |
robustZscore | NumericalAttrMedian | non parametric uni variate |
estimatedProbablity | MultiVariateDistribution | non parametric multi variate |
estimatedAttributeProbablity | NumericalAttrDistrStats | non parametric uni variate |
extremeValueProbablity | NumericalAttrStats | parametric uni variate |
The appropriate statistical modeling Spark job should be run first, before running outlier detecting Spark job.
Data Center Server Metric
Our use case involves data center monitoring all servers and collecting various usage metric for CPU, RAM, disk usage and network usage. They are interested in detecting unusual activities, which might indicate security breach and intrusion into their network.
Our data set contains CPU usage and some sample data is as follows. There is one record with high CPU usage.
OP5186BZQ4,1535293526,3,40
7HA60RSWDG,1535293826,3,56
AATL8034VH,1535293824,3,69
751W21QH42,1535293825,3,99
OP5186BZQ4,1535293826,3,77
The different fields in the data are as follows
- Server ID
- Time stamp
- Day of the week (0 being Monday)
- CPU usage
Cyclic Context
The data has a field which indicates the day of the week. Why do we need that? It’s because our data has cycles. The usage pattern on weekend is different from weekdays for any server. In other words, our data has temporal context.
The implication is that we have to build separate statistical models for each server, and for each server one for week days and one week end. That’s what we will do.
In our case we already knew that the data has weekly cycle. What happens if you don’t know what kind of cycles you have in your data. Auto correlation is one way of discovering the kind of cycles you have in your data.
Spark Workflow
The Spark workflow involves two Spark jobs. Each job is run twice. Here is the complete flow.
- NumericalAttrStats : calculates statistics for specified 1 or more columns with original data, potentially including outliers
- StatsBasedOutlierPredictor: predicts outliers based on calculated stats in step 1 for the field
- NumericalAttrStats : calculates statistics for specified 1 or more columns after outlier records have been removed in step 2
- StatsBasedOutlierPredictor: predicts outliers based on calculated revised stats for the field
With outliers, we have a chicken and egg problem. Outliers are removed based on some statistics calculated wit the training data. However, outliers influence the statistics we are calculating and using, especially the standard deviation. This is the reason, we are calculating statistics twice, second time with the outliers removed.
These additional steps to remove outliers is not necessary if you use robust z score, which is based median and not affected by the presence of outliers. Ideally, steps 1 and 2 should be repeated until the statistics have converged, as outliers get removed.
The data is partitioned as follows with the components listed. All stats are calculated per partition. When predicting whether a record is an outlier, the stats for the corresponding partition is used.
- Server ID
- Week day week and end index (0 for weekdays and 1 for weekends)
- Field index (column index for field being analyzed)
Calculating Statistics
The Spark job NumericalAttrStats performs this function. It’s output contains as many lines as the number of partitions in the data. Here is some sample output
[php]751W21QH42,weekDayOrWeekendOfWeek,0,3,$,125704.000,8294642.000,2016,62.353,226.487,15.049,24.000,100.000
OP5186BZQ4,weekDayOrWeekendOfWeek,0,3,$,125753.000,8312841.000,2016,62.377,232.483,15.247,25.000,100.000
AATL8034VH,weekDayOrWeekendOfWeek,1,3,$,38337.000,1531879.000,1152,33.279,222.288,14.909,6.000,100.000
7HA60RSWDG,weekDayOrWeekendOfWeek,1,3,$,37913.000,1499513.000,1152,32.911,218.554,14.784,6.000,99.000[php]
The fifth field from the end is the Mean and the third field from the end is the Std Deviation.
First time (step 1) it runs, it takes the raw data as input. In the first run the parameter rem.outliers should be set to true, so that outliers once identified will be removed from the data set. In the second run, (step 3) it takes clean data i.e. raw data with the outlier records removed. When you run it with clean data you will notice that the standard deviations have dropped significantly.
Detecting Outliers
Outliers are detected using the Spark job StatsBasedOutlierPredictor. It provides 5 techniques for statistical modeling based outlier detection. We are using the z score option. Here some sample output
OP5186BZQ4,1535452224,5,70,0.961,O
OP5186BZQ4,1535453122,5,71,0.965,O
7HA60RSWDG,1535454023,5,80,0.993,O
AATL8034VH,1535454925,5,96,0.998,O
OP5186BZQ4,1535455823,5,70,0.961,O
751W21QH42,1535457025,5,67,0.953,O
7HA60RSWDG,1535458525,5,93,0.998,O
7HA60RSWDG,1535459722,5,87,0.997,O
751W21QH42,1535460323,5,95,0.996,O
The output contains outlier records only, because we set the parameter output.outliers to true. Other wise, all records would have been in the output. The last field is either “O” for outlier records or “N” for normal records.
Even when you are getting only outlier data, you can still get some normal data below the threshold and above another threshold defined by score.thresholdNorm. Getting this additional data could be useful, if the data is being displayed in a dashboard for an administrator. The administrator could interpret some of the normal data as outliers which corresponds to the false negative cases.
The second field from the end is the outlier score. Outlier scores are between 0 and 1. Higher the value, more likely it is that record is an outlier. Some algorithms naturally produce outlier score between 0 and 1. However, z score can be any positive number. To generate score in the desired range, an exponential scaling is applied.
Sometimes the polarity of the data needs to be taken into account before identifying a record as an outlier. For example, with CPU usage, extreme high values are truly outliers bit not the extreme low values. This is controlled through the parameter outlier.polarity. it can take 3 values all, high and low.
Multi Variate Data
Multi variate data has many elements i.e it’s a vector. If your data is vector, you can calculate outlier scores for each element. The outlier scores can then be aggregated as an weighted average for the whole record. The weights are provided through configuration.
For example, the metric for host may include RAM usage, disk usage, network bandwidth usage in addition to CPU usage, making it a vector of 4 elements. You may be interested in detecting outliers at the record level.
Outlier Score Threshold
We set a threshold value for outlier score in the configuration through the parameter score.threshold. Any record with score higher than the threshold is marked as an outlier. But how do know what the threshold should be. We don’t, we just make an educated guess. For example z score when exponentially scaled a value of 0.95 corresponds to a z score value of 3.0, which is a reasonable threshold.
Generally outlier records are displayed on some kind dashboard for human consumption. Setting the proper threshold is a double edged sword. If the threshold is set too low, it will result in too many false positive. This problem is also known as alarm fatigue. If it’s set too high, it will result in too many false negatives and true outliers will be ignored.
Finding proper threshold can be framed as supervised Machine Learning problem. Relevance feedback from the user can be used as label for true outliers and a Machine Learning model can be trained to predict the correct threshold. I will post more on this topic later.
Real Time Outlier Detection
If you have low latency real time or near real time requirement for outlier detection, then the Spark job StatsBasedOutlierPredictor should be implemented for Spark Streaming. The statistical models could be held as state in the Spark streaming application.
Wrapping Up
We have gone through some statistical modeling techniques for detecting outliers in data. If your data is not stationary, these models need to built periodically with recent data. To execute all the steps for our use case, please follow the steps in the tutorial document.
Originally posted here.