Anomaly Detection with Robust Zscore

Anomaly detection with with various statistical modeling based techniques are simple and effective. The Zscore based technique is one among them. Zscore is defined as the absolute difference between a data value and it’s mean normalized with standard deviation. A data point with Zscore value above some threshold is considered to be a potential outlier. One criticism against Zscore is that it’s prone to be influenced by outliers. To remedy that, a technique called robust Zscore can be used which is much more tolerant of outliers.

In this post, I will go over a robust Zscore based implementation on Hadoop to detect outliers in data. The solution is part of my open source project chombo. I will be using IP network data to detect anomalous packets, as an use case.

Robust Zscore

Normal Zscore is based on mean and standard deviation as below and it’s a measure of how far a data point is from the mean.

z = |x – m(x)| / s(x)
where
z = Zscore
m = Mean
s = Standard deviation

As it is evident, we have a chicken and egg problem. The outliers that we are trying to detect have influenced the estimation of mean and standard deviation, unless the outlier are removed from the data set prior to calculating the mean and standard deviation. The removal can be done with an iterative process as below.

Calculate mean and standard deviation
Find outliers and remove them from the data set
Repeat steps 1 and 2 until some convergence criteria is met

However, there is a better way and that is what we will delve into now. Statistical methods not unduly affected by outliers are called robust statistical methods. From robust statistical point of view, we can make the following observation, referring to the link above.

Median is robust measure of central tendency, while mean is not.
Median absolute deviation (MAD) is robust measure of statistical dispersion, while standard deviation is not.

Robust Zscore as a function of median and median absolute deviation (MAD) is defined as below.

rz = |x – med(x)| / mad(x)
where
rz = Robust Zscore
med(x) = Median
mad = Median absolute deviation

With robust Zscore we can detect outliers reliably even in the presence of outliers in the data used to compute median and median absolute deviation.

Median, as we know corresponds to the 50 percentile value i.e., half the data points are below the median and the other half is above the median. Median absolute deviation is defined as below.

mad = 1.4296 x med(|x – med(x))
where
mad = Median absolute deviation
med = Median

We take the absolute of the deviation of a data point from the median and then take median of those absolute deviations. Robust Zscore definition was shown earlier.

Map Reduce for Median and Median Absolute Deviation

The map reduce class NumericalAttrMedian calculates median as well median absolute deviation (MAD), depending on how a configuration parameter is set. The first run of the map reduce calculates median. The second run which calculates MAD uses median calculated in the first run.

Generally median calculation involves sorting data. However, I have used a bucketing approach, so that only the data in the bucket where the 50 percentile value falls is sorted. We can make this optimization, because we know that buckets prior to this bucket will be below the 50 percentile mark. On the same token all the following buckets will be above the 50 percentile mark.

The mapper key consists of any partitioning field, column ordinal for the column for which the median is being computed and the bucket index. The mapper output value is the list of all values in the corresponding bucket. The sorting is done in the reducer side.

The input consists of the following fields.

Source IP address
Target IP address
Time stamp
Packet size

The pair (source IP address, target IP address) serves as the partitioning fields, assuming that the input data consists of packet size data for many host pair combination.

Map Reduce for Outlier Detection with Robust Zscore

With the median and MAD values in hand, outliers can easily be found using the generic data validation map reduce class ValidationChecker. With this map reduce, one or validators could be configured for any field. Details about this map reduce can be found can be found in an earlier post.

In our case, we have only one field that we are analyzing which is the packet size. We have configured only one validator for this field which is robustZscoreBasedRange. The validation checker map reduce will generate a report containing details of the records that were found to be invalid according to the validators configured for a field

In out case, any record with robust Zscore of packet size exceeding some user defined threshold is considered invalid. Those are also the outliers we are looking for. Here is the output.

165.68.75.105,165.68.65.106,1436192435,50
field:3
robustZscoreBasedRange
165.68.112.84,165.68.103.116,1436192602,9973
field:3
robustZscoreBasedRange
165.68.112.84,165.68.103.116,1436194507,514
field:3
robustZscoreBasedRange

For each invalid record, the output consists of list of fields found invalid. For each field, a list of validators that found the field to be invalid is also part of the output.

Summing Up

We have gone through an outlier detection technique based on robust Zscore. Anomaly or outlier detection techniques can be of two types. Either they are based on instance data or sequence data. The algorithm discussed ion this post is instance data based.

The tutorial document contains step by steps instruction for generating data and executing the used case.

Originally posted here.

Related Blogs:

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

AI & Machine Learning

Generative AI & ChatGPT

Big Data & Engineering

Digital Transformation

Automating Tasks

Know Your Customers

Project Types

Manufacturing

Retail

Healthcare

Energy, Oil & Gas

IT

AdTech

NGO

More...

Anomaly Detection with Robust Zscore

Robust Zscore

Map Reduce for Median and Median Absolute Deviation

Map Reduce for Outlier Detection with Robust Zscore

Summing Up

Transforming Enterprises with
Data & AI Services & Solutions.

Services We Offer

Tailored Solutions

Explore Us

Talk To Us

AI & Machine Learning

Generative AI & ChatGPT

Big Data & Engineering

Digital Transformation

Automating Tasks

Know Your Customers

Project Types

Manufacturing

Retail

Healthcare

Energy, Oil & Gas

IT

AdTech

NGO

More...

Anomaly Detection with Robust Zscore

Robust Zscore

Map Reduce for Median and Median Absolute Deviation

Map Reduce for Outlier Detection with Robust Zscore

Summing Up

Transforming Enterprises with Data & AI Services & Solutions.

Share This Article

Related Posts

AI – Past, Present and Future

Tabular Data Column Semantic Type Identification with Contrastive Deep Learning

Semantic Search with Pre Trained Neural Transformer Model using Document, Sentence and Token Level Embedding

Predicting Covid-19 Viral Infections using Contact Data with LSTM Neural Network

Services We Offer

Tailored Solutions

Explore Us

Talk To Us

Transforming Enterprises with
Data & AI Services & Solutions.