Learning Alarm Threshold from User Feedback using Decision Tree on Spark

Alarm fatigue is a phenomena where some one is exposed to large number of alarms, become desensitized to them and start ignoring them. It’s been reported that security professionals ignore 32% of alarms because they are thought to be false. This kind of sensory overload can happen with monitoring systems in various domains, e.g computer systems and network, industrial monitoring systems and medical patient monitoring systems.

Typically alarm flooding happens when alarm threshold levels are not set properly. How do we know what the proper alarm threshold level should be. That is the problem we will be addressing in this post. Assuming user feedback is available for alarms, we will use supervised Machine Learning to learn new threshold. The solution is available as implemented on Spark in my open source project beymani.

Remedy for Alarm Flooding

Alarms are generated based on some measured quantity being outside some acceptable range. As examples, the quantity could be CPU usage for a computer or temperature as measured by a sensor. The alarm could be based on the actual value of the quantity or some function of it.  For example, in anomaly detection, we could use z score of the quantity. In this case, the threshold could be defined as some maximum value for  z score.

Setting the threshold is a double edged sword. If set conservatively, there will be too many alarms, causing alarm fatigue. This is the case with too many false positive. If set leniently , there will be too few alarms, with the potential of a legitimate alarm not being generateda.k.a false negative This may have serious consequences including  threat to human life, in case of a patient monitoring system.

One of the reasons patient monitoring systems have a false alarm rate is 85% or higher, is because the threshold are set too conservatively to ensure that no legitimate alarm is missed. The remedy is to properly tune the threshold.

If there was a way to collect feedback from users indicating whether some alarm is relevant or not, the feedback could be used to along with a supervised machine learning algorithms to learn the proper threshold level. For cyber security system dash board, the administrator could click a button to indicate whether an alarm is being accepted as a real alarm or not. Similarly, for a patient monitoring system, a nurse could press a button in a hand held device, to indicate whether an alarm is legitimate or not.

Classification for Threshold Tuning

The problem of threshold tuning can be framed a classification problem where the quantity based on which an alarm is created is the feature variable and the class label is binary. The class label is true if an alarm is accepted as legitimate by the user, false other wise.

Although there are many classification algorithms  available, we are going to use decision tree, because we are only interested in extracting the rule from the trained decision treemodel. Our goal is not prediction. Since there is only one feature variable involved,  our trained model will be a decision stump, since the tree has only one node.

Once trained, we will find some value for the quantity, above which most alarms are labeled true and false otherwise. This newly discovered value should be used as the threshold for the alarm generating system, going forward.

Server Metric

We will use the same use case as in my earlier post on anomaly detection in server metric data, to be precise CPU usage data. The output from the Spark workflow which does a zscore based outlier detection, looks like this.

AATL8034VH,1535421024,5,82,0.992,O
7HA60RSWDG,1535421926,5,62,0.957,O
AATL8034VH,1535423425,5,59,0.929,N
AATL8034VH,1535423726,5,57,0.914,N
751W21QH42,1535424925,5,86,0.991,O
7HA60RSWDG,1535425526,5,54,0.904,N
OP5186BZQ4,1535427924,5,74,0.973,O
751W21QH42,1535428525,5,99,0.997,O
7HA60RSWDG,1535432424,5,56,0.922,N
AATL8034VH,1535432425,5,57,0.914,N
751W21QH42,1535432422,5,59,0.907,N

The fields in the output are as follows

  1. Server ID
  2. Time stamp
  3. Day of the week
  4. CPU usage
  5. Outlier score
  6. Label (O for outlier and N for normal)

The Spark job for outlier detection was configured such that some normal records below the threshold and another threshold were also part of the output. This was done so that once the output was displayed, the user if necessary could label some records as outlier deemed normal by the outlier detection system. This is the false negative case.

The user could also label some records as normal, deemed as outlier by the detection system reflecting the false positive case

Simulating User Feedback

We will be using a script to simulate user feedback, given the outlier and some normal records presented to the user. Here is some sample output, with 2 additional fields appended.

751W21QH42,1535428525,5,99,0.997,O,O,T
7HA60RSWDG,1535432424,5,56,0.922,N,N,F
AATL8034VH,1535432425,5,57,0.914,N,N,F
751W21QH42,1535432422,5,59,0.907,N,N,F
OP5186BZQ4,1535432723,5,89,0.992,O,O,T
7HA60RSWDG,1535433024,5,86,0.996,O,O,T
OP5186BZQ4,1535433626,5,98,0.996,O,O,T
OP5186BZQ4,1535433923,5,80,0.984,O,O,T
751W21QH42,1535434226,5,61,0.922,N,N,F
AATL8034VH,1535434824,5,64,0.956,O,N,F
7HA60RSWDG,1535435124,5,54,0.904,N,N,F

The field before the last is the label based on user feedback, which may or may not agree with the system predicted label. The last field is T or F depending system predicted and user feedback based labels are as follows. This is the class or output variable used the supervised learning model.

System predicted label User feedback based label Class label Comment
N N F true negative
N O T false negative
O N F false positive
O O T true positive

The classification uses the 5th field as the feature variable and the 8th field as the class variable.

Training Classification Model

The decision stump based classification model is implemented in the scala  object ThresholdLearner.  In decision tree, a variable is split in such way that the population in each partition is as homogeneous as possible with respect to the class variable. Here is the output for various split points and the corresponding information theory based score. A lower score indicates a better split point.

0.965000,0.033906
0.950000,0.176625
0.930000,0.258033
0.940000,0.222942
0.975000,0.126869
0.925000,0.275430
0.935000,0.234167
0.960000,0.094521
0.955000,0.137005
0.945000,0.211286
0.970000,0.058262

Two information theory based algorithms are supported for calculating the score that characterizes the split quality,  entropy and gini index. These results are based on entropy. The best split has the lowest score, which is 0.965. The split points to use are provided through the configuration parameter split.points

When simulating the user feedback, we had used a new threshold of 0.965, which corresponds to a false alarm scenario. Our learning model learns the same level of 0.965 for the threshold, which validates the learning model.

This should be the new threshold value for anomaly detection. if you look at the configuration file, you will find the threshold value used for detecting outliers was 0.950. According to our analysis, the new threshold value should 0.965, implying that the current threshold value is too low, generating too many alarms. In other words, if the threshold is set 0.965, users likely to find most records above the new threshold to be outliers and most below to be normal.

Summing Up

We have gone through the steps to solve the alarm flooding problem in any kind of monitoring system. We have used a supervised learning model using user feedback as labeled data.  Please follow the tutorial if you want to execute the steps.

Originally posted here.

ThirdEye Data

Transforming Enterprises with
Data & AI Services & Solutions.

ThirdEye delivers Data and AI services & solutions for enterprises worldwide by
leveraging state-of-the-art Data & AI technologies.

Talk to ThirdEye