Use Cases
We are describing a new production machine learning solution to monitor events in IT and industrial operations and explain their symptoms. This solution is used for a variety of industrial applications including proactively monitoring IT operations infrastructure, monitoring events in the Industrial Internet of Things (IoT) connected devices, and predictive monitoring to any IT operations management component such as hyperconverged, Clouds, virtual infrastructure, applications, networks and microservices.
The solution is deployed on Google Cloud Platform by combining the innovative research from Google’s corporate engineering and machine learning and operationalization tools of Google Cloud put together by Google Cloud’s professional services.
Key benefits of our approach are:
Google’s novel solution provides a scalable, unsupervised approach on largely unlabeled data to proactively monitor events in data streams and explain the predictions. Our approach is particularly useful when:
data is correlated and multi-modal,
failures are complex,
conditions are unpredictable, and
monitored components are too new to characterize normal and failure modes.
Our solution provides explanations of the predicted failures by using Google Research’s innovative model explainability technology.
Our solution has been deployed in a variety of industrial and IT management applications including:
Power and climate control in commercial buildings and power equipment.
IT infrastructure monitoring and management.
Badge readers and alarms in physical security systems.
Electromechanical components in power plants.
In this blog we describe how our solution is deployed to address industry critical problems of smart IT operations management for Zenoss, an IT service assurance company.
Algorithm
Imagine you’re a technician that maintains thousands of networked devices. These devices can be virtual machines, servers, HVAC units, engines, etc. that generate a data stream of timestamped, multidimensional measurement updates. Chances are high that at any given time, somewhere in the fleet there are faulty devices that require your attention. Due to complex device interactions and a dynamic environment, it may be impossible to characterize normal operating conditions with rules or even to train a machine learning classifier with labeled failure examples. Unsupervised anomaly detectors (trained without labels), like Isolation Forest, One-Class Support Vector Machines, are commonly used in those situations, but provide a nondescript alarm when the device generates unusual updates.
Detecting a faulty device is only the beginning of a technician’s task, and the repair requires more than that nondescript alarm to:
Determine if the anomaly is a true positive,
Diagnose the problem and estimate the root cause,
Triage and prioritize the problem,
Identify and apply a fix, and
Verify the fix was successful.
In the following paragraphs, we will consider three practical anomaly detection concepts that are essential to accomplishing these tasks: accuracy and explainability, sensitivity to correlation and modes, and deploying at scale.
Accuracy and Explainability
A patient might describe their ailment to a doctor with variable attribution, “my nose is congested and I have a severe headache”, and a contrastive normal “normally, I can breathe easily and I usually don’t have headaches”). Similarly, we must consider both detection accuracy (false positive and false negative error rates), and explainability. Like with human symptoms, an anomaly should be explained by:
variable attributions that assign a “blame” score to the most important variables, and
a nearest contrastive normal point to illustrate how far off the anomaly is from normal.
The chart below compares various anomaly detectors in terms of Detection Accuracy and Explainability.
Univariate Statistical methods apply outlier thresholds to each variable independently, and don’t recognize variable correlations or handle multimodal distributions.
Standard multivariate approaches, e.g., clustering, One Class Support Vector Machine (OC-SVM), Isolation Forest, or Extended Isolation Forest, provide medium to high detection accuracies, but no explanation.
Both DIFFI and Autoencoder+SHAP provide variable attributions and medium detection accuracy.
A supervised classifier trained on failure labels that uses Integrated Gradients provides contrastive explanations, but low detection accuracy.
Our solution combines both high detection accuracy and contrastive explanations.