Anomaly Detection

We will now briefly describe a sample log data and then discuss how to make data structures. Next, we will extract the features from the data to feed into machine learning model. Finally, we will discuss the results obtained by using auto encoder.

Data Set: Here we have considered HDFS (Hadoop Distributed File System) dataset, which has been obtained from a production Amazon EC2 system and is well suited for anomaly detection. Not only does it contain 11,175,629 log messages with 16,838 anomalies, the dataset is also pre-labeled by domain experts, which will be useful for identifying correctness of algorithms. A sample screen shot of the data is given below:

Methodology: We are following a three-step approach – log parsing, event matrix generation, and anomaly detection for finding the anomalies in a sequence of log messages. One thing to note here is that the sequence of log messages is generated by using either sliding window approach (i.e., slicing the data with three hours with one-hour stride) or detecting the identifier in the log message.

As observed in the sample data set, the raw log message contains two parts: constant part and variable part. The constant part constitutes the fixed plain text (wrote by the developer) and remains the same for every event occurrence, which can reveal the event type of the log message. The variable part carries the runtime information of interest (e.g., the IP address and port: 10.251.111.130:49851), which may vary among different event occurrences (variable part can be consider as a parameter). We will now see the three steps disused above more in details one by one:

a. Log Parsing: The goal of log parsing is to extract the event by automatically separating the constant part and variable part of a raw log message, and further transform each log message into a specific event (usually denoted by its constant part). We have used various log parsing algorithms given by [1] and [2], such as Cluster based (LKE, Logsig) and heuristic based (IPLOM, ALCT) for making the raw data structured. Below we draw a diagram of log parsing method and feature generation for anomaly detection

b. Event Matrix Generation: Once we extract events from raw log messages then we form an event count matrix M, which will be fed into the anomaly detection model. This also helps to convert the textual log information to numeric representation which gets easier for the model for feed with this kind of data. In the event count matrix, each row represents a block or time window, while each column indicates one event type. The value in cell M_ij represents how many times event j occurs on block i or time stamp i. Instead of directly detecting anomaly on M, we have used term frequency and inverse document frequency (TFIDF) method, which assign lower weights to common event types, i.e. the events which are less likely to contribute to the anomaly detection process.

c. Anomaly Detection: Finally, the event count matrix can be fed to machine learning models for training, and thus generate a model for anomaly detection. Here we can use any generic AI analytics engine (which is built for doing ML based Traditional Techniques for regression and classification task for any kind of data set) to identify whether the upcoming block contain anomalous logs or not. Apart from generic AI predictive analytics engine we can also use Deep Learning based auto encoder to detect anomalies.

Fig -4 The Analysis Methodology

Once we convert the unstructured textual data to a structured numeric data from the above process now we can make use of machine learning or deep learning model to train our model for anomaly detection. Now here we can train our model in 2 major ways:

Supervised Learning – It is a method in which models are trained using labeled data.
Unsupervised Learning – It is another method in which patterns inferred from the unlabeled input data.

Next section we will see how we can handle this in both the ways.