Personal tools
You are here: Home ProjectDetails Modelling network traffic anomalies at large scale
Log in


Forgot your password?
« July 2010 »
July
MoTuWeThFrSaSu
1234
567891011
12131415161718
19202122232425
262728293031
 

Modelling network traffic anomalies at large scale

Recent events have shown that benign and malicious large-scale anomalies such as flash crowds, network outages, worms, and denial-of-service attacks have the potential to disrupt critical services and infrastructures. Motivated by the observation that detection at the network edge is not well-suited for containing such large-scale attacks, several anomaly detection systems for backbone networks have been developed. These systems operate on data aggregated at the flow level, e.g., using Cisco Netflow, since inspecting single packets is not feasible on high-speed backbone links.
However, in contrast to the vast amount of anomaly detection systems proposed, the effectiveness of these systems has not been well-investigated. Today, common practice for evaluating anomaly detection systems is to show that a system is capable of detecting a few specific anomalies using one or very few traffic traces. Such simplistic evaluations, however, are not suitable to answer more general questions like: Which metric is best for detecting a specific anomaly? Which components of normal traffic are responsible for false positives? What is the penalty of increasing an anomaly detection system's sensitivity?
We argue that in order to answer these questions, systematic evaluations based on benchmark traffic traces are required. Unfortunately, such benchmark traffic traces are not available today, neither to research nor industry. Usually, privacy concerns of network providers and their customers prohibit that widely collected flow traces be released to the research community. Consequently, there are only very few traces available from research networks. The main problems with these few available traces are i) their limited diversity regarding network traffic characteristics and available anomalies, and ii) that traffic characteristics in research networks not necessarily reflect the characteristics of real networks. Sound research, however, demands diverse traffic traces - traces from networks with varying traffic characteristics that contain a wide range of anomalies of different types and intensities.
Today, research focuses mainly on one approach to solve the problem of traffic trace unavailability: modification of existing traces to achieve a higher degree of diversity. Anomalies are either injected in existing traces of normal traffic, or anomalies in existing traces are amplified. However, the degree of diversity gained by this kind of modification is very limited. Anonymization of existing traces would nicely complement the modification approach since it can solve the privacy issues with existing traces. However, the problem with anonymization is that it not only removes privacy-sensitive information from the traces but also information which is important and valuable for research. Optimizing this anonymization trade-off is still an open research issue.
In the framework of the INTERSECTION project we will take a different approach to generate diverse, synthetic network traces with certain anomalies. Our approach is to develop a constructive model for normal and anomalous backbone traffic at the flow level. Basically, there exist three levels at which network traffic can be modelled: sessions, flows, and packets. We decided to model traffic at the flow level since our goal is to generate synthetic traces for backbone anomaly detection and mitigation which usually work at the flow level (e.g., using Cisco Netflow). Moreover, there is basically no packet trace data available at large-scale for privacy-reasons and due to limited storage capacities (the longest available packet traces span days or maximum one week).
The two most common forms of data models are descriptive and constructive models. Descriptive models are basically a compact summary of a set of measurements. Constructive models, on the other hand, aim at finding a simplified representation of real systems in order to concisely characterize the systems' output, i.e. the observed dataset. Both approaches have their advantages and drawbacks. While descriptive models usually provide a better match of the data, they cannot adapt to changes in the system and are not as easily interpretable as constructive models. Constructive models, on the other hand, are easier to interpret but they usually have many parameters and do not match real measurements as well as descriptive models. Our approach aims at optimizing this trade-off between constructive and descriptive modelling. Therefore, we will use a combination of both modelling approaches. In a first step, we will derive a constructive Internet model by dividing the complex backbone network system into smaller, more intuitive subsystems (clusters). These subsystems are categories of distinct host or application behaviour such as human-controlled clients, servers, or virus-infected hosts. In a second step, we will develop descriptive models that describe the behaviour of hosts in each individual cluster using probabilistic models. And in a third step, the complete system, i.e., the observed network traffic at a backbone router, is then derived as the superposition of flows generated by the subsystems. This last step involves also estimating the parameters for the complete network, i.e. how many subsystems of each category are generating traffic, and how frequently do subsystems appear, disappear, or change from one category to another. In order to meet the requirements of model parsimony and generality, we intend to describe each identified subsystem with a fixed probabilistic distribution. To derive these distributions we will rely on the SWITCH flow traces, and on data from other networks if available. For estimating the parameters for the complex system model - how many subsystems of each category contribute to the overall traffic, and how frequent are subsystems changing categories, appearing, and disappearing under normal and anomalous conditions - we will analyze the SWITCH traces and other available backbone traces.
Finally, a detailed design and implementation plan for a trace generation system will be developed during the project. The flow trace generation framework will incorporate the descriptive host behaviour models as well as the constructive backbone network model. We envision a scalable system which can generate traces in a reasonable time. Therefore, we clearly need to optimize the number of operations required per generated flow. The ultimate goal is to make the generated traces available to the research community.

 

Back to Project Detail

 

Document Actions
FP7 Cooperation