Modelling network traffic anomalies at large scale
Recent events have shown that benign and malicious large-scale
anomalies such as flash crowds, network outages, worms, and
denial-of-service attacks have the potential to disrupt critical
services and infrastructures. Motivated by the observation that
detection at the network edge is not well-suited for containing such
large-scale attacks, several anomaly detection systems for backbone
networks have been developed. These systems operate on data aggregated
at the flow level, e.g., using Cisco Netflow, since inspecting single
packets is not feasible on high-speed backbone links.
However, in contrast to the vast amount of anomaly detection systems
proposed, the effectiveness of these systems has not been
well-investigated. Today, common practice for evaluating anomaly
detection systems is to show that a system is capable of detecting a
few specific anomalies using one or very few traffic traces. Such
simplistic evaluations, however, are not suitable to answer more
general questions like: Which metric is best for detecting a specific
anomaly? Which components of normal traffic are responsible for false
positives? What is the penalty of increasing an anomaly detection
system's sensitivity?
We argue that in order to answer these questions, systematic
evaluations based on benchmark traffic traces are required.
Unfortunately, such benchmark traffic traces are not available today,
neither to research nor industry. Usually, privacy concerns of network
providers and their customers prohibit that widely collected flow
traces be released to the research community. Consequently, there are
only very few traces available from research networks. The main
problems with these few available traces are i) their limited diversity
regarding network traffic characteristics and available anomalies, and
ii) that traffic characteristics in research networks not necessarily
reflect the characteristics of real networks. Sound research, however,
demands diverse traffic traces - traces from networks with varying
traffic characteristics that contain a wide range of anomalies of
different types and intensities.
Today, research focuses mainly on one approach to solve the problem of
traffic trace unavailability: modification of existing traces to
achieve a higher degree of diversity. Anomalies are either injected in
existing traces of normal traffic, or anomalies in existing traces are
amplified. However, the degree of diversity gained by this kind of
modification is very limited. Anonymization of existing traces would
nicely complement the modification approach since it can solve the
privacy issues with existing traces. However, the problem with
anonymization is that it not only removes privacy-sensitive information
from the traces but also information which is important and valuable
for research. Optimizing this anonymization trade-off is still an open
research issue.
In the framework of the INTERSECTION project we will take a different
approach to generate diverse, synthetic network traces with certain
anomalies. Our approach is to develop a constructive model for normal
and anomalous backbone traffic at the flow level. Basically, there
exist three levels at which network traffic can be modelled: sessions,
flows, and packets. We decided to model traffic at the flow level since
our goal is to generate synthetic traces for backbone anomaly detection
and mitigation which usually work at the flow level (e.g., using Cisco
Netflow). Moreover, there is basically no packet trace data available
at large-scale for privacy-reasons and due to limited storage
capacities (the longest available packet traces span days or maximum
one week).
The two most common forms of data models are descriptive and
constructive models. Descriptive models are basically a compact summary
of a set of measurements. Constructive models, on the other hand, aim
at finding a simplified representation of real systems in order to
concisely characterize the systems' output, i.e. the observed dataset.
Both approaches have their advantages and drawbacks. While descriptive
models usually provide a better match of the data, they cannot adapt to
changes in the system and are not as easily interpretable as
constructive models. Constructive models, on the other hand, are easier
to interpret but they usually have many parameters and do not match
real measurements as well as descriptive models. Our approach aims at
optimizing this trade-off between constructive and descriptive
modelling. Therefore, we will use a combination of both modelling
approaches. In a first step, we will derive a constructive Internet
model by dividing the complex backbone network system into smaller,
more intuitive subsystems (clusters). These subsystems are categories
of distinct host or application behaviour such as human-controlled
clients, servers, or virus-infected hosts. In a second step, we will
develop descriptive models that describe the behaviour of hosts in each
individual cluster using probabilistic models. And in a third step, the
complete system, i.e., the observed network traffic at a backbone
router, is then derived as the superposition of flows generated by the
subsystems. This last step involves also estimating the parameters for
the complete network, i.e. how many subsystems of each category are
generating traffic, and how frequently do subsystems appear, disappear,
or change from one category to another. In order to meet the
requirements of model parsimony and generality, we intend to describe
each identified subsystem with a fixed probabilistic distribution. To
derive these distributions we will rely on the SWITCH flow traces, and
on data from other networks if available. For estimating the parameters
for the complex system model - how many subsystems of each category
contribute to the overall traffic, and how frequent are subsystems
changing categories, appearing, and disappearing under normal and
anomalous conditions - we will analyze the SWITCH traces and other
available backbone traces.
Finally, a detailed design and implementation plan for a trace
generation system will be developed during the project. The flow trace
generation framework will incorporate the descriptive host behaviour
models as well as the constructive backbone network model. We envision
a scalable system which can generate traces in a reasonable time.
Therefore, we clearly need to optimize the number of operations
required per generated flow. The ultimate goal is to make the generated
traces available to the research community.


Previous: Study of malware traffic properties


