Personal tools
You are here: Home ProjectDetails Grammar-based adaptable parsing of intrusion detection data
Log in


Forgot your password?
« July 2010 »
July
MoTuWeThFrSaSu
1234
567891011
12131415161718
19202122232425
262728293031
 

Grammar-based adaptable parsing of intrusion detection data

One of the objectives of the INTERSECTION project is to design and develop a set of high performance adaptable parsers, which allow real-time stream processing of data for promptly detecting intrusions.

An issue to cope with in Intrusion Detection is the process of information extraction from textual sources, usually log files such as Web server logs [60]. Most of these data sources have an application-dependent format, typically fixed or variable column ASCII, or structured log elements expressed in languages such as XML. The definition of a standard format for such log elements is not a viable solution, especially when dealing with already deployed. A promising approach is to exploit the highly structured organization of text which is produced by a computer system. As a consequence, it is possible to automate the information extraction process. A first important step in processing data streams and extracting information from them is to convert the raw information flow (e.g. a stream of characters) into a sequence of lexical units related to a known dictionary. Although a number of different approaches are available to accomplish this task, the most popular one is based on lexical analyzers, or lexers. Other approaches that can be used instead of lexical analyzers include text zoners, segmenters, language guessers, stemmers and lemmatizers [61]. The lexical analyzer produces a sequence of tokens to forward as input to another module, the parser. Parsing a sequence of tokens means to compute the structural description of the sequence formally assigned by a grammar, assuming, of course, that the sentence is well-formed. Grammars have a fundamental role in the definition of formal languages and the automatic manipulation of formally specified documents, such as computer programs. A limited subset of their potential can be exploited to define log format and automate their manipulation. This approach retains all the well-established theoretical foundation upon which grammar tools and frameworks are built, and a number of associated advantages including:

–      a very large degree of expressiveness

–      the availability of well-known tools for the automatic processing of grammar-based artifacts

–      a high level of generality and technology-independence, which decouples the format definition from the underlying technology used for data processing.

Once defined a grammar for a specific log file, a Compiler of Compilers can be adopted to automatically extract a parser enabled for such log file. This has the fundamental advantage that adding support for a new log format only requires writing a new, generally simple, grammar for it. INTERSECTION partners have already made valuable research in this field [62].

 

References

[60] B. Krishnamurthy and J. Wang, On network-aware clustering of web clients, in ACM Proceedings of SIGCOMM 2000, 2000

[61] J. Turmo, A. Ageno, and N. Catala, Adaptive Information Extraction, ACM Computing Surveys, vol. 38, no. 2, 2006

[62] F. Campanile, A. Cilardo, L. Coppolino, L. Romano, Adaptable Parsing of Real-Time Data Streams, Proceedings of 15th Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP 2007), Feb. 2007

 

Back to Project Detail

 

Document Actions
FP7 Cooperation