As promised on the mailing list here comes a short description of the new db-parser functionality of syslog-ng. For an introduction to parsers in general see my previous blog post here.
The aim for db-parser is two-fold:
This message states that a user named "bazsi" has logged into the host named "bzorp" using SSH2 from the quoted IP and port. When you read this message as a human, the event that happened is perfectly clear. However if it is not a human, but a piece of software that has to make out the meaning of the message, you need to identify the event (e.g. that a user login has happened) and the additional information associated with the event (e.g. that he used 10.50.0.247 as the client).
If I wanted to express this as name-value pairs, it would be something like this:
Surely this latter form is easier to analyze than the first. So the first step of all kinds of log analysis is to extract information from messages. At a first glance, the easiest way to extract this information is the use of
regular expressions. For example:
Once you match with the regular expression above (courtesy of the logcheck project), the parentheseses mark the variable part of the information that you can reference as $1, $2 and so on.
The problem with regular expressions are several fold:
Clearly a different approach is needed. And that's what db-parser in syslog-ng is.
The db-parser() functionality of syslog-ng has the following objectives:
The database used by db-parser is an XML file that is read during syslog-ng startup. Here is an example entry from the db-parser() database:
As you can see the database is structured, and the first selection criteria to apply is the name of the application (e.g. the value for $PROGRAM). Then each rule matches against the message payload (e.g. the value for $MESSAGE) with the syslog header stripped off. The rule specifies the classification (e.g. 'system' in the example above) and lists one or more patterns. If any of the patterns match, the rule is considered a match.
The variable part of the pattern is specified using special sequences, starting and ending with a '@' character. Within the enclosing '@' characters a colon separated list of parameters are listed:
list of them (you can find these in the radix.c source file):
If a message matches a rule, the db-parser() will make the following list of values defined for the given message:
That's a rough skeleton of what db-parser() is. If you are interested, you can find the db-parser() implementation in syslog-ng OSE 3.0:
http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/
You can also find some example pattern databases here:
http://www.balabit.com/downloads/files/patterndb/
We are also thinking about further ideas to enhance db-parser() and make it the foundation of an Open Source log analysis framework. Stay tuned!
The aim for db-parser is two-fold:
- extract interesting information from a log message
- attach tags to a log message for later classification.
Feb 24 11:55:22 bzorp sshd[4376]: Accepted password for bazsi \
from 10.50.0.247 port 42156 ssh2
This message states that a user named "bazsi" has logged into the host named "bzorp" using SSH2 from the quoted IP and port. When you read this message as a human, the event that happened is perfectly clear. However if it is not a human, but a piece of software that has to make out the meaning of the message, you need to identify the event (e.g. that a user login has happened) and the additional information associated with the event (e.g. that he used 10.50.0.247 as the client).
If I wanted to express this as name-value pairs, it would be something like this:
event="user login", protocol="ssh2", \
client="10.50.0.247:42156", method="password"
Surely this latter form is easier to analyze than the first. So the first step of all kinds of log analysis is to extract information from messages. At a first glance, the easiest way to extract this information is the use of
regular expressions. For example:
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: Accepted \
(gssapi(-with-mic|-keyex)?|rsa|dsa|password|publickey|keyboard-interactive/pam) \
for [^[:space:]]+ from [^[:space:]]+ port [0-9]+( (ssh|ssh2))?
Once you match with the regular expression above (courtesy of the logcheck project), the parentheseses mark the variable part of the information that you can reference as $1, $2 and so on.
The problem with regular expressions are several fold:
- they are difficult to write (just look at the example above)
- they are even more difficult to understand, once written (again, please look at the example)
- they are slow and they scale poorly with the number of regexps that we need to match against the incoming message stream.
Clearly a different approach is needed. And that's what db-parser in syslog-ng is.
The db-parser() functionality of syslog-ng has the following objectives:
- use a database to match various messages (and not filters embedded in the configuration file)
- classify events into logcheck-like classes (cracking, violation, ignore, unknown)
- extract variable information from messages, and place those into name-value pairs
- be fast, scale to a high number of events/sec and high number of patterns
- integrate well to the rest of syslog-ng
...
parser p_db { db-parser(); };
...
log { source(src); parser(p_db); destination(d_parsed); };
...
The database used by db-parser is an XML file that is read during syslog-ng startup. Here is an example entry from the db-parser() database:
<patterndb>
<ruleset name='sshd'>
<pattern>sshd</pattern>
<rules>
<rule provider='balabit' id='1' class='system'>
<patterns>
<pattern>Accepted rsa for@QSTRING:username: @from\
@QSTRING:client_addr: @port @NUMBER:port:@ ssh2</pattern>
</patterns>
</rule>
...
</rules>
</ruleset>
</patterndb>
As you can see the database is structured, and the first selection criteria to apply is the name of the application (e.g. the value for $PROGRAM). Then each rule matches against the message payload (e.g. the value for $MESSAGE) with the syslog header stripped off. The rule specifies the classification (e.g. 'system' in the example above) and lists one or more patterns. If any of the patterns match, the rule is considered a match.
The variable part of the pattern is specified using special sequences, starting and ending with a '@' character. Within the enclosing '@' characters a colon separated list of parameters are listed:
- the parser to apply (QSTRING and NUMBER in the example above)
- the name of the value to be extracted from this position
- additional arguments to be passed to the parser
list of them (you can find these in the radix.c source file):
- IPv4: to parse an IPv4 address
- NUMBER: to parse a number
- STRING: to parse a word
- ESTRING: to parse a sequence of characters ending with a specific character
- QSTRING: to parse a string enclosed within quotes
If a message matches a rule, the db-parser() will make the following list of values defined for the given message:
- .classifier.class: logcheck-like classification
- .classifier.rule_id: the ID of the database entry that matched
- pattern specific values: variable part that get extracted from the message by patterns
# You can use them in a filter:
filter f_class {
match("system" value(".classifier.class"));
};
# but you can also use them in the names of files:
destination d_parsed {
file("/var/log/messages/${.classifier.class}.log");
};
That's a rough skeleton of what db-parser() is. If you are interested, you can find the db-parser() implementation in syslog-ng OSE 3.0:
http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/
You can also find some example pattern databases here:
http://www.balabit.com/downloads/files/patterndb/
We are also thinking about further ideas to enhance db-parser() and make it the foundation of an Open Source log analysis framework. Stay tuned!
Comments
You can get the sample code from older posts by selecting the sample text with mouse and then copy/pasting it.
Thanks!
@QSTRING:username: @
* QSTRING is the name of the parser to use
* username is the name of the value to be stored
* the last column after the 2nd colon specifies further arguments to the parser, in this case it specifies to use ' ' as a delimiter
I was excited about this post up until the following line:
The database used by db-parser is an XML file that is read during syslog-ng startup.
Seriously? XML is a great (well, okay) format for exchanging information between *programs*, but it's a rotten format for humans to either read or generate. You thought reading a regular expression with lots of parentheses was tough? Look at at all the XML markup you've got for *one line* of configuration! That's just crazy! You've already got a reasonable configuration parser for the syslog-ng configuration file; why not just use that? The amount of markup vs. content is so much lower -- which means fewer chances for syntax errors, and configuration code that is easier to read.
Imagine how much more legible the rule database would be if it looked like:
patterns {
ruleset "sshd" {
pattern "sshd" {
rule("Accepted rsa for...");
rule("Accepted password for...");
};
};
};
(And pardon the formatting...Blogger doesn't like the <pre> tag.)
I think this article is a nice summary of the issues involved in making XML your user-facing configuration language.
If you look at the OSE 3.1 branch, you can see a program called "pdbtool", which will be able to merge such databases from multiple sources for example.
It might be possible to define a more human friendly alternate syntax for the patterndb and implement a support for converting the result into XML, but I have not had time for this lately.
See the latest pdbtool work in Marci's branch (there might be some stuff I haven't integrated yet as I was spending my summer holiday the last two weeks):
http://git.balabit.hu/marci/syslog-ng-3.1
Marci's blog can be found here:
http://marci.blogs.balabit.com/