Tuesday, March 03, 2009

An introduction to db-parser()

As promised on the mailing list here comes a short description of the new db-parser functionality of syslog-ng. For an introduction to parsers in general see my previous blog post here.

The aim for db-parser is two-fold:
  • extract interesting information from a log message
  • attach tags to a log message for later classification.
For instance here's a log sample (lines broken for readability):

Feb 24 11:55:22 bzorp sshd[4376]: Accepted password for bazsi \
from 10.50.0.247 port 42156 ssh2


This message states that a user named "bazsi" has logged into the host named "bzorp" using SSH2 from the quoted IP and port. When you read this message as a human, the event that happened is perfectly clear. However if it is not a human, but a piece of software that has to make out the meaning of the message, you need to identify the event (e.g. that a user login has happened) and the additional information associated with the event (e.g. that he used 10.50.0.247 as the client).

If I wanted to express this as name-value pairs, it would be something like this:

event="user login", protocol="ssh2", \
client="10.50.0.247:42156", method="password"

Surely this latter form is easier to analyze than the first. So the first step of all kinds of log analysis is to extract information from messages. At a first glance, the easiest way to extract this information is the use of
regular expressions. For example:

^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: Accepted \
(gssapi(-with-mic|-keyex)?|rsa|dsa|password|publickey|keyboard-interactive/pam) \
for [^[:space:]]+ from [^[:space:]]+ port [0-9]+( (ssh|ssh2))?

Once you match with the regular expression above (courtesy of the logcheck project), the parentheseses mark the variable part of the information that you can reference as $1, $2 and so on.

The problem with regular expressions are several fold:
  • they are difficult to write (just look at the example above)
  • they are even more difficult to understand, once written (again, please look at the example)
  • they are slow and they scale poorly with the number of regexps that we need to match against the incoming message stream.
Projects like logcheck use regular expressions, but with the number of patterns increasing, the time needed to analyze logs skyrockets, which makes the whole thing unfeasible. Also, logcheck does not aim at extracting information from messages, it merely classifies them.

Clearly a different approach is needed. And that's what db-parser in syslog-ng is.

The db-parser() functionality of syslog-ng has the following objectives:
  • use a database to match various messages (and not filters embedded in the configuration file)
  • classify events into logcheck-like classes (cracking, violation, ignore, unknown)
  • extract variable information from messages, and place those into name-value pairs
  • be fast, scale to a high number of events/sec and high number of patterns
  • integrate well to the rest of syslog-ng
db-parser() is a generic parser, fits nicely to the parser framework inside syslog-ng. You can use it just like csv-parser():

...
parser p_db { db-parser(); };
...
log { source(src); parser(p_db); destination(d_parsed); };
...

The database used by db-parser is an XML file that is read during syslog-ng startup. Here is an example entry from the db-parser() database:

<patterndb>
<ruleset name='sshd'>
<pattern>sshd</pattern>
<rules>
<rule provider='balabit' id='1' class='system'>
<patterns>
<pattern>Accepted rsa for@QSTRING:username: @from\
@QSTRING:client_addr: @port @NUMBER:port:@ ssh2</pattern>
</patterns>
</rule>
...
</rules>
</ruleset>
</patterndb>



As you can see the database is structured, and the first selection criteria to apply is the name of the application (e.g. the value for $PROGRAM). Then each rule matches against the message payload (e.g. the value for $MESSAGE) with the syslog header stripped off. The rule specifies the classification (e.g. 'system' in the example above) and lists one or more patterns. If any of the patterns match, the rule is considered a match.

The variable part of the pattern is specified using special sequences, starting and ending with a '@' character. Within the enclosing '@' characters a colon separated list of parameters are listed:
  • the parser to apply (QSTRING and NUMBER in the example above)
  • the name of the value to be extracted from this position
  • additional arguments to be passed to the parser
The available parsers are currently not really documented, but here is a
list of them (you can find these in the radix.c source file):
  • IPv4: to parse an IPv4 address
  • NUMBER: to parse a number
  • STRING: to parse a word
  • ESTRING: to parse a sequence of characters ending with a specific character
  • QSTRING: to parse a string enclosed within quotes
Of course further parsers can be added to the code easily. You don't have to specify monsterous regexps to match an IPv4 address anymore. Not to mention IPv6 :)

If a message matches a rule, the db-parser() will make the following list of values defined for the given message:
  • .classifier.class: logcheck-like classification
  • .classifier.rule_id: the ID of the database entry that matched
  • pattern specific values: variable part that get extracted from the message by patterns
Each of the values defined previously can be referenced inside syslog-ng using a macro, e.g. you can do things like:

# You can use them in a filter:
filter f_class {
match("system" value(".classifier.class"));
};

# but you can also use them in the names of files:
destination d_parsed {
file("/var/log/messages/${.classifier.class}.log");
};

That's a rough skeleton of what db-parser() is. If you are interested, you can find the db-parser() implementation in syslog-ng OSE 3.0:

http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/

You can also find some example pattern databases here:

http://www.balabit.com/downloads/files/patterndb/

We are also thinking about further ideas to enhance db-parser() and make it the foundation of an Open Source log analysis framework. Stay tuned!

7 comments:

Anonymous said...

Actually, other samples get truncated as well.

Bazsi said...

I broke the lines in my last post using '\' in order to make them more readable.

You can get the sample code from older posts by selecting the sample text with mouse and then copy/pasting it.

Padoo said...

Actually I'm trying to do exactly what you described in your example above. I just can't figure out how to access the custom macros. Could you give me the exact macros I would use e.g. in a template using your example? I placed the file in /var/lib/syslog-ng/patterndb.xml. Is that correct for Ubuntu 8.10?
Thanks!

Bazsi said...

In my example you'd use $username or $client_addr, e.g. the names are specified in the match part:

@QSTRING:username: @

* QSTRING is the name of the parser to use
* username is the name of the value to be stored
* the last column after the 2nd colon specifies further arguments to the parser, in this case it specifies to use ' ' as a delimiter

Padoo said...

Thanks for the reply. My problem was that I was working with the wrong files, since after the update from 2.0.x to 3.0.1 in Ubuntu the relevant files are in /opt/syslog-ng and not in /etc or /var. DB-Parser is working beautifully now!

lars-oddbit-com said...

I realize that this post is now almost 6 months old, but you've tripped a pet peeve of mine, and (a) I really like syslog-ng and (b) I hope to use these features some day.

I was excited about this post up until the following line:

The database used by db-parser is an XML file that is read during syslog-ng startup.

Seriously? XML is a great (well, okay) format for exchanging information between *programs*, but it's a rotten format for humans to either read or generate. You thought reading a regular expression with lots of parentheses was tough? Look at at all the XML markup you've got for *one line* of configuration! That's just crazy! You've already got a reasonable configuration parser for the syslog-ng configuration file; why not just use that? The amount of markup vs. content is so much lower -- which means fewer chances for syntax errors, and configuration code that is easier to read.

Imagine how much more legible the rule database would be if it looked like:

patterns {

ruleset "sshd" {

pattern "sshd" {

rule("Accepted rsa for...");

rule("Accepted password for...");

};

};

};

(And pardon the formatting...Blogger doesn't like the <pre> tag.)

I think this article is a nice summary of the issues involved in making XML your user-facing configuration language.

Bazsi said...

You might be right about the XML file format being a bit complicated for human consumption, however in addition to simply create such databases I want it to be make it easily extendable/parseable/validable by programs.

If you look at the OSE 3.1 branch, you can see a program called "pdbtool", which will be able to merge such databases from multiple sources for example.

It might be possible to define a more human friendly alternate syntax for the patterndb and implement a support for converting the result into XML, but I have not had time for this lately.

See the latest pdbtool work in Marci's branch (there might be some stuff I haven't integrated yet as I was spending my summer holiday the last two weeks):

http://git.balabit.hu/marci/syslog-ng-3.1

Marci's blog can be found here:

http://marci.blogs.balabit.com/