Friday, March 31, 2006

OpenSSL problem found

I have finally tracked down the issue I was writing about in my previous blog post. It turned out to be a problem with OpenSSL on Linux 2.6 and NPTL. The default implementation of CRYPTO_thread_id() assumes that getpid() returns a unique value for each thread, however with NPTL each thread has the same pid and only their tid (thread id) value differ.

This made OpenSSL to hash its thread-specific error state to the same memory area, thus possibly overwriting memory concurrently freed in another thread. This caused a heap corruption which in turn caused crashes every now and then, of course showing a backtrace completely unrelated to the original problem.

The funny part is that I had a suspicion (see the yesterday's post) about this error state allocation I just have not seen the obvious reason: I had the impression that getpid() returns a different value for threads just like it did with LinuxThreads. Knowing the exact reason makes the whole issue trivial :).

I have posted this as an email on openssl-dev, I'm wondering what the reactions are going to be.
Take care.

Thursday, March 30, 2006

Spending time in gdb...

I have spent the last three days debugging an ugly crash in the upcoming Zorp 3.1. First I had some problems with the core files produced with Linux 2.6.12, as the register values proved to be invalid, thus the backtrace was even more unusable than it is usual with heap corruptions.

I could get access to the original register values as Zorp dumps part of its stack when a fatal signal is encountered. Using that information I could locate the stack frame of the signal handler and luckily Linux passes a "struct sigcontext" to each signal handler as parameter which contains register information. But nevertheless it made analyzing the core files difficult.

After a post to the gdb mailing list it turned out to be a kernel problem rather than a gdb problem and with the help of my collegue Krisztián Kovács (of Netfilter ct_sync fame) we could solve the problem by backporting a fix from 2.6.15, so core files are now ok.

The problem however seems to be difficult, I have already studied the libc malloc implementation, disassembled and annotated the _int_malloc and _int_free functions, I'm now able to read hexdumps of heap areas fluently but I still don't have a fix for the problem. Lucky us Zorp restarts itself in this situation and the scenario where this problem occurs is not frequently used.

My suspicion is that the SSL error state for threads are the cause of the problem as I have evidence that the freed heap block is overwritten by ERR_clear_state(), which destroys the next and prev pointers in the freed memory block, thus resulting in the crash. The error states are supposedly thread-specific variables, but the way the allocation is done is suspicious.

I hope I can finally find this problem tomorrow.

Tuesday, March 28, 2006

Preparing syslog-ng release

I have started to prepare syslog-ng 1.6.10 for release, the tarball has already been uploaded to the website, but I still have not sent an announcement to the mailing lists. So if you read this here, you might download a still unannounced version :)

Nothing really important in the release, a cleanup in the documentation with several fixes and a migration to DocBook/XML from the SGML favour and a new tunable called time_sleep().

The latter was worked out together with John Morrissey who did some profiling and found that on hosts with a lot of syslog connections syslog-ng might become a bottleneck. The option does nothing but sleep() a defined amount of time which makes syslog-ng to process incoming messages in batches, this way decreasing the number of poll() loop iterations which was listed high (about 67%) in the profiles generated by John.

Setting time_sleep() to about 50ms decreased the CPU load by 80% which is quite significant I'd say.

As Rusty Russell would say I have just received a SIGWIFE, so going to bed now :)

Starting my blog

I have considered starting my own blog for some time now and have finally started doing something for it. I first tried to set up blosxom but as I did not want to spend too much time customizing it I finally gave up and tried to find a nice blogger website which does everything for me. This is blogger.com, I like what I see so far.

Ops, I should have started by introducing myself: my name is Balázs Scheidler, I live in Budapest, Hungary and I started this blog because I would have some things to publish about some free software projects I am involved in and it is trendy to have a blog anyway :).

Back to my projects, I am the author of syslog-ng that you might know as an alternative system logging package for UNIX based systems. And also Zorp, an application layer gateway. You can find out more about these at http://www.balabit.com/. I also contribute patches to a couple of others (whenever I encounter something I don't like or which simply bugs me) and I sometimes poke into kernel development as well (generally netfilter related development like transparent proxying support for Linux).

So far, so good. Hopefully I won't give up too soon.

Bazsi