Tux NinjaWhat Is It?

ThreadNinja is a Linux library my team created that tracks pthread_create() and pthread_join() calls in an application. It prints a stacktrace where each thread is created and where it is joined. Any rogue (unjoined) threads are reported when the application exits. ThreadNinja is unobtrusive: it does NOT have to be compiled into the code. This means you can use it on applications you didn’t compile.

We found it useful and thought we’d share it. It’s be no means production code… just a tool. Hack on it, expand it, change it… whatever. It’s pretty small, so it should be easy to dive right in. We’ve released it under the BSD license.

Cut To The Chase

You can checkout the source code from Google Code, or download the version 1.0 tarball directly (threadninja.tar.gz).

To build ThreadNinja, simply untar it and call make:
> tar -zxf threadninja.tar.gz
> make

Now, simply use LD_PRELOAD to run the application:

> LD_PRELOAD=/path/to/threadninja/build/libthreadninja.so.1 TheApplication

If you don’t see function names in the stacktraces that are generated, then the application needs to be compiled with debug symbols. For my test app, I had to compile with the -rdynamic option:

> g++ -Wall -rdynamic main.cpp -lpthread

This causes the global symbol table to be included in the executable, which contains all the application’s function names. For more info, look at the --export-dynamic option on the GNU linker (ld) man page.

The Story Behind ThreadNinja

My team was assigned to stabilize a large video application that runs as a Linux-based appliance. The application consisted of 100,000+ of lines to code that was a tangle of build warnings, circular references, and many creative hacks. Our particular task was to fix a persistent set of seg-faults and memory leaks.

Continue reading »

Tagged with:  

Stop Stealing My File Descriptors!

On June 18, 2010, in Code Monkey, by Tom

Sherlock TuxWe ran into a weird problem the other day where our Linux video display appliance would lose audio support when the process was restarted. The audio was supposed to play through a custom joystick-keyboard that was attached via USB (the keyboard is used by security guards to PTZ cameras, control monitors, etc). The audio could be heard just fine when the box first booted, but if the application restarted audio would be lost.

Looking at the logs, we found that our audio pipeline was failing to open /dev/dsp on the restart. We then used lsof to list the open file descriptors to see which process currently held /dev/dsp:

# lsof | grep /dev/dsp
ntpd   18857    root   16u    CHR     14,3    180099 /dev/dsp

What!?!?… why the heck is NTP opening the sound device and how did it steal it from us??? After some discussion we started remembering a problem in the past with ntpd stealing our SNMP diagnostics port. This just didn’t make any sense.

Digging into our appliance code, we found this line:

system( "service ntpd restart" );

This would be called each time we were notified by the security system that the NTP server address had changed (which fired once each time the process was started so we could get the initial address). But this still didn’t explain why NTP took over ownership of our file descriptors on restart.

Long story short: system() is implemented as fork() followed by execv(). By default, fork() gives a copy of the parent’s file descriptors to the child process (i.e. the ntpd child process got a copy of the /dev/dsp file descriptor). To prevent this, you have to set the FD_CLOEXEC flag on the file desciptors you don’t want copied.

For example:

fd = open( "/dev/dsp", O_RDWR );
fcntl( fd, F_SETFD, FD_CLOEXEC );

Conclusion: setting the FD_CLOEXEC flag on the /dev/dsp file descriptor fixed the problem for audio. However, most of the other file desciptors still got owned by ntpd. Did we go back and set the FD_CLOEXEC flag on all file descriptors, you ask? Nope. It turns out we had a script monitoring the NTP config file and restarting ntpd for us when the file got updated… we just had to update the config file and remove the system( "service ntpd restart" ) call.

Oh, and the reason audio worked on first boot but not subsequent restarts was due to a weird race condition around when /dev/dsp got opened.

Tagged with:  

Software BugWe got an interesting application crash yesterday with a confusing message similar to this:

Fault bucket 42424242, type 1
Event Name: APPCRASH
Response: None
Cab Id: 0

Problem signature:
P1: MyApp.exe
P2: 1.42.42.42
P3: 598773cf
P4: StackHash_ac62
P5: 0.0.0.0
P6: 00000000
P7: c0000007
P8: 00000000
P9:
P10:

We spent some time wondering if our crypto libraries were the problem (we just made some changes recently), but concluded that was unlikely. So what the heck is the “StackHash” module? Did our trashed stack cause the kernel to think we were a different module? Nope.

The answer is that the Windows executive couldn’t identify the module we were in when the application crashed (it uses the instruction pointer to determine what code was executing). In this case, the kernel simply takes a hash of the stack so at least we might be able to identify if we’ve seen this exact crash before. Here’s the answer summarized by an engineer from Microsoft:

In the OS when I try to get a faulting module name it is possible that there is no module laoded (sic) at that address. For example in this case the EIP was zero. So in those cases where a module is not loaded and it is not also in the unloaded module list, I take a stack hash of the stack so that we can identify this crash from other crashes where also the module is not known.

Tagged with:  

GNU LogoSo here’s a cool feature of GNU’s implementation of libc: you can get a stack backtrace (as an array of strings) dynamically in your code. This can be really useful when trying to determine the code path taken when an error occurs. Most times, it’s faster to just run the code in a debugger and use it to display a backtrace, but there are instances when doing it programmatically is your best option. For example, you could get a backtrace in your application’s exception handler and use it to augment error log messages.

First, you need to include execinfo.h to your code:

#include <execinfo.h>

Next, call the backtrace() function to get an array of void pointers that represents the current stack (the pointers are the return addresses for each stack frame).

void* tracePtrs[100];
int count = backtrace( tracePtrs, 100 );

The backtrace() function returns the number of entries in the array (read the man pages for more info about the array size).

Finally, you need to resolve the function names associated with the pointers. You have 2 options: backtrace_symbols() and backtrace_symbols_fd(). Both of these methods resolve the pointers to strings, but the difference is that backtrace_symbols() allocates the strings on the heap while backtrace_symbols_fd() writes the strings to a file descriptor that you can read. Just keep in mind that backtrace_symbols() won’t work if the heap has been trashed.

Here’s an example using backtrace_symbols():

char** funcNames = backtrace_symbols( tracePtrs, count );

// Print the stack trace
for( int ii = 0; ii < count; ii++ )
   printf( “%s\n”, funcNames[ii] );

// Free the string pointers
free( funcNames );

NOTE: Make sure you call free() on the array of strings returned from backtrace_symbols().

For more information, here’s a good article from the Linux Journal.

Tagged with:  

ACE LogoThe Windows development environment provided by VisualStudio has some neat tools for detecting memory leaks in code. You simply #define _CRTDBG_MAP_ALLOC before including your headers, and #include <crtdbg.h> as the last header:

#define _CRTDBG_MAP_ALLOC
// Include header files here
#include <crtdbg.h>

Then, you call _CrtDumpMemoryLeaks() before your application exits. If your program exits at many points, you can alternatively call _CrtSetDbgFlag( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF ) at the beginning of you application, which will cause the leaks to also be printed when it exits. The results are printed to the Debug Window and look like the following:

Detected memory leaks!
Dumping objects ->
C:\PROGRAM FILES\VISUAL STUDIO\MyProjects\leaktest\leaktest.cpp(20) : {18}
normal block at 0×00780E80, 64 bytes long.
Data: < > CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD
Object dump complete.

Cool, Huh?! However, some libraries don’t play nice with this, as I explain below.

Continue reading »

Tagged with:  

GNU LogoDebugging C++ templates is difficult. Debugging C++ templates with GDB can be an act of torture for even seasoned GDB users. I like GDB, but there are some tricks you should know when using it to debug templates. In this post, I deal with setting breakpoints.

Breakpoint Basics:

Setting a breakpoint in GDB is supposed to be simple. Here we set a breakpoint at line 50 in file main.cpp:

(gdb) b main.cpp:50
Breakpoint 1 at 0×804937a: file main.cpp, line 50.

We can also use the function name and GDB will attempt to find the correct location for us:

(gdb) b DoSomething
Breakpoint 2 at 0×8049334: file main.cpp, line 150

Simple, right? Just wait…

Breakpoint Gotchas:

GDB’s breakpoint logic is pretty handy for simple projects, but it can break down fast when things get more complicated.

For example, let’s say your application is plugin-driven, with each plugin being a separate library. Now assume each plugin has a Plugin.cpp file under it’s own Source directory. Try to set a breakpoint in the Initialize() method of the Plugin class:

(gdb) b Initialize
Breakpoint 3 at 0×8049717: file main.cpp, line 230

Oops! There is an Initialize() method in main.cpp and GDB thought that’s where we wanted to put it: wrong!

Continue reading »

Tagged with: