Tux NinjaWhat Is It?

ThreadNinja is a Linux library my team created that tracks pthread_create() and pthread_join() calls in an application. It prints a stacktrace where each thread is created and where it is joined. Any rogue (unjoined) threads are reported when the application exits. ThreadNinja is unobtrusive: it does NOT have to be compiled into the code. This means you can use it on applications you didn’t compile.

We found it useful and thought we’d share it. It’s be no means production code… just a tool. Hack on it, expand it, change it… whatever. It’s pretty small, so it should be easy to dive right in. We’ve released it under the BSD license.

Cut To The Chase

You can checkout the source code from Google Code, or download the version 1.0 tarball directly (threadninja.tar.gz).

To build ThreadNinja, simply untar it and call make:
> tar -zxf threadninja.tar.gz
> make

Now, simply use LD_PRELOAD to run the application:

> LD_PRELOAD=/path/to/threadninja/build/libthreadninja.so.1 TheApplication

If you don’t see function names in the stacktraces that are generated, then the application needs to be compiled with debug symbols. For my test app, I had to compile with the -rdynamic option:

> g++ -Wall -rdynamic main.cpp -lpthread

This causes the global symbol table to be included in the executable, which contains all the application’s function names. For more info, look at the --export-dynamic option on the GNU linker (ld) man page.

The Story Behind ThreadNinja

My team was assigned to stabilize a large video application that runs as a Linux-based appliance. The application consisted of 100,000+ of lines to code that was a tangle of build warnings, circular references, and many creative hacks. Our particular task was to fix a persistent set of seg-faults and memory leaks.

One of the first things we noticed was that the application didn’t shutdown properly. The shutdown logic was something akin to: signal components to exit, sleep 2 seconds, call Release() a couple times to clean up extra ref-counts, sleep 2 seconds, and then just call exit() to get the process to terminate (bypassing the remaining clean up code). Of course, this renders Valgrind useless when trying to find memory leaks, because the automatic memory cleanup code gets bypassed during the process abort.

As a result, our first priority was to get the app to shutdown cleanly. The first issue we ran into was that pthread_join() blocked indefinitely because threads were failing to terminate. We tried using GDB to track the threads, but many hundreds of threads were being created and destroyed dynamically. We needed a way to let the application run for hours, allow 1000+ threads to live and die, and still be able to track rogue threads. Hence, ThreadNinja was born.

How It Works

Through the magic of LD_PRELOAD, ThreadNinja “injects” itself between all calls to pthread_create() and pthread_join(). This happens because LD_PRELOAD instructs the loader to load ThreadNinja into memory first, before other libraries (in this case, before the pthread library). The result is that ThreadNinja’s implementation of pthread_create() and pthread_join() are used by the application instead of pthread’s own implementation. What ThreadNinja does is track the calls to these methods and then pass the call on to the “real” pthread implementation. From the application’s point-of-view, the behavior of the thread methods are the same… the tracking is transparent.

Output

Each time pthread_create() is called, a stacktrace and timestamp are printed to stdout:

Thread Created: 3047947120
[bt] Thu Jul 8 16:31:23 2010

[bt] ./a.out(StartService(unsigned long*, int))
[bt] ./a.out(Initialize())
[bt] ./a.out(main+0xb) [0x80489fa]
[bt] /lib/libc.so.6(__libc_start_main+0xe6) [0xb5bbb6]
[bt] ./a.out() [0x8048811]

The first line (“Thread Created”) gives the value of the pthread_t handle, so you can later track where rogue threads where created. The next line is the time when the create happened. The remaining lines are the call stack that led to the pthread_create() call.

Each time pthread_join() is called, similar information is printed to stdout:

Thread Joined: 3047947120
[bt] Thu Jul 8 16:31:31 2010

[bt] ./a.out(Terminate())
[bt] ./a.out(main+0x10) [0x80489ff]
[bt] /lib/libc.so.6(__libc_start_main+0xe6) [0xb5bbb6]
[bt] ./a.out() [0x8048811]

When the application terminates (cleanly or uncleanly), a summary of the current state of the application threads is printed:

exit_handler()
[Thread Summary]
Total Created: 573
Total Joined: 568
Total Running: 5

In this case, you’ll notice that 5 threads were never joined on.

Limitations

ThreadNinja only tracks calls to pthread_create() and pthread_join(). This means calls like system(), exec(), and fork() are not tracked. Also, calls to pthread_cancel() are not tracked. We had started adding code to track pthread mutexes and stuff, but it turned out we didn’t need it. Feel free to add support for all this stuff and submit changes to the Google code site.

Happy coding!

Tagged with:  

2 Responses to ThreadNinja: Finding Rogue POSIX Threads

  1. Jeff Frontz says:

    Not sure which kernel (or thread implementation) this was written for, but under Linux 2.6.23, the pthread_t ID value can be reused. For processes that create/destroy a lot of threads, this makes the first assert() in DataCollection::AddThread() fail pretty quickly.

    Adding

    _threadInfoMap.erase(threadId);

    at the end of DataCollection::CloseThread() seems to help, except when there is truly a rogue thread and then the assert() fails.

    Reply

    Tom reply on April 13th, 2011 9:33 am:

    Thanks Jeff! I’ll take a look.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>