Today, Jean-François Roche and I were trying to debug a spinning Zope instance.
We first tried to use Zope 2.13 signal handling feature : when Zope catches a SIGUSR1 signal that is sent by issuing
kill -10 myzope_pid
Zope dumps the stack trace of each thread to stdout. (For the record, if you work with Zope2 release before 2.13, you can add Products.signalstack to get the same feature).
If you do not have access to stdout, for instance when running Zope in background, you can also use five.z2monitor that Jean-François released a few weeks ago. Its README will tell you more.
Back to our problem.
Unfortunately, neither of the two solutions hereabove did help us : we did not get any stracktrace. This made us guess that we were stuck in a C extension. This avoids Python to run registered signal handlers or to switch to other Python threads.
To debug this, we would need to use gdb that none of us had ever used.
We searched for 'start zope with gdb' and found the a very old article on old.zope.org : Debug a spinning Zope. Its content is pasted hereunder : this useful trick is well explained.
It was very easy to follow step by step. It allowed us to confirm that Zope was actually spinning in re.search, iow in a C extension.
"Spinning" is when a request causes a running Zope to consume all available CPU indefinitely. This is usually caused by some kind of infinite loop or deadlock, and is painful to debug. Under Linux, at least, I've been able to use gdb to solve one spinning problem.
I've only tried this on a Mandrake 8.1 Linux installation, with a multi-threaded, zdaemoned Zope 2.5.1 running under Python 2.1.3. I have no experience debugging any other configuration this way.
- Attach to Zope with the Gnu Debugger
Don't know how to use gdb? Neither do I, but I was able to muddle through.
- Look in your "var/Z2.pid" file and get the second pid listed there.
- Run gdb with the name of your python executable. For example, with Python 2.1.3, I ran "gdb python2.1".
- At the "(gdb)" prompt, type "attach", using the pid you found earlier.
- If all goes well, you should have to page through several screens worth of "Reading symbols" spew. Hit return until it's done.
- Find the spinning thread
- Type "info threads" at the "(gdb)" prompt.
- Unless your Zope is very busy, most of the threads should be in sigsuspend(), poll(), or select(). You should be able to spot the troublemaker here. Failing that, check "top" for the pid of the thread that's using all the CPU time, and look for "(LWP )" in the thread list.
- Supposing our culprit is listed as
4 Thread 2051 (LWP 8236) ...
, we now switch to thread 4 with the command "thread 4".
- Get a traceback
Now for the fun part, thanks to a post by Barry Warsaw.
- Type the following at the prompt:
call PyRun_SimpleString("import sys, traceback; sys.stderr=open('/tmp/tb','w',0); traceback.print_stack()")
- Look in "/tmp/tb" for a complete Python traceback of the current call stack of the thread.
- Type the following at the prompt:
- Figure out where the loop/deadlock is.
I can't give step-by-step instructions on this one. Try repeating step 3 several times; you should see a pattern. In my case, the thread was always in the __read() method of an NMB connection, and I discovered that it was being called with no timeout value.
This was a chance to go back in the past and being recalled of Evan Simpson which used to be famous as one of the authors of the Zope book.
I also want to thank the people that keep old.zope.org running, giving us a chance to keep access to information like the one hereabove.