Interesting way to safely debug multiprocessing python systems

I have one particular “job” that has 3 sub processes moving as fast as humanly possible to build a report. The main slowdown is an external data source which isn’t outright terrible but its not great either. The worst possible outcome is when this thing hangs or misses available work which it was predisposed to do a lot.

Various kill signals usually failed to give me an idea of where the workers were getting hung up and I wasn’t really excited about putting tracer log messages everywhere. Fortunately I have a dbgp enabled IDE and I found this answer on SO. http://stackoverflow.com/a/133384/9908

Taking that I modified it to look like this:

import traceback, signal

#classdef FeedUserlistWorker which is managed by a custom multiprocessing.Pool implementation.

    @classmethod
    def Create(cls, feed, year_month = None):

        signal.signal(signal.SIGUSR1, FeedUserlistWorker._PANIC)

        try:
            return cls(year_month=year_month, feed=feed).run()
        except Exception as e:
            from traceback import print_exc
            print_exc(e)
            sys.stderr.flush()
            sys.stdout.flush()

the print_exc is there because there isn’t a very reliable bridge to carry Exceptions from child to parent. Flush’s are there because stdout/stderr are buffered in between the parent pool manager.

    @classmethod
    def _PANIC(cls, sig, frame):
        d={'_frame':frame}
        d.update(frame.f_globals)
        d.update(frame.f_locals)

        from dbgp.client import brk; brk("192.168.1.2", 9090)

The only thing that matters is that call to dbgp. Using that tool, I was able to step up the call stack, fire adhoc commands to inspect variables in the stack frame, and find the exact blocking call, which turned out to be the validation/authentication part of boto s3. That turned out to be a weird problem as I had assumed the busy loop/block was in my own code ( eg while True: never break ), fortunately it has an easy fix https://groups.google.com/forum/#!msg/boto-users/0osmP0cUl5Y/5NZBfokIyoUJ which resolved the problem as my Pool manager doesn’t mark tasks complete and failures will only cause the lost task to be resumed from the last point of success.