Pages

Sunday, October 23, 2016

Python generators with multiple threads

Sometimes when writing Python there is a CPU expensive task that needs to be done for every item in an iterator. You might go to Google and search for python multithreading, which might take you to the threading module. But that isn't very useful, so next you look to the multiprocessing module. That looks more promising, but all the examples have managers or queues or pipes or some other complicated stuff.

I want simple. All I want to do is write a generator that yields something, have that something passed to a thread or process to compute the result, then have a generator yield the result when it is ready. Sounds simple. Unfortunately is is hard to find examples of how to do this in Python.

So here are some examples!

In the example below increasing_sleeps() is the function that yields something and rand_sleep() is the function that takes a long time to process each thing yielded by increasing_sleeps(). It uses multiprocessing.dummy to use threads instead of full processes.


The results will look something like this:

Slept 0 seconds
Slept 0 seconds
Slept 1 seconds
Slept 1 seconds
Slept 4 seconds
Slept 4 seconds
Slept 3 seconds
Slept 5 seconds
Slept 1 seconds
Slept 4 seconds

Cool! But what if I need to know what input was sent to rand_sleep() so I can correlate the results with the input? Simple, just yield the input from rand_sleep(). The example below does that correlation, and also uses full Python processes instead of threads. Threads are cheaper, but there are cases where you will need to use processes to get around the Global Interpreter Lock (GIL). One big gotcha with using full processes is that Python needs to pickle the function being sent to the new process (it doesn't share memory like with threads), and it can't pickle functions or instance methods. Oddly enough it can pickle instances of custom classes, so you can make a callable class and pass that instead of a function.


The results will look something like this:

Slept 0 seconds of max 0
Slept 1 seconds of max 1
Slept 1 seconds of max 2
Slept 0 seconds of max 3
Slept 3 seconds of max 5
Slept 0 seconds of max 6
Slept 4 seconds of max 4
Slept 2 seconds of max 8
Slept 5 seconds of max 7
Slept 8 seconds of max 9

Nice! That makes it clear that the results are coming back out of order. I you want the results to comeback in order, use Pool.imap() instead of Pool.imap_unordered().

And finally, here is an example of a simple yet useful use of these methods: