Why should we use concurrency?
When you execute a set of instructions on a CPU, it takes a given amount of time by default.
The major advantage of this concurrency is, that in theory and in a linear example we save 50% of computing time if we use 100% more CPU cores.
To understand how the concept works and which possibilities we have to improve our computation speed, we need a few terms clarified. If you feel confident about these terms, you may skip the first few headings and just select one of the three possibilities I introduced:
- Multithreading in Python using threading.Thread
- Multithreading in Python using joblib
- Multiprocessing in Python using multiprocessing.pool
- Comparing the parallelization techniques
When to apply concurrency?
Generally spoken you can apply concurrency whenever you can execute your code parallelly. Imagine a simple example where you want to add 8 numbers.
To further simplifiy things, let’s assume each addition would take a full second. So it would take 7 seconds to add these 8 number as we have to do 7 operations. How could you compute this in multiple threads? Well you could simple split these operations up into 7 additions of their own:
So in the first phase we would need 4 CPU cores in the second only 2 and in the last only 1. By applying this we theoretically save 4 seconds as this takes only 3 seconds to compute the 7 additions. We have to say theoretically as this is a insanely simplified example that doesn’t cosider various problems that occur when applying this.
Nevertheless, as you might have noticed, you have to know pretty exactly what your actual computation task is to apply concurrency in a meaningful way. You also have to know, if your concurrent computation is mathematically valid. We get into trouble if we just do the same with:
What is a process?
A process or task is a program in execution. It is a specific instantiation of a program on a system and has additional information besides its actual code. To execute it, your program must be translated to instructions your CPU is able to understand. This is something you usually don’t have to care about as compilers and interpreters do this for you.
But to actually execute it, your translated code still has to move to the CPU somehow. This happens by loading the information into the main memory and from there to your CPU (Theres actually happening a lot more but it’s way too much for this article, you may read it up here: https://en.wikipedia.org/wiki/Process_(computing))
Depending on which operating system you use, your system will assign a fixed or variable amount of main memory to this process wherein its information will be stored.
The allocated memory usually is splitted into four crucial segments:
- Heap: Contains dynamically assigned memory for storing relevant information
- Stack: Contains jump labels and local variables
- Data: Contains global variables
- Text: Contains the actual instructions that should be executed on your CPU
Despite the fact that you do not have infinite main memory heap and stack may not increase infinitely either. They usually share an allocated size of memory. You may have heard the term „stack overflow“ and now you know what is meant with it. When your program code is stored in the main memory the so called scheduler will assign CPU computation time to it and then it may finally be executed (Not fully, just as long as its computation time allows).
As this shouldn’t go too much into boring details, I will not elaborate further here. If you want to know more on this, please leave a comment and I will try to write a more detailed version.
What is a thread?
A thread is a part of a program that can be executed concurrently. A process may contain various threads. There are two major classes of threads:
- Kernel threads: the operating system supervises them
- User-Threads: a program must supervise them
A thread shares the data- text-segment with all other threads of the same process. Threads can be executed on multiple CPU cores. However, threads also have some additional information of their own:
- Program Counter
The stack usually is stored in the memory allocated for the stack of the process. Whereas the program counter is independent. These terms clarified we may finally come to the actual concepts of applying parallelization.
One way to apply concurrency we somehow already covered a bit is multithreading. There is some confusion about this term as there is hardware and software based threading.
If we talk about hardware threading we mean a single CPU core that may execute more than one thread at once. This is often called a logical processor or thread. Modern Intel and AMD CPUs mostly have two logical processors on each physical core. So for example an AMD Ryzen Threadripper 3970x (Affiliate Link) with 32 physical cores has 64 logical ones which means it may execute up to 64 threads at once on a single CPU.
If we talk about software based threading we mean that we give specific instructions on how our program may be executed on multiple (logical) processors.
Another way to reach parallelization of computation is to simply start multiple processes instead of threads. An obvious disadvantage of this method is, that this will produce a lot of redundant data (overhead) in the main memory, if you simply produce multiple process instances of the same job as they don’t share their memory anymore. However, this might sometimes be intended as you may see when you for example encounter the global interpreter lock (GIL) in Python.
A problem for concurrency
As a basis for looking deeper into how to apply concurrency in Python we need an example that isn’t too far from reality. Therefore I took a simple crawler that visits websites and returns their content.
To do the actual HTTP-GETs we use the requests library. Our crawl function takes an URL as an argument and will return the first 100 characters of the pages HTML content as a string via get_content().
Next we need some URLs to run the crawler on. To simplify things further, we will take advantage of a Python feature and use a list of 4 URLs and multiply it by 4 to get a list of 16 URLs:
The usual approach to run this would simply be, to call the function with an URL:
As we got a baseline, now we can look further into how to apply the different techniques.
Multithreading in Python using threading.Thread
As elaborated earlier our problem at hand is to speed up a crawler by applying concurrency. I prepared some simplified code to pose how this could possibly be realized via the threading.Thread class:
In this example we will run 4 threads concurrently which is why we need a function that will cut our URL list into tuples of the length 4. This is realized in the function fourwise_iteration() which uses the built-in functions iter and zip.
Next we need our actual Thread class. We reach this by introducing our own class CrawlThread that extends the Thread class and overwrites it’s init() and run() function. In our init function we pass all the variables we require later on. In our run function we define what should be executed when the thread is started. For further information please look into the threading documentation.
Next we have our multithread_crawling function that actually runs 4 threads to crawl 4 URLs nearly simultaneously. Please note that our result list will contain tuples of length 4 too.
Multithreading in Python using joblib
A pretty easy and lucid way to do this is by using the joblib module:
The syntax is straight forward and there is barely anything to explain. To run joblib we need a list of arguments that is provided in URLS[:adresses]. So for example if our variable addresses is 4 this will provide the first 4 URLs from the list URLS.
The n_jobs parameter of Parallel defines how many jobs we want to run concurrently. Next in delayed we provide the function we want to run, please note that we don’t use any brackets there as we want to pass the function and not the result of the function. Following that we do a simple for loop to pass our URLs to the crawl function via Parallel.
The simplicity of this code is beautiful and I personally prefer this approach when dealing with concurrency in Python.
Multiprocessing in Python using multiprocessing.Pool
Another way to reach is to spawn processes instead of threads. This causes overhead to some extend but also leads to sparing the global interpreter lock (GIL).
The straightest way to reach this is by using the Pool function of the built-in multiprocessing module:
As we saw in an earlier example we again need a list of arguments and a factor of concurrency. In this example it is again or list of URLs up to a certain point and 4 processes. First we need to define a Pool() which will define how many processes can be spawned at once. Next all we need to do is simply map our function to our arguments list.
Please note that you have to close these processes, otherwise they will be kept in background as long as your program runs which might even result into system failure if you spawn a very large number of processes.
As always I used this as an opportunity to apply the clean code principles to these code examples. If you think I missed something I would be glad to hear your opinion and improve!
Comparing the Python parallelization techniques
Last but not least and maybe the most interest, we want to look into how our execution time actually changes by applying these techniques. As we covered some basic technical details earlier, you should be able to at least give some educated guesses on how the execution time evolves.
To measure this I wrote a small package “function_timing“, you may unconditionally use this as it is currently available under the MIT-licence. I would enjoy it even more if you even contribute to this package.
Letts jump right into it:
The timer instance takes the functions, the arguments of the runs and the number of runs as construction variables. In this case we will make runs of increasing number of URLs from 1 URL to up to 16 in the end and measure the runtime of the function.
The result is pretty much what we should have expected:
The two fastest concurrent version of the crawler are the threaded ones, as they produce the least overhead. Next at about 7 URLs the multiprocessing version starts also to become faster than the single executed crawler.
This concludes my report about concurrency in Python. Thanks for reading this article. I hope I neither bored you with technical details nor disappointed you with missing information. However, if there is anything that is left unclear or I may improve on, don’t hesitate to leave a comment or contact me directly.
I am a computer scientist and entrepreneur from Germany. I chose to work in computer science because I love building things and improve people’s lives.
Our Book Recommendations on Python (Affiliate)