Concurrency saves time by using multiple CPU cores. Learn about multithreading and multiprocessing in Python for efficient computing

Why should we use concurrency?

When you execute a set of instructions on a CPU, it takes a given amount of time by default. The major advantage of this concurrency is, that in theory and in a linear example we save 50% of computing time if we use 100% more CPU cores.


To understand how the concept works and which possibilities we have to improve our computation speed, we need a few terms clarified. If you feel confident about these terms, you may skip the first few headings and select one of the three possibilities I introduce:

When to apply concurrency?

Generally spoken you can apply concurrency whenever you can execute your code parallelly. Imagine a simple example where you want to add 8 numbers.

To further simplifiy things, let's assume each addition would take a full second. So it would take 7 seconds to add these 8 number as we have to do 7 operations. How could you compute this in multiple threads? Well you could simple split these operations up into 7 additions of their own:

So in the first phase we would need 4 CPU cores in the second only 2 and in the last only 1. By applying this we theoretically save 4 seconds as this takes only 3 seconds to compute the 7 additions. We have to say theoretically as this is a insanely simplified example that doesn't cosider various problems that occur when applying this.

Nevertheless, as you might have noticed, you have to know pretty exactly what your actual computation task is to apply concurrency in a meaningful way. You also have to know, if your concurrent computation is mathematically valid. We get into trouble if we just do the same with:

What is a process?

A process or task is a program in execution. It is a specific instantiation of a program on a system and has additional information besides its actual code. To execute it, your program must be translated to instructions your CPU is able to understand. This is something you usually don't have to care about as compilers and interpreters do this for you.

But to actually execute it, your translated code still has to move to the CPU somehow. This happens by loading the information into the main memory and from there to your CPU Theres actually happening a lot more but it's way too much for this article, you may read it up here:

Depending on which operating system you use, your system will assign a fixed or variable amount of main memory to this process wherein its information will be stored.

Process information

The allocated memory usually is splitted into four crucial segments:

  • Heap: Contains dynamically assigned memory for storing relevant information
  • Stack: Contains jump labels and local variables
  • Data: Contains global variables
  • Text: Contains the actual instructions that should be executed on your CPU

Despite the fact that you do not have infinite main memory heap and stack may not increase infinitely either. They usually share an allocated size of memory. You may have heard the term „stack overflow“ and now you know what is meant with it. When your program code is stored in the main memory the so called scheduler will assign CPU computation time to it and then it may finally be executed (Not fully, just as long as its computation time allows).

As this shouldn't go too much into boring details, I will not elaborate further here. If you want to know more on this, please leave a comment and I will try to write a more detailed version.

What is a thread?

A thread is a part of a program that can be executed concurrently. A process may contain various threads. There are two major classes of threads:

  • Kernel threads: the operating system supervises them
  • User-Threads: a program must supervise them

A thread shares the data- text-segment with all other threads of the same process. Threads can be executed on multiple CPU cores. However, threads also have some additional information of their own:

  • Stack
  • Program Counter

The stack usually is stored in the memory allocated for the stack of the process. Whereas the program counter is independent. These terms clarified we may finally come to the actual concepts of applying parallelization.


One way to apply concurrency we somehow already covered a bit is multithreading. There is some confusion about this term as there is hardware and software based threading.

If we talk about hardware threading we mean a single CPU core that may execute more than one thread at once. This is often called a logical processor or thread. Modern Intel and AMD CPUs mostly have two logical processors on each physical core. So for example an AMD Ryzen Threadripper 3970x (Affiliate Link) with 32 physical cores has 64 logical ones which means it may execute up to 64 threads at once on a single CPU.

If we talk about software based threading we mean that we give specific instructions on how our program may be executed on multiple (logical) processors.


Another way to reach parallelization of computation is to simply start multiple processes instead of threads. An obvious disadvantage of this method is, that this will produce a lot of redundant data (overhead) in the main memory, if you simply produce multiple process instances of the same job as they don't share their memory anymore. However, this might sometimes be intended as you may see when you for example encounter the global interpreter lock (GIL) in Python.

A problem for concurrency

As a basis for looking deeper into how to apply concurrency in Python we need an example that isn't too far from reality. Therefore I took a simple crawler that visits websites and returns their content.

import requests
def get_content(url: str) -> str:
get = requests.get(url)
return get.text[:100] # Returns the first 100 characters
def crawl(url: str) -> str:
content = get_content(url)
# do something with the content
return content

To do the actual HTTP-GETs we use the requests library. Our crawl function takes an URL as an argument and will return the first 100 characters of the pages HTML content as a string via get_content().

Next we need some URLs to run the crawler on. To simplify things further, we will take advantage of a Python feature and use a list of 4 URLs and multiply it by 4 to get a list of 16 URLs:

URLS = ["",
""] * 4

The usual approach to run this would simply be, to call the function with an URL:

from concurrency.crawler import crawl
from concurrency.URLS import URLS
def non_parallel_crawling(adresses):
results = list()
for url in URLS[:adresses]:
return results

As we got a baseline, now we can look further into how to apply the different techniques.

Multithreading in Python using threading.Thread

As elaborated earlier our problem at hand is to speed up a crawler by applying concurrency. I prepared some simplified code to pose how this could possibly be realized via the threading.Thread class:

from threading import Thread
from concurrency.URLS import URLS
from concurrency.crawler import crawl
def fourwise_iterartion(iterable):
x = iter(iterable)
return zip(x, x, x, x)
class CrawlThread(Thread):
def __init__(self, address: str):
self.address = address
self.terminated = False
def run(self):
return crawl(self.address)
def multithread_crawling(adresses: int) -> list:
results = list()
# Results into 4 parallel threads at once
for url1, url2, url3, url4 in fourwise_iterartion(URLS[:adresses]):
threads = [CrawlThread(url1), CrawlThread(url2), CrawlThread(url3), CrawlThread(url4)]
for thread in threads:
for thread in threads:
# Wait for results to synchronize
return results

In this example we will run 4 threads concurrently which is why we need a function that will cut our URL list into tuples of the length 4. This is realized in the function fourwise_iteration() which uses the built-in functions iter and zip.

Next we need our actual Thread class. We reach this by introducing our own class CrawlThread that extends the Thread class and overwrites it's init() and run() function. In our init function we pass all the variables we require later on. In our run function we define what should be executed when the thread is started. For further information please look into the threading documentation.

Next we have our multithread_crawling function that actually runs 4 threads to crawl 4 URLs nearly simultaneously. Please note that our result list will contain tuples of length 4 too.

Multithreading in Python using joblib

A pretty easy and lucid way to do this is by using the joblib module:

from joblib import Parallel, delayed
from concurrency.crawler import crawl
from concurrency.URLS import URLS
def joblib_crawling(adresses: int, n_jobs: int = 4) -> list:
results = Parallel(n_jobs=n_jobs)(delayed(crawl)(url) for url in URLS[:adresses])
return results

The syntax is straight forward and there is barely anything to explain. To run joblib we need a list of arguments that is provided in URLS[:adresses]. So for example if our variable addresses is 4 this will provide the first 4 URLs from the list URLS.

The n_jobs parameter of Parallel defines how many jobs we want to run concurrently. Next in delayed we provide the function we want to run, please note that we don't use any brackets there as we want to pass the function and not the result of the function. Following that we do a simple for loop to pass our URLs to the crawl function via Parallel.

The simplicity of this code is beautiful and I personally prefer this approach when dealing with concurrency in Python.

Multiprocessing in Python using multiprocessing.Pool

Another way to reach is to spawn processes instead of threads. This causes overhead to some extend but also leads to sparing the global interpreter lock (GIL).

The straightest way to reach this is by using the Pool function of the built-in multiprocessing module:

from multiprocessing import Pool
from concurrency.crawler import crawl
from concurrency.URLS import URLS
def multiprocess_crawling(adresses: int, n_processes: int = 4):
process_pool = Pool(processes=n_processes)
results =, URLS[:adresses])
return results

As we saw in an earlier example we again need a list of arguments and a factor of concurrency. In this example it is again or list of URLs up to a certain point and 4 processes. First we need to define a Pool() which will define how many processes can be spawned at once. Next all we need to do is simply map our function to our arguments list.

Please note that you have to close these processes, otherwise they will be kept in background as long as your program runs which might even result into system failure if you spawn a very large number of processes.

As always I used this as an opportunity to apply the clean code principles to these code examples. If you think I missed something I would be glad to hear your opinion and improve!

Comparing the Python parallelization techniques

Last but not least and maybe the most interest, we want to look into how our execution time actually changes by applying these techniques. As we covered some basic technical details earlier, you should be able to at least give some educated guesses on how the execution time evolves.

The result is pretty much what we should have expected:


The two fastest concurrent version of the crawler are the threaded ones, as they produce the least overhead. Next at about 7 URLs the multiprocessing version starts also to become faster than the single executed crawler.

This concludes my report about concurrency in Python. Thanks for reading this article. I hope I neither bored you with technical details nor disappointed you with missing information. However, if there is anything that is left unclear or I may improve on, don't hesitate to leave a comment or contact me directly.