Code optimization in Python
Code optimization is a topic that might seem intimidating, and it can be. However, there are some optimization tips that are easy to implement at a basic level. Here we will be focusing on
for loops, and how some Python structures and functionality can significantly accelerate the runtime of your code.
For this tutorial, we will be using IPython, a Python interpreter with useful profiling features that allow us to evaluate the runtime of functions and code blocks. You can install Ipython in a Conda environment using with
conda install ipython. If you need help installing and activating a Conda environment, you can follow the tutorial here. (open in place).
Conda is a system that allows us to install programs, packages, and modules, in a project specific manner. This is useful in our case, because the installation is done in our user directory, so we don't need administrator privileges. Moreover, because it is project-specific, we can install several environments without worrying about some packages in one project to interfere with packages in another project.
# Starting from scratch curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh # Follow the instructions # Now add channels conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
That's it. If you let the installer initialize conda in your
.bashrc (highly recommended), you will be in your newly created environment every time you open your terminal. If you want to check if you are in, you can enter
which python in your terminal, and you should see the path to Python in your environment.
First, let's see how the "time" magic works in IPython (they are really called "magic"). With IPython "magic", you can run bash code, format output, among many other things. In this tutorial, we are only going to use the
%%time magic, which outputs the time it took to run a block of code. Let's see in action.
About code blocks and IPython
In an interpreter (Python shell, perl shell, etc), you usually run one line at the time. You type, you press enter, it runs (except loops, "if" statements, etc, but bear with me). In IPython, "block" magics start with
%%, so when you type a
%%magic command, IPython understands that you want to enter a block, and instead of executing a line after pressing Enter, it will show you another line. The code will execute after you press Enter two consecutive times on empty lines. Let's try an example. In order to launch IPython, you can type
ipython in a linux or Mac terminal, or you can search for the application on Windows Start menu.
print("Hello world") # press Enter +++Hello world print("Hola mundo") # press Enter +++Hola mundo
Now, Let's run those commands with the
%%time # press Enter print("Hello world") # press Enter print("Hola mundo") # press Enter # press Enter # press Enter +++Hello world +++Hola mundo +++CPU times: user 17 µs, sys: 4 µs, total: 21 µs +++Wall time: 22.9 µs
In the first example, each line is executed immediately. If you run the second example in your IPython console, you see that after each line, each line starts with
..., which means that you are still in the same block. As you can see, only at the end of the block, the lines are executed, and the last two lines are the output of the
%%time magic, with information about the time it took. Now, down to business
Optimizing the "for" loop
for loops run the same code for each of the elements of a collection of objects. Let's say we need to raise a list of numbers to the second power. So the plan is to
- Initialize list to store the results
- Implement a function that multiply a number by itself
- Iterate over the list we want to apply the function to
- Apply the function to each element of the list
- Append the output to the results list
# 1. results =  # 2. def power_2(n): return n * n input_list = list(range(1, 1000, 2)) # 3. In this case, the list is all the odd numbers from 1 to 1000 for i in input_list: # 4. raised = power_2(i) # 5. results.append(raised)
Now, let's see how fast it run
%%timeit result_for =  for i in input_list: raised = power_2(i) result_for.append(raised) +++47.7 µs ± 562 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Ok, not too bad, 47.7µs. By the way, here we are using the
%%timeit magic, which run the block several times, and report the mean and standard deviation of all the runs. This approach gives more accurate timing estimates.
The first way of optimizing our code, is to use list comprehension. Basically, it is a Python construct that allows us to build a list with computed values, instead of initializing it first, and appending values to it. Let's see it in action as we measure the time the same result takes to be computed.
%%timeit result_lc = [power_2(x) for x in input_list] +++34.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The format of a list comprehension is easy to understand, you can actually read in in English like "run the function
x, for each element
x in the list
input_list. In more formal terms, you basically put the function first, then the
for statement, and wrap everything in square brackets.
Here, the same calculation using list comprehension was 13.3 µs faster, which is 28% faster than using a for loop. This can be quite significant, especially with longer input lists. However, we can optimize this code even more.
map() construct allows Python to run a function to an array, and return a map object. We are not going to go into the details of a map object, because for our current application, we are going to convert it to a list. Let's see how it performs:
%%timeit result_map = list(map(power_2, input_list)) +++25.4 µs ± 545 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The format of the
map construct is quite simple. We just need to call the
map() function with a function as the first argument, and the list you want to apply the function to, as the second argument.
That is quite an improvement! 22.2 µs, or 46% faster than the
for loop implementation can be quite significant.
Bringing the big guns: multiprocessing in Python
Sometimes, with real big lists (e.g. genomes, entire gene annotations), we might have to use multiple processors or cores in our machine to run the calculations. For simple examples, this is actually quite easy. Let's use it on our
# Before running the timeit magic, load the futures submodule from concurrent import futures %%timeit with futures.ProcessPoolExecutor() as pool: r = list(pool.map(power_2, input_list)) +++60.7 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Wait, that happened? This is actually slower than the
for loop implementation!! Why? This is supposed to be using all the cores in my computer!
Well, what happens is that when using more than one core, Python has to copy all the data each processor needs to each processor, and then organize the data as it comes back to the main process. Given that this is a relatively simple computation, the overhead of copying and organizing data takes actually more than the computation itself. However, let's see an application where this approach can be useful.
GC content calculation
In this example, we will calculate the GC content of all the chromosomes and scaffolds of the human genome. For this, we will use the BioPython module, specifically the sequence parsers (SeqIO) and its GC calculator.
from Bio import SeqIO from Bio.SeqUtils import GC genome = "/path/to/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa" # Extract sequences from fasta file chromosome_seqs = [x.seq for x in SeqIO.parse(genome, "fasta")]
For loop implementation
%%timeit # Initialize results list gc_lst =  # Go through the chromosomes of the genome for sequence in chromosome_seqs: # calculate GC for each sequence gc = GC(sequence) # Add the value to our results gc_lst.append(gc) +++18.2 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension implementation
%%timeit gc_lst_lc = [GC(x) for x in chromosome_seqs] +++18.4 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit gc_lst_map = list(map(GC, chromosome_seqs)) +++17.9 s ± 298 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit with futures.ProcessPoolExecutor() as pool: res_fut = list((pool.map(GC, chromosome_seqs)) +++6.64 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Ok, now all of the sudden, the striking differences between
for, list comprehension, and
map() disappeared, and the multiprocessing implementation is almost three times as fast. What is going on? Well, what happens is that the system spends most of its time reading each chromosome-long sequence from the sequence list, so it doesn't really matter how do we optimize the calculation itself, we won't see an improvement because the bottleneck is on reading from the sequence list. However, in the multiprocessing implementation, one core can start processing one sequence, while another core is reading the next one. That is why the multiprocessing implementation is so much better.
As a general rule, a plain
for loop implementation is going to be the slowest code, and
map() implementations tend to be faster. However, it is important to keep in mind that list comprehensions can be useful because they are easy to read. On the other hand, if the code to be optimized can not be neatly put in a function, list comprehensions (and
map()) can be difficult to implement. Multiprocessing is often a good optimization stragegy only when the computations to be performed are very expensive, or when there is a lot of reading/writing involved. Finally, there are many libraries (e.g. Scipy, Numpy, scikit-learn) that have highly optimized code that can be used directly, but it is outside the scope of this tutorial. My final suggestion is to try to write
map() at the beginning, and if it does not work, move to list comprehensions, and then to
Happy optimizing! :)