Code optimization in Python

Code optimization is a topic that might seem intimidating, and it can be. However, there are some optimization tips that are easy to implement at a basic level. Here we will be focusing on for loops, and how some Python structures and functionality can significantly accelerate the runtime of your code.

For this tutorial, we will be using IPython, a Python interpreter with useful profiling features that allow us to evaluate the runtime of functions and code blocks. You can install Ipython in a Conda environment using with conda install ipython. If you need help installing and activating a Conda environment, you can follow the tutorial here. ().

Conda

Conda is a system that allows us to install programs, packages, and modules, in a project specific manner. This is useful in our case, because the installation is done in our user directory, so we don't need administrator privileges. Moreover, because it is project-specific, we can install several environments without worrying about some packages in one project to interfere with packages in another project.

Here we are going to install Bioconda, a Conda distribution with repositories specialized for bioinformatics data analisys. The following steps are taken from the Bioconda website

# Starting from scratch
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
# Follow the instructions

# Now add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

That's it. If you let the installer initialize conda in your .bashrc (highly recommended), you will be in your newly created environment every time you open your terminal. If you want to check if you are in, you can enter which python in your terminal, and you should see the path to Python in your environment.

First, let's see how the "time" magic works in IPython (they are really called "magic"). With IPython "magic", you can run bash code, format output, among many other things. In this tutorial, we are only going to use the %%time magic, which outputs the time it took to run a block of code. Let's see in action.

About code blocks and IPython

In an interpreter (Python shell, perl shell, etc), you usually run one line at the time. You type, you press enter, it runs (except loops, "if" statements, etc, but bear with me). In IPython, "block" magics start with %%, so when you type a %%magic command, IPython understands that you want to enter a block, and instead of executing a line after pressing Enter, it will show you another line. The code will execute after you press Enter two consecutive times on empty lines. Let's try an example. In order to launch IPython, you can type ipython in a linux or Mac terminal, or you can search for the application on Windows Start menu.


    print("Hello world") # press Enter
    +++Hello world
    print("Hola mundo") # press Enter
    +++Hola mundo
  

Now, Let's run those commands with the %%time magic


    %%time # press Enter
    print("Hello world") # press Enter
    print("Hola mundo") # press Enter
    # press Enter
    # press Enter
    +++Hello world
    +++Hola mundo
    +++CPU times: user 17 µs, sys: 4 µs, total: 21 µs
    +++Wall time: 22.9 µs
  

In the first example, each line is executed immediately. If you run the second example in your IPython console, you see that after each line, each line starts with ..., which means that you are still in the same block. As you can see, only at the end of the block, the lines are executed, and the last two lines are the output of the %%time magic, with information about the time it took. Now, down to business

Optimizing the "for" loop

for loops run the same code for each of the elements of a collection of objects. Let's say we need to raise a list of numbers to the second power. So the plan is to

  1. Initialize list to store the results
  2. Implement a function that multiply a number by itself
  3. Iterate over the list we want to apply the function to
  4. Apply the function to each element of the list
  5. Append the output to the results list

    # 1.
    results = []
    # 2.
    def power_2(n):
      return n * n
    input_list = list(range(1, 1000, 2))
    # 3. In this case, the list is all the odd numbers from 1 to 1000
    for i in input_list:
      # 4.
      raised = power_2(i)
      # 5.
      results.append(raised)
  

Now, let's see how fast it run


    %%timeit
    result_for = []
    for i in input_list:
      raised = power_2(i)
      result_for.append(raised)
    +++47.7 µs ± 562 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  

Ok, not too bad, 47.7µs. By the way, here we are using the %%timeit magic, which run the block several times, and report the mean and standard deviation of all the runs. This approach gives more accurate timing estimates.

List comprehension

The first way of optimizing our code, is to use list comprehension. Basically, it is a Python construct that allows us to build a list with computed values, instead of initializing it first, and appending values to it. Let's see it in action as we measure the time the same result takes to be computed.


    %%timeit
    result_lc = [power_2(x) for x in input_list]
    +++34.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

The format of a list comprehension is easy to understand, you can actually read in in English like "run the function power_2 to x, for each element x in the list input_list. In more formal terms, you basically put the function first, then the for statement, and wrap everything in square brackets.

Here, the same calculation using list comprehension was 13.3 µs faster, which is 28% faster than using a for loop. This can be quite significant, especially with longer input lists. However, we can optimize this code even more.

Map

The map() construct allows Python to run a function to an array, and return a map object. We are not going to go into the details of a map object, because for our current application, we are going to convert it to a list. Let's see how it performs:


      %%timeit
      result_map = list(map(power_2, input_list))
      +++25.4 µs ± 545 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

The format of the map construct is quite simple. We just need to call the map() function with a function as the first argument, and the list you want to apply the function to, as the second argument.

That is quite an improvement! 22.2 µs, or 46% faster than the for loop implementation can be quite significant.

Bringing the big guns: multiprocessing in Python

Sometimes, with real big lists (e.g. genomes, entire gene annotations), we might have to use multiple processors or cores in our machine to run the calculations. For simple examples, this is actually quite easy. Let's use it on our power_2() benchmark.


    # Before running the timeit magic, load the futures submodule
    from concurrent import futures
    %%timeit
    with futures.ProcessPoolExecutor() as pool:
      r = list(pool.map(power_2, input_list))
    +++60.7 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  

Wait, that happened? This is actually slower than the for loop implementation!! Why? This is supposed to be using all the cores in my computer!

Well, what happens is that when using more than one core, Python has to copy all the data each processor needs to each processor, and then organize the data as it comes back to the main process. Given that this is a relatively simple computation, the overhead of copying and organizing data takes actually more than the computation itself. However, let's see an application where this approach can be useful.

GC content calculation

In this example, we will calculate the GC content of all the chromosomes and scaffolds of the human genome. For this, we will use the BioPython module, specifically the sequence parsers (SeqIO) and its GC calculator.

Preliminary steps


      from Bio import SeqIO
      from Bio.SeqUtils import GC
      genome = "/path/to/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
      # Extract sequences from fasta file
      chromosome_seqs = [x.seq for x in SeqIO.parse(genome, "fasta")]
    

For loop implementation


      %%timeit
      # Initialize results list
      gc_lst = []
      # Go through the chromosomes of the genome
      for sequence in chromosome_seqs:
        # calculate GC for each sequence
        gc = GC(sequence)
        # Add the value to our results
        gc_lst.append(gc)
      +++18.2 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

List comprehension implementation

      
      %%timeit
        gc_lst_lc = [GC(x) for x in chromosome_seqs]
        +++18.4 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    

Map() implementation


      %%timeit
      gc_lst_map = list(map(GC, chromosome_seqs))
      +++17.9 s ± 298 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

Multiprocessing implementation


      %%timeit
      with futures.ProcessPoolExecutor() as pool:
        res_fut = list((pool.map(GC, chromosome_seqs))
      +++6.64 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

Ok, now all of the sudden, the striking differences between for, list comprehension, and map() disappeared, and the multiprocessing implementation is almost three times as fast. What is going on? Well, what happens is that the system spends most of its time reading each chromosome-long sequence from the sequence list, so it doesn't really matter how do we optimize the calculation itself, we won't see an improvement because the bottleneck is on reading from the sequence list. However, in the multiprocessing implementation, one core can start processing one sequence, while another core is reading the next one. That is why the multiprocessing implementation is so much better.

Conclusion

As a general rule, a plain for loop implementation is going to be the slowest code, and map() implementations tend to be faster. However, it is important to keep in mind that list comprehensions can be useful because they are easy to read. On the other hand, if the code to be optimized can not be neatly put in a function, list comprehensions (and map()) can be difficult to implement. Multiprocessing is often a good optimization stragegy only when the computations to be performed are very expensive, or when there is a lot of reading/writing involved. Finally, there are many libraries (e.g. Scipy, Numpy, scikit-learn) that have highly optimized code that can be used directly, but it is outside the scope of this tutorial. My final suggestion is to try to write map() at the beginning, and if it does not work, move to list comprehensions, and then to for loops.
Happy optimizing! :)