Adventures in code performance

Why your code is slow and how to make it faster.

2020-05-06

Preface

Recently I had to solve a very computationally expensive problem at work. Our current tool got the job done, but at way too slow of a pace. A rewrite brought the execution time down from hours to seconds. But how, and why?

In this post, we will explore some of the reasons why your code might not be as performant as you'd like it to be and suggest some methods for speeding it up. The intended audience is scientists and developers who perform large amounts of data processing in interpreted scripting languages, particularily Python and R. In fewer words, "data people".

I'll be sacrificing a tiny bit of accuracy for simplicity. The goal is not to become a performance expert, but to know what to look out for. If you're able to apply what you learn in this post to reduce just one slow script's execution time by an order of magnitude or two, I will be incredibly happy.

Introduction

The mission

As part of a project, I required a mapping of one type of geographical area to another. These weren't basic areas like cities or countries, so let's call them X values and Y values respectively. At the last minute, I was informed that there was no mapping for approximately 500 of the X values.

Thankfully, we had a mapping of postal codes to X values and shapefiles delineating the boundaries for the Y values. That would allow me to approximate the mapping by assigning each X value to the Y value in which most of its postal codes lay. Some X values near boundary lines could be miscategorised, but it would be good enough.

The issue? I got the news on Sunday but the larger project was due by Wednesday morning. Oh, and I had approximately 50,000 postal codes, hundreds of shapefiles, and an absurd number of edges per shape. The problem was solveable, but my biggest restriction was time.

Houston, we have a performance problem

We already had an internal tool, written in Python, to assign geographical coordinates to shapefiles. (I also didn't have geographical coordinates for the postal codes, but getting those was the easy part.) X values changed over time, so it was unfortunate but not rare when the client did not have an X to Y mapping for some of the data.

However, the tool wasn't very efficient. For each row, it read the data from an input CSV into a Python dictionary (a hash table). Then it concatenated the output values and commas (not even using str.join) to build one line of a CSV. Each row was concatenated to an ever-growing string which was written to a CSV at the end.

Normally we only needed to map a handful of X values so this did the job. But in this case, I was sure it wouldn't finish in time. It also wasn't general enough to handle the specifics of my problem, so I spent a few hours rewriting it in Python using Pandas. My rewrite was much faster, and in the end it only took... Several hours?! You've got to be kidding me.

Why does performance matter?

There are two main reasons why I think data people should care about performance.

Firstly, you don't want to be waiting for your code to finish running if you need results as soon as possible. Maybe this is because you have a tight deadline, but maybe it's because waiting 30 seconds all the time adds up. If your workflow is slower, you do less work; this is an opportunity cost. You also don't want your code to be the bottleneck in a larger workflow if someone or something else is waiting for your code to finish.

Secondly, heavy resource usage limits your computer's ability to handle other tasks. Less CPU availability means less computation can happen at the same time. Less RAM availability means more swapping to disk, which can cause major slowdowns. On shared computers, this can negatively affect other users' work. As well, CPU and RAM requirements can increase the cost of hosted machines.

It sucks to be single

No, I'm not talking about relationships.

Your computer's brain

Processing is done in the Central Processing Unit (CPU). Your computer could have several CPUs, each of which could have several cores. On some Intel CPUs, each physical core can appear as two logical cores to the operating system. This is called hyperthreading. Normally when you execute your code, that process runs on one core.

Want to know how many CPUs and/or cores you're working with?

Run $ system_profiler SPHardwareDataType on macOS to see the number of physical CPUs and physical cores.
Run $ sysctl -n hw.logicalcpu on macOS to see the number of logical cores.
Run $ lscpu on Linux to see the number of physical CPUs, physical cores, and logical cores.

What I'm calling "cores" here are also commonly called CPUs. This is completely valid, but I'm distinguishing the terms to make things simpler. However, you should keep this in mind so you're not confused if you encounter this alternate terminology.

Multithreading and multiprocessing

Multithreading allows a single process to run multiple threads concurrently. This is like writing code while answering emails. You have to switch between them as necessary because you can't do both at once.

Multiprocessing allows multiple processes to be run in parallel. This is like writing code while listening to music. You don't need to switch because you can do both at the exact same time.

Multiprocessing has greater overhead than multithreading. This is because multithreading shares memory between threads but multiprocessing requires a separate memory space for each process. You can also have multiple processes running multiple threads.

Why is your code slow?

Most programs are either I/O-bound or CPU-bound. This means that they're either wasting time waiting to read/write data or they're performing time-consuming computations. It's possible that your program can be slow because of both I/O and CPU, but odds are one is limiting you more than the other. At scale, it tends to be the weakest link that really messes things up. But never fear, there are ways to get past this!

I/O-bound programs can be sped up by multithreading. Multithreading allows multiple threads to wait for I/O at the same time. Since the waiting time is shared, they can do the I/O sequentially when each thread has access. Multiprocessing will also help, but with more overhead.

Okay, but we're data people! Sometimes we send lots of requests to gather data, but often we're just math-ing real hard on a bajillion numbers. CPU-bound programs can be sped up by multiprocessing... But in Python, this doesn't work as well as you might expect.

The global interpreter lock (GIL)

The GIL (pronounced "gill") is a lock (or mutex) that prevents more than one thread from executing Python bytecode at the same time. As a result, multithreaded programs are effectively single-threaded during the time they spend interpreting Python bytecode inside the GIL. Not only does each Python process need its own memory space, it needs its own Python interpreter too.

So why have the GIL at all? Python's memory management isn't thread-safe, so the GIL enforces a guarantee that only one thread can be dealing with the memory at a time. The GIL fits right into the language philosophy of Python: easy, safe, single-threaded programming.

Guido van Rossum, who created Python and was its "benevolent dictator for life" (BDFL) until mid 2018, has set very clear requirements for removing the GIL. In his 2007 article "It isn't easy to remove the GIL", he wrote:

I'd welcome [removing the GIL] only if the performance for a single-threaded program (and for a multi-threaded but I/O-bound program) does not decrease.

Let's be honest, a "Gilectomy" isn't happening any time soon, if ever. People have been working on ways to remove the GIL for years and years, but the most recent and ambitious effort hasn't worked out.

Our issue? We're data people! Our code is often CPU-intensive, not I/O-intensive. So while we can speed up our CPU-bound Python code with multiprocessing, the extra overhead means that the execution speed won't be roughly divisible by the number of cores. More simply, using twice as many cores might make your code faster, but it won't be twice as fast. If you can divide up your program easily into a small number of separate chunks, multiprocessing might still be worth it for you.

I must admit that I've been lying by omission. A language isn't the same thing as a language implementation. When I've been saying "Python", I've specifically been referring to the original and most common implementation, CPython. IronPython and Jython have removed the GIL entirely. Cython, which we'll talk about later, allows you to acquire and release the GIL as desired. Most people just use CPython, so I'm going to continue saying "Python" to avoid confusing readers with the distinction between the language and the implementation.

Making the best of it

If multiprocessing doesn't make sense for our problem or we don't want to deal with it, is there anything else we can do?

Interpreted versus compiled languages

Something Python and R have in common is that they're both interpreted languages. An interpreter tries to directly understand each statement step-by-step. This means that if a function is run a million times, it's interpreted a million times. In compiled languages, a compiler generates translates the high-level code into low-level code before execution. This means that even if a function is run a million times, it's compiled once.

There are various advantages and disadvantages to each approach. When performance matters, you want your code to execute as low-level as reasonable. This is because more information known at compilation means less overhead during execution. As a result, Python and R are relatively slow out of the box. What's "reasonable" will depend on a variety of factors including available development time, knowledge shared by the team, and ease of maintainance, in addition to performance concerns.

The line between interpreted and compiled languages isn't clear. For example, Python is first compiled into bytecode and then that bytecode is interpreted. From this view, Python can be considered both compiled and interpreted, although it's commonly described as interpreted. Similarly, Java is compiled into bytecode which is interpreted by the Java Virtual Machine, even though Java is commonly described as compiled. This is a deep rabbit hole I won't get into in this post, so we'll stick with the "hand-wavey" definition for now.

Python, but make it compiled

Just like Hannah Montana, it's possible to get the best of both worlds with Python to an extent. There are many projects in the Python ecosystem that enable compiling Python code. I'm going to quickly introduce some of the more stable and easy to use ones.

Cython, the Python implementation I briefly mentioned before, is a superset of the Python language that generates C code and lets you call C functions and declare C types. Because it's a superset of Python, it can generate C from both regular Python code and the extended Cython language. This is cool because it means you can start by just compiling your existing code and adding Cython as you feel comfortable. Since Cython is an extension to Python, the regular Python interpreter can't understand Cython code.

Numba is a just-in-time (JIT) compiler for basic Python and NumPy code. Like the name suggests, JIT compilation is when code is compiled just-in-time for execution. The first execution will be slow but subsequent ones will be pre-compiled. If your program is slow because you're executing some intensive code over and over, JIT compilation will make it almost like that code just happened once. Numba is awesome because it's super easy to implement. All you need to do is put a decorator above the function you want to JIT compile, as long as it meets the requirements, and it'll happen! It's also part of Anaconda, so if you have that then you already have Numba.

NumPy is implemented entirely in C. The critical parts of Pandas, which is built on top on NumPy, are also implemented in C. These libraries are very useful and powerful for many scientific computing and data processing problems, and are definitely worth exploring. If you're already using them, make sure you're offloading execution to those libraries as much as possible. For example, avoid looping through NumPy arrays and Pandas dataframes, and try to vectorise your code where possible.

R, but make it better

I've talked a decent bit about Python, but what about R? Well, the R language philosophy is to make stats and analysis easy, not to be fast or light. In fact, the R core team has been known to reject proposals that improve performance if they negatively affect ease of use.

A notable performance issue with R is that it stores all data in RAM at once. As a result, in addition to worrying about CPU and I/O bounds, you also need to worry about memory bounds! Processing large datasets on-machine is very slow, and can be impossible if the data is larger than the available RAM. Even if your code runs fine, eating up RAM can force other processes to swap or fail.

I'm not an R developer so I can't give as detailed recommendations as for Python. However, there are some general rules that you can follow. Firstly, try to use R packages that are performant. There are a lot of packages out there in the CRAN repository, but they're not all made equally. For example, dplyr is a more performant alternative to plyr. Otherwise, try to vectorise your code when possible, and avoid copies and appends to minimise moving memory around. R has a byte code compiler and is natively parallelisable, so consider making use of those features as well.

Identifying pain points

Before you begin optimising your code, you need to know which parts of your code require optimisation. This means figuring out which parts are taking up the most time and/or memory. A timer simply records the amount of time your code takes to run, sometimes distinguishing real time (or "wall time") versus CPU time. A time profiler is basically a more granular timer that records the time spent in each part of the code. A memory profiler is like a time profiler except it records memory usage in each part of the code.

If you're a data scientist who writes Python and/or R, chances are you probably use Jupyter. Jupyter makes timing and profiling individual lines or cells easy! It provides this through "magic commands", built-in special commands which start with %:

Timer: %time to run once and %timeit to average multiple runs.
Time profiler: %prun for time per call and %lprun for time per line.
Memory profiler: %memit to run once and %memprun for memory per line.

Don't use Jupyter? An easy place to start is with flame graphs. These graphs visualise the stack population and depth from a given profiler's output. Flame graphs can be generated for any language with stack traces, so there are plenty of tools to choose from.

Optimising my Python code

Before worrying about optimising my code, I needed to clean it up. I'd written it in a hurry and wanted it to be well-structured and commented before pushing it to the relevant repository. While doing so, I noticed a simple change I could make to the area assignment algorithm which would save us from checking some of the incorrect assignments. Sometimes it can be useful to take a second look at your algorithms before you start optimising.

I don't know exactly how long the original Python code took to execute, but I'd estimate it at between 4 and 5 hours. Post-cleaning, it only took approximately 2 hours and 45 minutes. I'll call this the "initial" code.

For the optimisations, I chose to keep things simple. In my specific case, the bottleneck was caused by the assignment function executing hundreds of times on average for each of the 50,000 postal codes. I knew that reducing the amount of computation that happened to perform these assignments would make a big difference. I didn't want to spend more than a few minutes in my work day and I didn't want to make the program more complicated to understand or maintain. Otherwise, I would have considered improving the algorithm.

I used Numba to JIT compile the assignment code so that each execution would be much faster. I also changed the part where I assigned each postal code's Y value from a df.iterrows loop to a df.apply statement so that I was modifying the dataframe once instead of 50,000 times. The reason I didn't do this intially is that the assignment function actually needed to return multiple values per assignment. For the record, zip(*df.apply(lambda x: foo(x['bar']), axis=1)) will structure things correctly.

After optimising, my code was nearly 200 times faster than the initial code, taking a little over 50 seconds to run. The optimisations were so simple to implement that there's little argument against trying. However, the optimised code used approximately 480 MB of memory while the initial code only used approximately 360 MB. This is because JIT compilation requires more memory for reasons that I won't get into here. By the way, I ran these programs multiple times each to get better estimates of time and memory usage.

Putting the pedal to the metal

That was a pretty awesome speedup already, and more than good enough for our needs. However, this problem was particularly easy to optimise because it just had one very dramatic bottleneck. What would I do if all my code was slow? I was nerd sniped and took this problem as an opportunity to explore further on my own time. Yes, this is my idea of a fun evening project.

Why not just write compiled code?

JIT compiling the assignment function and making use of a library implemented in C (Pandas) gave me a huge speedup because all the hard stuff was compiled. I found myself wondering how fast this code could be if it was written in a compiled language in the first place. I was also curious how much memory we could save, especially compared to using a JIT compiler.

Optimising interpreted code can take a long time and may not be intuitive or easy. This is especially true if your code has multiple bottlenecks or is simply a little too slow in too many places. And despite optimising, you will still have the overhead of the interpreted language slowing you down. Instead of spending effort trying to optimise parts of the code to bring them close to low-level performance, why not write in a compiled language instead and get that performance out of the box?

The two language problem

When it comes to programming languages, data people have historically had to choose between two options:

Languages that are slow to code but fast to run, such as C, C++, C#, and Java.
Languages that are fast to code but slow to run, such as Python, R, and Matlab.

As a result, they often write their initial code in high-level languages then rewrite that code in low-level languages later. This is known as the "two language problem".

I personally experienced this while working at a scientific research lab. Researchers developed new features in a desktop Python application, while developers like me mimicked these features in C++ that ran on ultrasound machines. We had more researchers than developers, so the Python prototype was light-years ahead of the C++ implementation.

As you can imagine, this is a huge waste of time. We shouldn't have to choose between creativity and performance. Thankfully, more modern languages are bridging the gap! As a bonus, some of these languages have better concurrency and parallelism support, which helps us worry less about finding optimisations.

Other language options

Julia is a faster alternative to Python, R, and Matlab. It also has native parallelism if you need even more speed. It's compiled so it delivers low-level performance out of the box, but it's JIT compiled so it still feels like an interpreted language. In fact, Julia is the "Ju" in Jupyter notebooks alongside Python and R. Julia is specifically aimed at solving the two language problem in scientific computing and data analysis. Its ecosystem includes visualisation, dataframes, machine learning, and web deployment.

Go is a simpler and natively-concurrent alternative to C# and Java. All you need to do is write go before a function call to run it concurrently as a goroutine. Goroutines are much lighter than threads because they have a dynamic stack size which starts at just 2 KB. Default thread sizes depend on the OS, but are often 2 MB. Goroutines are also managed by Go instead of the OS, so the scheduler won't get in your way. Go has a lot of other concurrency features built-in, including channels and wait groups. Additionally, gofmt auto-formats your code to avoid "fingerprints" from individual coding styles.

Rust is a safer and higher-level alternative to C and C++. It has no runtime environment and impressive compiler optimisations, providing C-like speed. It's guaranteed to be memory safe and free of data races. If your code compiles, you can be fairly confident it's correct... Aside from logic errors, which is why you should always test your code anyway. The compiler errors are awesome. They clearly explain the problem and suggest fixes, and visually show you exactly where in a line of code there was an error.

Rewriting my code in Go

I chose to rewrite my code in Go because it's most relevant to me as a backend-ish developer. Go doesn't have as many built-ins as Python, so I found myself implementing lower-level code like determining the most frequent element in a list. Although the code was much longer in length, development time was quick because the compiler immediately alerted me of issues with my code. Parallelising the code in a way that prevented data races was totally painless.

On one core, the Go code took approximately 15 seconds and used approximately 70 MB of memory. This was over 660 times faster than the initial Python code, and almost 3.5 times faster than the optimised Python code. It also used over 5 times less memory than the initial Python code and almost 7 times less memory than the optimised Python code. I created exactly one goroutine per core, and the overhead was practically unnoticeable on both speed and memory usage. As a result, the actual run time was 15 seconds divided by the number of cores used, resulting in actual speed multipliers in the thousands when compared to the initial code.

If your CPU has hyperthreading, the run time will be twice as slow as you might expect. This is because if you have 4 physical cores on a hyperthreaded CPU, it will appear to the operating system as 8 cores. However, if you create 8 goroutines they will be competing for the 4 actual hardware cores. In essence, you will have 8 concurrent goroutines, but only 4 will be parallel at any point in time. You could detect hyperthreading and halve the number of goroutines accordingly, but I don't think it's worth it. Two competing threads can switch if one is stalled due to a cache miss, branch misprediction, or data dependency.

No language is perfect for all problems

Just as Python and R have their drawbacks, so do Julia, Go, and Rust. It's important to understand the strengths and weaknesses of each language to know when it is and isn't appropriate to use them. You should be especially careful not to introduce new languages into codebases and teams where they don't make sense or would not be comfortably adopted. For this reason, I added both my Go code and my optimised Python code to our utilities repo.

It's possible for programs in one language to call code in another language by using a foreign function interface (FFI). They require some overhead but are typically pretty simple. Using an FFI, you'd be able to rewrite just the slow parts of your code in a faster language without diving all the way in. Conversely, you'd be able to call a higher-level language from a lower-level one if you needed that abstraction or interoperability.

Conclusion

My motivation in encouraging you to explore other languages is simply for you to expand your language toolset. If all you have is a hammer, everything looks like a nail. If all you know is one language, it's tempting to apply it to every problem even where it isn't optimal. I will probably never be a Julia developer and I haven't had a need for Rust in a work environment yet, but my Go knowledge has been invaluable.

That being said, don't be too quick to rush to a different language. Researching ways to make up for the shortcomings of the language you're already using can make a huge difference. If you don't know where to start optimising, profile your code to find your pain points. Additionally, don't be afraid to try multithreading and multiprocessing in applicable cases if speed is important. Finally, consider using FFIs to leverage the advantages of certain languages when you need them.