# Not-so-casual Performance Optimization in Python

My previous post (which was honestly created to test out the theme for this site), provided a few code snippets that computed $N$ terms of the sum of inverse squares. I wrote the code in my 4 favorite languages—Python, C, Rust, and Haskell—but when I ran the Python code, it was embarrassingly slow. Compared to the $\approx 950$ ms it took sequential Rust, Python took 70 seconds! So, in this post, we’re going to attempt to get Python some more reasonable numbers.

## The Original Solution

```
def basel(N: int) -> float:
return sum(x**(-2) for x in range(1,N))
```

This is the most *Pythonic* way to do the task. An easy-to-read, single
line of code that uses built-in generators. Let’s time it, with $N=10^8$
instead of $10^9$ like the last post (since I don’t want to wait that long):

```
import time
# Time the input function
def time_function(func):
def wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Function '{func.__name__}' executed in {execution_time:.6f} seconds.")
return result
return wrapper
```

```
>>> f = time_function(basel)
>>> f(100000000) # 10^8
Function 'basel' executed in 6.641589 seconds.
1.644934057834575
```

Let’s try rewriting it, but less *Pythonic*, using a `for`

loop:

```
def basel_less_pythonic(N: int) -> float:
s = 0.0
for x in range(1, N):
s += x**(-2)
return s
```

```
>>> f(100000000) # 10^8
Function 'basel_less_pythonic' executed in 5.908466 seconds.
1.644934057834575
```

Interesting. The less idiomatic way to write it actually improved performance.
Why? Let’s investigate using `dis`

, the Python disassembly module.

Here’s the original version. I’ve added some helpful comments:

```
# Load variables
00 LOAD_GLOBAL 0 (sum)
02 LOAD_CONST 1 (<code object <genexpr> at 0x104f42e40>)
04 LOAD_CONST 2 ('basel.<locals>.<genexpr>')
# Create the function
06 MAKE_FUNCTION 0
# Create `range` object
08 LOAD_GLOBAL 1 (range)
10 LOAD_CONST 3 (1)
12 LOAD_FAST 0 (N)
14 CALL_FUNCTION 2
# Convert to iterator
16 GET_ITER
# Call generator
18 CALL_FUNCTION 1
# Call sum function
20 CALL_FUNCTION 1
# Return
22 RETURN_VALUE
f <code object <genexpr> at 0x104f42e40>:
# Essentially a `for` loop with yield
00 GEN_START 0
02 LOAD_FAST 0 (.0)
04 FOR_ITER 7 (to 20)
06 STORE_FAST 1 (x)
08 LOAD_FAST 1 (x)
10 LOAD_CONST 0 (-2)
12 BINARY_POWER
14 YIELD_VALUE
16 POP_TOP
18 JUMP_ABSOLUTE 2 (to 4)
20 LOAD_CONST 1 (None)
22 RETURN_VALUE
```

Here’s the `for`

loop version:

```
# Load variables
00 LOAD_CONST 1 (0.0)
02 STORE_FAST 1 (s)
# Create `range` object
04 LOAD_GLOBAL 0 (range)
06 LOAD_CONST 2 (1)
08 LOAD_FAST 0 (N)
10 CALL_FUNCTION 2
12 GET_ITER
# Declare `for` loop
14 FOR_ITER 8 (to 32)
16 STORE_FAST 2 (x)
# Power, add, store
18 LOAD_FAST 1 (s)
20 LOAD_FAST 2 (x)
22 LOAD_CONST 3 (-2)
24 BINARY_POWER
26 INPLACE_ADD
28 STORE_FAST 1 (s)
# Go to top of `for` loop
30 JUMP_ABSOLUTE 7 (to 14)
# return `s`
32 LOAD_FAST 1 (s)
34 RETURN_VALUE
```

In this *specific case*, using the generator slowed down the program.
As we can see, the generator loop code is identical to the second version.
The first one just does extra work dealing with the generator object,
which requires (slow) `CALL_FUNCTION`

directives. Note that in practice, generators
are usually faster due to their lazy nature. It’s just that in this case,
the extra bloat isn’t worth it.

Regardless, the difference in performance between the two versions ($\approx 1$s) is insignificant compared to the overall difference between Python and C/Rust. Even with the faster version, the Rust code is $\approx 65$ times faster than Python.

Why? Mainly because Python is a loosely typed, interpreted language. This means that the Python interpreter (CPython), has to translate Python code into something easily executable by your computer, on-the-fly. It does this by quickly compiling Python code into a generalized form of assembly, called Python bytecode, which we just looked at.

This is then executed by the interpreter. This compilation step
is the first hit to performance that Python suffers. Since languages
like Rust only require a program to be compiled once, that time isn’t
factored into a programs *runtime*. But the biggest hit to performance
comes from having poor quality, architecture agnostic as opposed to natively
optimized bytecode. This is simply
a part of the nature of interpreted languages, since they can’t afford to spend
so much time on high quality compilation.

So, how *can* you write fast Python? Well, you can’t. But, you *can* call
into fast C code, through heavily optimized libraries like Numpy. These libraries
contain pre-compiled, vectorized C functions that let you bypass the whole
Python interpreter.

## Let’s Use Numpy

```
import numpy as np
def basel_np(N: int) -> float:
# [1, 1, ..., 1]
ones = np.ones(N - 1)
# [1, 2, ..., N]
r = np.arange(1, N)
# [1, 1/2, ..., 1/N]
div = ones / r
# [1, 1/4, ..., 1/N^2]
inv_squares = np.square(div)
# ~ pi^2/6
return float(np.sum(inv_squares))
```

```
>>> f(100000000) # 10^8
Function 'basel_np' executed in 0.460317 seconds.
1.6449340568482196
```

Wow, that ran $\approx 13$ times faster! Looking at the bytecode in
this case won’t tell us much, since it’ll just consist of a few
`CALL_FUNCTION`

s to numpy, which does the actual work. Let’s see which
line is taking the largest toll:

```
def basel_np(N: int) -> tuple[float, list[float]]:
times = []
# Time a single step
start = time.perf_counter()
ones = np.ones(N - 1)
end = time.perf_counter()
step_time = end - start
times.append(step_time)
# Remaining timing code omitted
r = np.arange(1, N)
div = ones / r
square = np.square(div)
ret = np.sum(square)
return ret, times
```

Operation | Time for operation (ms) | Cumulative Time (ms) |
---|---|---|

Creating ones | 97 | 97 |

Creating range | 79 | 176 |

Squaring range | 98 | 274 |

Dividing ones/squares | 112 | 387 |

Final sum | 58 | 444 |

Let’s think about these numbers for a minute. The *creating ones/range* steps
don’t involve heavy computational work. However, they take almost the same
amount of time as the “tougher” steps, such as squaring and dividing. Surprisingly,
the sum over the final array is the fastest step! This suggests that the bottleneck
here is not the CPU, but memory access. Let’s see how performance scales,
ranging $N$ from $10^7$ to $10^9$.

In this plot, the edge of the topmost line represents the total time taken, and the space in between consecutive lines represents the time for each step.

We can see that

The total time increases non-linearly, even though the algorithm is $O(n)$

The

`divide`

step takes the most amount of time, with sharp increases at $\approx 3 \times 10^8$ and $\approx 7 \times 10^8$.

This makes sense, because unlike the other operations, `np.divide`

needs to
access two large arrays simultaneously, and write the result to a new array.
This means lots of main memory accesses, and possibly reads from the SSD, which
is extremely slow.

### Broadcasting

There’s actually an optimization for these kinds of problems that Numpy has built-in called broadcasting, which virtually “projects” smaller sized vectors into larger sizes for vector-vector operations.

```
def basel_np_broadcast(N) -> float:
ones = 1
r = np.arange(1, N)
div = ones / r
square = np.square(div)
# As if its [1, 1, ..., 1] / [1, 4, ..., N^2]
return float(np.sum(square))
```

This lets us save a lot of precious cache space. Let’s run the same metrics as before. With $N=10^8$:

Operation | Time for operation (ms) | Cumulative Time (ms) |
---|---|---|

Creating ones | 0.00 | 0 |

Creating range | 68.56 | 70.48 |

Squaring range | 105.14 | 180.74 |

Dividing ones/squares | 133.08 | 271.30 |

Final sum | 71.08 | 310.41 |

From now on, I’ll refer to the broadcasted version as “the Numpy solution”.

Although a significant improvement, we still see those latency spikes. Let’s try to fix that.

## Thinking About Memory

To improve the memory use of the function, we could work block-by-block, doing the same operations on small chunks of the range at a time, and adding it all up in the end. This would let us only keep one chunk in memory at once, and hopefully lead to better cache utilization while still benefiting from Numpy’s vectorized code.

Lets modify the function to take a range of $N$s:

```
# [N1, N2], inclusive
def basel_np_range(N1: int, N2: int) -> float:
# Timing code omitted
ones = 1
r = np.arange(N1, N2 + 1)
div = ones / r
squares = np.square(div)
return float(np.sum(squares))
```

And write a function to sum up all the chunks.

```
def basel_chunks(N: int, chunk_size: int) -> float:
# Timing code omitted
s = 0.0
num_chunks = N // chunk_size
for i in range(num_chunks):
s += basel_np_range(i * chunk_size + 1, (i + 1) * chunk_size)
return s
```

With $N=10^8$:

```
Function 'basel_chunks' executed in 0.108557 seconds.
1.6449340568482258
```

Nice! It runs $\approx 3$ times faster than the Numpy solution.

As a plus, the answer will actually
be slightly *more accurate*, due to the nature of *IEEE 754* floats.
Let’s look at how the performance scales in the same way we did before,
keeping the chunk size constant.

First of all, look at the y-axis. We’re now working on the scale
of seconds, not minutes. We also see that the runtime
increases linearly, consistent with the algorithm’s $O(n)$ time complexity.
We can theorize that with the `chunk_size`

of 20000, all of the information
is able to be fit in cache, so there are no runtime spikes for out-of-cache
access.

Let’s try varying `chunk_size`

in the range $[5 \times 10^2, 10^6]$, keeping $N$ constant at $10^9$.

From the figure, we see performance gains up until a $\approx 51000$ `chunk_size`

,
after which there are latency spikes presumably caused by cache misses.

## Some Napkin Math

Each `float64`

, uses 64 bits, or 8 bytes. We are working with 3 arrays
of $N$ `float64`

s, so memory usage for one call is
$51000 (8) (3) = 1224000 \approx 1.2$ MB.

My M1 Pro has the following specs:

Memory Type | Size |
---|---|

L1 | 128KB (data cache, per core) |

L2 | 24MB (shared between 6 performance cores) |

L3 | 24MB |

Main | 16GB |

This suggests the maximum $L1 + L2$ cache space available to the function is $\approx 1.2$ MB, and if we go over that, we have to wait for the $L3$ cache.

## Multiprocessing

Now, let’s try making the computer fit more of the array into cache. One possibility is to try running the function on all of the cores, so that we can use more $L2$ for our function, so let’s try that.

First, let’s modify `basel_chunks`

to take a range of $N$s as input.

```
# (N1, N2]
def basel_chunks_range(N1: int, N2: int, chunk_size: int):
# Timing code omitted
s = 0.0
num_chunks = (N2 - N1) // chunk_size
for i in range(num_chunks):
s += basel_np_range(N1 + i * chunk_size + 1, N1 + (i + 1) * chunk_size)
return s
```

Now for the actual multiprocessing:

```
from multiprocessing import Pool
def basel_multicore(N: int, chunk_size: int):
# 10 cores on my machine
NUM_CORES = 10
N_PER_CORE = N // NUM_CORES
Ns = [
(i * N_PER_CORE, (i + 1) * N_PER_CORE, chunk_size)
for i in range(NUM_CORES)
]
# process 10 batches in parallel
with Pool(NUM_CORES) as p:
result = p.starmap(basel_chunks_range, Ns)
return sum(result)
```

Under the hood, Python’s multiprocessing
module pickles
the function object, and spawns 9 new instances of `python3.10`

, all of which
execute the function with $N=10^9$ and `num_chunks = 50000`

.

Let’s compare performance.

```
Function 'basel_multicore' executed in 1.156558 seconds.
1.6449340658482263
Function 'basel_chunks' executed in 1.027350 seconds.
1.6449340658482263
```

Oops, it did worse. This is probably because initial work done by multiprocessing is expensive, so it will only be of benefit if the function takes a relatively long time to complete.

Increasing $N$ to $10^{10}$:

```
Function 'basel_multicore' executed in 2.314828 seconds.
1.6449340667482235
Function 'basel_chunks' executed in 10.221904 seconds.
1.6449340667482235
```

There we go! It’s $\approx 5$ times faster. Let’s try $10^{11}$

```
Function 'basel_multicore' executed in 13.844876 seconds.
multicore 1.6449340668379773
Function 'basel_chunks' executed in 102.480372 seconds.
chunks 1.6449340668379773
```

That’s $\approx 8$ times faster! As $N$ increases, the ratio of single-core to 10-core should approach 10:1.

The x-axis is on a log scale, which is why the linearly increasing runtimes look exponential

We can see that using multiprocessing has a $\approx 1$ second overhead, so there’s only a performance benefit when the chunked runtime is higher. This happens at $\approx N=1.3 \times 10^9$. After that, however, there is a massive difference.

Through experiment
(not shown here) I found that a `chunk_size`

around 50000 is also
optimal for the multiprocessing code, telling us that the OS
puts some restriction on L2 use for a single core.

## Summary of Results

Python function | $N=10^8$ (s) | $N=10^9$ (s) | $N=10^{10}$ (s) |
---|---|---|---|

`basel_multicore` | 1.04 | 1.17 | 2.33 |

`basel_chunks` | 0.09 | 0.91 | 9.18 |

`basel_np_broadcast` | 0.27 | 9.68 | NaN (Ran out of memory) |

`basel_less_pythonic` | 5.99 | 59.25 | 592 (est.) |

`basel` (idiomatic) | 7.64 | 68.28 | 682 (est.) |

`basel_np` | 0.618 | 44.27 | NaN (Ran out of memory) |

How does it stack up to the languages from the last post?

Language | $N=10^9$ (ms) |
---|---|

Rust (rc, –release, multicore w/ rayon) | 112.65 |

Python3.10 (`basel_chunks` ) | 913 |

Rust (rc, –release) | 937.94 |

C (clang, -O3) | 995.38 |

Python3.10 (`basel_multicore` ) | 1170 |

Python3.10 (`basel_np_broadcast` ) | 9680 |

Haskell (ghc, -O3) | 13454 |

Python3.10 (`basel_np` ) | 44270 |

Python3.10 (`basel_less_pythonic` ) | 59250 |

Python3.10 (`basel` ) | 68280 |

Extremely good! The chunked Numpy code is the fastest sequential solution! And remember, Python has a whole interpreter running in the background as it makes calculations.

Comparing Rust to Python multicore:

Language | $N=10^8$ (ms) | $N=10^9$ (ms) | $N=10^{10}$ (ms) | $N=10^{11}$ (ms) |
---|---|---|---|---|

Rust | 12.3 | 111.2 | 1083 | 10970 |

Python | 1040 | 1173 | 2330 | 12629 |

Clearly, Rust’s method of parallelization (through OS threads) is far more efficient for small $N$ than Python’s process based parallelism. But the difference between two shrinks as $N$ increases.

## Conclusion

If it wasn’t clear, this entire article was just an exercise to learn
a bit about how python works under the hood. *Please do not try
to impress your manager by hyper-optimizing your Python*. There’s
almost always a better solution, like

```
from math import pi
def basel_smart() -> float:
return pi*pi / 6
```

Ta-da! Orders faster and infinitely simpler than all the other functions.

When thinking about performance, you need to be careful. Don’t *assume*
performance is an issue until you measure, and you *know* it is. If so,
always see if there’s a smarter way of solving the problem before
you rush to type `import multiprocessing`

and brute force your way through
an otherwise simple problem.

Also, if your application is really that performance critical, you probably
shouldn’t be writing it in Python. Python was created as a tool that excels
in prototyping, not execution speed. But as you can see, it isn’t *impossible*
to get awesome performance.

Anyway, I hope you enjoyed my first real article on this site. If you find a faster Python “solution” to this “problem”, shoot me an email!

## Update log

Date | Description | Thanks to |
---|---|---|

08/02/2023 | Added the broadcasting section, Python becomes fastest solution | u/mcmcmcmcmcmcmcmcmc_ & u/Allanon001 |

08/02/2023 | Changed `time.time()` to `time.perf_counter()` | u/james_pic |