Python is a very clean and readable language for scientific (and other) development activities, but the same flexibility that makes code easy to write can often make it rather slow. Fortunately, the popularity of Python has created a number of solutions so that you don't have to go too far outside the Python ecosystem (if at all) in order to accelerate your code. In this brief, we will explore several options for accelerating your numerical computations in Python.
Before delving into Python extensions, we should first discuss the one library you should already be using: NumPy.
NumPy, a powerful library for arrays, linear algebra, and more, is extremely popular and common. If you are using Continuum Analytics Anaconda to manage your Python packages (and if you aren't, you should be), you can install it easily with
conda install numpy
NumPy is very easy to use, but many of its array functions and linear algebra routines have been written in C, allowing for near-native performance.
While NumPy is an excellent resource for most applications, others require more custom computation graphs that can run on specialized hardware like GPUs, and benefit from automatic differentiation (note that this is different from numerical differentiation, which can actually slow down high-dimensional optimization problems). There are libraries that exist to fill this niche, such has Theano and Tensorflow. Interested readers may find this comparison of frameworks useful.
What about when you have Python code written, but it's just not fast enough?
Numba, a library sponsored by Continuum Analytics, may be the solution for you.
As with NumPy, installation via the conda
package manager is easy:
conda install numba
So, how involved is it to use Numba?
We'll take a look at an example on the site.
from numpy import arange
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
a = arange(9).reshape(3,3)
print(sum2d(a))
from numba import jit
from numpy import arange
@jit
def sum2d(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
a = arange(9).reshape(3,3)
print(sum2d(a))
That's it.
Just a decorator.
What happens under the hood?
The @jit
decorator takes the bytecode for the function and compiles it to native code.
If your code is also packaged and distributed using the conda
package manager (see PACKAGING AND DEPLOYMENT), it's simple to include numba
as a dependency to easily gain extra speed.
Of course, there are some caveats. Sometimes, if you're using Python's dynamic typing convenience, the native code will just call the interpreter, and much of the speed increase will be lost. Guidelines on how to properly use Numba's facilities can be found on the website.
Sometimes, you don't want to add an extra dependency, or the libraries available do not cover your use case.
Short of resorting to writing your code directly in C, with the attendant boilerplate to connect back to Python.
Fortunately, there is a solution in the Python ecosystem: Cython. Cython itself is a superset of Python, allowing type declarations and seamless import of C and C++ functions.
It is also an optimizing compiler that first converts the Cython into C, and then compiles the C into native code, which can be delivered as a binary.
To install, if you have Anaconda (see above if you don't), it's just conda install cython
Cython is also a straightforward way to connect Python code to C code, as Cython can both be readily imported by Python, and can readily access C libraries. As with Numba, it can require some finesse to ensure that the Cython code you write does not need to call the interpreter very often. To judge this, one can simply use
cython -a my_code.pyx
to generate an HTML file with lines highlighted based on the frequency of interpreter calls.
As an example of code that has been optimized such that no interpreter calls are necessary at all, we have reproduced some code from one of our own projects, YANK, a free energy calculation tool:
cimport cython
from libc.math cimport exp, isnan
from libc.stdio cimport printf
cdef extern from 'stdlib.h' nogil:
double drand48()
@cython.wraparound(False)
@cython.cdivision(True)
@cython.boundscheck(False)
cpdef long _mix_replicas_cython(long nswap_attempts, long nstates, long[:] replica_states, double[:,:] u_kl, long[:,:] Nij_proposed, long[:,:] Nij_accepted) nogil:
cdef long swap_attempt
cdef long i, j, istate, jstate, tmp_state
cdef double log_P_accept
for swap_attempt in range(nswap_attempts):
i = <long>(drand48()*nstates)
j = <long>(drand48()*nstates)
istate = replica_states[i]
jstate = replica_states[j]
if (isnan(u_kl[i, istate]) or isnan(u_kl[i, jstate]) or isnan(u_kl[j, istate]) or isnan(u_kl[j, jstate])):
continue
log_P_accept = - (u_kl[i, jstate] + u_kl[j, istate]) + (u_kl[j, jstate] + u_kl[i, istate])
Nij_proposed[istate, jstate] +=1
Nij_proposed[jstate, istate] +=1
if(log_P_accept>=0 or drand48()<exp(log_P_accept)):
tmp_state = replica_states[i]
replica_states[i] = replica_states[j]
replica_states[j] = tmp_state
Nij_accepted[i,j] += 1
Nij_accepted[j,i] += 1
return 0
As you can see, while it shares a lot of Python's nice clean syntax, we also declare variable types and manage them carefully. By doing this, we can ensure that we get the performance of C, but where using the code is as simple as:
from .mixing._mix_replicas import _mix_replicas_cython
replica_states = md.utils.ensure_type(self.replica_states, np.int64, 1, "Replica States")
u_kl = md.utils.ensure_type(self.u_kl, np.float64, 2, "Reduced Potentials")
Nij_proposed = md.utils.ensure_type(self.Nij_proposed, np.int64, 2, "Nij Proposed")
Nij_accepted = md.utils.ensure_type(self.Nij_accepted, np.int64, 2, "Nij accepted")
_mix_replicas_cython(self.nstates**4, self.nstates, replica_states, u_kl, Nij_proposed, Nij_accepted)
Note that we have some extra calls to ensure that the type is as we declared. However, it is otherwise as simple as calling a Python function.
Suppose the intensive part of your code would be ideal as a custom GPU kernel, and frameworks like Tensorflow and Theano are not appropriate, but you do not want to write an entire package in C++. There are options: PyCuda offers a nice wrapper around the Nvidia CUDA API, and allows you to write only the kernel in C++, leaving everything else in Python (example taken from PyCuda homepage):
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
#parentheses added
print(dest-a*b)
If your application is appropriate for the GPU, this may be an easy option. But what about GPUs from vendors other than NVIDIA? You can use PyOpenCL, which allows you to use any OpenCL-compatible device. An example taken from PyOpenCL's homepage:
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)
prg = cl.Program(ctx, """
__kernel void sum(
__global const float *a_g, __global const float *b_g, __global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)
# Check on CPU with Numpy:
print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))
Note that these may require substantial knowledge of GPU programming, but automate a large portion of the typical boilerplate, making them an attractive option for many applications.
Although your application may fit another paradigm, in general, we recommend pursuing acceleration in this order:
- NumPy: Your code should already be using this, but if it's not, this could be an easy speedup
- Numba: May require only a simple decorator; in more involved cases, may just require rearranging some variables
- Tensorflow/Theano: Operates under a different paradigm, but very useful for some communities (e.g., deep learning) and provides automatic differentiation.
- PyCuda/PyOpenCL: If you need to write a piece of code for a GPU or other accelerated processing device, these libraries can make it as painless as possible.
- Cython: When nothing else fits your application, Cython is a productive option. Pure Python is still valid Cython, so you can change the code slowly to add static typing and other performance-enhancing features until desired performance is achieved.
- C/C++ extensions: Sometimes for very application-specific purposes this is the best option (especially when communicating with GPUs and other specialized devices). However, it can be very time consuming, and should be used as a last resort.
Want to contribute to this repository? Have a suggestion for an improvement? Spot a typo? We're always looking to improve this document for the betterment of all.
- Please feel free to open a new issue with your feedback and suggestions!
- Or make a pull request from your branch or fork!
Previous: | Next: | |
---|---|---|
Documentation | Back to Top | Packaging and Deployment |