Skip to content

Commit

Permalink
feat: add a page about compiling + parallel and GPU computing
Browse files Browse the repository at this point in the history
  • Loading branch information
Saransh-cpp committed Aug 21, 2024
1 parent 8493054 commit 981f678
Show file tree
Hide file tree
Showing 2 changed files with 390 additions and 0 deletions.
385 changes: 385 additions & 0 deletions ch08performance/030misc-libs.ipynb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
# ---
# jupyter:
# jekyll:
# display_name: Miscellaneous libraries
# jupytext:
# notebook_metadata_filter: -kernelspec,jupytext,jekyll
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# ---
#
# %% [markdown]
# # Miscellaneous libraries to improve your code's performance
#
# Over the course of years, Python ecosystem has developed more and more solutions
# to combat the relatively slow performance of the language. As seen in the previous
# lessons, the speed of computation matters quite a lot while using a technology
# in scientific setting. This lesson introduces a few such popular libraries that have
# been widely adopted in the scientific use of Python.
#
# ## Compiling Python code
#
# We know that Python is an interpreted and not a compiled language, but is there a way
# to compile Python code? There are a few libraries/frameworks that lets you
# *just in time* or *ahead of time* compile Python code. Both the techniques allow
# users to compile their Python code without using any explicit low level langauge's
# bindings, but both the techniques are different from each other.
#
# Just in time compilers compile functions/methods at runtime whereas ahead of time
# compilers compile the entire code before runtime. AOT can do much more compiler
# optimizations than JIT, but AOT compilers produce huge binaries and that too only
# for specific platforms. JIT compilers have limited optimization routines but they
# produce small and platform independent binaries.
#
# JIT and AOT compilers for Python include, but are not limited to -
# - [Numba](http://numba.pydata.org): a Just-In-Time Compiler for Numerical Functions in Python
# - [mypyc](https://mypyc.readthedocs.io/en/latest/introduction.html): compiles Python modules to C extensions using standard Python type hints
# - [JAX](https://jax.readthedocs.io/en/latest/jit-compilation.html): a Python library for accelerator-oriented array computation and program transformation
#
# Although all of them were built to speed-up Python code, each one of them is a bit
# different from each other when it comes to their use-case. We will specifically look into
# Numba in this lesson.
#
# ### Numba
#
# Numba is an open source JIT compiler that translates a subset of Python and NumPy code
# into fast machine code. The good thing about Numba is that it works on plain Python
# loops and one doesn't need to configure any compilers manually to JIT compile Python code.
# Another good thing is that it understands NumPy natively, but as detailed in its
# documentation, it only understands a subset of NumPy functionalities.
#
# Numba provides users with an easy to use function decorator - `jit`. Let's start by
# importing the libraries we will use in this lesson.
#

# %%
import math

from numba import jit, njit
import numpy as np

# %% [markdown]
# We can mark a function to be JIT compiled by simply decorating it
# with numba's `@jit`. The decorator takes a `nopython` argument that tells Numba to
# enter the compilation mode and not fall back to usual Python execution.

# %%
def f(x):
return np.sqrt(x)

@jit(nopython=True)
def jit_f(x):
return np.sqrt(x)

# %% [markdown]
# It looks like the `jit` decorator should make our Numba compile our function
# and make it much faster than the non-jit version. Let's test this out.
#
# Note that the first function call is not included while timing because that
# is where Numba compiles the function. The compilation at runtime is called
# just in time compilation and resultant binaries are cached for the future
# runs.

# %%
data = np.random.uniform(low=0.0, high=100.0, size=1_000)

# %%
# %%timeit
f(data)

# %%
_ = jit_f(data)

# %%
# %%timeit
jit_f(data)

# %% [markdown]
# Surprisingly, the JITted function was slower than plain Python and NumPy
# implementation! Why did this happen? Numba does not provide valuable performance
# gains over pure Python or NumPy code for simple operations and small dataset.
# The JITted function turned out to be slower than the non-JIT implementation
# because of the compilation overhead. Let's try increasing the size of our
# data and perform a non-NumPy list comprehension on the data.
#
# The `jit` decorator with `nopython=True` is so widely used there exists an alias
# decorator for the same - `@njit`

# %%
data = np.random.uniform(low=0.0, high=100.0, size=1_000_000)


# %%
def f(x):
return [math.sqrt(elem) for elem in x]

@njit
def jit_f(x):
return [math.sqrt(elem) for elem in x]

# %%
# %%timeit
f(data)

# %%
# %%timeit
jit_f(data)

# %% [markdown]
# That was way faster than the non-JIT function! But, the result was still slower
# than the NumPy implementation. NumPy is still good for relatively simple
# computations, but as the complexity increases, Numba functions start
# outperforming NumPy implementations.

# %% [markdown]
# Let's go back to our plain Python mandelbrot code from the previous lessons and
# JIT compile it -

# %%
@njit
def mandel1(position, limit=50):

value = position

while abs(value) < 2:
limit -= 1
value = value**2 + position
if limit < 0:
return 0

return limit

xmin = -1.5
ymin = -1.0
xmax = 0.5
ymax = 1.0
resolution = 300
xstep = (xmax - xmin) / resolution
ystep = (ymax - ymin) / resolution
xs = [(xmin + (xmax - xmin) * i / resolution) for i in range(resolution)]
ys = [(ymin + (ymax - ymin) * i / resolution) for i in range(resolution)]

# %%
# %%timeit
data = [[mandel1(complex(x, y)) for x in xs] for y in ys]

# %% [markdown]
# The compiled code already beats our fastest NumPy implementation! It is not
# necessary the compiled code will perform better than NumPy code, but it usually
# gives performance gains for signifantly large computations. As always, it is
# good to measure the perrformance to check if there are any gains.

# %% [markdown]
# Let's try JITting our NumPy code.

# %%
@njit
def mandel_numpy(position,limit=50):
value = position
diverged_at_count = np.zeros(position.shape)
while limit > 0:
limit -= 1
value = value**2+position
diverging = value * np.conj(value) > 4
first_diverged_this_time = np.logical_and(diverging, diverged_at_count == 0)
diverged_at_count[first_diverged_this_time] = limit
value[diverging] = 2

return diverged_at_count

ymatrix, xmatrix = np.mgrid[ymin:ymax:ystep, xmin:xmax:xstep]
values = xmatrix + 1j * ymatrix

# %%
# %%timeit
mandel_numpy(values)


# %% [markdown]
# That does not work. The error might be solvable or it might just be out of Numba's
# scope. Numba does not distinguish between plain Python
# and NumPy; hence both the implementations are broken down to the same machine
# code. Therefore, for Numba, it makes sense to write complex computations with the ease
# of Python loops and lists than with NumPy functions. Moreover, Numba only understands
# a subset of Python and NumPy so it is possible that a NumPy snippet does not
# work but a simplified Python loop does.

# %% [markdown]
# Numba also has functionalities to vectorize, parallelize, and strictly type check
# the code. All of these functions boost the performance even further or helps
# Numba to avoid falling back to the "object" mode (`nopython=False`). Refer
# to [Numba's documentation](http://numba.pydata.org) for a complete list of features.

# %% [markdown]
# #### Numba support in Scientific Python ecosystem
#
# Most of the scientific libraries nowadays ship Numba support with them. For example,
# [Awkward Array](https://awkward-array.org) is a library that lets users perfor JIT
# compilable operations on non-uniform data, something that NumPy misses. Similarly,
# [Akimbo](https://akimbo.readthedocs.io) provides JIT compilable processing of nested,
# non-uniform data in dataframes. There are other niche libraries, such as,
# [vector](https://vector.readthedocs.io) that provides Numba support for high energy
# physics vector.
#
# ## Compiled Python bindings
#
# Several Python libraries/frameworks are written in a compiled language but provide Python
# bindings for their compiled code. This code does not have to be explicitly marked to be
# compiled by developers. A few Python libraries that have their core written in
# compiled languages but provide Python bindings -
#
# - [Cython](https://cython.org): superset of Python that supports calling C functions and declaring C types on variables and class attributes
# - [Scipy](https://scipy.org): wraps highly-optimized scientific implementations written in low-level languages like Fortran, C, and C++
# - [Polars](https://docs.pola.rs): dataframes powered by a multithreaded, vectorized query engine, written in Rust
# - [boost-histogram](https://boost-histogram.readthedocs.io/en/latest/): Python bindings for the C++14 Boost::Histogram library
# - [Awkward Array](https://awkward-array.org/doc/main/): compiled and fast manipulation of JSON-like data with NumPy-like idioms
# - [Astropy](https://www.astropy.org): astronomy and astrophysics core library
#
# JIT or AOT compiling libraries and the libraries providing Python bindings for a compiled
# language are not mutually exclusive. For instance, PyTorch and Tensorflow offer users
# "eager" and "graph" executation. Eager execution builds computational graph at runtime,
# making it slow, easy to debug, and JIT compilable. On the other hand, "graph" execution
# builds the computational graph at the kernel level, making it fast, hard to debug, and
# with no need to JIT compile. Similarly, Awkward Array provides Python bindings for array
# creation, but high-level operations on these arrays can be JIT compiled.
#
# We will specifically look into Cython in the next lesson.
#
# ## Distributed and parallel computing with Python
#
# Along with the speed of execution of a program, scientific computation usually requires
# a large amount of memory because of the massive nature of data produced at scientific
# experiments. This issue can be addressed by employing distributed and parallel
# computing frameworks to spread the computation across multiple machines or multiple
# cores in a single machine. Parallel computing at a smaller level can be achieved by distributing
# a task over multiple threads, but Python's GIL (Global Interpreter Lock) blocks
# the interpretor from doing this for computational tasks. There are ongoing efforts, such
# as the implementation of [PEP 703](https://peps.python.org/pep-0703/) to make GIL
# optional, but Numba allows users to bypass GIL by pass `nogil=True` to `@jit`.
#
# At a larger scale, Dask can be used to parallelize computation over multiple
# machine or "nodes". Dask can even parallelize the code marked with `@numba.jit(nogil=True)`
# to multiple threads in a machine, but it does not itself bypass the GIL.
#
# ### Dask
#
# Dask is a Python library for parallel and distributed computing. Dask supports
# arrays, dataframes, Python objects, as well as tasks on computer clusters and
# nodes. The API of Dask is very similar to that of NumPy, Pandas, and plain Python,
# allowing users to quickly parallelize their implementations. Let's create a dummy
# dataframe using dask.

# %%
import dask

df = dask.datasets.make_people() # returns a dask "bag"
df = df.to_dataframe()

# %%
df
# The computation gave us a dask object and not the actual answer. Why is that?
# We can visualise the dask task graph using -

# %% [markdown]
# Displaying the dataframe just displays the metadata of the variable, and not any
# data. This is because of the "lazy" nature of dask. Dask has "lazy" execution,
# which means that it will store the operations on the data and
# create a task graph for the same, but will not execute the operations until a
# user explicitly asks for the result. The metadata specifies `npartitions=10`,
# which means that the dataframe is split into 10 parts that will be accessed parallely.
# We can get explicitly tell dask to give us the dataframe values using `.compute()`.

# %%
df.compute()

# %% [markdown]
# We can now perform pandas like operation on our dataframe and let dask create a
# task graph for the same.

# %%
new_df = df.groupby("occupation").age.max()
new_df

# %% [markdown]
# Instead of computing, let's visualize the task graph.

# %%
dask.visualize(new_df, filename="visualization.png")

# %% [markdown]
# We can see that the task graph starts with 10 independent branches because out dataframe
# was split into 10 partitions at the start. Let's compute the answer.

# %%
new_df.compute()

# %% [markdown]
# Similarly, one can peform such computations on arrays and selected Python data structures
#

# #### Dask support in Scientific Python ecosystem
#
# Similar to the adoption of Numba in the scientific Python ecosystem, dask is being
# adopted at an increasing rate. Some examples include
# [dask-awkward](https://dask-awkward.readthedocs.io) for ragged data
# computation, [dask-histogram](https://dask-histogram.readthedocs.io) for
# parallel filling of histograms dask support in [cuDF](https://docs.rapids.ai/api/cudf),
# for parallelizing CUDA dataframes, and
# [dask-sql](https://dask-sql.readthedocs.io), a distributed SQL Engine.
#
# ## GPU accelerated computing in Python
#
# The last method for speeding up Python that we will look at is running your code
# on GPUs instead of CPUs. GPUs are specifically built to speed up computational tasks
# such as doing linear algebra or experimenting with computer graphics. Support for
# GPUs can be provided by writing custom kernels or by using Pythonic GPU libraris
# such as CuPy.
#
# ### CuPy
#
# [CuPy](https://cupy.dev) is an array-based library enabling Python code to be scaled to GPUs. CuPy's
# API is very similar to NumPy and SciPy, and it was built as their extension to
# GPUs. There are a few differences in the API, and even a few missing functions,
# but it is actively developed and in a stable state. CuPy requires CUDA or
# AMD ROCm to be installed on your machine. After the installation, majority of
# the NumPy and SciPy scripts can be converted to CuPy by just replacing `numpy`
# and `scipy` with `cupy`.
#
# ```python
# import cupy as cp
#
# print(cp.cuda.runtime.getDeviceCount()) # check if CuPy identifies the CUDA GPU
#
# x = cp.arange(6).reshape(2, 3).astype('f')
# y = cp.arange(6).reshape(2, 3).astype('f')
#
# result = x + y
# ```
#
# The code is not executed in this lesson as the service where these lessons are built
# do not posses a GPU.
#
# #### CuPy support in Scientific Python ecosystem
#
# Similar to Numba and Dask, CuPy is being adopted in multiple scientific libraries to
# provide GPU capabilities to the users. [Dask](https://docs.dask.org/en/stable/gpu.html)
# along with its exosystem libraries, such as [dask-image](https://image.dask.org/en/latest/index.html)
# provide CuPy and GPU support. Similarly, packages like
# [Awkward Array](https://awkward-array.org/doc/main/),
# [cupy-xarray](https://cupy-xarray.readthedocs.io), and
# [pyxem](https://pyxem.readthedocs.io/en/latest/) use CuPy internally to offer GPU support.
#
# ## Writing custom GPU kernels and binding them to Python
#
# Some Python libraries write their own custom kernels to provide GPU support to the users.
# For instance, [cuDF](https://docs.rapids.ai/api/cudf) uses libcudf, a C++/CUDA dataframe
# library to provide GPU support for dataframes in Python. Similarly, Tensorflow and
# PyTorch have their own GPU kernels and they do not depend on CuPy to run their code
# on GPUs. One can even write custom GPU kernels in CuPy, and libraries like
# [Awkward Array](https://awkward-array.org/doc/main/) leverage that for ragged data
# computation.
#

5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,8 @@ pycodestyle
pylint
webcolors
ruff
numba
dask
mimesis
dask-expr
graphviz

0 comments on commit 981f678

Please sign in to comment.