From aa4f47c16fb2868b7422732d032a5f08ecd0355c Mon Sep 17 00:00:00 2001 From: Saransh Chopra Date: Thu, 15 Aug 2024 14:10:57 +0100 Subject: [PATCH] chore: port the remaining doctoral course --- ch02data/061internet.ipynb.py | 129 ++-- ch02data/062basic_plotting.ipynb.py | 115 ++++ .../{062csv.ipynb.py => 063tabular.ipynb.py} | 4 +- ch02data/064JsonYamlXML.ipynb.py | 248 +++++--- ch02data/066QuakeExercise.ipynb.py | 87 ++- ch02data/068QuakesSolution.ipynb.py | 229 +++++-- ch02data/072plotting.ipynb.py | 243 ++++---- ch02data/082NumPy.ipynb.py | 582 ++++++++++-------- ch02data/084Boids.ipynb.py | 568 +++++++---------- 9 files changed, 1269 insertions(+), 936 deletions(-) create mode 100644 ch02data/062basic_plotting.ipynb.py rename ch02data/{062csv.ipynb.py => 063tabular.ipynb.py} (99%) diff --git a/ch02data/061internet.ipynb.py b/ch02data/061internet.ipynb.py index 06ac2cc1..05b5e057 100644 --- a/ch02data/061internet.ipynb.py +++ b/ch02data/061internet.ipynb.py @@ -12,7 +12,7 @@ # --- # %% [markdown] -# ## Getting data from the Internet +# # Getting data from the internet # %% [markdown] # We've seen about obtaining data from our local file system. @@ -28,99 +28,133 @@ # We may also want to be able to programmatically *upload* data, for example, to automatically fill in forms. # %% [markdown] -# This can be really powerful if we want to, for example, perform an automated meta-analysis across a selection of research papers. +# This can be really powerful if we want to, for example, do automated meta-analysis across a selection of research papers. # %% [markdown] -# ### URLs +# ## Uniform resource locators # %% [markdown] -# All internet resources are defined by a Uniform Resource Locator. +# All internet resources are defined by a [_uniform resource locator_ (URL)](https://en.wikipedia.org/wiki/URL) which are a particular type of [_uniform resource identifier_ (URI)](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). For example # %% -"https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US" +"https://mt0.google.com:443/vt?x=658&y=340&z=10&lyrs=s" # %% [markdown] -# A url consists of: +# A URL consists of: # -# * A *scheme* (`http`, `https`, `ssh`, ...) -# * A *host* (`static-maps.yandex.ru`, the name of the remote computer you want to talk to) -# * A *port* (optional, most protocols have a typical port associated with them, e.g. 80 for http, 443 for https) -# * A *path* (Like a file path on the machine, here it is `1.x/`) -# * A *query* part after a `?`, (optional, usually ampersand-separated *parameters* e.g. `size=400x400`, or `z=10`) +# * A *scheme* (`http` [_hypertext transfer protocol_](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol), `https` [_hypertext transfer protocol secure_ ](https://en.wikipedia.org/wiki/HTTPS), `ssh` [_secure shell_](https://en.wikipedia.org/wiki/Secure_Shell), ...) +# * A *host* (`mt0.google.com`, the name of the remote computer you want to talk to) +# * A *port* (optional, most protocols have a typical port associated with them, e.g. 80 for HTTP, 443 for HTTPS) +# * A *path* (analogous to a file path on the machine, here it is just `vt`) +# * A *query* part after a ?, (optional, usually ampersand `&` separated *parameters* e.g. `x=658` or `z=10`) # %% [markdown] -# **Supplementary materials**: These can actually be different for different protocols, the above is a simplification. You can see more, for example, at -# [the wikipedia article about the URI scheme](https://en.wikipedia.org/wiki/URI_scheme). +# **Supplementary materials**: These can actually be different for different protocols, the above is a simplification, you can see more, for example, at +# [the Wikipedia article on URIs](https://en.wikipedia.org/wiki/URI_scheme). # %% [markdown] -# URLs are not allowed to include all characters; we need to, for example, "escape" a space that appears inside the URL, -# replacing it with `%20`, so e.g. a request of `http://some example.com/` would need to be `http://some%20example.com/` +# URLs are not allowed to include all characters; we need to, for example, [_escape_](https://en.wikipedia.org/wiki/Escape_character) a space that appears inside the URL, replacing it with `%20`, so e.g. a request of `http://some example.com/` would need to be `http://some%20example.com/`. # # %% [markdown] # **Supplementary materials**: The code used to replace each character is the [ASCII](http://www.asciitable.com) code for it. # %% [markdown] -# **Supplementary materials**: The escaping rules are quite subtle. See [the wikipedia article for more detail](https://en.wikipedia.org/wiki/Percent-encoding). The standard library provides the [urlencode](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode) function that can take care of this for you. +# **Supplementary materials**: The escaping rules are quite subtle. See [the Wikipedia article on percent-encoding](https://en.wikipedia.org/wiki/Percent-encoding). # %% [markdown] -# ### Requests +# ## Requests # %% [markdown] -# The python [requests](http://docs.python-requests.org/en/latest/) library can help us manage and manipulate URLs. It is easier to use than the `urllib` library that is part of the standard library, and is included with anaconda and canopy. It sorts out escaping, parameter encoding, and so on for us. +# The Python [Requests](http://docs.python-requests.org/en/latest/) library can help us manipulate URLs and requesting the content associated with them. It is easier to use than the `urllib` library that is part of the standard library, and is included with Anaconda and Canopy. It sorts out escaping, parameter encoding, and so on for us. + +# %% +import requests # %% [markdown] # To request the above URL, for example, we write: # %% -import requests +response = requests.get( + url="https://mt0.google.com:443/vt", + params={'x': 658, 'y': 340, 'lyrs': 's', 'z': 10} +) + +# %% [markdown] +# The returned object is a instance of the `requests.Response` class # %% -response = requests.get("https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US", - params={ - 'size': '400,400', - 'll': '-0.1275,51.51', - 'zoom': 10, - 'l': 'sat', - 'lang': 'en_US' - }) +response # %% -response.content[0:50] +isinstance(response, requests.Response) # %% [markdown] -# When we do a request, the result comes back as text. For the png image in the above, this isn't very readable. +# The `Response` class defines various useful attributes associated with the responses, for example we can check the [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) for our request with a value of 200 indicating a successful request + +# %% +response.status_code # %% [markdown] -# Just as for file access, therefore, we will need to send the text we get to a python module which understands that file format. +# We can also more directly check if the response was successful or not with the boolean `Response.ok` attribute + +# %% +response.ok # %% [markdown] -# Again, it is important to separate the *transport* model (e.g. a file system, or an "http request" for the web) from the data model of the data that is returned. +# We can get the URL that was requested using the `Response.url` attribute + +# %% +response.url # %% [markdown] -# ### Example: Sunspots +# When we do a request, the associated response content, accessible at the `Response.content` attribute, is returned as bytes. For the JPEG image in the above, this isn't very readable: + +# %% +type(response.content) + +# %% +response.content[:10] # %% [markdown] -# Let's try to get something scientific: the sunspot cycle data from [SILSO](http://sidc.be/silso/home): +# We can also get the content as a string using the `Response.content` attribute, though this is even less readable here as some of the returned bytes do not have corresponding character encodings + +# %% +type(response.text) + +# %% +response.text[:10] + +# %% [markdown] +# To get a more useful representation of the data, we will therefore need to process the content we get using a Python function which understands the byte-encoding of the corresponding file format. + +# %% [markdown] +# Again, it is important to separate the *transport* model, (e.g. a file system, or a HTTP request for the web), from the data model of the data that is returned. + +# %% [markdown] +# ## Example: sunspots + +# %% [markdown] +# Let's try to get something scientific: the sunspot cycle data from the [Sunspot Index and Long-term Solar Observations website](http://sidc.be/silso/home) # %% spots = requests.get('http://www.sidc.be/silso/INFO/snmtotcsv.php').text # %% -spots[0:80] +spots[-100:] # %% [markdown] -# This looks like semicolon-separated data, with different records on different lines. (Line separators come out as `\n`) +# This looks like semicolon-separated data, with different records on different lines. Line separators come out as `\n` which is the escape-sequence corresponding a newline character in Python. # %% [markdown] # There are many many scientific datasets which can now be downloaded like this - integrating the download into your data # pipeline can help to keep your data flows organised. # %% [markdown] -# ### Writing our own Parser +# ## Writing our own parser # %% [markdown] -# We'll need a python library to handle semicolon-separated data like the sunspot data. +# We'll need a Python library to handle semicolon-separated data like the sunspot data. # %% [markdown] # You might be thinking: "But I can do that myself!": @@ -139,31 +173,34 @@ # But **don't**: what if, for example, one of the records contains a separator inside it; most computers will put the content in quotes, # so that, for example, # -# "something; something"; something; something +# "Something; something"; something; something # # has three fields, the first of which is # -# something; something -# -# The naive code above would give four fields, of which the first is -# -# "something +# Something; something +# + +# %% [markdown] +# Our naive code above would however not correctly parse this input: + +# %% +'"Something; something"; something; something'.split(';') # %% [markdown] # You'll never manage to get all that right; so you'll be better off using a library to do it. # %% [markdown] -# ### Writing data to the internet +# ## Writing data to the internet # %% [markdown] # Note that we're using `requests.get`. `get` is used to receive data from the web. # You can also use `post` to fill in a web-form programmatically. # %% [markdown] -# **Supplementary material**: Learn about using `post` with [requests](http://docs.python-requests.org/en/latest/user/quickstart/). +# **Supplementary material**: Learn about using `post` with [Requests](http://docs.python-requests.org/en/latest/user/quickstart/). # %% [markdown] -# **Supplementary material**: Learn about the different kinds of [http request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods): [Get, Post, Put, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)... +# **Supplementary material**: Learn about the different kinds of [HTTP request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods): [Get, Post, Put, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)... # %% [markdown] # This can be used for all kinds of things, for example, to programmatically add data to a web resource. It's all well beyond diff --git a/ch02data/062basic_plotting.ipynb.py b/ch02data/062basic_plotting.ipynb.py new file mode 100644 index 00000000..40006d8b --- /dev/null +++ b/ch02data/062basic_plotting.ipynb.py @@ -0,0 +1,115 @@ +# --- +# jupyter: +# jekyll: +# display_name: Basic plotting +# jupytext: +# notebook_metadata_filter: -kernelspec,jupytext,jekyll +# text_representation: +# extension: .py +# format_name: percent +# format_version: '1.3' +# jupytext_version: 1.15.2 +# --- + +# %% [markdown] +# # Plotting with Matplotlib - the `pyplot` interface +# +# [Matplotlib](https://matplotlib.org/) is a Python library which can be used to produce plots to visualise data. It has support for a wide range of [different plot types](https://matplotlib.org/stable/gallery/index.html), and as well as supporting static outputs it also allows producing [animations](https://matplotlib.org/stable/api/animation_api.html) and [interactive plots](https://matplotlib.org/stable/gallery/index.html#event-handling). As an intial introduction, we will demonstrate how to use Matplotlib's [`pyplot` interface](https://matplotlib.org/stable/api/index.html#the-pyplot-api) (modelled on the plotting functions in MATLAB), to create a simple line plot. If you return for our more advanced course we will illustrate Matplotlib's [object-oriented interface](https://matplotlib.org/stable/api/index.html#id3) which allows more flexibility in creating complex plots and greater control over the appearance of plot elements. +# +# ## Importing Matplotlib +# +# We import the `pyplot` object from Matplotlib, which provides us with an interface for making figures. A common convention is to use the `import ... as ...` syntax to alias `matplotlib.pyplot` to the shorthand name `plt`. + +# %% +import matplotlib.pyplot as plt + +# %% [markdown] +# ## A basic plot +# +# As a first example we create a basic line plot. + +# %% +plt.plot([2, 4, 6, 8, 10], [1, 5, 3, 7, -11]) + +# %% [markdown] +# The `plt.plot` function allows producing line and scatter plots which visualize the relationship between pairs of variables. Here we pass `plt.plot` two lists of five numbers corresponding to respectively the coordinates of the points to plot on the horizontal (*x*) axis and the coordinates of the points to plot on the vertical (*y*) axis. When passed no other arguments by default `plt.plot` will produce a line plot passing through the specified points. The value returned by `plt.plot` is a list of objects corresponding to the plotted line(s): in this case we plotted only one line so the list has only one element. We will for now ignore these return values until we cover object-oriented programminga and return to explain Matplotlib's object-oriented interface. +# +# If passed a single list of numbers, the `plt.plot` function will interpret these as the coordinates of the points to plot on the vertical (*y*) axis, with the horizontal (*x*) axis points in this case implicitly assumed to be the indices of the values in the list. For example, if we plot with just the second list from the previous `plt.plot` call + +# %% +plt.plot([1, 5, 3, 7, -11]) + +# %% [markdown] +# We get a very similar looking plot other than the change in the scale on the horizontal axis. + +# %% [markdown] +# ## Plotting a function +# +# To make things a little more visually interesting, we will illustrate plotting the trigonometric functions *sine* ($\sin$) and *cosine* ($\cos$). We first import implementations of these functions from the in-built `math` module as well as the constant numerical constant `pi` ($\pi$). + +# %% +from math import sin, cos, pi + +# %% [markdown] +# The `sin` and `cos` functions both take a single argument corresponding to an angular quantity in [radians](https://en.wikipedia.org/wiki/Radian) and are [periodic](https://en.wikipedia.org/wiki/Periodic_function) with period $2\pi$. We therefore create a list of equally spaced angles in the interval $[0, 2\pi)$ and assign it to a variable `theta`. + +# %% +number_of_points = 100 +theta = [2 * pi * n / number_of_points for n in range(number_of_points)] + +# %% [markdown] +# Using a list comprehension we can now compute the value of the sine function for each value in `theta` and graph this as the vertical coordinates of a line plot. + +# %% +plt.plot(theta, [sin(t) for t in theta]) + +# %% [markdown] +# ## Plotting multiple lines +# +# We can plot multiple different lines on the same plot by making mutiple calls to `plt.plot` within the same cell. For example in the cell below we compute both the sine and cosine functions. + +# %% +plt.plot(theta, [sin(t) for t in theta]) +plt.plot(theta, [cos(t) for t in theta]) + +# %% [markdown] +# By default Matplotlib will cycle through a [sequence of colours](https://matplotlib.org/stable/gallery/color/color_cycle_default.html) as each new plot is added to help distinguish between the different plotted lines. +# +# ## Changing the line styles +# +# The `plt.plot` function offers various optional keyword arguments that can be used to further customise the plot. One useful argument is `linestyle` which allows the style of the line used to join the plotted points to be specified - for example this can useful to allow plotted lines to be distinguished even when they are printed in monochrome. Matplotlib as [a variety of built-in linestyles with simple string names](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html) as well as options for performing further customisation. Here we specify for the cosine curve to be plotted with a dotted line. + +# %% +plt.plot(theta, [sin(t) for t in theta]) +plt.plot(theta, [cos(t) for t in theta], linestyle="dotted") + +# %% [markdown] +# ## Adding a legend +# +# Although we can visually distinguish between the two plotted lines, ideally we would have labels to indicate which corresponds to which function. We can add a legend to the plot with the `plt.legend` function. If we pass a list of strings to `plt.legend` these will be interpreted as the labels for each of the lines plotted so far in the order plotted. Matplotlib has [in-built support](https://matplotlib.org/stable/tutorials/text/mathtext.html) for using [TeX markup](https://en.wikibooks.org/wiki/LaTeX/Mathematics) to write mathematical expressions by putting the TeX markup within a pair of dollar signs (`$`). As TeX's use of the backslash character `\` to prefix commands conflicts with Python's interpretation of `\` as an escape character, you should typically use raw-strings by prefixing the string literal with `r` to simplify writing TeX commands. + +# %% +plt.plot(theta, [sin(t) for t in theta]) +plt.plot(theta, [cos(t) for t in theta], linestyle="dotted") +plt.legend([r"$\sin\theta$", r"$\cos\theta$"]) + +# %% [markdown] +# Matplotlib also allows the legend label for a plot to be specified in the `plt.plot` call using the `label` keyword arugment. When plotting many lines this can be more readable than having to create a separate list of labels to pass to a subsequent `plt.legend` call. If we specify the `label` keyword arguments we can call `plt.legend` without any arguments. + +# %% +plt.plot(theta, [sin(t) for t in theta], label=r"$f(\theta) = \sin\theta$") +plt.plot(theta, [cos(t) for t in theta], linestyle="dotted", label=r"$f(\theta) = \cos\theta$") +plt.legend() + +# %% [markdown] +# ## Adding axis labels and a title +# +# The `pyplot` interface also provides functions for adding axis labels and a title to our plot. Specifically `plt.xlabel` and `plt.ylabel` are functions which set the labels on respectively the horizontal (*x*) axis and vertical (*y*) axis, both accepting a string argument corresponding to the axis label. The `plt.title` function, as the name suggests, allows setting an overall title for the plot. As for the legend labels, the axis labels and title may all optionally use TeX mathematical notation delimited by dollar `$` signs. + +# %% +plt.plot(theta, [sin(t) for t in theta], label=r"$f(\theta) = \sin\theta$") +plt.plot(theta, [cos(t) for t in theta], linestyle="dotted", label=r"$f(\theta) = \cos\theta$") +plt.legend() +plt.xlabel(r"Angle in radians $\theta$") +plt.ylabel(r"$f(\theta)$") +plt.title("Trigonometric functions") diff --git a/ch02data/062csv.ipynb.py b/ch02data/063tabular.ipynb.py similarity index 99% rename from ch02data/062csv.ipynb.py rename to ch02data/063tabular.ipynb.py index 8de8dc09..fc74ad0a 100644 --- a/ch02data/062csv.ipynb.py +++ b/ch02data/063tabular.ipynb.py @@ -1,7 +1,7 @@ # --- # jupyter: # jekyll: -# display_name: CSV +# display_name: Tabular data # jupytext: # notebook_metadata_filter: -kernelspec,jupytext,jekyll # text_representation: @@ -12,7 +12,7 @@ # --- # %% [markdown] -# ## Field and Record Data (Tabular data) +# # Field and Record Data (Tabular data) # # Tabular data, that is data that is formatted as a table with a fixed number of rows and columns, is very common in a research context. A particularly simple and also popular file format for such data is [_delimited-separated value_ files](https://en.wikipedia.org/wiki/Delimiter-separated_values). diff --git a/ch02data/064JsonYamlXML.ipynb.py b/ch02data/064JsonYamlXML.ipynb.py index 397bdbd4..0d04cea5 100644 --- a/ch02data/064JsonYamlXML.ipynb.py +++ b/ch02data/064JsonYamlXML.ipynb.py @@ -12,132 +12,224 @@ # --- # %% [markdown] -# ## Structured Data - -# %% [markdown] -# ### Structured data - -# %% [markdown] -# CSV files can only model data where each record has several fields, and each field is a simple datatype, -# a string or number. - -# %% [markdown] -# We often want to store data which is more complicated than this, with nested structures of lists and dictionaries. -# Structured data formats like JSON, YAML, and XML are designed for this. - -# %% [markdown] -# ### JSON - -# %% [markdown] -# [JSON](https://en.wikipedia.org/wiki/JSON) is a very common open-standard data format that is used to store structured data in a human-readable way. - -# %% [markdown] -# This allows us to represent data which is combinations of lists and dictionaries as a text file which -# looks a bit like a Javascript (or Python) data literal. +# # Structured data +# +# *Comma separated variable* (CSV) files can only store *tabular data* in which all records share the same fields and each field is a simple data type such as a string or number. We often want to store data which has a more complex *hierarchical structure*, for example data in which the fields in each record are themselves structured objects. Structured data formats like [JSON](#JSON), [YAML](#YAML) and [XML](#XML) are designed for this. +# +# ## JSON +# +# A very common structured data format is [*JavaScript Object Notation* (JSON)](https://en.wikipedia.org/wiki/JSON). As the name suggests this a JavaScript based format for storing data objects. JSON allows us to represent hierarchical data with a [tree-like structure](https://en.wikipedia.org/wiki/Tree_(data_structure)) by nesting sequence-like *arrays* (analogous to Python `list` objects) and mapping-like *objects* (analagous to Python `dict` objects), with the 'leaves' of the tree being one of a small set of simple data types: +# +# * *number*: a signed decimal number potentially using [E notation](https://en.wikipedia.org/wiki/Scientific_notation#E_notation), for example `12`, `0.58`, `6.022e23`. Unlike Python no distinction is made between integer and floating-point values. +# * *string*: a sequence of [Unicode characters](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) delimited with double quotation marks, for example `"Hello world"`, `""`, `"πŸ˜€πŸ˜„πŸ˜"`. Unlike Python, single quotation marks cannot be used as the delimiters instead. +# * *boolean*: one of the literals `true` or `false`. Note the difference in capitalisation from the equivalent Python values. +# * `null`: an empty value, comparable to `None` in Python. +# +# JSON *arrays* are ordered sequences of zero or more elements, and like Python lists are delimited with square brackets with comma separated values. Also like Python lists each element in the array can be of a different type, with the allowable types being the any of the four simple types described above, arrays or objects (see following). +# +# JSON *objects* are *key-value* mappings in which the keys are strings and the values may be any of the simple data types above, arrays or objects. As with Python dictionaries, the keys within each object must be unique, and also like Python dictionaries objects are delimited with curly braces, with each key-value pair comma-separated and a colon used to separate the key from the following value. +# +# The `json` module in the Python standard library provides functions for encoding / decoding Python data structures to / from JSON format. # %% import json # %% [markdown] -# Any nested group of dictionaries and lists can be saved: - -# %% -mydata = {'key': ['value1', 'value2'], - 'key2': {'key4':'value3'}} +# Specifically `json` uses the following translations between types when decoding and encoding +# +#
+# +# +# +# +# +# +# +# +# +# +# +#
Decoding (JSON to Python)
JSON Python
object dict
array list
string str
number (integer) int
number (real) float
boolean bool
null None
+# +# +# +# +# +# +# +# +# +# +#
Encoding (Python to JSON)
Python JSON
dict object
list, tuple array
str string
int, float number
bool boolean
None null
+# +#
+# +# Note that mapping of types when encoding from Python to JSON is not one-to-one so sequentially encoding a Python object to JSON and then decoding back to a Python object can result in some changes in types. +# +# As a simple example consider the following Python object consisting of nested dictionaries and lists: # %% -json.dumps(mydata) +my_data = {'foo': ['value', 1, True], 'bar': {'spam': 3.4, 'eggs': None}} # %% [markdown] -# If you would like a more readable output, you can use the `indent` argument. +# We can encode this Python object to a JSON formatted string using the [`json.dumps` function](https://docs.python.org/3/library/json.html#json.dumps) # %% -print(json.dumps(mydata, indent=4)) +json_string = json.dumps(my_data) +print(json_string) # %% [markdown] -# Loading data is also really easy: +# The `json.dumps` function has several optional keyword arguments that can be used to produce a more nicely formatted output which can be useful to increase readability when encoding large objects, for instance # %% -# %%writefile myfile.json -{ - "somekey": ["a list", "with values"] -} +json_string = json.dumps(my_data, indent=4, sort_keys=True) +print(json_string) -# %% -with open('myfile.json', 'r') as json_file: - my_data_as_string = json_file.read() +# %% [markdown] +# We can then easily save the JSON formatted string to a file using the `open` function we encountered in a previous lesson. # %% -my_data_as_string +with open('my_file.json', 'w') as f: + f.write(json_string) -# %% -mydata = json.loads(my_data_as_string) +# %% [markdown] +# As encoding a Python object to JSON format and writing the result to a file is such a common operation, the `json` module also provides the `json.dump` function which can be used to directly write the JSON encoding of a Python object to a file: # %% -mydata['somekey'] +with open('my_file.json', 'w') as f: + json.dump(my_data, f) # %% [markdown] -# This is a very nice solution for loading and saving Python data structures. +# We can similarly use `open` to read JSON formatted data in to a string -# %% [markdown] -# It's a very common way of transferring data on the internet, and of saving datasets to disk. - -# %% [markdown] -# There's good support in most languages, so it's a nice inter-language file interchange format. +# %% +with open('my_file.json', 'r') as f: + loaded_json_string = f.read() +print(loaded_json_string) # %% [markdown] -# ### YAML +# We can then use the [`json.loads` function](https://docs.python.org/3/library/json.html#json.loads) to decode this JSON formatted string in to a Python object -# %% [markdown] -# [YAML](https://en.wikipedia.org/wiki/YAML) is a very similar data format to JSON, with some nice additions: +# %% +loaded_data = json.loads(loaded_json_string) +print(f"loaded_data = {loaded_data}") +print(f"type(loaded_data) = {type(loaded_data).__name__}") +print(f"type(loaded_data['foo']) = {type(loaded_data['foo']).__name__}") # %% [markdown] -# * You don't need to quote strings if they don't have funny characters in -# * You can have comment lines, beginning with a `#` -# * You can write dictionaries without the curly brackets: it just notices the colons. -# * You can write lists like this: +# As with `json.dumps` and `json.dump`, there is also a [`json.load` function](https://docs.python.org/3/library/json.html#json.load) which can be used to directly load a JSON formatted file in to a Python object: # %% -# %%writefile myfile.yaml -somekey: - - a list # Look, this is a list - - with values +with open('my_file.json', 'r') as f: + loaded_data = json.load(f) +print(f"loaded_data = {loaded_data}") + +# %% [markdown] +# JSON is a very useful format for loading and saving Python data structures. It is a common way of transferring data on the internet, and as there is good support in many programming languages, it is a convenient inter-language file interchange format. + +# %% [markdown] +# ## YAML + +# %% [markdown] +# YAML ([originally short for *Yet Another Markup Language*](https://en.wikipedia.org/wiki/YAML#History_and_name)) is another structured data format with many similarities to JSON; in fact [recent versions of YAML are a superset of JSON](https://en.wikipedia.org/wiki/YAML#Comparison_with_JSON). As well as supporting the same types and syntax as JSON, YAML also several additional features which can allow more readable formatting of data, for example +# +# * Similar to Python, whitespace indentation can be used to denote nested structures rather than using explicit delimiters. +# * Strings do not need to be delimited with quotes in most cases other than when escaping special characters. +# * Comments can be included by prefixing with the `#` character, with all subsequent characters up to the end of the line ignored when decoding. +# * Arrays (lists) can be denoted by placing each element on a separate line prefixed with a `-` character. +# * Objects (dictionaries) can be denoted by placing each key-value pair on a separate line with a `:` character separating the key and value. +# +# For example, the following text represents an equivalent data object as encountered in the previous JSON section in a YAML compatible format +# +# ```yaml +# foo: +# # The following is a list +# - value +# - 1 +# - true +# bar: +# # The indentation heres indicates the following lines are a nested object +# spam: 3.4 +# eggs: null +# ``` +# +# Unlike for JSON, there is no built-in module for encoding / decoding YAML files in the Python standard library. One third-party option is the [PyYAML](https://pyyaml.org/wiki/PyYAMLDocumentation) library. If PyYAML is installed in the active Python environment (it is included in the Anaconda Python distribution for example), then the `yaml` module can then be imported by running the following # %% -import yaml # This may need installed as pyyaml - -# %% -with open('myfile.yaml') as yaml_file: - my_data = yaml.safe_load(yaml_file) -print(mydata) +import yaml # %% [markdown] -# YAML is a popular format for ad-hoc data files, but the library doesn't ship with default Python (though it is part -# of Anaconda and Canopy), so some people still prefer JSON for its universality. +# Similarly to the `json.dump` and `json.dumps` functions, `yaml` provides a `dump` function which can be used to encode a Python object to a corresponding YAML formatted string *or* directly write the YAML formatted output to a file. When called without a `stream` keyword argument the `yaml.dump` function returns a YAML formatted string corresponding to the passed object (analogous to `json.dumps`): + +# %% +yaml_string = yaml.dump(my_data) +print(yaml_string) # %% [markdown] -# Because YAML gives the option of serialising a list either as newlines with dashes or with square brackets, -# you can control this choice: +# If we instead pass a stream-like object such as a file as the second argument, the output will instead be written directly to the stream (analogous to `json.dump`): # %% -print(yaml.safe_dump(mydata)) +with open('my_file.yaml', 'w') as f: + yaml.dump(my_data, stream=f) + +# %% [markdown] +# The `yaml` module also provides a `load` function analogous to the `json.load` function for loading objects from YAML formatted files. Although YAML itself only represents data, it supports language-specific tags which YAML parsers such as PyYAML may use to allow representing arbitrary types. As this means PyYAML can construct Python objects which may execute code on loading, loading YAML files from untrusted sources can be a [security concern](https://pyyaml.org/wiki/PyYAMLDocumentation#loading-yaml). It is therefore recommended to use `yaml.safe_load` to load YAML files when you are unsure about their source as this removes the risk of arbitrary code execution (with the tradeoff of no longer being able to encode any Python object). As the data object we just wrote to file only uses simple types, we can use `yaml.safe_load` here without any issues. # %% -print(yaml.safe_dump(mydata, default_flow_style=True)) +with open('my_file.yaml', 'r') as f: + loaded_data = yaml.safe_load(f) +print(f"loaded_data = {loaded_data}") +print(f"type(loaded_data) = {type(loaded_data).__name__}") +print(f"type(loaded_data['foo']) = {type(loaded_data['foo']).__name__}") # %% [markdown] -# `default_flow_style=False` (the default) uses a "block style" (rather than an "inline" or "flow style") to delineate data structures. See [the YAML docs](http://yaml.org/spec/1.2/spec.html) for more details. +# YAML is a very versatile format for ad-hoc data files, however, as YAML encoding / decoding is not part of the Python standard library, JSON is sometimes preferred for its increased ease of use and universality. # %% [markdown] -# ### XML +# ## XML # %% [markdown] -# *Supplementary material*: [XML](http://www.w3schools.com/xml/) is another popular choice when saving nested data structures. -# It's very careful, but verbose. If your field uses XML data, you'll need to learn a [python XML parser](https://docs.python.org/3/library/xml.etree.elementtree.html) -# (there are a few), and about how XML works. +# *Supplementary material* +# +# [*Extensible Markup Language* (XML)](http://www.w3schools.com/xml/) is another popular format for storing hierarchical data structures. XML is very general and flexible, but is also very verbose which can hinder the human readability of XML encoded data and lead to large file sizes. In some scientific fields, XML based formats for data storage are very common. If you want to read and write XML formatted data in Python, a collection of tools for processing XML are available in the standard library within the [XML package](https://docs.python.org/3/library/xml.html). # %% [markdown] -# ### Exercise: Saving and loading data +# ## Exercise: saving and loading a maze # %% [markdown] -# Use YAML or JSON to save your maze data structure to disk and load it again. +# Use YAML or JSON to save to disk, and to load it again, the maze data structure you designed in the previous *A Maze Model* exercise or the example solution below if you do not have a solution to hand. + +# %% +maze = { + 'living' : { + 'exits': { + 'north' : 'kitchen', + 'outside' : 'garden', + 'upstairs' : 'bedroom' + }, + 'people' : ['James'], + 'capacity' : 2 + }, + 'kitchen' : { + 'exits': { + 'south' : 'living' + }, + 'people' : [], + 'capacity' : 1 + }, + 'garden' : { + 'exits': { + 'inside' : 'living' + }, + 'people' : ['Sue'], + 'capacity' : 3 + }, + 'bedroom' : { + 'exits': { + 'downstairs' : 'living', + 'jump' : 'garden' + }, + 'people' : [], + 'capacity' : 1 + } +} diff --git a/ch02data/066QuakeExercise.ipynb.py b/ch02data/066QuakeExercise.ipynb.py index f799e2ff..de45d54c 100644 --- a/ch02data/066QuakeExercise.ipynb.py +++ b/ch02data/066QuakeExercise.ipynb.py @@ -12,42 +12,75 @@ # --- # %% [markdown] -# ## Classroom exercise: the biggest earthquake in the UK this century +# # Extended exercise: the biggest earthquake in the UK this century +# +# ## USGS earthquake catalog +# +# [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) is a JSON-based file format for sharing geographic data. One example dataset available in the GeoJSON format is the [USGS earthquake catalog](https://www.usgs.gov/natural-hazards/earthquake-hazards/earthquakes). A [web service *application programming interface* (API)](https://earthquake.usgs.gov/fdsnws/event/1/) is provided for programatically accessing events in the earthquake catalog. Specifically the `query` method allows querying the catalog for events with the [query parameters](https://earthquake.usgs.gov/fdsnws/event/1/#parameters) passed as `key=value` pairs. +# +# We can use the [`requests` Python library](https://docs.python-requests.org/en/latest/) to simplify constructing the appropriate query string to add to the URL and to deal with sending the HTTP request. + +# %% jupyter={"outputs_hidden": false} +import requests + +# %% [markdown] +# We first define a variable URL for the earthquake catalog web service API. + +# %% jupyter={"outputs_hidden": false} +earthquake_catalog_api_url = "http://earthquake.usgs.gov/fdsnws/event/1/query" + +# %% [markdown] +# We now need to define the parameters of our query. We want to get the data in GeoJSON format for all events in the earthquake catalog with date on or after 1st January 2000 and with location within [a bounding box covering the UK](http://bboxfinder.com/#49.877000,-8.182000,60.830000,1.767000). We will filter the events to only request those with magnitude greater or equal to 1 to avoid downloading responses for more frequent small magnitude events. Finally we want the results to be returned in order of ascending time. + +# %% jupyter={"outputs_hidden": false} +query_parameters = { + "format": "geojson", + "starttime": "2000-01-01", + "maxlatitude": "60.830", + "minlatitude": "49.877", + "maxlongitude": "1.767", + "minlongitude": "-8.182", + "minmagnitude": "1", + "orderby": "time-asc" +} # %% [markdown] -# ### The Problem +# We now can execute the API query using the [`requests.get` function](https://docs.python-requests.org/en/latest/api/#requests.get). This takes as arguments the URL to make the request from and, optionally, a dictionary argument `params` containing the parameters to send in the query string for the request. A [`requests.Response` object](https://docs.python-requests.org/en/latest/api/#requests.Response) is returned, which represents the server's response to the HTTP request made. + +# %% jupyter={"outputs_hidden": false} +quakes_response = requests.get(earthquake_catalog_api_url, params=query_parameters) # %% [markdown] -# GeoJSON is a JSON-based file format for sharing geographic data. One example dataset is the USGS earthquake data: +# The response object has various attributes and methods. A useful attribute to check is the `ok` attribute which will be `False` if the status code for the response to the request corresponds to a client or server error and `True` otherwise. # %% -import requests -quakes = requests.get("http://earthquake.usgs.gov/fdsnws/event/1/query.geojson", - params={ - 'starttime': "2000-01-01", - "maxlatitude": "58.723", - "minlatitude": "50.008", - "maxlongitude": "1.67", - "minlongitude": "-9.756", - "minmagnitude": "1", - "endtime": "2018-10-11", - "orderby": "time-asc"} - ) +quakes_response.ok + +# %% [markdown] +# We can also check specifically that the status code corresponds to [the expected `200 OK`](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_success) using the `status_code` attribute # %% -quakes.text[0:100] +quakes_response.status_code == 200 # %% [markdown] -# Your exercise: determine the location of the largest magnitude earthquake in the UK this century. +# The actual content of the response can be accessed in various formats. The `content` attribute gives the content of the response as [bytes](https://docs.python.org/3/library/stdtypes.html#bytes). As here we expect the response content to be Unicode-encoded text, the `text` attribute is more relevant as it gives the response content as a Python string. We can display the first 100 characters of the response content as follows + +# %% +print(quakes_response.text[:100]) # %% [markdown] -# You'll need to: -# * Get the text of the web result -# * Parse the data as JSON -# * Understand how the data is structured into dictionaries and lists -# * Where is the magnitude? -# * Where is the place description or coordinates? -# * Program a search through all the quakes to find the biggest quake -# * Find the place of the biggest quake -# * Form a URL for an online map service at that latitude and longitude: look back at the introductory example -# * Display that image +# ## Task +# +# The task for this exercie is to determine the location of the largest magnitude earthquake in the UK this century, using the data from the USGS earthquake catalog. +# +# You will need to: +# +# * Query the USGS earthquake catalog for the relevant event data (see above). +# * Parse the data as JSON. +# * Understand how the data is structured into dictionaries and lists +# * Where is the magnitude? +# * Where is the place description or coordinates? +# * Program a search through all the quakes to find the biggest quake. +# * Find the place of the biggest quake. +# * Form a URL for a map tile centered at that latitude and longitude (look back at the introductory example). +# * Display that map tile image. diff --git a/ch02data/068QuakesSolution.ipynb.py b/ch02data/068QuakesSolution.ipynb.py index 03f23deb..ece50423 100644 --- a/ch02data/068QuakesSolution.ipynb.py +++ b/ch02data/068QuakesSolution.ipynb.py @@ -12,118 +12,223 @@ # --- # %% [markdown] -# ## Solution to the earthquake exercise +# # Solution: the biggest earthquake in the UK this century # -# **NOTE:** This is intended as a reference for **after** you have attempted [the problem](./066QuakeExercise.html) [(notebook version)](./066QuakeExercise.ipynb) yourself! +# ## Import modules +# +# We first import the modules we will need to get and plot the data. We will use the `requests` module to query the USGS earthquake catalog, the `math` module to do some coordinate conversion and the `IPython` module to display a map image. + +# %% jupyter={"outputs_hidden": false} +import requests +import math +import IPython + +# %% [markdown] +# ## Download the data +# +# We can reuse the code provided in the exercise notebook to query the USGS earthquake catalog API using the `requests` module. + +# %% +earthquake_catalog_api_url = "http://earthquake.usgs.gov/fdsnws/event/1/query" +query_parameters = { + "format": "geojson", + "starttime": "2000-01-01", + "maxlatitude": "60.830", + "minlatitude": "49.877", + "maxlongitude": "1.767", + "minlongitude": "-8.182", + "minmagnitude": "1", + "orderby": "time-asc" +} +quakes_response = requests.get(earthquake_catalog_api_url, params=query_parameters) # %% [markdown] -# ### Download the data +# We can check that the returned object is a `Response` as expected -# %% -import requests -quakes = requests.get("http://earthquake.usgs.gov/fdsnws/event/1/query.geojson", - params={ - 'starttime': "2000-01-01", - "maxlatitude": "58.723", - "minlatitude": "50.008", - "maxlongitude": "1.67", - "minlongitude": "-9.756", - "minmagnitude": "1", - "endtime": "2018-10-11", - "orderby": "time-asc"} - ) +# %% jupyter={"outputs_hidden": false} +type(quakes_response) # %% [markdown] -# ### Parse the data as JSON +# It is also useful to check whether the status of the returned response corresponds to there not being any client or server errors, with this being indicated by a status code of 200 + +# %% jupyter={"outputs_hidden": false} +assert quakes_response.status_code == 200 + +# %% [markdown] +# ## Parse the data as JSON + +# %% [markdown] +# We saw in the exercise notebooks that the `Reponse` objects returned by `requests.get` have various attributes that allow accessing the response content in different formats including `Response.content` to get the raw `bytes` content and `Response.text` to get the response as a (Unicode) `str` object. We can print out all of the attributes of an object in Python using the inbuilt `dir` function; typically these will include attributes intended for internal use only which are conventionally indicated by prefixing with an underscore character `_`. We can display all the attributes without an initial underscore character as follows. # %% -import json +[attribute for attribute in dir(quakes_response) if attribute[0] != '_'] + +# %% [markdown] +# As well as the `content`, `ok`, `status_code` and `text` attributes we already encountered, we can see there is also a `json` attribute, which seems like it could be relevant to our task of decoding the response as JSON. We can find out more about this attribute by using a useful feature of Jupyter / IPython - by adding a question mark `?` to the end of a Python object the documentation string (`docstring`) for that object will be displayed. # %% -requests_json = json.loads(quakes.text) +# quakes_response.json? # %% [markdown] -# ### Investigate the data to discover how it is structured +# From this we can see that `quakes.response` is a method (function bound to an object) which returns a Python object corresponding to a JSON encoded response, which is exactly what we need. There are no required arguments so we can call the method by just adding a pair of empty parentheses. + +# %% jupyter={"outputs_hidden": false} +quakes_json = quakes_response.json() # %% [markdown] -# There is no foolproof way of doing this. A good first step is to see the type of our data! +# If we had not been aware of the `json` method an alternative would be to use the `json` module directly as we saw previously. For example the following would give an equivalent result to the above. +# +# ```Python +# import json +# quakes_json = json.loads(quakes_response.text) +# ``` -# %% -type(requests_json) +# %% [markdown] +# ## Investigate the data to discover how it is structured. +# +# Now that we have queried and decoded the data into a Python object, we need to do some exploration to find out how it is structure. In some cases there may be documentation we can use to help guide our exploration of data - for example [this page on the USGS earthquake catalog website](https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php) gives a summary of the GeoJSON format. However, in other cases we may not be so lucky and have to work with data with an undocumented format so it is also useful to consider how we might explore that data structure ourselves. +# +# +# A potentially useful first step is to check what the type of the `quakes_json` object is. + +# %% jupyter={"outputs_hidden": false} +type(quakes_json) # %% [markdown] -# Now we can navigate through this dictionary to see how the information is stored in the nested dictionaries and lists. The `keys` method can indicate what kind of information each dictionary holds, and the `len` function tells us how many entries are contained in a list. How you explore is up to you! +# We see that `quakes_json` is a Python `dict` (dictionary) object. We might want to find out what keys are defined in this dictionary - we can do this by calling the `keys` method. -# %% -requests_json.keys() +# %% jupyter={"outputs_hidden": false} +quakes_json.keys() -# %% -len(requests_json['features']) +# %% [markdown] +# We see that the dictionary has three keys, all of which are strings. The `features` key in particular here looks potentially promising for our goal of finding the maximum magnitude event in the data (on the rationale that the magnitude is a *feature* of the event). We can check what type the value associated with the `features` key is. -# %% -requests_json['features'][0].keys() +# %% jupyter={"outputs_hidden": false} +type(quakes_json['features']) + +# %% [markdown] +# We find out that the `features` key contains a list. We can check the length of this list. + +# %% jupyter={"outputs_hidden": false} +len(quakes_json['features']) + +# %% [markdown] +# We could also use a set (which we encountered previously in the lesson on dictionaries) to find out what the set of types of all of the elements in the `quakes_json['features']` list is. Similarly to the list comprehensions we encountered in a previous lesson we can use a similar syntax here to succinctly construct the set we required. # %% -requests_json['features'][0]['properties'].keys() +{type(feature) for feature in quakes_json['features']} + +# %% [markdown] +# From this we see that all of the elements in the `quakes_json['features']` share the same Python `dict` type. We can use a similar approach to find out what keys all these dictionary objects have. # %% -requests_json['features'][0]['properties']['mag'] +{tuple(feature.keys()) for feature in quakes_json['features']} + +# %% [markdown] +# This tells us that as well as all the elements being dictionaries, all of the dictionaries have the same keys. This suggests the list corresponds to a representation of a sequence of objects of the same type, with each dictionary containing the 'features' for a particular object, with in in this case the objects in questions being particular earthquake events. +# +# To check this idea, we can look at the value of a particular element in the `features` list - as we know all elements are dictionaries with the same keys, its reasonable to assume the first element `quakes_json['features'][0]` will be representative of the all the other elements in the list. We can start by summarising the keys and types of the values in this dictionary. # %% -requests_json['features'][0]['geometry'] +for key, value in quakes_json['features'][0].items(): + print(key, type(value).__name__) # %% [markdown] -# Also note that some IDEs display JSON in a way that makes its structure easier to understand. Try saving this data in a text file and opening it in an IDE or a browser. +# We can also view the dictionary directly + +# %% +quakes_json['features'][0] # %% [markdown] -# ### Find the largest quake +# From this we can see the `properties` and `geometry` keys both themselves map to `dict` objects. Within these inner dictionaries are several keys which look relevant to our task of identifying the highest magnitude earthquake event and displaying its location on a map. Specifically the `mag` key in the `properties` dictionary seems likely to represent the magnitude of the event # %% -quakes = requests_json['features'] +quakes_json['features'][0]['properties']['mag'] + +# %% [markdown] +# while the `coordinates` key in the `geometry` dictionary seems to represent the location of the event. # %% -largest_so_far = quakes[0] -for quake in quakes: - if quake['properties']['mag'] > largest_so_far['properties']['mag']: - largest_so_far = quake -largest_so_far['properties']['mag'] +quakes_json['features'][0]['geometry']['coordinates'] + +# %% [markdown] +# If go to the URL listed as the value for the `url` key in the `properties` dictionary, # %% -lat = largest_so_far['geometry']['coordinates'][1] -long = largest_so_far['geometry']['coordinates'][0] -print("Latitude: {} Longitude: {}".format(lat, long)) +quakes_json['features'][0]['properties']['url'] # %% [markdown] -# ### Get a map at the point of the quake +# we can confirm that this is indeed a correct interpretation of the data as the listed magnitude corresponds to the value for the `mag` key while the longitude (East-West axis) and latitude (North-South axis) coordinates (in degrees) of the event location correspond to the first two elements respectively in the list associated with the `coordinates` key (with the third coordinate corresponding to the depth of the event). + +# %% [markdown] +# ## Find the largest quake(s) +# +# Now that we have a handle on the structure of the data, we are ready to search through the data to identify the largest magnitude earthquake event(s). We are interested in finding the element (or elements) in a sequence which maximises some property - this operation is termed the [$\arg\max$ in mathematics and computer science](https://en.wikipedia.org/wiki/Arg_max). While there is built-in `max` function in Python there is no corresponding `argmax` function, though several external libraries including the NumPy library which we encounter in a subsequent lesson do include an `argmax` function. +# +# We will therefore loop over all of the event details in the `features` list and construct a list of the event or events for which the magnitude is currently the largest, creating a new list if the magnitude of the current event is larger than the previous largest or adding the event to the previous list if it has an equal magnitude. After iterating through all the events this list should contain the details of the event(s) with the largest magnitude. An example implementation of this approach is as follows. + +# %% jupyter={"outputs_hidden": false} +largest_magnitude_events = [quakes_json['features'][0]] + +for quake in quakes_json['features']: + if quake['properties']['mag'] > largest_magnitude_events[0]['properties']['mag']: + largest_magnitude_events = [quake] + elif quake['properties']['mag'] == largest_magnitude_events[0]['properties']['mag']: + largest_magnitude_events += [quake] # %% [markdown] -# We saw something similar in the [Greengraph example](../ch01python/010exemplar.html#More-complex-functions) [(notebook version)](../ch01python/010exemplar.ipynb#More-complex-functions) of the previous chapter. +# We can now check if there was a single event with the maximum magnitude or multiple # %% -import requests +len(largest_magnitude_events) +# %% [markdown] +# It turns out there are two events with the same maximal magnitude. As a sanity check we can print the magnitude of both events to check that they match -def request_map_at(lat, long, satellite=True, - zoom=10, size=(400, 400)): - base = "https://static-maps.yandex.ru/1.x/?" +# %% jupyter={"outputs_hidden": false} +print([quake['properties']['mag'] for quake in largest_magnitude_events]) - params = dict( - z=zoom, - size="{},{}".format(size[0], size[1]), - ll="{},{}".format(long, lat), - l="sat" if satellite else "map", - lang="en_US" - ) - return requests.get(base, params=params) +# %% [markdown] +# ## Get a map at the point of the quakes +# +# There are various different web services which can be used to get map imagery, here we will [OpenStreetMap](https://www.openstreetmap.org/). Specifically we will get a pre-rendered map tile containing a location specified by latitude and longitude coordinates as a *portable network graphic* (PNG) image. [This page](https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Python) on the OpenStreeMap wiki gives a Python implementation of a function `deg2num` to convert from a latitude-longitude pair in degrees plus a [zoom level](https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Zoom_levels), to a pair of indices specifying a specific map tile. +# %% +def deg2num(lat_deg, lon_deg, zoom): + lat_rad = math.radians(lat_deg) + n = 2.0 ** zoom + xtile = int((lon_deg + 180.0) / 360.0 * n) + ytile = int((1.0 - math.asinh(math.tan(lat_rad)) / math.pi) / 2.0 * n) + return (xtile, ytile) + + +# %% [markdown] +# There are various [tile servers](https://wiki.openstreetmap.org/wiki/Slippy_map_tilenames#Tile_servers) for the OpenStreetMap data with differing rendering styles and URL templates. Here we use the 'standard' style tiles for which an appropiate URL template is `http://a.tile.openstreetmap.org/{zoom}/{x}/{y}.png` where `{zoom}` indicates the zoom level and `{x}` and `{y}` the two components of tile number returned by `deg2num`. We can use Python [formatted string literals](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings) to populate the template with the appropriate values and use the `requests.get` function to request the image corresponding to the URL. Finally we can create an [`IPython.display.Image` object](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.Image) which will display the raw data corresponding to the contents of the image request as an image in the Jupyter Lab front end. We wrap this all in to a function `get_map_tile_at` to allow us to use it to display images for each of the largest magnitude earthquake events identified earlier. # %% -map_png = request_map_at(lat, long, zoom=10, satellite=False) +def get_map_tile_at(latitude, longitude, zoom=10, satellite=False): + x, y = deg2num(latitude, longitude, zoom) + tile_url = f"http://a.tile.openstreetmap.org/{zoom}/{x}/{y}.png" + response = requests.get(tile_url) + assert response.status_code == 200 + return IPython.core.display.Image(response.content) + # %% [markdown] -# ### Display the map +# As a test we can check the map displayed for the coordinates of the [Prime meridian at the Royal Observatory Greenwich](https://geohack.toolforge.org/geohack.php?pagename=Prime_meridian_(Greenwich)¶ms=51_28_40.1_N_0_0_5.3_W_type:landmark_globe:earth_region:GB_scale:1000) # %% -from IPython.display import Image -Image(map_png.content) +get_map_tile_at(latitude=51.477806, longitude=-0.001472, zoom=14) + +# %% [markdown] +# ## Display the maps +# +# We can now finally show the maps for the locations of the maximum magnitude earthquake events. As additional check we also print the description under the `title` key in the `properties` dictionary for the event to check it tallies with the location shown in the displayed map. + +# %% jupyter={"outputs_hidden": false} +for quake in largest_magnitude_events: + longitude = quake['geometry']['coordinates'][0] + latitude = quake['geometry']['coordinates'][1] + print(quake['properties']['title']) + display(get_map_tile_at(latitude, longitude, 12)) diff --git a/ch02data/072plotting.ipynb.py b/ch02data/072plotting.ipynb.py index 0414b879..0d54213b 100644 --- a/ch02data/072plotting.ipynb.py +++ b/ch02data/072plotting.ipynb.py @@ -12,202 +12,189 @@ # --- # %% [markdown] -# ## Plotting with Matplotlib +# # Plotting with Matplotlib - object-oriented interface # -# Plotting data is very common and useful in scientific work. Python does not include any plotting functionality in the language itself, but there are various frameworks available for producing plots and visualisations. -# -# In this section, we will look at Matplotlib, one of those frameworks. As the name indicates, it was conceived to provide an interface similar to the MATLAB programming language, but no knowledge of MATLAB is required! +# Now that we are familiar with the basics of object-oriented programming, we revisit Matplotlib to illustrate its powerful object-oriented interface, rather than the more restricted `pylab` interface we encountered previously. -# %% [markdown] -# ### Importing Matplotlib +# %% +import matplotlib.pyplot as plt +import numpy as np # %% [markdown] -# We import the `pyplot` object from Matplotlib, which provides us with an interface for making figures. We usually abbreviate it. +# ## *Aside*: Notebook magics +# +# At the beginning of notebooks you will often see the command +# +# ```Python +# %matplotlib inline +# ``` +# +# This is an example of a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html), a special syntax which can only be used when running an [*IPython* (interactive Python)](https://ipython.readthedocs.io/en/stable/overview.html) kernel, for example via a Jupyter notebook or the `IPython` terminal application. These commands are prefixed by one or two percent characters, with a single `%` indicating a [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics) command where only the content on the immediately following line is used by the magic and a double `%%` indicating a [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics) which should be placed at the beginning of a notebook cell and means that all the content in the cell will be interpreted by the magic command. We have already encountered magic commands, for example the `%%writefile` cell magic which writes the cell content to a specified file. +# +# The [`%matplotlib` magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib) allows specifying the [backend](https://matplotlib.org/stable/tutorials/introductory/usage.html#backends) used by Matplotlib to display plots, with there various system-dependent [built-in option](https://matplotlib.org/stable/tutorials/introductory/usage.html#the-builtin-backends) as well as additional backends that can be installed as extensions such as [`ipympl`](https://github.com/matplotlib/ipympl) which uses [Jupyter's widget framework](https://ipywidgets.readthedocs.io/en/latest/) to enable interactive features on plots in Jupyter notebooks. A list of the backends current available can be obtained by running # %% -from matplotlib import pyplot as plt +# %matplotlib --list # %% [markdown] -# ### Notebook magics - -# %% [markdown] -# When we write: +# Running `%matplotlib inline` instructs Matplotlib to use a backend which displays plots as static images 'inline' in the cell output area. In recent versions of Matplotlib and Jupyter this backend is automatically set when importing `matplotlib.pyplot` within a Jupyter notebook and so the `%matplotlib inline` command is not strictly necessary in this case, however you will often still see this command in older notebooks. As the command will have no additional effect if `matplotlib.pyplot` has been imported it is safe to include it to for example ensure compatibility with older Jupyter / Matplotlib versions # %% # %matplotlib inline # %% [markdown] -# We tell the Jupyter notebook to show figures we generate alongside the code that created it, rather than in a -# separate window. Lines beginning with a single percent are not python code: they control how the notebook deals with python code. - -# %% [markdown] -# Lines beginning with two percents are "cell magics", that tell Jupyter notebook how to interpret the particular cell; -# we've seen `%%writefile`, for example. - -# %% [markdown] -# ### A basic plot - -# %% [markdown] -# When we write: +# ## Basic plot elements - figures and axes +# +# As a first example we recreate the basic line plot we constructed previously using the `pylab` interface. # %% -from math import sin, cos, pi -my_fig = plt.plot([sin(pi * x / 100.0) for x in range(100)]) +fig = plt.figure(figsize=(6, 4)) +ax = fig.add_subplot() +lines = ax.plot([0, 1, 2, 3, 4], [1, 5, 3, 7, -11]) # %% [markdown] -# The plot command *returns* a figure, just like the return value of any function. The notebook then displays this. - -# %% [markdown] -# To add a title, axis labels etc, we need to get that figure object, and manipulate it. -# For convenience, matplotlib allows us to do this just by issuing commands to change the "current figure": +# We have introduced three new functions here and corresponding Python types. The `plt.figure` function instructs Matplotlib to create and return a new [`Figure` object](https://matplotlib.org/stable/tutorials/introductory/usage.html#figure) which holds all of the elements corresponding to a single figure, such as one or more plot axes and other elements such as an overall figure title. The `plt.figure` function takes various optional arguments; here we pass a tuple of the figure dimensions in inches `(width, height)` as the `figsize` keyword argument. Note that the actual displayed size of the plot on screen will also depend on the resolution of your display, this can be adjusted for by also specifying the optional `dpi` (dots per inch) keyword argument to `plt.figure`. +# +# A `Figure` object is itself mainly just a container with its main function to allow adding the [`AxesSubplot` objects](https://matplotlib.org/stable/tutorials/introductory/usage.html#axes) on which we will actually produce plots. A simple way to add an axes to a `Figure` instance is using the `add_subplot` method which returns an `AxesSubplot` object. This optionally takes arguments to specify the number and location of plots when adding multiple axes to a single figure, however here we only require a single axes so we just call `add_subplot` on the `Figure` object `fig` we created in the previous line without any arguments. +# +# The `AxesSubplot` object exposes various methods for actually producing plots. One of the more generally useful methods is `plot`, which is almost identical to the `plt.plot` function we encountered previously. The value returned by `ax.plot` is a list of `Line2D` instances corresponding to the plotted line(s): in this case we plotted only one line so the list has only one element. The `Line2D` object can be used to update or access the properties of the plotted line after the initial `ax.plot` call. +# +# As creating a figure and adding a subplot (axes) to it is such a common operation, Matplotlib provides a function `plt.subplots` which combines the functionality of the `plt.figure` and `fig.add_subplot` functions. The `plt.subplots` function as the name suggests can be used to generate multiple subplots on a single figure (we will demonstrate this in the following section) but with its default arguments will just generate a figure with a single axes. Keyword arguments to `plt.figure` will be passed through from `plt.subplots` so for example we can set the figure size by specifying the `figsize` argument. The `plt.subplots` function returns a pair of objects the first of which is the `Figure` object and the second the one or more `AxesSubplot` instances created. We can therefore recreate the above plot as follows. # %% -plt.plot([sin(pi * x / 100.0) for x in range(100)]) -plt.title("Hello") - -# %% [markdown] -# But this requires us to keep all our commands together in a single cell, and makes use of a "global" single "current plot", -# which, while convenient for quick exploratory sketches, is a bit cumbersome. To produce from our notebook proper plots -# to use in papers, the library defines some types we can use to treat individual figures as variables, -# and manipulate these. - -# %% [markdown] -# ### Figures and Axes +fig, ax = plt.subplots(figsize=(6, 4)) +lines = ax.plot([0, 1, 2, 3, 4], [1, 5, 3, 7, -11]) # %% [markdown] -# We often want multiple graphs in a single figure (e.g. for figures which display a matrix of graphs of different variables for comparison). - -# %% [markdown] -# So Matplotlib divides a `figure` object up into axes: each pair of axes is one 'subplot'. -# To make a boring figure with just one pair of axes, however, we can just ask for a default new figure, with -# brand new axes. The relevant function returns a (figure, axis) pair, which we can deal out with parallel assignment (unpacking). +# ## Creating multiple axes on a figure +# +# As mentioned above the `plt.subplots` function can also be used to create figures with multiple subplots, specifically a rectlinear grid of axes. There first two positional arguments to `plt.subplots` are respectively `nrows`, which specifies the number of rows in the grid of axes, and `ncol` which specifies the number of columns in the axes grid. For example the following cell will create a figure with 2×3 grid of (empty) axes. # %% -sine_graph, sine_graph_axes = plt.subplots() +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) # %% [markdown] -# Once we have some axes, we can plot a graph on them: +# When called with arguments that create multiple subplots, the second argument returned by `plt.subplots` is a NumPy array of `AxesSubplot` objects. For example we can inspect the `axes` object just created by displaying its string representation. # %% -sine_graph_axes.plot([sin(pi * x / 100.0) for x in range(100)], label='sin(x)') +axes # %% [markdown] -# We can add a title to a pair of axes: - -# %% -sine_graph_axes.set_title("My graph") - -# %% -sine_graph_axes.set_ylabel("f(x)") +# We can therefore use NumPy's convenient array indexing syntax for accessing invidual axes, with the first index corresponding to the row (vertical) dimension and the second to the column (horizontal) direction. For example the following cell plots a grid of simple polynomial functions on the domain [-1, 1] with varying exponents and coefficients, with a different function being plotted on each axis. # %% -sine_graph_axes.set_xlabel("100 x") +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) +x = np.linspace(-1, 1, 100) +axes[0, 0].plot(x, -x) +axes[0, 1].plot(x, -x**2) +axes[0, 2].plot(x, -x**3) +axes[1, 0].plot(x, x) +axes[1, 1].plot(x, x**2) +axes[1, 2].plot(x, x**3) # %% [markdown] -# Now we need to actually display the figure. As always with the notebook, if we make a variable be returned by the last -# line of a code cell, it gets displayed: +# ## Using loops to avoid repetition +# +# While individually accessing axes like this can be useful, in cases such as this where there is a lot of commonality between the plots we can often avoid the repetition in our code by instead using loops to iterate over the axes objects. When we iterate over a NumPy array we return the array *slices* corresponding to indexing only the first dimension. As the first dimension for the `axes` object returned by `plt.subplots` corresponds to the row, a for loops of the form `for axes_row in axes:` will loop over each row of axes. In our case as each row contains three columns, the `axes_row` object will itself be another (one dimensional) NumPy array, with the leading (and only) array dimension now corresponding to the column index. To access the individual axes in each row we can *nest* a further loop which iterates over the elements in `axes_row`. For example, the following cell plots the identity function on all axes. # %% -sine_graph +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) +x = np.linspace(-1, 1, 100) +for axes_row in axes: + for ax in axes_row: + ax.plot(x, x) # %% [markdown] -# We can add another curve: - -# %% -sine_graph_axes.plot([cos(pi * x / 100.0) for x in range(100)], label='cos(x)') +# Plotting the same six times is not very interesting or useful. A common pattern is that we wish to loop over one or more sequences and create a subplot using for each of the items (or combinations of items) in the sequence(s). A handy inbuilt Python function for iterating over several iterable objects in unison is [`zip`](https://docs.python.org/3.3/library/functions.html#zip) - it accepts as argument one or more iterable objects and returns an object which iterates over *tuples* with the *i*th tuple containing the items at the *i*th position in each passed iterable, in the order the iterables are passed to `zip`. +# +# For example the following cell iterates over pairs of character and integers from respectively a string and `range` object. # %% -sine_graph +for c, i in zip('abc', range(3)): + print(c, i) # %% [markdown] -# A legend will help us distinguish the curves: - -# %% -sine_graph_axes.legend() +# We can use `zip` to combine the nested iteration over the axes with iterating over different coefficients and exponents to plot on each row and column of the axes grid. For example, the following cell plots the same set of polynomial functions plotted previously but avoids the repetitive calls to `plot` by using nested loops. # %% -sine_graph +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) +x = np.linspace(-1, 1, 100) +for axes_row, coefficient in zip(axes, (-1, 1)): + for ax, exponent in zip(axes_row, (1, 2, 3)): + ax.plot(x, coefficient * x**exponent) # %% [markdown] -# ### Saving figures - -# %% [markdown] -# We must be able to save figures to disk, in order to use them in papers. This is really easy: +# ## Setting axes properties +# +# As well as allowing plotting using the `AxesSubplot.plot` method, `AxesSubplot` objects also expose other methods which can used to customise the appearance of the plot. Previously we encountered the `plt.xlabel`, `plt.ylabel` and `plt.title` functions in the Matplotlib `pylab` interface, for setting respectively the *x*-axis label, *y*-axis label and title of the current axis. The `AxesSubplot` object has methods `AxesSubplot.set_xlabel`, `AxesSubplot.set_ylabel` and `AxesSubplot.set_title` which perform the same roles for the corresponding to a particular `AxesSubplot` object. If we wish to set multiple properties of an axis, we can also use the `AxesSubplot.set` batch property setter method, which takes keyword arguments corresponding to properties that can be set by the `AxesSubplot.set_*` methods. For example the following cell adds labels to the *x*- and *y*-axis for each plot and a title describing the plotted function. # %% -sine_graph.savefig('my_graph.png') +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) +x = np.linspace(-1, 1, 100) +for axes_row, coefficient in zip(axes, (-1, 1)): + for ax, exponent in zip(axes_row, (1, 2, 3)): + ax.plot(x, coefficient * x**exponent) + ax.set(xlabel='$x$', ylabel='$y$', title=f'$y = {coefficient} x^{exponent}$') # %% [markdown] -# In order to be able to check that it worked, we need to know how to display an arbitrary image in the notebook. - -# %% [markdown] -# The programmatic way is like this: +# ## Improving the plot layout +# +# Once we start adding labels and titles to multiple axes it is common for plot elements to start overlapping as in the plot produced by the previous cell. Matplotlib `Figure` objects have a useful `Figure.tight_layout` method which will automatically adjust the padding around subplots to try to avoid any overlapping elements. This method should be called after all plots, labels and titles have been added so that all of the displayed elements are taken into account when adjusting the layout. For example the following cell replots the same plot as the previous cell but adds a `fig.tight_layout()` call at the end, with this producing a much better plot layout. # %% -from IPython.display import Image # Get the notebook's own library for manipulating itself. -Image(filename='my_graph.png') - -# %% [markdown] -# ### Subplots +fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8)) +x = np.linspace(-1, 1, 100) +for axes_row, coefficient in zip(axes, (-1, 1)): + for ax, exponent in zip(axes_row, (1, 2, 3)): + ax.plot(x, coefficient * x**exponent) + ax.set(xlabel='$x$', ylabel='$y$', title=f'$y = {coefficient} x^{exponent}$') +fig.tight_layout() # %% [markdown] -# We might have wanted the $\sin$ and $\cos$ graphs on separate axes: - -# %% -double_graph = plt.figure() - -# %% -sin_axes = double_graph.add_subplot(2, 1, 1) # 2 rows, 1 column, 1st subplot - -# %% -cos_axes = double_graph.add_subplot(2, 1, 2) # 2 rows, 1 column, 2nd subplot - -# %% -double_graph - -# %% -sin_axes.plot([sin(pi * x / 100.0) for x in range(100)]) - -# %% -sin_axes.set_ylabel("sin(x)") - -# %% -cos_axes.plot([cos(pi * x / 100.0) for x in range(100)]) +# ## Plotting multiple lines on each plot +# +# Creating a new subplot for each plotted line can sometimes be overkill. For example if rather than wanting plot polynomial functions for two different coefficients, we wished to plot for six different coefficients (and three exponents), we would need a 6×3 grid of plots which would make for either a very large or very cramped figure, while also making it difficult to compare the relationships between the plots for different coefficients. In this case we might to decide to plot multiple lines corresponding to different coefficient values on a single subplot. We can add multiple lines to a subplot by making multiple calls to the `AxesSubplot.plot` method. By default each successive plot [will cycle through a sequence of ten colours](https://matplotlib.org/stable/gallery/color/color_cycle_default.html). For example, the following cell plots a single row of three axes, each corresponding to a different exponent and with a series of lines corresponding to polynomials with six different coefficients linearly spaced in [-1, 1] plotted on each axis. # %% -cos_axes.set_ylabel("cos(x)") +fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4)) +x = np.linspace(-1, 1, 100) +for ax, exponent in zip(axes, [1, 2, 3]): + for coefficient in np.linspace(-1, 1, 6): + ax.plot(x, coefficient * (x ** exponent)) +fig.tight_layout() -# %% -cos_axes.set_xlabel("100 x") +# %% [markdown] +# When plotting multiple lines on an axis as here, it is good practice to include a legend to allow uniquely identifying each different plotted line. The `AxesSubplot.legend` method allows adding a legend to an axis; as with the `plt.legend` function encountered previously there are [multiple ways of associating labels with each plot element](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.legend.html). The simplest is often to add a `label` keyword argument to each `AxesSubplot.plot` call with the required string label describing what is being plotted. For example the following cell recreates the plot from the previous cell but with the addition of axis labels, a title and a legend on each subplot. As demonstrated here, the `AxesSubplot.legend` method takes various keyword arguments which can be used to customise the styling and layout of the generated legend. # %% -double_graph +fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4)) +x = np.linspace(-1, 1, 100) +for ax, exponent in zip(axes, [1, 2, 3]): + for coefficient in np.linspace(-1, 1, 6): + ax.plot(x, coefficient * (x ** exponent), label=f'$\\alpha = {coefficient:.2g}$') + ax.set(xlabel='$x$', ylabel='$y$', title=f'$y = \\alpha x^{exponent}$') + ax.legend(ncol=2, frameon=False, handlelength=1, columnspacing=1, labelspacing=0) +fig.tight_layout() # %% [markdown] -# ### Versus plots +# ## Saving figures # %% [markdown] -# When we specify a single `list` to `plot`, the x-values are just the array index number. We usually want to plot something -# more meaningful: +# We need to be able to save figures to disk, in order to use them externally, for example in articles. Matplotlib `Figure` objects provide [`Figure.savefig`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html) method for this purpose. This accepts one positional argument corresponding to the file path to save the generated figure image file to; this should include an extension which specifies the file format to use. For example the following cell saves the figure produced in the previous cell into a new [*scalable vector graphics* (SVG)](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics) file `polynomials.svg` in the current directory. # %% -double_graph = plt.figure() -sin_axes = double_graph.add_subplot(2, 1, 1) -cos_axes = double_graph.add_subplot(2, 1, 2) -cos_axes.set_ylabel("cos(x)") -sin_axes.set_ylabel("sin(x)") -cos_axes.set_xlabel("x") +fig.savefig('polynomials.svg') -# %% -sin_axes.plot([x / 100.0 for x in range(100)], [sin(pi * x / 100.0) for x in range(100)]) -cos_axes.plot([x / 100.0 for x in range(100)], [cos(pi * x / 100.0) for x in range(100)]) +# %% [markdown] +# If you are running this notebook interactively in Jupyter Lab, you should be able to see the generated image file in the file browser pane accessible from the left toolbar (and you can preview the file in a new tab within Jupyter Lab by double clicking on it in the file browser). +# +# As well as SVG output, Matplotlib also supports saving figures in various other formats. A dictionary mapping from file extensions to the names of file formats supported by the backend being used locally can be accessed using the code in the following cell. # %% -double_graph +fig.canvas.get_supported_filetypes() # %% [markdown] -# ### Learning More +# ## Learning more # %% [markdown] -# There's so much more to learn about matplotlib: pie charts, bar charts, heat maps, 3-d plotting, animated plots, and so on. You can learn all this via the [Matplotlib Website](http://matplotlib.org/). -# You should try to get comfortable with all this, so please use some time in class, or at home, to work your way through a bunch of the [examples](https://matplotlib.org/stable/gallery/index). +# We have only explored a small number of the features and plot types offered by Matplotlib here, with there so much more to learn about Matplotlib: pie charts, bar charts, heat maps, three-dimensional plotting, animated plots, and so on. You can learn all this via the excellent documentation on the [Matplotlib Website](http://matplotlib.org/#). You should try to get comfortable with Matplotlib, so please use some time in class, or at home, to work your way through a bunch of the [examples](http://matplotlib.org/examples/index.html). diff --git a/ch02data/082NumPy.ipynb.py b/ch02data/082NumPy.ipynb.py index 790a582d..ad344ca7 100644 --- a/ch02data/082NumPy.ipynb.py +++ b/ch02data/082NumPy.ipynb.py @@ -12,382 +12,471 @@ # --- # %% [markdown] -# ## NumPy - -# %% [markdown] -# ### The Scientific Python Trilogy - -# %% [markdown] +# # NumPy +# +# ## The scientific Python trilogy +# # Why is Python so popular for research work? - -# %% [markdown] -# MATLAB has typically been the most popular "language of technical computing", with strong built-in support for efficient numerical analysis with matrices (the *mat* in MATLAB is for Matrix, not Maths), and plotting. - -# %% [markdown] -# Other dynamic languages have cleaner, more logical syntax (Ruby, Haskell) - -# %% [markdown] +# +# MATLAB has typically been the most popular "language of technical computing", with strong built-in support for efficient numerical analysis with matrices (the *mat* in MATLAB is for matrix, not maths), and plotting. +# +# Other dynamic languages can be argued to have cleaner, more logical syntax, for example [Ruby](https://www.ruby-lang.org/en/) or [Scheme][scheme]. +# +# [scheme]: https://en.wikipedia.org/wiki/Scheme_(programming_language) # But Python users developed three critical libraries, matching the power of MATLAB for scientific work: +# +# * [Matplotlib](https://matplotlib.org/), a plotting library for creating visualizations in Python. +# * [NumPy](https://numpy.org/), a fast numeric computing library offering a flexible *n*-dimensional array type. +# * [IPython](https://ipython.readthedocs.io/en/stable/overview.html), an interactive Python interpreter that later led to the [Jupyter notebook](https://jupyter.org/) interface. +# +# By combining a plotting library, a fast numeric library, and an easy-to-use interface allowing live plotting commands in a persistent environment, the powerful capabilities of MATLAB were matched by a free and open toolchain. +# +# We've learned about Matplotlib and IPython in this course already. NumPy is the last part of the trilogy. +# +# ## Limitations of Python lists +# +# The standard Python list is inherently one dimensional. To make a matrix (two-dimensional array of numbers), we could create a list of lists: -# %% [markdown] -# * Matplotlib, the plotting library created by [John D. Hunter](https://en.wikipedia.org/wiki/John_D._Hunter) -# * NumPy, a fast matrix maths library created by [Travis Oliphant](https://en.wikipedia.org/wiki/Travis_Oliphant) -# * IPython, the precursor of the notebook, created by [Fernando Perez](https://en.wikipedia.org/wiki/Fernando_P%nC3%A9rez_(software_developer)) +# %% +x = [[row_index + col_index for col_index in range(5)] for row_index in range(5)] +x # %% [markdown] -# By combining a plotting library, a matrix maths library, and an easy-to-use interface allowing live plotting commands -# in a persistent environment, the powerful capabilities of MATLAB were matched by a free and open toolchain. +# However, applying an operation to every element is a pain. We would like to be able to use an intuitive syntax like the following -# %% [markdown] -# We've learned about Matplotlib and IPython in this course already. NumPy is the last part of the trilogy. +# %% +x + 5 # %% [markdown] -# ### Limitations of Python Lists +# As the `+` operator is used to perform concatenation for lists we instead have to use something more cumbersome such as the following. + +# %% +[[elem + 5 for elem in row] for row in x] # %% [markdown] -# The normal Python list is just one dimensional. To make a matrix, we have to nest Python lists: +# Common useful operations like transposing a matrix or reshaping a 10 by 10 matrix into a 20 by 5 matrix are not easy to perform with nested Python lists. +# +# ## Importing NumPy +# +# The main NumPy *application programming interface* (API) is exposed via the top-level `numpy` module. This is not part of the Python standard library but instead needs to be separately installed, for example using a package manager such as `pip` or `conda`. +# +# As we will typically need to access the names defined in the `numpy` module a lot when working with NumPy in code, it is common in practice to import `numpy` as the shorthand name `np`. While modern editing environments such as the Jupyter Lab interface have tab-completion functionality which makes the reduction in keystrokes necessary less important, the shorter name can still have value in reducing line length and is a very common convention so we will follow suit here. # %% -x = [list(range(5)) for N in range(5)] +import numpy as np -# %% -x +# %% [markdown] +# ## The NumPy array +# +# NumPy's `ndarray` type represents a multidimensional grid of values of a shared type. In NumPy nomenclature each dimension is termed an *axes* and the tuple of sizes of all the dimensions of the array is termed the array *shape*. We can construct a `ndarray` using the `np.array` function. The first positional argument to this function should be an *array like* Python object: this can be another array, or more commonly [a (nested) sequence](https://numpy.org/doc/stable/user/basics.creation.html#converting-python-sequences-to-numpy-arrays), with the constraint that sequences at each level must be of the same length. For example to construct a one-dimensional array with 5 integer elements we can pass in a corresponding list of 5 integers. # %% -x[2][2] +my_array = np.array([0, 1, 2, 3, 4]) +type(my_array) # %% [markdown] -# Applying an operation to every element is a pain: +# The NumPy array seems at first to be very similar to a list: # %% -x + 5 +my_array # %% -[[elem + 5 for elem in row] for row in x] +my_array[2] -# %% [markdown] -# Common useful operations like transposing a matrix or reshaping a 10 by 10 matrix into a 20 by 5 matrix are not easy to code in raw Python lists. +# %% +for element in my_array: + print("Hello " * element) # %% [markdown] -# ### The NumPy array +# However, we see see there are some differences, for example: -# %% [markdown] -# NumPy's array type represents a multidimensional matrix $M_{i,j,k...n}$ +# %% +my_array.append(4) # %% [markdown] -# The NumPy array seems at first to be just like a list. For example, we can index it and iterate over it: +# NumPy arrays do not provide an `append` method. For NumPy arrays it is generally expected that you will not change the *size* of an array once it has been defined and the way arrays are stored in memory would make such resize operations inefficient. Python lists on the other can be efficiently appended to, joined and split. However, you gain a lot of functionality in return for this limitation. +# +# ## Array creation routines +# +# As well as `np.array`, NumPy has various other routines available for creating arrays. For example [`np.arange`](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) provides an array equivalent to the built-in `range` function, with `start`, `stop` and `step` arguments with the same semantics as for `range`. When called with integer arguments `np.arange` will return a NumPy array equivalent to an equivalent call to `range` passed to `np.array`. For example # %% -import numpy as np -my_array = np.array(range(5)) +np.arange(0, 10) # %% -my_array +np.array(range(0, 10)) # %% -my_array[2] +np.arange(1, 5, 2) # %% -for element in my_array: - print("Hello" * element) +np.array(range(1, 5, 2)) # %% [markdown] -# We can also see our first weakness of NumPy arrays versus Python lists: +# Unlike `range`, `np.arange` can also be used with non-integer arguments, for example # %% -my_array.append(4) +np.arange(0.0, 0.5, 0.1) # %% [markdown] -# For NumPy arrays, you typically don't change the data size once you've defined your array, -# whereas for Python lists, you can do this efficiently. However, you get back lots of goodies in return... +# Beware however because of the limits of floating point precision, using `np.arange` with non-integer arguments [can sometimes lead to inconsistent seeming outputs](https://stackoverflow.com/questions/62217178/inconsistent-behavior-in-np-arange): + +# %% +print(np.arange(14.1, 15.1, 0.1)) +print(np.arange(15.1, 16.1, 0.1)) # %% [markdown] -# ### Elementwise Operations +# The `np.linspace` function is an alternative array creation routine which offers a safer approach for constructing arrays of equally spaced non-integer values. The first three arguments to `np.linspace` are `start`, `stop` and `num`, corresponding under the default keyword argument values to respectively the starting value of the returned sequence, the end value of the sequence and the number of values in the sequence. Unlike the `stop` argument to `np.arange` by default the `stop` value in `np.linspace` is an inclusive upper bound on the sequence values and the `num` argument allows explicitly stating the length of the returned sequence, preventing inconsistencies in the returned lengths for similar inputs encountered above for `np.arange`. For example the following cell constructs an array of 11 evenly spaced floating point values between 15.1 and 16.1 inclusive + +# %% +np.linspace(15.1, 16.1, 11) # %% [markdown] -# Most operations can be applied element-wise automatically! +# `np.linspace` also accepts an optional boolean keyword argument `endpoint` which can be used to specify whether the `stop` argument corresponds to the last sample; if `False` then the first `num` of the `num + 1` evenly spaced samples between `start` and `stop` (inclusive) are returned. For example # %% -my_array + 2 +np.linspace(15.1, 16.1, 10, endpoint=False) # %% [markdown] -# These "vectorized" operations are very fast: (the `%%timeit` magic reports how long it takes to run a cell; there is [more information](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) available if interested) +# NumPy also provides routines for constructing arrays with all one elements (`np.ones`), all zero elements (`np.zeros`) and all elements equal to a specified value (`np.full`) # %% -import numpy as np -big_list = range(10000) -big_array = np.arange(10000) +np.ones(shape=5) # %% -# %%timeit -[x**2 for x in big_list] +np.zeros(shape=10) # %% -# %%timeit -big_array**2 +np.full(shape=100, fill_value=42) # %% [markdown] -# ### arange and linspace +# The `np.empty` function can be used to construct an array in which the array memory is left uninitialised; while this can potentially be slightly cheaper than initialising arrays with a defined value care needs to be taken to not use the uninitialised values as these will depend on whatever was stored in the memory previously (and may not even evaluate to a valid number) + +# %% +np.empty(50) # %% [markdown] -# NumPy has two methods for quickly defining evenly-spaced arrays of (floating-point) numbers. These can be useful, for example, in plotting. +# ## Array data types # -# The first method is `arange`: +# A Python `list` can contain data of mixed type: # %% -x = np.arange(0, 10, 0.1) # Start, stop, step size +x = ['hello', 2, 3.4, True] +for el in x: + print(type(el)) # %% [markdown] -# This is similar to Python's `range`, although note that we can't use non-integer steps with the latter! +# In *most cases* all the elements of a NumPy array will be of the same type. The [*data type*](https://numpy.org/doc/stable/user/basics.types.html) of an array can be specified by the `dtype` argument to `np.array` (as well as to other array creation routines such as `np.arange` and `np.linspace`). If omitted the default is to use the 'minimum' (least generic) type required to hold the objects in the (nested) sequence, with all the objects cast to this type. The results of this type conversion can sometimes be non-intuitive. For example, in the following # %% -y = list(range(0, 10, 0.1)) +for el in np.array(x): + print(type(el)) # %% [markdown] -# The second method is `linspace`: +# the array data type has been automatically set to the `numpy.str_` string type as the other integer, float and bool objects can all be represented as strings. In contrast if we repeat the same code snippet but exclude the first string entry in the list `x` # %% -import math -values = np.linspace(0, math.pi, 100) # Start, stop, number of steps - -# %% -values +for el in np.array(x[1:]): + print(type(el)) # %% [markdown] -# Regardless of the method used, the array of values that we get can be used in the same way. +# we see that all the array elements are now `numpy.float64` double-precision floating point values, as both integer and bool values can be represented as floats. The main takeaway here is that when construct NumPy arrays with `np.array` it generally is advisable to either always use (nested) sequences containing objects of a uniform type *or* to explicitly specify the data type using the `dtype` argument to ensure the constructed array has the data type you expect! # -# In fact, NumPy comes with "vectorised" versions of common functions which work element-by-element when applied to arrays: +# An important exception to the rule-of-thumb that all elements in NumPy arrays are of the same type is the NumPy array with an `object` datatype. This is the 'catch-all' data type used when no other type can represent all the objects in the nested sequence passed to `np.array` or if this data type is explicitly specified via the `dtype` argument. In this case the array stores only references to the objects and when the array elements are accessed the original objects are recovered: # %% -# %matplotlib inline - -from matplotlib import pyplot as plt -plt.plot(values, np.sin(values)) +for el in np.array(x, dtype=object): + print(type(el)) # %% [markdown] -# So we don't have to use awkward list comprehensions when using these. +# While this flexibility allows arrays to hold objects of any type, as we will see most of NumPy's performance gains arise from using arrays with specific data types that allow using efficient compiled code to implement computations. +# +# Arrays have a `dtype` attrinute which specifies their data type: -# %% [markdown] -# ### Multi-Dimensional Arrays +# %% +x = np.array([2., 3.4, 7.2, 0.]) +x.dtype # %% [markdown] -# NumPy's true power comes from multi-dimensional arrays: +# NumPy supports a wide range of numeric data types of varying precision levels. The type code in the data type string representation typically consists of the combination of [a primitive type and integer specifying the bit width of the value](https://numpy.org/devdocs/reference/arrays.scalars.html#sized-aliases). +# +# NumPy will also convert Python type names to corresponding data types: # %% -np.zeros([3, 4, 2]) # 3 arrays with 4 rows and 2 columns each +int_array = np.array(x, dtype=int) -# %% [markdown] -# Unlike a list-of-lists in Python, we can reshape arrays: +# %% +float_array = np.array(x, dtype=float) # %% -x = np.array(range(40)) -x +int_array # %% -y = x.reshape([4, 5, 2]) -y +float_array -# %% [markdown] -# And index multiple columns at once: +# %% +int_array.dtype # %% -y[3, 2, 1] +float_array.dtype # %% [markdown] -# Including selecting on inner axes while taking all from the outermost: +# ## Elementwise operations and scalar broadcasting +# +# Most arithmetic operations can be applied directly to NumPy arrays, with the elementwise interpretation of the operations corresponding to what we would intutively expect from the corresponding mathematical notation. # %% -y[:, 2, 1] - -# %% [markdown] -# And subselecting ranges: +my_array = np.arange(5) +my_array + my_array # %% -y[2:, :1, :] # Last 2 axes, 1st row, all columns +my_array * my_array # %% [markdown] -# And [transpose](https://en.wikipedia.org/wiki/Transpose) arrays: +# We can also use unary operations, for example # %% -y.transpose() +-my_array # %% [markdown] -# You can get the dimensions of an array with `shape`: +# As well as binary operations between arrays of the same shape as above, we can also apply binary operations to mixes of arrays and scalars, with binary operations involving a scalar and an array being [*broadcast*](https://numpy.org/doc/stable/user/basics.broadcasting.html) such that the scalar is treated as if it was an array of equal shape to the array operand and the operation then performed elementwise. This again gives compact expressions which correspond with how we would typically intepret such expressions in mathematical notation # %% -y.shape +my_array * 2 # %% -y.transpose().shape +my_array + 1 + +# %% +2 ** my_array # %% [markdown] -# Some numpy functions apply by default to the whole array, but can be chosen to act only on certain axes: +# These *vectorised* operations are very fast. For example, we can use the [`%timeit` magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) to compare the time taken to use a list comprehension to compute the squares of the first 10 000 integers: # %% -x = np.arange(12).reshape(4,3) -x +# %timeit [x**2 for x in range(10_000)] -# %% -x.mean(1) # Mean along the second axis, leaving the first. +# %% [markdown] +# with the time taken to compute the corresponding array of squared integers using NumPy: # %% -x.mean(0) # Mean along the first axis, leaving the second. +# %timeit np.arange(10_000)**2 + +# %% [markdown] +# Note that this speed-up is a consequence of all of the array elements being known in advance to be of a specific data type enabling the use of efficient compiled loop in NumPy's backend when computing the operation. The performance advantage is lost when using arrays with the catch-all `object` data type, as in this case NumPy has to revert to looping over the array elements in Python: # %% -x.mean() # mean of all axes +# %timeit np.arange(10_000, dtype=object)**2 # %% [markdown] -# ### Array Datatypes +# ## Numpy mathematical functions +# +# NumPy comes with vectorised versions of [common mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html) which work elementwise when applied to arrays. +# +# Compared to the list comprehensions used previously these signficantly simplify the process of plotting functions using Matplotlib: + +# %% +from matplotlib import pyplot as plt +x = np.linspace(-3, 3, 100) +for func in (np.sin, np.cos, np.tanh, np.arctan, np.absolute): + plt.plot(x, func(x), label=func.__name__) +plt.legend() # %% [markdown] -# A Python `list` can contain data of mixed type: +# ## Multi-dimensional arrays +# +# A particularly powerful feature of NumPy is its ability to handle arrays of (almost) arbitrary dimension. For example to create a three dimensional array of zeros we can use the following # %% -x = ['hello', 2, 3.4] +np.zeros(shape=(3, 4, 2)) # or equivalently np.zeros((3, 4, 2)) -# %% -type(x[2]) +# %% [markdown] +# Unlike nested lists in Python, we can change the shape of NumPy arrays using the `np.reshape` function (or `ndarray.reshape` method) providing the new total number of elements in the new shape matches the old shape. For example # %% -type(x[1]) +x = np.arange(12).reshape((3, 4)) +x # %% [markdown] -# A NumPy array always contains just one datatype: +# We can also reorder the dimensions of arrays using the `np.transpose` function or corresponding `ndarray.transpose` method. By default this reverses the order of the axes (dimensions) # %% -np.array(x) +x.transpose() # %% [markdown] -# NumPy will choose the least-generic-possible datatype that can contain the data: +# The shorthand `ndarray.T` property can also be used access the result corresponding to calling the `transpose` method with its default arguments # %% -y = np.array([2, 3.4]) +x.T + +# %% [markdown] +# We can also pass in a specific permutation of the axes indices to `transpose` to get different reorderings, for example # %% -y +y = np.arange(24).reshape((3, 4, 2)) +y.transpose((0, 2, 1)) # %% [markdown] -# You can access the array's `dtype`, or check the type of individual elements: +# The shape of a particular array can always be accessed using the `ndarray.shape` attribute # %% -y.dtype +y.shape # %% -type(y[0]) +y.transpose((0, 2, 1)).shape -# %% -z = np.array([3, 4, 5]) -z +# %% [markdown] +# The total number of dimensions (axes) of an array can be accessed using the `ndarray.ndim` attribute # %% -type(z[0]) +y.ndim # %% [markdown] -# The results are, when you get to know them, fairly obvious string codes for datatypes: -# NumPy supports all kinds of datatypes beyond the python basics. +# ## Array indexing and slicing +# +# A multidimensional array accepts a comma-delimited sequence of integer indices or index ranges enclosed in square brackets, of up to the number of array dimensions. For an array with *n* dimensions if *n* integer indices are specified then the result is a scalar value of the array's data type + +# %% +x = np.arange(40).reshape([4, 5, 2]) +x[2, 1, 0] # %% [markdown] -# NumPy will convert python type names to dtypes: +# If we pass *m < n* indices, then the indices are used to select a [*slice*](https://docs.python.org/3/glossary.html#term-slice) of the array corresponding to using the specified indices to select the first *m* dimensions and selecting all of the remaining *n - m* dimensions # %% -x = [2, 3.4, 7.2, 0] +x[2, 1] # %% -int_array = np.array(x, dtype=int) +x[2] -# %% -float_array = np.array(x, dtype=float) +# %% [markdown] +# Similar to lists, NumPy arrays also support [an extended indexing syntax](https://numpy.org/devdocs/user/basics.indexing.html#slicing-and-striding) to support selecting portions of a particular dimension; somewhat confusingly the term *slice* is used to refer both to the outcome of selecting a portion of a sequence *and* the object and associated [syntactic sugar](https://en.wikipedia.org/wiki/Syntactic_sugar) used to select such portions. Passing a colon separated range `start:stop:step` as the index for an array dimension, will select the elements for which this dimension's index is in the corresponding range object `range(start, stop, step)` for example # %% -int_array +np.arange(10)[1:10:2] -# %% -float_array +# %% [markdown] +# As for lists the `step` component of the range can be omitted; in this case the second colon can also be left out. # %% -int_array.dtype +np.arange(10)[1:3] + +# %% [markdown] +# If the `start` argument is omitted it is implicitly assumed to be zero - that is to start from the beginning of the dimension. If the `end` argument is omitted it is implicitly assumed to be equal to the length of that dimension, that is to stop at the final index along that dimension. Combining these rules together means that for example a plain colon `:` will be interpreted as referring to all indices in that dimension # %% -float_array.dtype +np.arange(10)[:] # %% [markdown] -# ### Broadcasting +# For multiple dimensional arrays we can slice along multiple dimensions simultaneously -# %% [markdown] -# This is another really powerful feature of NumPy. +# %% +x[2:, :1, 0] # %% [markdown] -# By default, array operations are element-by-element: +# As for lists, the [inbuilt `slice` object](https://docs.python.org/3/library/functions.html#slice) can be used to programatically define slices, for example the following is equivalent to the slicing in the above cell # %% -np.arange(5) * np.arange(5) +x[slice(2, None), slice(None, 1), 0] # %% [markdown] -# If we multiply arrays with non-matching shapes we get an error: +# ## Reduction operations +# +# NumPy provides various functions and associate `ndarray` methods which apply an operation along one or more array axes, resulting in an output of reduced dimension. The *reduction* operations include +# +# * `sum`, `prod`: compute the sum or product along one or more dimensions, +# * `min`, `max`: compute the minimum or maximum along one or more dimensions, +# * `argmin`, `argmax`: compute the indices corresponding to the minimum or maximum along one or more dimensions, +# * `mean`, `std`: compute the empirical mean or standard deviation along one or dimensions. +# +# All of these operations include both a functional form available in the `numpy` module namespace, and a corresponding `ndarray` method. The interface to the `ndarray` methods match the function other than the first array positional argument being set to array the method is being called on; for example `np.sum(x)` and `x.sum()` will give equivalent results. # %% -np.arange(5) * np.arange(6) +x = np.arange(12).reshape(2, 2, 3) +x # %% -np.zeros([2,3]) * np.zeros([2,4]) +np.sum(x) # Sums along all axes # %% -m1 = np.arange(100).reshape([10, 10]) +x.sum() # Also sums along all axes -# %% -m2 = np.arange(100).reshape([10, 5, 2]) +# %% [markdown] +# All the reduction operations accept an optional `axis` argument which specifies the array axis (dimension) or axes to apply the reduction along. This defaults to `None` corresponding to applying along all axes, with the returned output then a scalar. # %% -m1 + m2 +x.sum(axis=None) # Also sums along all axes # %% [markdown] -# Arrays must match in all dimensions in order to be compatible: +# If `axis` is set to an integer (corresponding to a valid axis index for the array) then the reduction will be applied only along that axis, resulting in a returned output of dimension one less than the original array. # %% -np.ones([3, 3]) * np.ones([3, 3]) #Β Note elementwise multiply, *not* matrix multiply. +x.sum(axis=0) # Sums along the first axis + +# %% +x.sum(axis=1) # Sums along the second axis + +# %% +x.sum(axis=2) # Sums along the third axis # %% [markdown] -# **Except**, that if one array has any Dimension 1, then the data is **REPEATED** to match the other. +# If axis is set a *tuple* of integer axis indices, the reduction is applied along all the corresponding axes, with the returned output then of dimension equal to the original array dimension minus the length of the axis index tuple. # %% -col = np.arange(10).reshape([10, 1]) -col +x.sum(axis=(0, 1)) # Sums along the first and second axes # %% -row = col.transpose() -row +x.sum(axis=(0, 1, 2)) # Also sums along all axes + +# %% [markdown] +# ## Advanced broadcasting +# +# We earlier encountered the concept of *broadcasting* in the context of binary operations on mixes of scalars and arrays. A very powerful feature of NumPy is that broadcasting is also extended to apply to operations involving arrays with differing but *compatible* shapes. This allows us to broadcast a smaller array across a larger array without needlessly repeating the data in the smaller array. It also +# +# NumPy's binary operations are usually applied elementwise on pairs of arrays, with the simplest case being when the arrays have exactly matching shapes # %% -col.shape # "Column Vector" +np.arange(0, 5) * np.arange(5, 10) # %% -row.shape # "Row Vector" +np.ones((3, 2)) + np.zeros((3, 2)) + +# %% [markdown] +# If we apply binary operations to arrays with non-matching shapes we will typically get an error # %% -row + col +np.arange(5) * np.arange(6) # %% -10 * row + col +np.ones((2, 3)) * np.zeros((2, 4)) # %% [markdown] -# This works for arrays with more than one unit dimension. +# However, the condition that array shapes exactly match is relaxed to allow operations to be peformed on pairs of arrays for which the shapes are *compatible* under certain rules. The shape of two arguments to a binary operation are considered compatible if: working from the rightmost dimension leftwards all dimensions which are defined for both arrays are either equal or one of them is equal to one. Any dimensions for which one array only has size one, that array is treated as if the array element was repeated a number of times equal to the size of the corresponding dimension of the other array. +# +# This provides a convenient way for example to perform an outer product operation on vectors (one dimensional arrays) + +# %% +np.arange(5).reshape(5, 1) * np.arange(5, 10).reshape(1, 5) # %% [markdown] -# ### Newaxis +# Importantly arrays do not need to have the same number of dimensions to have compatible shapes, providing the rightmost dimensions of the array with a larger dimension are compatible with the shape of the smaller array, with the missing leftmost dimensions of the smaller array treated as if they were of size one. For example + +# %% +np.arange(6).reshape(3, 2) + np.arange(2) # %% [markdown] -# Broadcasting is very powerful, and numpy allows indexing with `np.newaxis` to temporarily create new one-long dimensions on the fly. +# For a more complete description of NumPy's broadcasting rules including some helpful visualisation [see this article in the official documentation](https://numpy.org/devdocs/user/basics.broadcasting.html). +# +# ## Adding new axes +# +# Broadcasting is very powerful, and NumPy allows indexing with `np.newaxis` to temporarily create new length one dimensions on the fly, rather than explicitly calling `reshape`. # %% -import numpy as np x = np.arange(10).reshape(2, 5) y = np.arange(8).reshape(2, 2, 2) # %% -x - -# %% -y +x.reshape(2, 5, 1, 1).shape # %% x[:, :, np.newaxis, np.newaxis].shape @@ -397,92 +486,67 @@ # %% res = x[:, :, np.newaxis, np.newaxis] * y[:, np.newaxis, :, :] - -# %% res.shape +# %% [markdown] +# This is particularly useful when performing outer product type operations + # %% -np.sum(res) +x = np.arange(5) +y = np.arange(5, 10) +x[:, np.newaxis] * y[np.newaxis, :] # %% [markdown] -# Note that `newaxis` works because a $3 \times 1 \times 3$ array and a $3 \times 3$ array contain the same data, -# differently shaped: +# Note that `newaxis` works because an array with extra length one dimensions has the same overall size (and so can be a view to the same underlying data) just with a different shape. In other words, a $3 \times 1 \times 3$ and a $3 \times 3$ array contain the same data, differently shaped: # %% -threebythree = np.arange(9).reshape(3, 3) -threebythree +three_by_three = np.arange(9).reshape(3, 3) +three_by_three # %% -threebythree[:, np.newaxis, :] +three_by_three[:, np.newaxis, :] # %% [markdown] -# ### Dot Products - -# %% [markdown] -# NumPy multiply is element-by-element, not a dot-product: +# ## Matrix multiplications +# +# NumPy interprets the standard `*` operator as elementwise multiplication # %% a = np.arange(9).reshape(3, 3) -a - -# %% b = np.arange(3, 12).reshape(3, 3) -b - -# %% a * b # %% [markdown] -# To get a dot-product, (matrix inner product) we can use a built in function: +# To perform matrix multiplication we use the `@` operator instead # %% -np.dot(a, b) +a @ b # %% [markdown] -# Though it is possible to represent this in the algebra of broadcasting and newaxis: +# As well as matrix-matrix products (that is products of pairs of two dimensional arrays) this can also be used for matrix-vector products (products of a two dimensional array and one dimensional array) and vector-vector products (products of pairs of one dimensional arrays) # %% -a[:, :, np.newaxis].shape +v = np.arange(3) +a @ v # %% -b[np.newaxis, :, :].shape - -# %% -a[:, :, np.newaxis] * b[np.newaxis, :, :] - -# %% -(a[:, :, np.newaxis] * b[np.newaxis, :, :]).sum(1) +v @ v # %% [markdown] -# Or if you prefer: - -# %% -(a.reshape(3, 3, 1) * b.reshape(1, 3, 3)).sum(1) - -# %% [markdown] -# We use broadcasting to generate $A_{ij}B_{jk}$ as a 3-d matrix: - -# %% -a.reshape(3, 3, 1) * b.reshape(1, 3, 3) - -# %% [markdown] -# Then we sum over the middle, $j$ axis, [which is the 1-axis of three axes numbered (0,1,2)] of this 3-d matrix. Thus we generate $\Sigma_j A_{ij}B_{jk}$. +# ## Structured arrays # -# We can see that the broadcasting concept gives us a powerful and efficient way to express many linear algebra operations computationally. - -# %% [markdown] -# ### Record Arrays - -# %% [markdown] -# These are a special array structure designed to match the CSV "Record and Field" model. It's a very different structure -# from the normal NumPy array, and different fields *can* contain different datatypes. We saw this when we looked at CSV files: +# So far we have encountered arrays with 'simple' data types corresponding to a single type. NumPy also offers arrays with [*structured data types*](https://numpy.org/devdocs/user/basics.rec.html#structured-datatypes), sometimes termed *record arrays*, for which each array element is composed of several fields, with each field having the same data type across all array elements. These are a special array structure designed to match the *comma separated variable* (CSV) *record and field* model. We saw this when we looked at CSV files: # %% -x = np.arange(50).reshape([10, 5]) +x = np.arange(50).reshape((10, 5)) # %% -record_x = x.view(dtype={'names': ["col1", "col2", "another", "more", "last"], - 'formats': [int]*5 }) +record_x = x.view( + dtype={ + 'names': ["col1", "col2", "another", "more", "last"], + 'formats': [int] * 5 + } +) # %% record_x @@ -494,68 +558,74 @@ record_x['col1'] # %% [markdown] -# We've seen these already when we used NumPy's CSV parser. - -# %% [markdown] -# ### Logical arrays, masking, and selection - -# %% [markdown] -# Numpy defines operators like == and < to apply to arrays *element by element*: +# ## Comparison operators and boolean indexing +# +# Numpy defines comparison operators like `==` and `<` to apply to arrays elementwise and also to broadcast similar to arithmetic operations # %% -x = np.zeros([3, 4]) +x = np.arange(-1, 2)[:, np.newaxis] * np.arange(-2, 2)[np.newaxis, :] x # %% -y = np.arange(-1, 2)[:, np.newaxis] * np.arange(-2, 2)[np.newaxis, :] -y - -# %% -iszero = x == y -iszero +is_zero = (x == 0) +is_zero # %% [markdown] -# A logical array can be used to select elements from an array: +# Boolean arrays can also be used to filter the elements of an array matching some condition. For example # %% -y[np.logical_not(iszero)] +x[is_zero] # %% [markdown] -# Although when printed, this comes out as a flat list, if assigned to, the *selected elements of the array are changed!* +# We can use the unary negation operator `~` to negate conditions # %% -y[iszero] = 5 - -# %% -y +x[~is_zero] # %% [markdown] -# ### Numpy memory - -# %% [markdown] -# Numpy memory management can be tricksy: +# Although when used to get items from an array, boolean indexing results in a new one dimensional array, if the boolean indexing is instead part of an assignment statement the *selected elements of the array are changed in place* # %% -x = np.arange(5) -y = x[:] +x[is_zero] = 5 # %% -y[2] = 0 x # %% [markdown] -# It does **not** behave like lists! +# For more details about boolean array indexing [see the official documentation](https://numpy.org/devdocs/user/basics.indexing.html#boolean-array-indexing). +# +# ## Copies and views +# +# Care needs to be taken when assigning to slices of a NumPy array # %% -x = list(range(5)) -y = x[:] +x = np.arange(6).reshape((3, 2)) +x # %% -y[2] = 0 +y = x[0, :] +y[1] = -99 x # %% [markdown] -# We must use `np.copy` to force separate memory. Otherwise NumPy tries its hardest to make slices be *views* on data. +# In general NumPy will try to return *views* to the same underlying array data buffer when performing indexing and slicing operations, where possible. These views share the same underlying data in memory with the original array, and so make such indexing operations cheap in both memory usage and computational cost (by avoiding unnececssary copies). As we saw above however, if we assign to a slice we will therefore also update the original array. We can use the `np.copy` function or corresponding `ndarray.copy` method to force creation of an array referencing a new copy of the underlying data + +# %% +x = np.arange(6).reshape((3, 2)) +y = x[0, :].copy() +y[1] = -99 +x # %% [markdown] -# Now, this has all been very theoretical, but let's go through a practical example, and see how powerful NumPy can be. +# More details are given [in the official documentation](https://numpy.org/devdocs/user/basics.copies.html). +# +# ## Further reading +# +# As well as the links to external resource throughout these notes, some suggestions for further reading for those who wish to learn more about NumPy are +# +# * The [NumPy quickstart guide](https://numpy.org/devdocs/user/quickstart.html) in the official documentation. +# * If you have previous experience with MATLAB, [this guide to NumPy for MATLAB users](https://numpy.org/devdocs/user/numpy-for-matlab-users.html) in the official documentation may also be helpful. +# * [This *Software Carpenty* lesson](https://swcarpentry.github.io/python-novice-inflammation/02-numpy/index.html) also introduces the basics of using NumPy in an applied example. +# * The [NumPy section of this Python tutorial for the Stanford CS231n course](https://cs231n.github.io/python-numpy-tutorial/#numpy) covers a lot of the same topics as we have here. +# +# We will also see some practical usage of NumPy arrays in one of the subsequent notebooks on simulating flocking dynamics. diff --git a/ch02data/084Boids.ipynb.py b/ch02data/084Boids.ipynb.py index d21c976d..1b9c7966 100644 --- a/ch02data/084Boids.ipynb.py +++ b/ch02data/084Boids.ipynb.py @@ -12,424 +12,318 @@ # --- # %% [markdown] -# ## The Boids! +# # The Boids! +# +# Now that we have covered the basics of both NumPy and Matplotlib, we will go through an extended example of using these libaries to run and visualise a simulation of flocking dynamics. +# +# ## Flocking # -# This section shows an example of using NumPy to encode a model of how a group of birds or other animals moves. It is based on [a paper by Craig W. Reynolds](http://www.cs.toronto.edu/~dt/siggraph97-course/cwr87/). Reynolds calls the simulated creatures "bird-oids" or "boids", so that's what we'll be calling them here too. - -# %% [markdown] -# ### Flocking - -# %% [markdown] # # > The aggregate motion of a flock of birds, a herd of land animals, or a school of fish is a beautiful and familiar # part of the natural world... The aggregate motion of the simulated flock is created by a distributed behavioral model much -# like that at work in a natural flock; the birds choose their own course. Each simulated bird is implemented as an independent -# actor that navigates according to its local perception of the dynamic environment, the laws of simulated physics that rule its -# motion, and a set of behaviors programmed into it... The aggregate motion of the simulated flock is the result of the -# dense interaction of the relatively simple behaviors of the individual simulated birds. +# like that at work in a natural flock; the birds choose their own course. Each simulated bird is implemented as an independent actor that navigates according to its local perception of the dynamic environment, the laws of simulated physics that rule its motion, and a set of behaviors programmed into it... The aggregate motion of the simulated flock is the result of the dense interaction of the relatively simple behaviors of the individual simulated birds. # -# -- Craig W. Reynolds, "Flocks, Herds, and Schools: A Distributed Behavioral Model", *Computer Graphics* **21** _4_ 1987, pp 25-34 - -# %% [markdown] -# The model includes three main behaviours which, together, give rise to "flocking". In the words of the paper: +# — Craig W. Reynolds, "Flocks, Herds, and Schools: A Distributed Behavioral Model", *Computer Graphics* **21** _4_ 1987, pp 25-34 (see the [original paper](http://www.cs.toronto.edu/~dt/siggraph97-course/cwr87/)) # -# * Collision Avoidance: avoid collisions with nearby flockmates -# * Velocity Matching: attempt to match velocity with nearby flockmates -# * Flock Centering: attempt to stay close to nearby flockmates - -# %% [markdown] -# ### Setting up the Boids - -# %% [markdown] -# Our boids will each have an x velocity and a y velocity, and an x position and a y position. - -# %% [markdown] -# We'll build this up in NumPy notation, and eventually, have an animated simulation of our flying boids. +# A basic model of flocking behaviour can be derived from [three basic rules](https://en.wikipedia.org/wiki/Boids) +# +# > * *Collision avoidance*: avoid collisions with nearby flockmates. +# > * *Velocity matching*: attempt to match velocity with nearby flockmates. +# > * *Flock centering*: attempt to stay close to nearby flockmates. +# +# We will use a simple two-dimensional model in which the state of the *n*th *boid* (short for *bird-oid* or flock member) is described by a position vector $\boldsymbol{x}_n = (x_{n,0}, x_{n,1})$ and velocity $\boldsymbol{v}_n = (v_{n,0}, v_{n,1})$. We will represent our flock state as NumPy arrays, implement our simulation dynamics using NumPy array operations and use the [animation capabilities of Matplotlib](https://matplotlib.org/stable/api/animation_api.html) to create animated simulations of our flying boids. We first import the relevant modules `numpy`, `matlotlib.pyplot` and `matplotlib.animation` and [set Matplotlib animations to default to being represented as interactive JavaScript widgets in the notebook interface](http://louistiao.me/posts/notebooks/embedding-matplotlib-animations-in-jupyter-as-interactive-javascript-widgets/). # %% import numpy as np +from matplotlib import pyplot as plt, animation +plt.rcParams['animation.html'] = 'jshtml' # %% [markdown] -# Let's start with simple flying in a straight line. - -# %% [markdown] -# Our positions, for each of our N boids, will be an array, shape $2 \times N$, with the x positions in the first row, -# and y positions in the second row. +# ## Initialising the simulation +# +# We will represent the positions and velocities of all the boids as a pair of two-dimensional arrays, both with shape `(num_boids, 2)` where `num_boids` is an integer determining the number of boids. We need to set initial values for these arrays, representing the positions and velocities at the start of the simulation. Here we will use a NumPy random number generator object to sample random initial values for the positions and velocities from some distribution. +# +# Generation of random number is a very common task in research software, but a topic with lots of important subtleties. A common pattern in many programming languages is to use a *global* random number generator which has a state which is updated by every call to random number generation routines. Use of global state like this breaks the usual expectation that for fixed inputs a function will produce fixed outputs, can lead to hard to diagnose bugs and make it difficult to ensure reproducibility of our research. +# +# To avoid these issues we will create an instance of [the NumPy `Generator` class](https://numpy.org/doc/stable/reference/random/generator.html), which encapsulates the state of a random number generator plus methods to produce random numbers from specified distributions, and explicitly pass this in to functions in which we wish to generate random numbers. We will also fix the initial state of the random number generator by explicitly specifying a *seed* value, meaning we can reproduce our results. The easiest way to produce a seeded `Generator` object is using the `np.random.default_rng` function and passing in integer `seed` argument # %% -boid_count = 10 +rng = np.random.default_rng(seed=21878533808081494313) -# %% [markdown] -# We'll want to be able to seed our Boids in a random position. # %% [markdown] -# We'd better define the edges of our simulation area: - -# %% -limits = np.array([2000, 2000]) - -# %% -positions = np.random.rand(2, boid_count) * limits[:, np.newaxis] -positions +# The `rng` object we just created provides methods for producing from various different probability distributions. Here we generate random values uniformly distributed over a specified interval using [the `Generator.uniform` method](https://numpy.org/doc/stable/reference/random/generator.html). We can inspect the docstring for this method by running `rng.uniform?` (if in a Jupyter notebook or IPython interpreter) or `help(rng.uniform)` (in any Python interpreter). +# # %% -positions.shape +# rng.uniform? # %% [markdown] -# We used **broadcasting** with np.newaxis to apply our upper limit to each boid. -# `rand` gives us a random number between 0 and 1. We multiply by our limits to get a number up to that limit. - -# %% -limits[:, np.newaxis] - -# %% -limits[:, np.newaxis].shape +# Helpfully, `Generator.uniform` accepts a `size` argument allowing us to generate an array of random samples of a specified *shape* (the argument is named `size` rather than `shape` to avoid clashing with the usage of shape as the standard name for a parameter of some probability distributions). We can also specify array-like arguments for the `low` and `high` parameters specifying the lower and upper bounds of the interval the distribution is defined on, these should either match the shape specified by `size` or be of a compatible shape for *broadcasting*. Below we define a function which given a `Generator` object `rng` and optional arguments specifying the number of boids `num_boids`, minimum and maximum of the intervals to draw the positions (`min_position` and `max_position`) and velocities (`min_velocity` and `max_velocity`) from, outputs a pair of arrays `positions` and `velocities`, both of shape `(num_boids, 2)`, corresponding to the sampled position and velocities values respectively. # %% -np.random.rand(2, boid_count).shape - +def initialise_boid_states( + rng, + num_boids=100, + min_position=(100, 900), + max_position=(200, 1100), + min_velocity=(0, -20), + max_velocity=(10, 20) +): + """Generate random initial states for the boids. + + Args: + rng: NumPy random number generator object. + num_boids: Number of boids to generate states for. + min_position: Length 2 sequence defining lower bounds for + interval to uniformly generate positions from. + max_position: Length 2 sequence defining upper bounds for + interval to uniformly generate position from. + min_velocity: Length 2 sequence defining lower bounds for + interval to uniformly generate velocities from. + max_velocity: Length 2 sequence defining upper bounds for + interval to uniformly generate velocities from. + + Returns: + Tuple of two arrays `(positions, velocities)`, both of shape + `(num_boids, 2)` corresponding to respectively the positions and + velocities of boids. + """ + positions = rng.uniform(min_position, max_position, size=(num_boids, 2)) + velocities = rng.uniform(min_velocity, max_velocity, size=(num_boids, 2)) + return positions, velocities -# %% [markdown] -# So we multiply a $2\times1$ array by a $2 \times 10$ array -- and get a $2\times 10$ array. # %% [markdown] -# Let's put that in a function: +# We can call this function with our `rng` random number generator object (and default values for the other arguments) to generate random initial values for the `positions` and `velocities` arrays. # %% -def new_flock(count, lower_limits, upper_limits): - width = upper_limits - lower_limits - return (lower_limits[:, np.newaxis] + np.random.rand(2, count) * width[:, np.newaxis]) +positions, velocities = initialise_boid_states(rng) # %% [markdown] -# For example, let's assume that we want our initial positions to vary between 100 and 200 in the x axis, and 900 and 1100 in the y axis. We can generate random positions within these constraints with: -# ```python -# positions = new_flock(boid_count, np.array([100, 900]), np.array([200, 1100])) -# ``` - -# %% [markdown] -# But each bird will also need a starting velocity. Let's make these random too: +# ## Visualising the boids # -# We can reuse the `new_flock` function defined above, since we're again essentially just generating random numbers from given limits. This saves us some code, but keep in mind that using a function for something other than what its name indicates can become confusing! +# Now that we have initialised the state of our simulation, we are ready to start visualising our boids. As we have assumed our boids exist in two-dimensions, we can use Matplolib's extensive range of plotting functions for two-dimensional data. Ideally we want to represent both the instantenous positions and velocities of the boids in our visualisation. We will use a [Matplotlib *quiver* plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.quiver.html) to do this, representing each boid as an arrow located on the plot axes at a point corresponding to its simulated position, and with the direction of the arrow representing its current heading. While we could also represent its speed by the size of the arrow, we choose here to keep the arrows all the same size - this requires calculating the *unit vectors* (vectors of length one) corresponding to the velocity vectors, which can be done efficiently with broadcasted array operations in NumPy. We wrap the code for plotting the boids into a function to allow for easy reuse. We return the generated Matplotlib figure and axis (subplot) objects, along with *artist* object representing the quiver plot arrows returned by the `quiver` call, to allow us to make calls to these objects outside of the function. + +# %% +def calculate_unit_vectors(vectors): + """Calculate the length-one vectors corresponding to an array of vectors. + + Args: + vectors: Array with last dimension corresponding to vector dimension. + + Returns: + Array of same shape as `vectors` with Euclidean norm along last axis + equal to one. + """ + return vectors / (vectors**2).sum(-1)[..., np.newaxis]**0.5 + + +def plot_boids(positions, velocities, figsize=(8, 8), xlim=(0, 2000), ylim=(0, 2000)): + """Create visual representation of boids as quiver plot. + + Args: + positions: Array of shape `(num_boids, 2)` defining positions of boids. + velocities: Array of shape `(num_boids, 2)` defining velocities of boids. + figsize: Tuple defining figure dimension `(width, height)` in inches. + xlim: Tuple `(min, max)` defining extents of horizontal axis. + ylim: Tuple `(min, max)` defining extents of vertical axis. + + Returns: + Tuple containing Matplotlib figure, axis and artist corresponding to plot. + """ + fig, ax = plt.subplots(figsize=(8, 8)) + ax.set(xlim=xlim, ylim=ylim, xlabel="$x_0$ coordinate", ylabel="$x_1$ coordinate") + velocity_unit_vectors = calculate_unit_vectors(velocities) + arrows = ax.quiver( + positions[:, 0], # horizontal coordinates for origin points for arrows + positions[:, 1], # vertical coordinates for origin points of arrows + velocity_unit_vectors[:, 0], # horizontal component of arrow vectors + velocity_unit_vectors[:, 1], # vertical component of arrow vectors + scale=40, # size of arrows + angles='xy', # convention used for specifying arrow directions + color='C0', # color of arrows - set to first value in default color cycle + pivot='middle' # plot middle of arrows at specified origin points + ) + return fig, ax, arrows + + +# %% [markdown] +# We can now produce a (rather boring) visualisation of our boids in their initial state. + +# %% +fig, ax, arrows = plot_boids(positions, velocities) + + +# %% [markdown] +# ## Simulating the model dynamics # -# Here, we will let the initial x velocities range over $[0, 10]$ and the y velocities over $[-20, 20]$. - -# %% -velocities = new_flock(boid_count, np.array([0, -20]), np.array([10, 20])) -velocities - -# %% [markdown] -# ### Flying in a Straight Line - -# %% [markdown] -# Now we see the real amazingness of NumPy: if we want to move our *whole flock* according to +# We will model the boids as being subject to forces corresponding to the three rules described above. Newton's second law of motion tells use that the acceleration of a body (rate of change of velocity) is proportional to the net force acting on it, with the velocity in turn being the rate in change of the position of the object. In mathematical notation, if $F_{k,n}$ is a function which calculates the value of the $k$th force on the $n$th boid given the positions and velocities of all the boids then we have that # -# $\delta_x = \delta_t \cdot \frac{dv}{dt}$ - -# %% [markdown] -# we just do: +# +# $$ +# \frac{\mathrm{d}\boldsymbol{x}_{0:N}}{\mathrm{d}t} = \boldsymbol{v}_{0:N}, \qquad +# \frac{\mathrm{d}\boldsymbol{v}_{0:N}}{\mathrm{d}t} = \sum_{k=1}^K F_{k,n}(\boldsymbol{x}_{0:N}, \boldsymbol{v}_{0:N}). +# $$ +# +# Note that it is not a problem if you are not familiar with the notation here - understanding the mathematics behind the model is non-essential! There are a variety of numerical approaches for approximately simulating dynamics of this form. Here we will use a particularly simple to implement numerical method which corresonds to the approximation that for a small timestep $\delta t$ that +# +# $$ +# \boldsymbol{x}_{0:N}(t + \delta t) \approx \boldsymbol{x}_{0:N}(t) + \delta t \boldsymbol{v}_{0:N}(t), \qquad +# \boldsymbol{v}_{0:N}(t + \delta t) \approx \boldsymbol{x}_{0:N}(t) + \delta t \sum_{k=1}^K F_k(\boldsymbol{x}_{0:N}(t + \delta t), \boldsymbol{v}_{0:N}(t)). +# $$ +# +# Using NumPy array operations we can implement this numerical scheme efficiently by updating the positions and velocities for all boids simulateneously. We wrap this into a function `simulate_timestep` below that updates the `positions` and `velocities` arrays in-place. # %% -positions += velocities - -# %% [markdown] -# ### Matplotlib Animations +def simulate_timestep(positions, velocities, forces, timestep): + """Simulate model dynamics forward one timestep, updating states in-place. + + Args: + positions: Array of shape `(num_boids, 2)` defining positions of boids. + velocities: Array of shape `(num_boids, 2)` defining velocities of boids. + forces: Sequence of functions computing forces on boids given positions + and velocities of all boids. + timestep: Scalar timestep to use for numerical integrator. + """ + positions += timestep * velocities + velocities += timestep * sum(force(positions, velocities) for force in forces) -# %% [markdown] -# So now we can animate our Boids using the matplotlib animation tools. All we have to do is import the relevant libraries: - -# %% -from matplotlib import animation -from matplotlib import pyplot as plt -# %matplotlib inline # %% [markdown] -# Then, we make a static plot, showing our first frame: +# ## Creating an animation of the simulation +# +# Now that we have a function to update the boid states according to the model dynamics, we are ready to produce animations visualising the simulated dynamics over time. To do this we will use [the `FuncAnimation` class from the `matplotlib.animation` module](https://matplotlib.org/stable/api/_as_gen/matplotlib.animation.FuncAnimation.html). We can inspect the docstring for the initialiser for this class: # %% -# create a simple plot -# initial x position in [100, 200], initial y position in [900, 1100] -# initial x velocity in [0, 10], initial y velocity in [-20, 20] -positions = new_flock(100, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(100, np.array([0, -20]), np.array([10, 20])) - -figure = plt.figure() -axes = plt.axes(xlim=(0, limits[0]), ylim=(0, limits[1])) -scatter = axes.scatter(positions[0, :], positions[1, :], - marker='o', edgecolor='k', lw=0.5) -scatter - +# animation.FuncAnimation? # %% [markdown] -# Then, we define a function which **updates** the figure for each timestep +# We see that `FuncAnimation` produces an animation by repeatedly calling a specified function to update the animation 'frame', which corresponds to a Matplotlib figure instance. As well as the figure object to use and function updating the frames, we need to specify the `frames` argument - here we do this with an integer corresponding to the number of frames to animate. Finally we specify an optional argument `interval` corresponding to the delay between each frame being shown in milliseconds. Again we wrap all of this into a function to allow for easy reuse. # %% -def update_boids(positions, velocities): - positions += velocities - - -def animate(frame): - update_boids(positions, velocities) - scatter.set_offsets(positions.transpose()) +def animate_flock(positions, velocities, forces=(), timestep=1., num_step=100): + """Visualise the dynamics of the boids as a Matplotlib animation. + + Args: + positions: Array of shape `(num_boids, 2)` defining positions of boids. + velocities: Array of shape `(num_boids, 2)` defining velocities of boids. + forces: Sequence of functions computing forces on boids given positions + and velocities of all boids. + timestep: Scalar timestep to use for numerical integrator. + num_step: Number of timesteps to simulate in animation. + + Returns: + Matplotlib animation of simulated boid dynamics. + """ + fig, ax, arrows = plot_boids(positions, velocities) + def update_frame(frame_index): + simulate_timestep(positions, velocities, forces, timestep) + velocity_unit_vectors = calculate_unit_vectors(velocities) + arrows.set_offsets(positions) + arrows.set_UVC(velocity_unit_vectors[:, 0], velocity_unit_vectors[:, 1]) + return [arrows] + + # Close Matplotlib figure object to avoid displaying static figure as well as animation + plt.close(fig) + return animation.FuncAnimation(fig, update_frame, num_step, interval=50) -# %% [markdown] -# Call `FuncAnimation`, and specify how many frames we want: - -# %% -anim = animation.FuncAnimation(figure, animate, - frames=50, interval=50) # %% [markdown] -# Save out the figure: +# We are now ready to produce our first animation! We generate initial positions and velocities using our `initialise_boid_states` function and then pass these to the `animate_flock` function, using the default values for the other arguments for now, with in particular we not specifying any forces acting on the boids for now. # %% -positions = new_flock(100, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(100, np.array([0, -20]), np.array([10, 20])) -anim.save('boids_1.mp4') +positions, velocities = initialise_boid_states(rng) +animate_flock(positions, velocities) -# %% [markdown] -# And download the [saved animation](http://github-pages.ucl.ac.uk/rsd-engineeringcourse/ch02data/boids_1.mp4). # %% [markdown] -# You can even view the results directly in the notebook. +# Though our boids are now moving, we see that they have the rather uninteresting behaviour of moving perpetually in a straight line. To get more interesting flocking like behaviour we need to specify some forces. +# +# ## Specifying the flocking dynamics +# +# Now that we have our simulation and animation framework set up, we are ready to start defining the forces corresponding to the flocking behaviour rules we encountered at the start of the notebook. +# +# ### *Cohesion*: staying close to nearby flockmates +# +# Rule cohesion
Credit: Craig Reynolds (public domain)
+# +# The first behaviour we consider is the tendency for flocks to remain close to flockmates, which we term *cohesion*. We represent this as a force which pushes the flock members towards the mean point of surrounding flock members, that is the force exerted is proportional to the difference between the mean flock position and the current flock members position. We can implement this efficiently using broadcasted NumPy array operations: # %% -from IPython.display import HTML -HTML(anim.to_jshtml()) +def cohesion_force(positions, velocities, cohesion_strength=0.01): + return cohesion_strength * (positions.mean(axis=0)[np.newaxis] - positions) -# %% [markdown] -# ### Fly towards the middle # %% [markdown] -# Boids try to fly towards the middle: - -# %% -positions = new_flock(4, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(4, np.array([0, -20]), np.array([10, 20])) +# Now that we have defined our first force function, we can create a new animation with this force applied: # %% -positions +positions, velocities = initialise_boid_states(rng) +animate_flock(positions, velocities, [cohesion_force]) -# %% -velocities - -# %% -middle = np.mean(positions, 1) -middle - -# %% -direction_to_middle = positions - middle[:, np.newaxis] -direction_to_middle # %% [markdown] -# This is easier and faster than: +# Our boids now show a more interesting dynamic behaviour, staying together as one cohesive unit. We see however that a lot of the time boids appear to be in (near) collisions with each other. # -# ``` python -# for bird in birds: -# for dimension in [0, 1]: -# direction_to_middle[dimension][bird] = positions[dimension][bird] - middle[dimension] +# ### *Separation*: avoiding collisions with nearby flockmates +# +# Rule separation
Credit: Craig Reynolds (public domain)
+# +# The second behaviour that we consider it tendency of flockmates to avoid collisions with each other. We represent this as a force which for each pair of boids within a certain distance of each other, exerts a force which pushes the boids away from each other so that the displacement beween them increases, by acting along the line of displacement. To only sum the displacements over the pairs of boids within a specified distance of each other we can use [the `numpy.where` function](https://numpy.org/doc/stable/reference/generated/numpy.where.html). Calling +# ```Python +# values = np.where(conditions, values_if_true, values_if_false) # ``` +# creates an array `values` such that for an index `i` (either a single integer for one-dimensional arrays or tuple of integers for multidimensional arrays) `values[i]` is equal to `values_if_true[i]` if `conditions[i]` is `True` and `values_if_false[i]` otherwise. All of the array arguments `conditions`, `values_if_true` and `values_if_false` should be of compatible shapes - that is of the same shape, or of shapes that can be broadcast (remember that a scalar is compatible for broadcasting with an array of any shape). # %% -move_to_middle_strength = 0.01 -velocities = velocities - direction_to_middle * move_to_middle_strength +def separation_force(positions, velocities, separation_strength=1., separation_distance=10.): + displacements = positions[np.newaxis] - positions[:, np.newaxis] + are_close = (displacements**2).sum(-1)**0.5 <= separation_distance + return separation_strength * np.where(are_close[..., None], displacements, 0).sum(0) # %% [markdown] -# Let's update our function, and animate that: - -# %% -def update_boids(positions, velocities): - move_to_middle_strength = 0.01 - middle = np.mean(positions, 1) - direction_to_middle = positions - middle[:, np.newaxis] - velocities -= direction_to_middle * move_to_middle_strength - positions += velocities - +# We can now run and animate a simulation with both the forces we have defined: # %% -def animate(frame): - update_boids(positions, velocities) - scatter.set_offsets(positions.transpose()) +positions, velocities = initialise_boid_states(rng) +animate_flock(positions, velocities, [cohesion_force, separation_force]) -# %% -anim = animation.FuncAnimation(figure, animate, - frames=50, interval=50) - -# %% -positions = new_flock(100, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(100, np.array([0, -20]), np.array([10, 20])) -HTML(anim.to_jshtml()) - -# %% [markdown] -# ### Avoiding collisions - -# %% [markdown] -# We'll want to add our other flocking rules to the behaviour of the Boids. - -# %% [markdown] -# We'll need a matrix giving the distances between each bird. This should be $N \times N$. - -# %% -positions = new_flock(4, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(4, np.array([0, -20]), np.array([10, 20])) - -# %% [markdown] -# We might think that we need to do the X-distances and Y-distances separately: - -# %% -xpos = positions[0, :] - -# %% -xsep_matrix = xpos[:, np.newaxis] - xpos[np.newaxis, :] - -# %% -xsep_matrix.shape - -# %% -xsep_matrix - -# %% [markdown] -# But in NumPy we can be cleverer than that, and make a $2 \times N \times N$ matrix of separations: - -# %% -separations = positions[:, np.newaxis, :] - positions[:, :, np.newaxis] - -# %% -separations.shape - -# %% [markdown] -# And then we can get the sum-of-squares $\delta_x^2 + \delta_y^2$ like this: - -# %% -squared_displacements = separations * separations - -# %% -square_distances = np.sum(squared_displacements, 0) - -# %% -square_distances - -# %% [markdown] -# Now we need to find birds that are too close: - -# %% -alert_distance = 2000 -close_birds = square_distances < alert_distance -close_birds - # %% [markdown] -# Find the direction distances **only** to those birds which are too close: - -# %% -separations_if_close = np.copy(separations) -far_away = np.logical_not(close_birds) - -# %% [markdown] -# Set `x` and `y` values in `separations_if_close` to zero if they are far away: - -# %% -separations_if_close[0, :, :][far_away] = 0 -separations_if_close[1, :, :][far_away] = 0 -separations_if_close - -# %% [markdown] -# And fly away from them: - -# %% -np.sum(separations_if_close, 2) - -# %% -velocities = velocities + np.sum(separations_if_close, 2) - - -# %% [markdown] -# Now we can update our animation: - -# %% -def update_boids(positions, velocities): - move_to_middle_strength = 0.01 - middle = np.mean(positions, 1) - direction_to_middle = positions - middle[:, np.newaxis] - velocities -= direction_to_middle * move_to_middle_strength - - separations = positions[:, np.newaxis, :] - positions[:, :, np.newaxis] - squared_displacements = separations * separations - square_distances = np.sum(squared_displacements, 0) - alert_distance = 100 - far_away = square_distances > alert_distance - separations_if_close = np.copy(separations) - separations_if_close[0, :, :][far_away] = 0 - separations_if_close[1, :, :][far_away] = 0 - velocities += np.sum(separations_if_close, 1) - - positions += velocities - +# ### *Alignment*: matching velocity with nearby flockmates +# +# Rule alignment
Credit: Craig Reynolds (public domain)
+# +# The final rule we consider is a little different in that it specifies a relationship between the *velocities* rather than *positions* of the boids. The *alignment* rule specifies that nearby flockmates should tend to match velocities with each other. We can implement this as a force which is negatively proportional to the velocity differences between pairs of boids within a certain distance of each other. We can use a similar implementation to that for `separation_force`: # %% -def animate(frame): - update_boids(positions, velocities) - scatter.set_offsets(positions.transpose()) - - -anim = animation.FuncAnimation(figure, animate, - frames=50, interval=50) - -positions = new_flock(100, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(100, np.array([0, -20]), np.array([10, 20])) -HTML(anim.to_jshtml()) +def alignment_force(positions, velocities, alignment_strength=0.125, alignment_distance=100): + displacements = positions[np.newaxis] - positions[:, np.newaxis] + velocity_differences = velocities[np.newaxis] - velocities[:, np.newaxis] + are_close = (displacements**2).sum(-1)**0.5 <= alignment_distance + return -alignment_strength * np.where(are_close[..., None], velocity_differences, 0).mean(0) # %% [markdown] -# ### Match speed with nearby birds - -# %% [markdown] -# This is pretty similar: - -# %% -def update_boids(positions, velocities): - move_to_middle_strength = 0.01 - middle = np.mean(positions, 1) - direction_to_middle = positions - middle[:, np.newaxis] - velocities -= direction_to_middle * move_to_middle_strength - - separations = positions[:, np.newaxis, :] - positions[:, :, np.newaxis] - squared_displacements = separations * separations - square_distances = np.sum(squared_displacements, 0) - alert_distance = 100 - far_away = square_distances > alert_distance - separations_if_close = np.copy(separations) - separations_if_close[0, :, :][far_away] = 0 - separations_if_close[1, :, :][far_away] = 0 - velocities += np.sum(separations_if_close, 1) - - velocity_differences = velocities[:, np.newaxis, :] - velocities[:, :, np.newaxis] - formation_flying_distance = 10000 - formation_flying_strength = 0.125 - very_far = square_distances > formation_flying_distance - velocity_differences_if_close = np.copy(velocity_differences) - velocity_differences_if_close[0, :, :][very_far] = 0 - velocity_differences_if_close[1, :, :][very_far] = 0 - velocities -= np.mean(velocity_differences_if_close, 1) * formation_flying_strength - - positions += velocities - +# Now we are finally ready to run and animate a simulation with forces for all three rules defined: # %% -def animate(frame): - update_boids(positions, velocities) - scatter.set_offsets(positions.transpose()) - - -anim = animation.FuncAnimation(figure, animate, - frames=200, interval=50) - - -positions = new_flock(100, np.array([100, 900]), np.array([200, 1100])) -velocities = new_flock(100, np.array([0, -20]), np.array([10, 20])) -HTML(anim.to_jshtml()) +positions, velocities = initialise_boid_states(rng) +animate_flock(positions, velocities, [cohesion_force, separation_force, alignment_force]) # %% [markdown] -# Hopefully the power of NumPy should be pretty clear now. This would be **enormously slower** and harder to understand using traditional lists. +# ## Conclusion and extensions +# +# Hopefully this example has illustrated the power and convenience of NumPy. Implementing an equivalent model using native Python data structures such as lists to represent the states of the boids would both result in code that was slower to run and harder to read. +# +# If you are interested in exploring this model further here are some ideas for extensions +# +# * Currently the cohesion force applies globally rather than only acting locally on boids within a certain distance of each other like the other two forces. Can you alter the implementation to have a similar behaviour of only acting over a finite distance? +# * There are various model parameters such as the relative force strengths `cohesion_strength`, `separation_strength` and `alignment_strength` which are currently left as the default values specified in the force function definitions. Can you redesign the `simulate_timestep` and/or `animate_flock` to allow passing in a dictionary of these parameter values to override the default to allow easily visualising the behaviour for different parameters? +# * Currently our boids exist in a two-dimensional world - can you alter the simulation to work in three-dimensions? You may find [Matplotlib's three-dimensional plotting toolkit](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) useful for the visualisation side. +# +#