Skip to content

Commit

Permalink
Merge pull request UCL#247 from UCL/saransh/port-doctoral-2
Browse files Browse the repository at this point in the history
chore: port the remaining doctoral course
  • Loading branch information
dpshelio authored Aug 20, 2024
2 parents 8493054 + dcd4a3a commit d78d9d9
Show file tree
Hide file tree
Showing 10 changed files with 1,305 additions and 896 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,4 @@ session04/greetings/doc/
session04/greetings/scripts/
Gemfile.lock
.env/

polynomials.svg
115 changes: 115 additions & 0 deletions ch01python/060basic_plotting.ipynb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# ---
# jupyter:
# jekyll:
# display_name: Basic plotting
# jupytext:
# notebook_metadata_filter: -kernelspec,jupytext,jekyll
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# ---

# %% [markdown]
# # Plotting with Matplotlib - the `pyplot` interface
#
# [Matplotlib](https://matplotlib.org/) is a Python library which can be used to produce plots to visualise data. It has support for a wide range of [different plot types](https://matplotlib.org/stable/gallery/index.html), and as well as supporting static outputs it also allows producing [animations](https://matplotlib.org/stable/api/animation_api.html) and [interactive plots](https://matplotlib.org/stable/gallery/index.html#event-handling). As an intial introduction, we will demonstrate how to use Matplotlib's [`pyplot` interface](https://matplotlib.org/stable/api/index.html#the-pyplot-api) (modelled on the plotting functions in MATLAB), to create a simple line plot. Later, we will then illustrate Matplotlib's [object-oriented interface](https://matplotlib.org/stable/api/index.html#id3) which allows more flexibility in creating complex plots and greater control over the appearance of plot elements.
#
# ## Importing Matplotlib
#
# We import the `pyplot` object from Matplotlib, which provides us with an interface for making figures. A common convention is to use the `import ... as ...` syntax to alias `matplotlib.pyplot` to the shorthand name `plt`.

# %%
import matplotlib.pyplot as plt

# %% [markdown]
# ## A basic plot
#
# As a first example we create a basic line plot.

# %%
plt.plot([2, 4, 6, 8, 10], [1, 5, 3, 7, -11])

# %% [markdown]
# The `plt.plot` function allows producing line and scatter plots which visualize the relationship between pairs of variables. Here we pass `plt.plot` two lists of five numbers corresponding to respectively the coordinates of the points to plot on the horizontal (*x*) axis and the coordinates of the points to plot on the vertical (*y*) axis. When passed no other arguments by default `plt.plot` will produce a line plot passing through the specified points. The value returned by `plt.plot` is a list of objects corresponding to the plotted line(s): in this case we plotted only one line so the list has only one element. We will for now ignore these return values, we will return to explain Matplotlib's object-oriented interface in a later episode.
#
# If passed a single list of numbers, the `plt.plot` function will interpret these as the coordinates of the points to plot on the vertical (*y*) axis, with the horizontal (*x*) axis points in this case implicitly assumed to be the indices of the values in the list. For example, if we plot with just the second list from the previous `plt.plot` call

# %%
plt.plot([1, 5, 3, 7, -11])

# %% [markdown]
# We get a very similar looking plot other than the change in the scale on the horizontal axis.

# %% [markdown]
# ## Plotting a function
#
# To make things a little more visually interesting, we will illustrate plotting the trigonometric functions *sine* ($\sin$) and *cosine* ($\cos$). We first import implementations of these functions from the in-built `math` module as well as the constant numerical constant `pi` ($\pi$).

# %%
from math import sin, cos, pi

# %% [markdown]
# The `sin` and `cos` functions both take a single argument corresponding to an angular quantity in [radians](https://en.wikipedia.org/wiki/Radian) and are [periodic](https://en.wikipedia.org/wiki/Periodic_function) with period $2\pi$. We therefore create a list of equally spaced angles in the interval $[0, 2\pi)$ and assign it to a variable `theta`.

# %%
number_of_points = 100
theta = [2 * pi * n / number_of_points for n in range(number_of_points)]

# %% [markdown]
# Using a list comprehension we can now compute the value of the sine function for each value in `theta` and graph this as the vertical coordinates of a line plot.

# %%
plt.plot(theta, [sin(t) for t in theta])

# %% [markdown]
# ## Plotting multiple lines
#
# We can plot multiple different lines on the same plot by making mutiple calls to `plt.plot` within the same cell. For example in the cell below we compute both the sine and cosine functions.

# %%
plt.plot(theta, [sin(t) for t in theta])
plt.plot(theta, [cos(t) for t in theta])

# %% [markdown]
# By default Matplotlib will cycle through a [sequence of colours](https://matplotlib.org/stable/gallery/color/color_cycle_default.html) as each new plot is added to help distinguish between the different plotted lines.
#
# ## Changing the line styles
#
# The `plt.plot` function offers various optional keyword arguments that can be used to further customise the plot. One useful argument is `linestyle` which allows the style of the line used to join the plotted points to be specified - for example this can useful to allow plotted lines to be distinguished even when they are printed in monochrome. Matplotlib as [a variety of built-in linestyles with simple string names](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html) as well as options for performing further customisation. Here we specify for the cosine curve to be plotted with a dotted line.

# %%
plt.plot(theta, [sin(t) for t in theta])
plt.plot(theta, [cos(t) for t in theta], linestyle="dotted")

# %% [markdown]
# ## Adding a legend
#
# Although we can visually distinguish between the two plotted lines, ideally we would have labels to indicate which corresponds to which function. We can add a legend to the plot with the `plt.legend` function. If we pass a list of strings to `plt.legend` these will be interpreted as the labels for each of the lines plotted so far in the order plotted. Matplotlib has [in-built support](https://matplotlib.org/stable/tutorials/text/mathtext.html) for using [TeX markup](https://en.wikibooks.org/wiki/LaTeX/Mathematics) to write mathematical expressions by putting the TeX markup within a pair of dollar signs (`$`). As TeX's use of the backslash character `\` to prefix commands conflicts with Python's interpretation of `\` as an escape character, you should typically use raw-strings by prefixing the string literal with `r` to simplify writing TeX commands.

# %%
plt.plot(theta, [sin(t) for t in theta])
plt.plot(theta, [cos(t) for t in theta], linestyle="dotted")
plt.legend([r"$\sin\theta$", r"$\cos\theta$"])

# %% [markdown]
# Matplotlib also allows the legend label for a plot to be specified in the `plt.plot` call using the `label` keyword arugment. When plotting many lines this can be more readable than having to create a separate list of labels to pass to a subsequent `plt.legend` call. If we specify the `label` keyword arguments we can call `plt.legend` without any arguments.

# %%
plt.plot(theta, [sin(t) for t in theta], label=r"$f(\theta) = \sin\theta$")
plt.plot(theta, [cos(t) for t in theta], linestyle="dotted", label=r"$f(\theta) = \cos\theta$")
plt.legend()

# %% [markdown]
# ## Adding axis labels and a title
#
# The `pyplot` interface also provides functions for adding axis labels and a title to our plot. Specifically `plt.xlabel` and `plt.ylabel` are functions which set the labels on respectively the horizontal (*x*) axis and vertical (*y*) axis, both accepting a string argument corresponding to the axis label. The `plt.title` function, as the name suggests, allows setting an overall title for the plot. As for the legend labels, the axis labels and title may all optionally use TeX mathematical notation delimited by dollar `$` signs.

# %%
plt.plot(theta, [sin(t) for t in theta], label=r"$f(\theta) = \sin\theta$")
plt.plot(theta, [cos(t) for t in theta], linestyle="dotted", label=r"$f(\theta) = \cos\theta$")
plt.legend()
plt.xlabel(r"Angle in radians $\theta$")
plt.ylabel(r"$f(\theta)$")
plt.title("Trigonometric functions")
129 changes: 83 additions & 46 deletions ch02data/061internet.ipynb.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# ---

# %% [markdown]
# ## Getting data from the Internet
# # Getting data from the internet

# %% [markdown]
# We've seen about obtaining data from our local file system.
Expand All @@ -28,99 +28,133 @@
# We may also want to be able to programmatically *upload* data, for example, to automatically fill in forms.

# %% [markdown]
# This can be really powerful if we want to, for example, perform an automated meta-analysis across a selection of research papers.
# This can be really powerful if we want to, for example, do automated meta-analysis across a selection of research papers.

# %% [markdown]
# ### URLs
# ## Uniform resource locators

# %% [markdown]
# All internet resources are defined by a Uniform Resource Locator.
# All internet resources are defined by a [_uniform resource locator_ (URL)](https://en.wikipedia.org/wiki/URL) which are a particular type of [_uniform resource identifier_ (URI)](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier). For example

# %%
"https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US"
"https://mt0.google.com:443/vt?x=658&y=340&z=10&lyrs=s"

# %% [markdown]
# A url consists of:
# A URL consists of:
#
# * A *scheme* (`http`, `https`, `ssh`, ...)
# * A *host* (`static-maps.yandex.ru`, the name of the remote computer you want to talk to)
# * A *port* (optional, most protocols have a typical port associated with them, e.g. 80 for http, 443 for https)
# * A *path* (Like a file path on the machine, here it is `1.x/`)
# * A *query* part after a `?`, (optional, usually ampersand-separated *parameters* e.g. `size=400x400`, or `z=10`)
# * A *scheme* (`http` [_hypertext transfer protocol_](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol), `https` [_hypertext transfer protocol secure_ ](https://en.wikipedia.org/wiki/HTTPS), `ssh` [_secure shell_](https://en.wikipedia.org/wiki/Secure_Shell), ...)
# * A *host* (`mt0.google.com`, the name of the remote computer you want to talk to)
# * A *port* (optional, most protocols have a typical port associated with them, e.g. 80 for HTTP, 443 for HTTPS)
# * A *path* (analogous to a file path on the machine, here it is just `vt`)
# * A *query* part after a ?, (optional, usually ampersand `&` separated *parameters* e.g. `x=658` or `z=10`)

# %% [markdown]
# **Supplementary materials**: These can actually be different for different protocols, the above is a simplification. You can see more, for example, at
# [the wikipedia article about the URI scheme](https://en.wikipedia.org/wiki/URI_scheme).
# **Supplementary materials**: These can actually be different for different protocols, the above is a simplification, you can see more, for example, at
# [the Wikipedia article on URIs](https://en.wikipedia.org/wiki/URI_scheme).

# %% [markdown]
# URLs are not allowed to include all characters; we need to, for example, "escape" a space that appears inside the URL,
# replacing it with `%20`, so e.g. a request of `http://some example.com/` would need to be `http://some%20example.com/`
# URLs are not allowed to include all characters; we need to, for example, [_escape_](https://en.wikipedia.org/wiki/Escape_character) a space that appears inside the URL, replacing it with `%20`, so e.g. a request of `http://some example.com/` would need to be `http://some%20example.com/`.
#

# %% [markdown]
# **Supplementary materials**: The code used to replace each character is the [ASCII](http://www.asciitable.com) code for it.

# %% [markdown]
# **Supplementary materials**: The escaping rules are quite subtle. See [the wikipedia article for more detail](https://en.wikipedia.org/wiki/Percent-encoding). The standard library provides the [urlencode](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode) function that can take care of this for you.
# **Supplementary materials**: The escaping rules are quite subtle. See [the Wikipedia article on percent-encoding](https://en.wikipedia.org/wiki/Percent-encoding). The standard library provides the [urlencode](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode) function that can take care of this for you.

# %% [markdown]
# ### Requests
# ## Requests

# %% [markdown]
# The python [requests](http://docs.python-requests.org/en/latest/) library can help us manage and manipulate URLs. It is easier to use than the `urllib` library that is part of the standard library, and is included with anaconda and canopy. It sorts out escaping, parameter encoding, and so on for us.
# The Python [Requests](http://docs.python-requests.org/en/latest/) library can help us manipulate URLs and requesting the content associated with them. It is easier to use than the `urllib` library that is part of the standard library, and is included with Anaconda and Canopy. It sorts out escaping, parameter encoding, and so on for us.

# %%
import requests

# %% [markdown]
# To request the above URL, for example, we write:

# %%
import requests
response = requests.get(
url="https://mt0.google.com:443/vt",
params={'x': 658, 'y': 340, 'lyrs': 's', 'z': 10}
)

# %% [markdown]
# The returned object is a instance of the `requests.Response` class

# %%
response = requests.get("https://static-maps.yandex.ru/1.x/?size=400,400&ll=-0.1275,51.51&z=10&l=sat&lang=en_US",
params={
'size': '400,400',
'll': '-0.1275,51.51',
'zoom': 10,
'l': 'sat',
'lang': 'en_US'
})
response

# %%
response.content[0:50]
isinstance(response, requests.Response)

# %% [markdown]
# When we do a request, the result comes back as text. For the png image in the above, this isn't very readable.
# The `Response` class defines various useful attributes associated with the responses, for example we can check the [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) for our request with a value of 200 indicating a successful request

# %%
response.status_code

# %% [markdown]
# Just as for file access, therefore, we will need to send the text we get to a python module which understands that file format.
# We can also more directly check if the response was successful or not with the boolean `Response.ok` attribute

# %%
response.ok

# %% [markdown]
# Again, it is important to separate the *transport* model (e.g. a file system, or an "http request" for the web) from the data model of the data that is returned.
# We can get the URL that was requested using the `Response.url` attribute

# %%
response.url

# %% [markdown]
# ### Example: Sunspots
# When we do a request, the associated response content, accessible at the `Response.content` attribute, is returned as bytes. For the JPEG image in the above, this isn't very readable:

# %%
type(response.content)

# %%
response.content[:10]

# %% [markdown]
# Let's try to get something scientific: the sunspot cycle data from [SILSO](http://sidc.be/silso/home):
# We can also get the content as a string using the `Response.content` attribute, though this is even less readable here as some of the returned bytes do not have corresponding character encodings

# %%
type(response.text)

# %%
response.text[:10]

# %% [markdown]
# To get a more useful representation of the data, we will therefore need to process the content we get using a Python function which understands the byte-encoding of the corresponding file format.

# %% [markdown]
# Again, it is important to separate the *transport* model, (e.g. a file system, or a HTTP request for the web), from the data model of the data that is returned.

# %% [markdown]
# ## Example: sunspots

# %% [markdown]
# Let's try to get something scientific: the sunspot cycle data from the [Sunspot Index and Long-term Solar Observations website](http://sidc.be/silso/home)

# %%
spots = requests.get('http://www.sidc.be/silso/INFO/snmtotcsv.php').text

# %%
spots[0:80]
spots[-100:]

# %% [markdown]
# This looks like semicolon-separated data, with different records on different lines. (Line separators come out as `\n`)
# This looks like semicolon-separated data, with different records on different lines. Line separators come out as `\n` which is the escape-sequence corresponding a newline character in Python.

# %% [markdown]
# There are many many scientific datasets which can now be downloaded like this - integrating the download into your data
# pipeline can help to keep your data flows organised.

# %% [markdown]
# ### Writing our own Parser
# ## Writing our own parser

# %% [markdown]
# We'll need a python library to handle semicolon-separated data like the sunspot data.
# We'll need a Python library to handle semicolon-separated data like the sunspot data.

# %% [markdown]
# You might be thinking: "But I can do that myself!":
Expand All @@ -139,31 +173,34 @@
# But **don't**: what if, for example, one of the records contains a separator inside it; most computers will put the content in quotes,
# so that, for example,
#
# "something; something"; something; something
# "Something; something"; something; something
#
# has three fields, the first of which is
#
# something; something
#
# The naive code above would give four fields, of which the first is
#
# "something
# Something; something
#

# %% [markdown]
# Our naive code above would however not correctly parse this input:

# %%
'"Something; something"; something; something'.split(';')

# %% [markdown]
# You'll never manage to get all that right; so you'll be better off using a library to do it.

# %% [markdown]
# ### Writing data to the internet
# ## Writing data to the internet

# %% [markdown]
# Note that we're using `requests.get`. `get` is used to receive data from the web.
# You can also use `post` to fill in a web-form programmatically.

# %% [markdown]
# **Supplementary material**: Learn about using `post` with [requests](http://docs.python-requests.org/en/latest/user/quickstart/).
# **Supplementary material**: Learn about using `post` with [Requests](http://docs.python-requests.org/en/latest/user/quickstart/).

# %% [markdown]
# **Supplementary material**: Learn about the different kinds of [http request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods): [Get, Post, Put, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)...
# **Supplementary material**: Learn about the different kinds of [HTTP request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods): [Get, Post, Put, Delete](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)...

# %% [markdown]
# This can be used for all kinds of things, for example, to programmatically add data to a web resource. It's all well beyond
Expand Down
4 changes: 2 additions & 2 deletions ch02data/062csv.ipynb.py → ch02data/063tabular.ipynb.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ---
# jupyter:
# jekyll:
# display_name: CSV
# display_name: Tabular data
# jupytext:
# notebook_metadata_filter: -kernelspec,jupytext,jekyll
# text_representation:
Expand All @@ -12,7 +12,7 @@
# ---

# %% [markdown]
# ## Field and Record Data (Tabular data)
# # Field and Record Data (Tabular data)
#
# Tabular data, that is data that is formatted as a table with a fixed number of rows and columns, is very common in a research context. A particularly simple and also popular file format for such data is [_delimited-separated value_ files](https://en.wikipedia.org/wiki/Delimiter-separated_values).

Expand Down
Loading

0 comments on commit d78d9d9

Please sign in to comment.