manual.Rmd

---
title: "Analyzing Open Contracting data"
author: "Rodrigo Parra"
subtitle: A manual using the R programming language
output:
  bookdown::html_document2:
    fig_caption: yes
    theme: cosmo
    toc: yes
  bookdown::pdf_document2:
    fig_caption: yes
    toc: yes
  bookdown::word_document2:
    fig_caption: yes
    toc: yes
urlcolor: blue
---

<style>
body {
  text-align: justify
}

.caption {
  text-align: center
}

caption {
  text-align: center
}


</style>

\newpage
# Introduction

The goal of this manual is to provide the basic foundations needed to analyze and visualize Open Contracting data using the R programming language to people that might be interested in pursuing this challenging but worthy endeavor. A basic understanding of computer programming will come in handy when reading this document; however, enough guidance is provided to ensure that less-tech-savy readers can also follow along.

As a part of this guide, we acquire, clean, analyze and plot Open Contracting data from 4 members of the partnership: Paraguay, Mexico, Uruguay and Colombia. This document was written using [R Notebooks](https://rmarkdown.rstudio.com/r_notebooks), a nifty feature which allows convenient integration of Markdown and R code, and exporting to HTML and PDF. The source code of this project is available [here](https://https://github.com/rparrapy/ocds-r-manual).   


The rest of this document is organized as follows:

  * The remainder of this section introduces the Open Contracting Data Standard, briefly describing its main building blocks, and the R programming language, covering its core features and how to install it.
  
  * Section 2 deals with data acquisition and cleaning, focusing on reading and processing JSON files with R.
  
  * Section 3 presents the [Tidyverse](https://www.tidyverse.org), a collection of R packages designed to perform common data science tasks. 
  
  * Finally, section 4 introduces [ggplot2](https://ggplot2.tidyverse.org) and the grammar of graphics it is based on, to draw several plots that describe our previously acquired datasets.

## An introduction to the Open Contracting Data Standard

By requiring data-sharing in a structured, re-usable and machine readable form, Open data opens up new opportunities for analysis and citizen engagement and participation. The [Open Contracting Data Standard](https://www.open-contracting.org/data-standard/) was created to apply these principles to the complete contracting lifecycle including planning, tender, award, contract and implementation. 

The data standard, designed and developed through an open process, allows governments and cities around the world to share their contracting data, enabling greater transparency in public contracting, and supporting accessible and in-depth analysis of the efficiency, effectiveness, fairness, and integrity of public contracting systems. Additionally, the help desk team, staffed by Open Data Services Co-operative, is available to assist prospective users in their journey towards adoption of the standard.

The intention of this section is to introduce the reader to the standard, the use cases it was designed for and the basic concepts needed to apply it. Most of the content was taken from the official documentation of the standard; for a more thorough introductory walktrough, please refer to the [getting started guide](http://standard.open-contracting.org/latest/en/getting_started/).

### Users and use cases

The standard was designed with four main groups of user needs:

* Achieving value for money for government
* Strengthening the transparency, accountability and integrity of public contracting
* Enabling the private sector to fairly compete for public contracts
* Monitoring the effectiveness of service delivery

To find out about who is using OCDS-compliant data around the globe and how they are doing it, have a look at the [Open Contracting Partnership website](http://www.open-contracting.org/). Four potential use cases for open contracting data are:

* Value for money in procurement: helping officials get good value for money during the procurement process, and analyzing whether this goals was achieved afterwards. 
* Detecting fraud and corruption: identifying red flags that might indicate corruption by studying individual procurements or networks based on funding, ownership and interests.
* Competing for public contracts: allowing private firms to understand the potential pipeline of procurement opportunities by looking at information related to past and current procurements.
* Monitoring Service Delivery: helping interested actors to leverage traceability in the procurement process for monitoring purposes, linking budgets and donor data to the contracts and results. 

### The contracting process

The standard defines a contracting process as:

> All the planning, tendering information, awards, contracts and contract implementation information related to a single initiation process.


The standard covers all the stages of a contracting process, even though some processes might not involve all possible steps. The stages of the procurement process, with example objects that might be associated to each one, are depicted in figure 1.

```{r include=FALSE}
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, 
                      message = FALSE, cache.lazy = FALSE)
```


```{r, echo=FALSE, fig.cap="Stages of the procurement process."}
knitr::include_graphics("procurement-stages.png")
```

For identification purposes, all contracting processes are assigned a unique Open Contracting ID (ocid), which can be used to join data from different stages. In order to avoid ocid clashes between publishers, a publisher can prepend a prefix to locally generated identifiers. Publishers are encouraged to register their prefix [here](http://standard.open-contracting.org/latest/en/implementation/registration/).


### Documents

Contracting processes are represented as **documents** in the Open Contracting Data Standard (OCDS from now on, for brevity). Each document is made up of several **sections**, mentioned below:

* release metadata: contextual information about each release of data;
    * parties: information about the organizations and other participants involved in the contracting process;
    * planning: information about the goals, budgets and projects a contracting process relates to;
    * tender: information about how a tender will take place, or has taken place;
    * awards: information on awards made as part of a contracting process;
    * contract: information on contracts signed as part of a contracting process;
        * implementation: information on the progress of each contract towards completion.

An example JSON snippet compliant with this structure looks as follows:

```json
{
   "language": "en",
   "ocid": "contracting-process-identifier",
   "id": "release-id",
   "date": "ISO-date",
   "tag": ["tag-from-codelist"],
   "initiationType": "tender",
   "parties": {},
   "buyer": {},
   "planning": {},
   "tender": {},
   "awards": [ {} ],
   "contracts":[ {
       "implementation":{}
   }]
}
```

There are two types of documents defined in the standard:

* **Releases** are immutable and represent updates on the contracting process. For example, they can be used to notify users of new tenders, awards, contracts and other updates. As such, a single contracting process can have many releases.

* **Records** are snapshots of the current state of a contracting process. A record should be updated every time a new release associated to its contracting process is published; hence, there should only be a single record per contracting process.

### Fields

Each section may contain several **fields** specified in the standard, which are used to represent data. These objects can appear several times in different sections of the same document; for example, items can occur in tender (to indicate the items that a buyer wishes to buy), in an award object (to indicate the items that an award has been made for) and in a contract object (to indicate the items listed in the contract). Some example fields, accompanied by corresponding JSON snippets, are presented below.

#### Parties (Organizations)

```json
{
    "address": {
        "countryName": "United Kingdom",
        "locality": "London",
        "postalCode": "N11 1NP",
        "region": "London",
        "streetAddress": "4, North London Business Park, Oakleigh Rd S"
    },
    "contactPoint": {
        "email": "procurement-team@example.com",
        "faxNumber": "01234 345 345",
        "name": "Procurement Team",
        "telephone": "01234 345 346",
        "url": "http://example.com/contact/"
    },
    "id": "GB-LAC-E09000003",
    "identifier": {
        "id": "E09000003",
        "legalName": "London Borough of Barnet",
        "scheme": "GB-LAC",
        "uri": "http://www.barnet.gov.uk/"
    },
    "name": "London Borough of Barnet",
    "roles": [ ... ]
}
```

#### Amounts

```json
{
    "amount": 11000000,
    "currency": "GBP"
}
```

#### Items

```json
{
    "additionalClassifications": [
       {
            "description": "Cycle path construction work",
            "id": "45233162-2",
            "scheme": "CPV",
            "uri": "http://cpv.data.ac.uk/code-45233162.html"
        }
    ],
    "classification": {
        "description": "Construction work for highways",
        "id": "45233130",
        "scheme": "CPV",
        "uri": "http://cpv.data.ac.uk/code-45233130"
    },
    "description": "string",
    "id": "0001",
    "quantity": 8,
    "unit": {
        "name": "Miles",
        "value": {
            "amount": 137000,
            "currency": "GBP"
        }
    }
}
```

#### Time Periods

```json
{
    "endDate": "2011-08-01T23:59:00Z",
    "startDate": "2010-07-01T00:00:00Z"
}
```

#### Documents

```json
{
    "datePublished": "2010-05-10T10:30:00Z",
    "description": "Award of contract to build new cycle lanes to AnyCorp Ltd.",
    "documentType": "notice",
    "format": "text/html",
    "id": "0007",
    "language": "en",
    "title": "Award notice",
    "url": "http://example.com/tender-notices/ocds-213czf-000-00001-04.html"
}
```

#### Milestones

```json
{
    "description": "A consultation period is open for citizen input.",
    "dueDate": "2015-04-15T17:00:00Z",
    "id": "0001",
    "title": "Consultation Period"
}
```


### Extensions and codelists

In addition to regular fields, the OCDS schema defines some fields that can only be used in certain sections, e.g. *titles* and *descriptions* of tenders, awards and contracts. In some cases, publishers may require fields that are not provided by the core schema; an **extension** allows defining new fields that can be used in these cases. A list of available extensions is available [here](http://standard.open-contracting.org/latest/en/extensions); if no existing extension addresses a publisher's needs, the publisher is encouraged to collaborate on the creation of a new community extension.

Another concept worth mentioning is that of codelists. Codelists are sets of case sensitive  strings with associated labels, available in each language OCDS has been translated into. Publishers should use codelist values whenever possible to map their existing classification systems; if needed, detail fields can be used to provide more detailed classification information. There are two types of codelists:

* **Closed codelists** are fixed sets of values. If a field is associated with a closed codelist, it should only accept an option from the published list.
* **Open codelists** are sets of recommended values. If a field is associated with an open codelist, it accepts options from the list but also other values.


The Open Contracting Data Standard is maintained using [JSON Schema](http://json-schema.org). In this section we have introduced and described the main sections and common objects used in the schema, providing JSON snippets as examples of these basic building blocks. If you are interested in the full JSON schema reference, please refer to the [official documentation](http://standard.open-contracting.org/latest/en/schema/).


## An introduction to the R programming language

R is a multi-paradigm interpreted programming language and software environment focused on statistical computing, commonly used for data analysis. It is published under the [GPL v2 license](https://www.gnu.org/licenses/old-licenses/gpl-2.0.html) and maintained by the [R Foundation](https://www.r-project.org/foundation/), with binaries available for GNU/Linux, macOS and Windows. While the basic installer comes bundled with a command line interface, several graphical integrated development environments are available, of which RStudio and RStudio Server are the most noteworthy[^1].

In this section we will introduce some of R's basic features and syntax; after going through it, the reader should be better prepared to understand the code used for analysis in the rest of the guide. Having said that, several features are ommited as a complete reference of R is out of the scope of this document. Should the reader feel inclined to learn more about R, out of curiosity or need, a list of manuals maintained by the R Development Core Team is available [here](https://cran.r-project.org/manuals.html).

For completeness and reproducibility purposes, we include a excerpt of system information below.

```{r}
R.version
```
  

[^1]: The author recommends RStudio as the go-to way to run R code and believes it is one major reason behind R's spike in popularity. Consequently, this guide was written using RStudio v1.1.453 in macOS High Sierra.

### Operators

R's arithmetic and logical operators should be familiar to most programmers, as they are common to other programming languages. It is worth mentioning that arithmetic operators work on scalars and collections.

| Operator | Description |
|:--------:|:-----------|
| **+**    | Addition    |
| **-**    | Subtraction |
| **\* **  | Multiplication|
| **/**    | Division    |
| **\*\* ** or **^**|Exponentiation&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|
| **%%**   | Modulus     |
Table: Some R arithmetic operators


| Operator | Description |
|:--------:|:-----------|
| **>**    | Greater than    |
| **>=**   | Greater than or equal to|
| **<**    | Lesser than|
| **<=**   | Lesser than or equal to|
| **==**   | Equal to |
| **!=**   | Not equal to |
Table: Some R logical operators

### Variables

R supports several data types including scalars, vectors (of numbers, strings, booleans, etc), matrices, dataframes and tibbles, among others. The **<-** operator is used to assign a value to a variable. A few examples of variables from these types are shown below:

```{r}
a_scalar <- 4

a_number_vector <- c(1, 2, 3, 4, 5) # all elements of a vector must be of the same type
a_string_vector <- c("a1", "b2", "c3", "d4")
a_boolean_vector <- c(TRUE, FALSE)

# lists can have elements of different types, associated to a key
a_list <- list(name = "John", last_name = "Deer", age = 42, married = FALSE)

# there are several ways to access an element from a collection
a_number_vector[0]
a_string_vector[[1]]
a_list$name

# a matrix is a special kind of vector, with the number of rows and columns as attributes
m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
m
```

### Data Frames and Tibbles

Data types to store data tables are core to R's purpose and functionality, hence they deserve their own subsection. The most common data table type in R are data frames, which can be though of as list of vectors of equal length.

```{r}
name <- c("Mike", "Lucy", "John")
age <- c(20, 25, 30)
student <- c(TRUE, FALSE, TRUE)
df <- data.frame(name, age, student)
df
```

The access operators presented for other data types can also be used to get data cells from a data frame.

```{r}
df[1, ] # R has 1-based indexing, i.e. collections start at 1.
df[1, ]$name # Values from a row can be accessed by column name
df$name # Columns can also be accessed directly from the data frame.
```

Another type of data table available are Tibbles. Tibbles belong to the Tidyverse, a set of libraries to perform data analysis in R following best practices that will be discussed at length in another section. For now, just think of Tibbles as data frames with some quirkyness fixed. As an example, printing a data frame displays all of its rows, which may be problematic for large datasets; printing a tibble displays only the first 10 rows by default. 

Data frames can be converted to tibbles and viceversa; the latter can be useful when dealing with older libraries that are not Tidyverse-ready. The author advises the use of tibbles and favors (as many users in the community) leveraging Tidyverse features and adhering to its guidelines as closely as possible.
```{r message=FALSE, warning=FALSE}
# Packrat should have installed dependencies for you, 
# you can do it manually by running the command below
# install.packages("tidyverse")
library(tidyverse) # Side note: this is how you import a library in R
```

```{r}
tb <- as_tibble(df)
class(as.data.frame(tb))
```

### Functions

The syntax for function definition in R is pretty straightforward and resemblant of that of other programming languages. A function receives zero, one or multiple arguments and it runs the code included in its body:

> function ( arglist ) {body}


Because of the fact that R targets a specific niche (statistical computing), it offers a very rich set of built-in functions and libraries readily available for installation. Thus, function calls tend to be much more common in R than function definition.

Most of the R ecosystem is focused on dealing (cleaning, plotting, modelling) tabular data. As a simple display of basic statistical features, we can use the *summary* function to get descriptive statistics from a table.

```{r}
summary(tb)
```

Reading data that is already tabular is also very simple, as shown in the example below which reads a CSV file as a data frame. The file corresponds to the iris dataset, a very popular toy dataset for data science available [online](https://archive.ics.uci.edu/ml/datasets/iris).

```{r}
iris <- as_tibble(read.table("iris.csv", header = TRUE, sep = ","))
iris
```

One last useful operator that is worth mentioning is the pipe *%>%*. Pipes allow function chaining in R, which favors readability and (one might say) elegance when a sequence of function calls is needed.

```{r}

# For now, there is no need to understand what each function of this snippet actually does
iris %>%
  group_by(Name) %>%
  summarize_if(is.numeric, mean) %>%
  ungroup()
```

The pipe operator passes the value of its left-hand-side operand as the first argument of its right-hand-side operand. Using it avoids the need of declaring variables to store intermediate results in additional variables (thus cluttering the namespace) or nesting function calls (all those parenthesis can be hard to read).

### Style

As a final remark, although there is no official code style guideline for the R programming language, the author (not surprisingly if you have been paying attention) recommends following the [Tidyverse style guide](http://style.tidyverse.org/index.html)[^2]. The [styler](http://styler.r-lib.org) package is a handy linter and code formatter that can help keeping source code compliant; it includes an RStudio add-in, the easiest way to re-style existing code.

[^2]: This notebook was written with the Tidyverse code style guide in mind. Please blame any non-conformant snippet on the lack of good coffee at that point of the writing process :)

 
## Installing R and other dependencies

Assuming a machine using Ubuntu 18.04 Bionic Beaver as its operating system, a user with *sudo* access can install the R programming language and R Studio by following these steps:

1. Update system packages to make sure already installed dependencies are up to date.
```{bash, eval=FALSE}
sudo apt update
```
2. Install the R programming language.
```{bash, eval=FALSE}
sudo apt -y install r-base
```
3. RStudio for Ubuntu comes bundled as a *.deb* package. The easiest way to install a *.deb* package in Ubuntu is by using the *gdebi* command, which is not installed in the system by default. It can be installed by running the following command.
```{bash, eval=FALSE}
sudo apt install gdebi-core
```
4. Go to the official [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) and download the file corresponding to you operating system. In our current scenario, that would be *RStudio 1.1.456 - Ubuntu 16.04+/Debian 9+ (64-bit)*.
5. From the directory in which you downloaded the package at the last step, install Rstudio using *gdebi*.
```{bash, eval=FALSE}
sudo gdebi rstudio-xenial-1.1.456-amd64.deb
```
6. Profit! You should now be able to access Rstudio by running the *rstudio* command or by clicking its icon in the apps menu.

Besides RStudio Desktop, which can be installed by following the previous set of steps, RStudio is also available in a remote access flavor. RStudio Server is an application that can be installed in a web server, thus offering the same features of the desktop version through a web browser. To find out more about this alternative and how to install it, please refer to the [official documentation](https://www.rstudio.com/products/rstudio/download-server/).

The project this document is generated from uses [Packrat](https://rstudio.github.io/packrat/) to keep track of all the libraries that it depends on. This ensures the portability of the software (on import, the appropriate version for the corresponding OS will be installed) and the reproducibility of the analysis. Therefore, assuming you have Git installed, playing around with this notebook should be as easy as:

1. Cloning the project
```{bash, eval=FALSE}
git clone https://github.com/rparrapy/ocds-r-manual.git
```
2. Opening it with RStudio by using the menu bar option **File -> Open Project...**


We will also be using a Postgres database to run our analysis, hence we need to install Postgres in case it is missing in our system. We can do so by running:

```{bash, eval=FALSE}
sudo apt install postgresql postgresql-contrib
```

The last command install the Postgres database engine and creates a *postgres* user as a default account to access our databases.

# Data acquisition

This section is focused on data loading from diverse sources, which is the first step for any analysis process. We will read the data from the four countries this document is concerned with, using them as examples for different data acquisition methods:

* Data from Uruguay will be read using their public web API.
* Data from Paraguay and Mexico will be loaded from local JSON files.
* Data from Colombia will be loaded from JSONB columns of a Postgres database. Since this is the largest dataset we have access to, we will use it as an example of how to handle big data with R using JSON streams.

## From a REST API

The open contracting dataset from Uruguay is available via a RSS endpoint, which includes URLs to the latest releases. It is worth noting that since only latest releases are accessible, not much in depth analysis can be done using this dataset. Having said that, it is still a perfectly suitable example to showcase R's HTTP fetching capabilities; we start by fetching the RSS feed and creating a data frame where each row corresponds to a release.

```{r}
library(XML)

xml_url <- "http://www.comprasestatales.gub.uy/ocds/rss"
xml_file <- xmlParse(xml_url)
xml_top <- xmlRoot(xml_file)[[1]]

index_tb <- as_tibble(xmlToDataFrame(xmlElementsByTagName(xml_top, "item")))
index_tb
```

In the snippet above, we start by importing R's *XML* library since Uruguay's API responses are in this data format. The *xmlParse* method fetches the content given by the RSS feed url and parses it, returning a tree-like nested array. We then get the root of the parsed structure and seek every element with the *item* tag by calling the *xmlElementsByTagName* method. We convert the result to a data frame and after that to a tibble using *xmlToDataFrame* and *as_tibble* respectively.

We then proceed to fetch the JSON content corresponding to each release, using the values of the **link** column defined above. There are many libraries to handle JSON with R, each with its list of pros and cons; because of its focus in performance and web api consumption, we use [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html) in this guide.

```{r message=FALSE, warning=FALSE}
library(jsonlite)
```

```{r, cache=TRUE}

# we run fromJSON (which fetches the url) for every row of the data frame
releases <- apply(index_tb["link"], 1, function(r) {
  return(as_tibble(fromJSON(r, flatten = TRUE)[["releases"]]))
})
releases[[1]]
```

The main function in the snippet above is *fromJSON*, which is provided by *jsonlite*. We use it to fetch the JSON files from the url contained in the *link* column of *index_tb*. The flatten flag indicates that we want the function to squeeze the nested JSON structure as much as possible, to make it suitable for a tabular format. We repeat the process for each row of the tibble using the *apply* function, which applies the function it receives as a parameter to each row and returns the values as a list.


At this point we have a list of tibbles, each tibble representing a release. As our next step, we can merge all those tibbles, filling missing values (for example, according to the stage they correspond to, some releases include tender information while some do not) with NA (R's constant for 'Not Available'). Merging severel tibbles row wise is achieved by calling the *bind_rows* function.

```{r, cache=TRUE}
releases_tb <- bind_rows(releases)
```

Because data from Uruguay includes only the latest 500 releases, actual compilation of records is not possible (as not enough data is available). Therefore, this is as far as we will go with their open contracting data.


## From JSON files

We can also read records or releases from a set of JSON files, published by one or more official publishers associated to a country. That is the case of Paraguay, the next country we will be adding to our analysis. Paraguayan open contracting data are published by the National Directorate of Public Tender (*Dirección Nacional de Contrataciones Públicas*, DNCP) and the Ministry of Finance (*Ministerio de Hacienda*); as they share ocids, we should be able to use their published data together.

To continue with this guide, you should now download a copy of these datasets to your local machine, and store them under the *./data* directory (which should be empty if you just cloned this project). That is:

1. Head over to [https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ](https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ).
2. Download the *data.zip* file and extract its content inside the *./data* directory.
3. If succesful, you should now have two new folders with .json files in them in your system: *./data/records_dncp* and *./data/releases_mh*.

Before actually loading the data, lets take a small detour to make our data munging more efficient. The size of the datasets we use in this guide ranges from non trivial (with processing taking a few minutes) to challenging (with processing taking several hours). This is aggravated by the fact that R runs in a single thread by default, not leveraging the multicore capabilities that most modern day computers offer. To change that, we can use the *parallel* library, which allows us to run apply-like functions in parallel.

```{r, warning=FALSE, message=FALSE}
library(glue) # This is just a utility library to concatenate strings
library(parallel)
library(lubridate) # if you ever have to work with dates, do yourself a favor and use this


# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)
clusterExport(cl, c("fromJSON", "paste", "as_tibble", "ymd_hms", "select"))
```


To start using parallel, we define a cluster of size equal to the number of cores available in our machine minus one (not leaving a single core available for other computation can lead to our computer freezing) with the *makeCluster* function. Each of the workers of our newly created cluster runs an R interpreter with no other dependencies; to import libraries we are going to need to them we use the *clusterExport* function.


We are now ready to start with DNCP data, which is stored under *data/records_dncp*:


```{r, cache=TRUE}
files <- list.files("data/records_dncp/", "*.json")
records <- parLapply(cl, files, function(r) {
  file <- paste("data/records_dncp/", r, sep = "")
  parsed <- fromJSON(file, flatten = TRUE)
  records <- parsed[["records"]]
  publishedDate <- ymd_hms(parsed[['publishedDate']])
  records$publishedDate <- rep(publishedDate, nrow(records))
  return(as_tibble(records))
})
records[[1]]
```

The code snippet above uses several of our already familiar functions such as *fromJSON*, *as_tibble* and so on. There are three newcomers that demand explanation:

* *list.files* returns a list of paths corresponding to files under the path that it receives as its first parameter, that match the regular expression it receives as its second parameter.
* *paste* simply concatenates every string it receives as a parameter, placing the *sep* string between them.
* *parLapply* is the parallel version of *lapply*, scheduling different executions of the function it receives as a parameters in different workers of our cluster. This can be done since reading different JSON files is an embarrasingly parallel task.

```{r, cache=TRUE}
dncp_records_tb <- bind_rows(records)
remove(records)
dncp_records_tb
```

To free up some space, we use the *remove* function to explicitly clear intermediate results from memory. Data from the Ministry of Finance can be read likewise:

```{r, cache=TRUE}
files <- list.files("data/releases_mh/", "*.json")
releases <- parLapply(cl, files, function(r) {
  file <- paste("data/releases_mh/", r, sep = "")
  parsed <- fromJSON(file, flatten = TRUE)
  releases <- parsed[["releases"]]
  publishedDate <- ymd_hms(parsed[['publishedDate']])
  # there are some files that don't include any release (?)
  if (!is.null(nrow(releases)) && nrow(releases) > 0) {
      releases$publishedDate <- rep(publishedDate, nrow(releases))
  }
  return(as_tibble(releases))
})

mh_releases_tb <- bind_rows(releases)
remove(releases)
mh_releases_tb
```

## From a Postgres database with JSONB columns

For Mexico and Colombia we process data as a stream, i.e. one record at a time. Data frames and tibbles are stored in memory by R, while stream processing only requires one record to be stored in memory. For this reason, this data processing paradigm is particularly useful when dealing with big data.

To continue with this guide, you should now download the backup files of these databases to your local machine, and restore them to your local Postgres instance. To do so:

1. Again, head over to [https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ](https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ).
2. Download the files *ocds_colombia.dump* and *ocds_mexico.dump* to your local hard drive.
3. Create the databases by running *sudo -u postgres createdb ocds_colombia* and *sudo -u postgres createdb ocds_mexico* from a terminal session.  
3. Within a terminal session, from the directory where you downloaded the files, run *pg_restore -d ocds_colombia ocds_colombia_dump* and *pg_restore -d ocds_mexico ocds_mexico.dump*.
4. Check the databases you just created, they should have several tables which were created by the restore process. In particular, the *data* table should contain multiple records that we will read for our analysis.

The library we are using (jsonlite) includes the *stream_in* function to support [ndjson](http://ndjson.org) a convenient format to store multiple JSON records in a single file. Since our records are stored in a Postgres database, we have to write a similar feature ourselves. Let's write a function to do precisely that in the next code snippets:

```{r}
library(RPostgreSQL)

db_engine <- "PostgreSQL"
host <- "localhost"
user <- "postgres" # if you are concerned about security
password <- "" # you should probably change this 2 lines
port <- 5433 # set the port of your Postgres instance, default is 5432
query <- "SELECT id, data FROM data"

drv <- dbDriver(db_engine)
con_colombia <- dbConnect(drv, host = host, port = port,
                          dbname = "ocds_colombia", user = user, password = password)

con_mexico <- dbConnect(drv, host = host, port = port,
                        dbname = "ocds_mexico", user = user, password = password)
```

We connect and query Postgres through [DBI](https://cran.r-project.org/web/packages/DBI/index.html), a database interface definition for communication between R and relational database management systems. In particular, we use [RPostgreSQL](https://cran.r-project.org/web/packages/RPostgreSQL/index.html), a DBI implementation for PostgreSQL databases.

```{r}
stream_in_db <- function(con, query, page_size = 1000, acc = 0) {
  current_id <- 0

  return(function(handler) {
    repeat{
      paged_query <- paste(query, "WHERE id > ", current_id, "ORDER BY id ASC LIMIT", page_size)
      data <- dbGetQuery(con, paged_query)

      if (dim(data)[1] == 0) {
        break
      }
      
      acc <- handler(data[['data']], acc)
      current_id <- tail(data[['id']], n=1)
    }
    return(acc)
  })
}

stream_in_colombia <- stream_in_db(con_colombia, query)

stream_in_mexico <- stream_in_db(con_mexico, query)
```

Let's recap on what we just did:

1. First, we defined a function *stream_in_db* that receives database connection parameters and a query, and runs the query against the specified database.
1. *stream_in_db* returns a function, which applies a handler received as a parameter to every page of the result set.
1. Finally, we used *stream_in_db* to get functions pointing at the databases of Mexico and Colombia.

Pagination using *LIMIT* and *OFFSET* is known to degrade its performance as we scan pages further appart from the start. To speed things up as much as possible we use an approach know as *keyset pagination*, 
which leverages the fact that we have an index defined for the primary key. For a more detailed comparison between pagination approaches please read [this article](http://allyouneedisbackend.com/blog/2017/09/24/the-sql-i-love-part-1-scanning-large-table/).

Now let's take our streaming functions for a ride. To test them out, we can define a dummy handler to count the number of rows of data that result from our query.

```{r message=FALSE, warning=FALSE, eval=FALSE}
sanity_checker <- function(data, acc) {
  m <- parLapply(cl, data, function(e) {
    t <- fromJSON(e, flatten = TRUE)
    return(1)
  })
  
  return(acc + Reduce("+", m))
}

```


# Analysis of contracting data

This section introduces a bunch of useful tools to perform basic data analysis using R, accompanied by some basic examples to showcase their features. 

## Introducing the Tidyverse
The Tidyverse is a collection of R packages aimed at data science tasks; besides the libraries, it shares a common set of data structures, a style guide, and an underlying philosophy for data analysis.
The core Tidyverse [packages](https://www.tidyverse.org/packages/) are:

* gglplot2: a system for declaratively creating graphics, based on The Grammar of Graphics, a tool that enables the user to concisely describe most of the graphical mappings used in data visualization. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
* dplyr: an implementation of a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges
* tidyr: provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
* readr: provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf).
* purrr: enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
* tibble: a modern reimplementation of R's star data structure: the data frame. At this point, the user should be fairly familiar with tibbles.
* stringr: provides a cohesive set of functions designed to make working with strings as easy as possible.
* forcats: provides a suite of useful tools that solve common problems with factors, which are used by R to handle categorical variables.

For the purposes of this guide, the first three packages above are the most important ones. Hence, tidyr and dplyr will be covered in the remainder of this section and ggplot2 will be described in the next one.

## Data cleaning functions

According to the Tidyverse philosophy, data cleaning is the process of making data tidy. Data being tidy means:

* Each variable must have its own column.
* Each observation must have its own row.
* Each value must have its own cell.

Every package from the Tidyverse is designed to work with tidy data; when dealing with an untidy dataset, the first step of our analysis must therefore be using **tidyr** to clean it up. There are three main verbs provided by this package to help making data tidy: gather, spread and separate.

### Gathering

A frequent issue with untidy data is a dataset where some column names are not variable *names* but variable *values*. For example, consider the code snippet below:

```{r}
world_population = tibble(
  country = c("Paraguay", "Uruguay", "Colombia", "Mexico"),
  "2017" = c(7000000, 3000000, 45000000, 127000000),
  "2018" = c(7200000, 3200000, 46000000, 128000000),
)
world_population
```

In the example above, the data is untidy because *2017* and *2018* are values of the implicit *year* variable. This means each row corresponds to two observations, not one. We can fix this problem by applying the *gather* function as follows:

```{r}
world_population %>% gather(`2017`, `2018`, key = "year", value = "inhabitants")
```

*Gather* receives as parameters the names of the columns we want to pivot, and the names of the two new columns we want to create. Much better! In the final result, the problematic columns are dropped and our dataset is now 100% tidy.

### Spreading

Gathering is useful when one row corresponds to more than one observation. Spreading works in the opposite scenario, when a single observation is scattered accross multiple rows. Consider a simple extension of our example dataset: 

```{r}
world_count = tibble(
  country = c("Paraguay", "Uruguay", "Colombia", "Mexico", "Paraguay", "Uruguay", "Colombia", "Mexico"),
  year = 2018,
  type = c("inhabitants", "inhabitants", "inhabitants", "inhabitants", "cars", "cars", "cars", "cars"),
  count = c(7000000, 3000000, 45000000, 127000000, 1000000, 500000, 10000000, 75000000)
)
world_count
```

In this case we have variables stored as cell values, which is the case of *inhabitants* and *cars*. How do we fix it? Just watch the **spread** function do its magic below:

```{r}
world_count %>% spread(key = type, value = count)
```

*Spread* turns every value in the column given by its *key* parameter and turns it into a separate column, filling in the cells with the corresponding value from the column given by its *value* parameter.

### Separating

Our final tidyr function, **separate**, allows us to solve cases in which multiple variable values are stored in the same cell. As an example, consider the result of parsing the Uruguayan RSS feed that we downloaded in a previous section.

```{r}
index_tb
```

It is clear that the title column contains values for two variables: *id_compra* and *release_id*, which are separated by a comma. Let's use **separate** to make things tidy:

```{r}
index_tb %>% 
  separate(title, into = c("id_compra", "release_id"), sep = ",") %>%
  transform(id_compra=str_replace(id_compra,"id_compra:","")) %>%  
  transform(release_id=str_replace(release_id,"release_id:",""))   %>%
  # the last 2 lines just remove the needless prefixes
  head(5)
```

*Separate* splits the value of a single column by the character given by the *sep* parameter; the names of the newly defined columns are given by the *into* parameter, which should be a collection of strings. The last two lines of the code snippet above are just a sneak peak of one of the features of the package we will cover next: dplyr. 

## Data analysis functions

Once we have tidied up our data with **tidyr**, the next steps of data analysis usually involve data manipulation in one way or the other. Selecting specific columns and rows based on a condition, adding composite columns, summarizing data, etc. can be cited among the operations that are frequently performed as a part of the analytical process. The Tidyverse includes **dplyr** as the go-to tool for this kind of data manipulation, we will now cover some of its basic functions.

### Mutating

The **mutate** function allows the user to add a new column to a tibble based on the values of one (or more) already existing columns. Let's use the dataset of releases from Uruguay to illustrate a case where this function might come in handy:

```{r}
releases_tb
```

Let's suppose we need a column with the *name* of the month when a release was made. This value is not readily available in the dataset, but the information is contained in the *date* column and can be extracted with some help from **mutate**:

```{r}

uruguay_releases_with_month = releases_tb %>% mutate(month = month.name[month(date)])
## month.name is a vector with the name of each month
uruguay_releases_with_month[c('ocid', 'month')]
```

*Mutate* receives a tibble as its first parameter; it receives a variable assignment expression as the second parameter, which is evaluated row-wise to produce the new column.

### Selecting

In our code snippet above we selected certain columns of our tibble using array-indexing notation. **dplyr** provides a function to do the same thing, **select**:

```{r}
select(uruguay_releases_with_month, ocid, month)
```

*Select* receives a variable number of parameters, the first one must be a tibble or data frame and the rest of them are the columns to be included in the selection.

### Filtering

A very common operation when dealing with data is to select a subset of rows that are of interest for our analysis. For example, let's assume we are only interested in releases that are related to awards. We can identify the stage of the contracting process to which a release is related through the *tag* column.

```{r}
releases_tb[1,]$tag
```

Knowing this, it is easy to get the collection of releases that correspond to awards by using **filter** as shown below:

```{r}
awards_tb = releases_tb %>% filter(tag == 'award')
awards_tb
```

```{r}
awards_tb[10, ]$tag
```

*Filter* receives a tibble as its first parameter and a boolean expression as the second one, the boolean expression is evaluated for each row and only rows wich evaluate to *TRUE* are included in the final result.

### Summarizing

Summarization allows us to aggregate variables based on a preset grouping. To illustrate this we can revisit our example using the Iris dataset:

```{r}

iris %>%
  group_by(Name) %>%
  summarize_if(is.numeric, mean)
```

In the snippet above, we start by grouping the dataset by species name and then summarize every numeric column using the mean as the aggregator function.


# Data visualization

Data visualization can be defined as the visual mapping of data using visual cues such as size, shape and color to represent the variation of the values of the variables in a dataset.  It is a good way to communicate complex information, since it is much easier to spot patterns and make comparisons than with raw data. 

This section introduces the reader to some basic concepts of data visualization, providing a few guidelines on how to choose an appropriate visual representation of a dataset. Although it is definitely helpful to acquire such theoretical background, the good news for the reader are that R gently guides you towards the right choice (if you let it). The last statement will become clearer once we cover **ggplot2**, R's main data visualization library and a very important component of the Tidyverse.


Finally, we will test what we just learned by plotting OCDS data using **ggplot2**.

##  A data visualization primer

Whenever we visualize data, we are encoding variables using visual attributes such as size, shape or color. Consider for example a quantitative variable, differences among values associated to each observation can be represented in several ways, as displayed in figure 2:

```{r, echo=FALSE, fig.cap="Potential attribute mapping for a quantitative variable."}
knitr::include_graphics("visual_cues.jpg")
```

As the reader might have noticed, many mappings are possible, but not all of them seem equally appropriate. In fact, statisticians William Cleveland and Robert McGill explored this intuition further by running some experiments with human volunteers to determine which visual attributes encoded quantitative information more accurately. Their findings are summarized in figure 3:

```{r, echo=FALSE, fig.cap="Preattentive attributes sorted by their suitability to encode quantitative variables."}
knitr::include_graphics("ordered_visual_cues.jpg")
```

While this preference order is a good guideline for quantitative variables, other types of data map differently to visual attributes. For example, while color hue is a poor choice to encode a quantitative variable, it works well to encode a categorical one. 

Different combinations of encoding choices render different types of plots. In this manual we will use four type of plots, which will be introduced as we use them to describe OCDS data in a later subsection: bar charts, line charts, box plots and histograms. There are several other types of charts, and additional considerations that can be taken when visualizing data; for a more in depth tour of these topics we recommend the great visualization course by Peter Aldhous, which is available (online)[http://paldhous.github.io/ucb/2016/dataviz/week2.html] (figures from this section were taking from the course site). 

##  An introduction to ggplot2

**ggplot2** is the Tidyverse package for data visualization. It is based on the *grammar of graphics*, a formal grammar to declaratively describe most of the most common charts used when visualizing data.

Graphs are described in ggplot by using a concise set of elements which can be combined according to a basic structure defined by the grammar of graphics. A simplified version of the grammar can be read below:

```{r, eval=FALSE}
ggplot(data = [DATA]) + 
  [GEOM_FUNCTION](mapping = aes([MAPPINGS]))
```

You can consider the snippet above as a template for graphics. To make a plot, you simply fill in the elements within brackets with actual values:

* [DATA] is pretty self-explanatory, it is the placeholder for the dataset that we want to plot.
* [GEOM] should be replaced by any **geom** function provided by ggplot2. geoms are geometrical objects used to represent data such as points, lines, etc.
* Finally [MAPPINGS] should be a collection of ggplot2 **aesthetics**, which are basically visual encoding specifications for the variables we want to visualize.

Let's see this template in action by plotting our well-known Iris dataset as a scatter plot:


```{r}
ggplot(data = iris) + geom_point(mapping = aes(x = PetalLength, y = PetalWidth, color = Name))
```

As the reader might have noticed, **ggplot2** favors convention over configuration and any plot that we draw with it includes many good practices out of the box. For example, in the scatter plot above we got axis labelling and a legend for *free* just by specifying our set of aesthetic mappings. 

There are several other components of the **ggplot2** grammar, but for this brief introduction our simplified template will do. To check which other elements are available, besides an exhaustive list of *geoms* and *aesthetics* please visit the [official website](https://ggplot2.tidyverse.org) of the library.


##  Visualizing Open Contracting data

Now that we know the basics of **ggplot2** we can go ahead and start making some plots ...almost. Since we are mostly interested in comparing amounts spent by year and buyer, we can start off by extracting all rows of our dataset that contain at least one contract. We achieve exactly that by building a boolean index with one element per row of our dataset, indicating whether its corresponding row passes our condition or not.

```{r}
contract_indices <- parApply(cl, dncp_records_tb, 1, function(r) {
  contracts <- r['compiledRelease.contracts'][[1]]
  return(!is.null(contracts) && !is.null(dim(contracts)))
})
```


Once we have our boolean index, we can use it to filter out the records that we are not interested in and keep the ones that are useful for our analysis; which we achieve this by doing `dncp_records_tb[contract_indices, ]` in the snippet below. After that, we extract the contracts and the buyer name from the records we kept.

```{r}
contracts <- parApply(cl, dncp_records_tb[contract_indices, ], 1, function(r) {
  result <- r['compiledRelease.contracts'][[1]]
  result['buyer.name'] <- r['compiledRelease.buyer.name']
  result['publishedDate'] <- r['publishedDate']
  return(result)
})
```


The `contracts` object defined above is a nested collection of contracts. To make it a single tibble we call the `bind_rows` function. We explicitly free the memory we are no longer going to use by removing the `contracts` object.

```{r}
contracts_dncp <- bind_rows(contracts)
rm(contracts)
```


We now have a contracts dataset from DNCP ready for plotting and analysis. We can follow similar steps to get one from the Finance Ministry.

```{r}
contracts <- parApply(cl, mh_releases_tb, 1, function(r) {
  result <- r['contracts'][[1]]
  result['buyer.name'] <- r['buyer.name']
  result['publishedDate'] <- r['publishedDate']
  return(result)
})
```


```{r}
contracts_hacienda <- bind_rows(contracts)
rm(contracts)
```

We know there is significant overlap between the publications of DNCP and the Ministry of Finance, so naively merging both tibbles will end up in contract duplication, hence making our analysis invalid. First, let's make sure the overlap actually exists, by selecting the contracts that are present in the DNCP dataset **and** the Ministry of Finance dataset. The `subset` function selects rows from a tibble based on a condition; in this case, a non-empty subset would confirm the existance of duplicates.


```{r}
overlap <- subset(
  contracts_dncp,
  (dncpContractCode %in% contracts_hacienda[['dncpContractCode']]))['dncpContractCode']
dim(overlap)
```

Having confirmed the presence of duplicate contracts, we need to come up with a smarter merging strategy: lets merge both tibbles together and group them by *dncpContractCode* (this should group duplicates together), sort the rows within each group by *publishedDate* in ascending order, and finally pick the last row from each group. This preprocessing step, coded in the snippet below, ensures that we only include the most current version of a contract in our analysis. 

```{r}
contracts_paraguay <- bind_rows(contracts_dncp, contracts_hacienda) %>%
  group_by(dncpContractCode) %>%
  arrange(publishedDate) %>%
  slice(n()) %>%
  ungroup

contracts_paraguay
```

Next, there is the issue of contract currency. Let's have a look at which currencies are used and how many contracts were signed for each one.

```{r}
contracts_paraguay %>%
  group_by(value.currency) %>%
  summarise(count = n())
```

Results presented in the table above show that the vast majority of contracts were signed with amounts in Paraguayan Guaranies (PYG), a non-so-surprising fact given that we are looking at Paraguayan contracts. We could include contracts with amounts in US dollars by using a given exchange rate, but to keep things simple we will consider only the former currency.

To continue with our analysis, lets get a tibble of our contract dataset grouped by year and some interesting summaries. First, we extract the year from the *dateSigned* field using the `mutate` function, then we filter some invalid rows (na values) and outliers while keeping only the contracts in PYG using the `filter` function. Finally, we group the remaining contracts by year with the `group_by` function and obtain the total count and amount per group using the `summarise` function.
 
```{r}
by_year <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(!is.na(signed_year), !is.na(value.amount),
         signed_year < 2019, signed_year > 2009, value.currency == 'PYG') %>%
  group_by(signed_year) %>%
  summarise(count = n(), amount = sum(value.amount))
```

With our dataset defined above, we can draw line plots with a few lines of **ggplot2** magic. Let's plot the number of contracts and the total amount for all contracts per year:

```{r}
ggplot(data=by_year, aes(x=signed_year, y=count)) +
  geom_line(color="blue") +
  geom_point() +
  labs(x = 'Year', y = 'Number of contracts')
```


```{r}
ggplot(data=by_year, aes(x=signed_year, y=amount)) +
  geom_line(color='springgreen4') +
  geom_point() +
  labs(x = 'Year', y = 'Total Amount (PYG)')
```

We have done some line plots, what about drawing some bars? Lets draw the number of contracts per buyer for a given year (2017). You should be noticing a pattern by now: we start by getting our data in the right shape using **dplyr** verbs, and only then we do the plotting. Bar charts require an additional intermediate step, at least to give them the typical top-n look: buyer names are character arrays and hence sorted lexicographically by default, we need to redefine the column using the factor type (used for categorical values) and set a custom ordering using the `order` function.   

```{r}
count_by_buyer <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017) %>%
  group_by(buyer.name) %>%
  summarise(ccount = n()) %>%
  arrange(desc(ccount)) %>%
  head(5)

count_by_buyer$buyer.name <- factor(
  count_by_buyer$buyer.name,
  levels = count_by_buyer$buyer.name[order(count_by_buyer$ccount)])


ggplot(data=count_by_buyer, aes(x=buyer.name, y=ccount)) +
  geom_bar(stat="identity", fill="steelblue") +
  coord_flip() +
  labs(y = 'Number of contracts', x = 'Buyer')
```


We can follow similar steps to draw a bar plot considering the total amount by buyer:

```{r}
amount_by_buyer <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017) %>%
  group_by(buyer.name) %>%
  summarise(amount = sum(value.amount)) %>%
  arrange(desc(amount)) %>%
  head(5)

top_buyers = amount_by_buyer$buyer.name
amount_by_buyer$buyer.name <- factor(
  amount_by_buyer$buyer.name,
  levels = amount_by_buyer$buyer.name[order(amount_by_buyer$amount)])


ggplot(data=amount_by_buyer, aes(x=buyer.name, y=amount)) +
  geom_bar(stat="identity", fill="springgreen4") +
  coord_flip() +
  labs(y = 'Total Amount (PYG)', x = 'Buyer')
```

The plots we have drawed so far are useful to show grouped summaries, but they do not give us any clue about the underlying distribution of the plotted numerical variable. Box plots let us visualize distributions grouped by a categorical variable, together with some very useful statistics; figure 4 (taken from  [a stats course from the BBC](http://www.bbc.co.uk/schools/gcsebitesize/maths/statistics/representingdata3hirev6.shtml) shows how to interpret each component of a box plot:

```{r, echo=FALSE, fig.cap="Elements of a box plot."}
knitr::include_graphics("boxplot.jpg")
```

The beauty of **ggplot2** lies on its declarative nature. Take a look at the code snippet below, which allows us to draw a box plot: the only significant difference with previous example is the usage of another geom (`geom_boxplot`).

```{r}
contracts_top5_buyers <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017, buyer.name %in% top_buyers) %>%
  arrange(desc(value.amount)) %>%
  tail(1000)

contracts_top5_buyers$buyer.name <- factor(contracts_top5_buyers$buyer.name)


ggplot(data=contracts_top5_buyers, aes(x=buyer.name, y=value.amount, color=buyer.name)) +
  geom_boxplot(outlier.colour="black", outlier.shape=NA, notch=FALSE) +
  labs(y = 'Total Amount (PYG)', x = 'Buyer') +
  theme(axis.text.y=element_blank()) +
  coord_flip()
```

There was a tiny lie in the previous paragraph, as there is another difference in the snippet above: since there is a buyer with contracts much larger than the others, our box plot would actually degenerate if we plotted all contracts. To avoid this ugly scenario we restrict ourselves to the thousand contracts in the lower end of our dataset in terms of amount.

A histogram is an appropriate choice if we want to represent the distribution of a quantitative variable such as contract amount without any grouping. A histogram splits the entire domain of the variable in multiple equally sized ranges (usually called bins) and counts how many of the samples (contracts, in this case) fall in each one. The code snippet below draws a histogram of contract amounts of 2017; again, to filter outliers in order to get a meaningful plot we consider only the bottom 10000 contracts.

```{r}
selected_contracts <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017, !is.na(value.amount)) %>%
  arrange(desc(value.amount)) %>%
  tail(10000)

ggplot(data=selected_contracts, aes(value.amount)) +
  geom_histogram(fill='steelblue', col='black') +
  labs(x = 'Amount (PYG)', y = 'Number of Contracts')
```

In the last two examples we have been restricting our analysis to a certain tail of the distribution. This is not uncommon, specially in the presence of outliers, since the difference in magnitudes makes it hard to extract meaningful conclusions from an overall plot. It is easy to change the part of the distribution we are looking at by changing the parameter of the `tail` function, or by using its complement, the `head` function. If you are running this notebook in an interactive mode, you can take a few minutes to do so and try to find some interesting patterns in the data.

As our last plot we will draw a histogram for Mexican contracts, which will be read from a PostgreSQL database with the streaming helper function we implemented in previous sections. We do so by defining a callback function that will extract the contract amounts for every contract in every published *compiledRelease*. In summary, the callback defined in the snippet below does the following:

* It parses a release using **jsonlite** helper functions.
* Given a release, it extracts the contracts if there are any (checked using the `dim` function).
* For a given contract, if no amount is available it sets a dummy amount to allow filtering later on.
* It merges contracts from the current page with contracts from previously processed pages (stored in the *acc* variable). 


```{r cache=TRUE, warning=FALSE}
amount_getter <- function(data, acc) {
  m <- parLapply(cl, data, function(e) {
    t <- as_tibble(fromJSON(e, flatten = TRUE)$compiledRelease$contracts)
    if (dim(t)[[1]] > 0) {
      if (!("valueWithTax.amount" %in% colnames(t))) {
        t$valueWithTax.amount = -1
        t$valueWithTax.currency = "BTC"
      }
      return(select(t, valueWithTax.amount, valueWithTax.currency))
    } else {
      return(NULL)
    }
  })
  
  if (is.tibble(acc)) {
    return(bind_rows(acc, m))
  } else {
    return(bind_rows(m))
  }
}

contracts <- stream_in_mexico(amount_getter)
```


Notice how cumbersome this á-la-map-reduce processing is, no matter how efficient, when compared to the **dplyr** functions we used in previous examples. The moral of the story is: while R *can* handle out-of-core computation, several of its tools are not really well prepared to handle such a case. Having said that, once we have a tibble that **does** fit into memory things get much friendlier again, as seen in the following code snippets to draw our histogram. 

```{r}
contracts_mx <- contracts %>%
  filter(valueWithTax.currency == 'MXN') %>%
  arrange(desc(valueWithTax.amount)) %>%
  tail(10000)
```


```{r}

ggplot(data=contracts_mx, aes(valueWithTax.amount)) +
  geom_histogram(fill='forestgreen', col='black') +
  labs(x = 'Amount (MXN)', y = 'Number of Contracts')

```


```{r}
stopCluster(cl)
```

Since we are done with our heavy data munging, let's not forget to clean up after ourselves. The `stopCluster` function releases all the resources used by our parallel processing nodes.

#  Recommendations and conclusions

In the pages of this guide we have introduced the reader to the Open Contracting Data Standard and the R programming language; furthermore, we have shown how to use R, and more specifically tools available as a part of the **Tidyverse**, to perform basic data fetching, cleaning, analysis and visualization.

The specifics of the data format and the volume of the available dataset have given us enough reason to introduce more advanced concepts such as parallel processing. Having said that, publishers could greatly improve the experience of prospective data analysts by:

* Providing an **ndjson** dump of their datasets, which would allow the usage of **jsonlite's** *stream_in* and *stream_out* functions.
* Even better, publishing unnormalized tabular datasets of objects of interest, such as contracts. This is important because a) the largest overhead of out-of-core computations in our experience are serialization and flattening costs and b) most data analysis tools are designed with tabular data in mind.
* Maximizing standard coverage; in other words, using more fields provided by the standard when publishing their datasets.

Having reached this point, the reader should be able to do their own analysis on OCDS dataset using the R programming language. Throughout the guide, we have provided several links that point to valuable resources to keep on learning about the language and its rich tool ecosystem. Now it's time to use what we learned to pursue the goal of the OCDS: to increase contracting transparency, and allow deeper analysis of contracting data.