forked from rparrapy/ocds-r-manual
-
Notifications
You must be signed in to change notification settings - Fork 0
/
manual.Rmd
1154 lines (811 loc) · 61.4 KB
/
manual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analyzing Open Contracting data"
author: "Rodrigo Parra"
subtitle: A manual using the R programming language
output:
bookdown::html_document2:
fig_caption: yes
theme: cosmo
toc: yes
bookdown::pdf_document2:
fig_caption: yes
toc: yes
bookdown::word_document2:
fig_caption: yes
toc: yes
urlcolor: blue
---
<style>
body {
text-align: justify
}
.caption {
text-align: center
}
caption {
text-align: center
}
</style>
\newpage
# Introduction
The goal of this manual is to provide the basic foundations needed to analyze and visualize Open Contracting data using the R programming language to people that might be interested in pursuing this challenging but worthy endeavor. A basic understanding of computer programming will come in handy when reading this document; however, enough guidance is provided to ensure that less-tech-savy readers can also follow along.
As a part of this guide, we acquire, clean, analyze and plot Open Contracting data from 4 members of the partnership: Paraguay, Mexico, Uruguay and Colombia. This document was written using [R Notebooks](https://rmarkdown.rstudio.com/r_notebooks), a nifty feature which allows convenient integration of Markdown and R code, and exporting to HTML and PDF. The source code of this project is available [here](https://https://github.com/rparrapy/ocds-r-manual).
The rest of this document is organized as follows:
* The remainder of this section introduces the Open Contracting Data Standard, briefly describing its main building blocks, and the R programming language, covering its core features and how to install it.
* Section 2 deals with data acquisition and cleaning, focusing on reading and processing JSON files with R.
* Section 3 presents the [Tidyverse](https://www.tidyverse.org), a collection of R packages designed to perform common data science tasks.
* Finally, section 4 introduces [ggplot2](https://ggplot2.tidyverse.org) and the grammar of graphics it is based on, to draw several plots that describe our previously acquired datasets.
## An introduction to the Open Contracting Data Standard
By requiring data-sharing in a structured, re-usable and machine readable form, Open data opens up new opportunities for analysis and citizen engagement and participation. The [Open Contracting Data Standard](https://www.open-contracting.org/data-standard/) was created to apply these principles to the complete contracting lifecycle including planning, tender, award, contract and implementation.
The data standard, designed and developed through an open process, allows governments and cities around the world to share their contracting data, enabling greater transparency in public contracting, and supporting accessible and in-depth analysis of the efficiency, effectiveness, fairness, and integrity of public contracting systems. Additionally, the help desk team, staffed by Open Data Services Co-operative, is available to assist prospective users in their journey towards adoption of the standard.
The intention of this section is to introduce the reader to the standard, the use cases it was designed for and the basic concepts needed to apply it. Most of the content was taken from the official documentation of the standard; for a more thorough introductory walktrough, please refer to the [getting started guide](http://standard.open-contracting.org/latest/en/getting_started/).
### Users and use cases
The standard was designed with four main groups of user needs:
* Achieving value for money for government
* Strengthening the transparency, accountability and integrity of public contracting
* Enabling the private sector to fairly compete for public contracts
* Monitoring the effectiveness of service delivery
To find out about who is using OCDS-compliant data around the globe and how they are doing it, have a look at the [Open Contracting Partnership website](http://www.open-contracting.org/). Four potential use cases for open contracting data are:
* Value for money in procurement: helping officials get good value for money during the procurement process, and analyzing whether this goals was achieved afterwards.
* Detecting fraud and corruption: identifying red flags that might indicate corruption by studying individual procurements or networks based on funding, ownership and interests.
* Competing for public contracts: allowing private firms to understand the potential pipeline of procurement opportunities by looking at information related to past and current procurements.
* Monitoring Service Delivery: helping interested actors to leverage traceability in the procurement process for monitoring purposes, linking budgets and donor data to the contracts and results.
### The contracting process
The standard defines a contracting process as:
> All the planning, tendering information, awards, contracts and contract implementation information related to a single initiation process.
The standard covers all the stages of a contracting process, even though some processes might not involve all possible steps. The stages of the procurement process, with example objects that might be associated to each one, are depicted in figure 1.
```{r include=FALSE}
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
message = FALSE, cache.lazy = FALSE)
```
```{r, echo=FALSE, fig.cap="Stages of the procurement process."}
knitr::include_graphics("procurement-stages.png")
```
For identification purposes, all contracting processes are assigned a unique Open Contracting ID (ocid), which can be used to join data from different stages. In order to avoid ocid clashes between publishers, a publisher can prepend a prefix to locally generated identifiers. Publishers are encouraged to register their prefix [here](http://standard.open-contracting.org/latest/en/implementation/registration/).
### Documents
Contracting processes are represented as **documents** in the Open Contracting Data Standard (OCDS from now on, for brevity). Each document is made up of several **sections**, mentioned below:
* release metadata: contextual information about each release of data;
* parties: information about the organizations and other participants involved in the contracting process;
* planning: information about the goals, budgets and projects a contracting process relates to;
* tender: information about how a tender will take place, or has taken place;
* awards: information on awards made as part of a contracting process;
* contract: information on contracts signed as part of a contracting process;
* implementation: information on the progress of each contract towards completion.
An example JSON snippet compliant with this structure looks as follows:
```json
{
"language": "en",
"ocid": "contracting-process-identifier",
"id": "release-id",
"date": "ISO-date",
"tag": ["tag-from-codelist"],
"initiationType": "tender",
"parties": {},
"buyer": {},
"planning": {},
"tender": {},
"awards": [ {} ],
"contracts":[ {
"implementation":{}
}]
}
```
There are two types of documents defined in the standard:
* **Releases** are immutable and represent updates on the contracting process. For example, they can be used to notify users of new tenders, awards, contracts and other updates. As such, a single contracting process can have many releases.
* **Records** are snapshots of the current state of a contracting process. A record should be updated every time a new release associated to its contracting process is published; hence, there should only be a single record per contracting process.
### Fields
Each section may contain several **fields** specified in the standard, which are used to represent data. These objects can appear several times in different sections of the same document; for example, items can occur in tender (to indicate the items that a buyer wishes to buy), in an award object (to indicate the items that an award has been made for) and in a contract object (to indicate the items listed in the contract). Some example fields, accompanied by corresponding JSON snippets, are presented below.
#### Parties (Organizations)
```json
{
"address": {
"countryName": "United Kingdom",
"locality": "London",
"postalCode": "N11 1NP",
"region": "London",
"streetAddress": "4, North London Business Park, Oakleigh Rd S"
},
"contactPoint": {
"email": "[email protected]",
"faxNumber": "01234 345 345",
"name": "Procurement Team",
"telephone": "01234 345 346",
"url": "http://example.com/contact/"
},
"id": "GB-LAC-E09000003",
"identifier": {
"id": "E09000003",
"legalName": "London Borough of Barnet",
"scheme": "GB-LAC",
"uri": "http://www.barnet.gov.uk/"
},
"name": "London Borough of Barnet",
"roles": [ ... ]
}
```
#### Amounts
```json
{
"amount": 11000000,
"currency": "GBP"
}
```
#### Items
```json
{
"additionalClassifications": [
{
"description": "Cycle path construction work",
"id": "45233162-2",
"scheme": "CPV",
"uri": "http://cpv.data.ac.uk/code-45233162.html"
}
],
"classification": {
"description": "Construction work for highways",
"id": "45233130",
"scheme": "CPV",
"uri": "http://cpv.data.ac.uk/code-45233130"
},
"description": "string",
"id": "0001",
"quantity": 8,
"unit": {
"name": "Miles",
"value": {
"amount": 137000,
"currency": "GBP"
}
}
}
```
#### Time Periods
```json
{
"endDate": "2011-08-01T23:59:00Z",
"startDate": "2010-07-01T00:00:00Z"
}
```
#### Documents
```json
{
"datePublished": "2010-05-10T10:30:00Z",
"description": "Award of contract to build new cycle lanes to AnyCorp Ltd.",
"documentType": "notice",
"format": "text/html",
"id": "0007",
"language": "en",
"title": "Award notice",
"url": "http://example.com/tender-notices/ocds-213czf-000-00001-04.html"
}
```
#### Milestones
```json
{
"description": "A consultation period is open for citizen input.",
"dueDate": "2015-04-15T17:00:00Z",
"id": "0001",
"title": "Consultation Period"
}
```
### Extensions and codelists
In addition to regular fields, the OCDS schema defines some fields that can only be used in certain sections, e.g. *titles* and *descriptions* of tenders, awards and contracts. In some cases, publishers may require fields that are not provided by the core schema; an **extension** allows defining new fields that can be used in these cases. A list of available extensions is available [here](http://standard.open-contracting.org/latest/en/extensions); if no existing extension addresses a publisher's needs, the publisher is encouraged to collaborate on the creation of a new community extension.
Another concept worth mentioning is that of codelists. Codelists are sets of case sensitive strings with associated labels, available in each language OCDS has been translated into. Publishers should use codelist values whenever possible to map their existing classification systems; if needed, detail fields can be used to provide more detailed classification information. There are two types of codelists:
* **Closed codelists** are fixed sets of values. If a field is associated with a closed codelist, it should only accept an option from the published list.
* **Open codelists** are sets of recommended values. If a field is associated with an open codelist, it accepts options from the list but also other values.
The Open Contracting Data Standard is maintained using [JSON Schema](http://json-schema.org). In this section we have introduced and described the main sections and common objects used in the schema, providing JSON snippets as examples of these basic building blocks. If you are interested in the full JSON schema reference, please refer to the [official documentation](http://standard.open-contracting.org/latest/en/schema/).
## An introduction to the R programming language
R is a multi-paradigm interpreted programming language and software environment focused on statistical computing, commonly used for data analysis. It is published under the [GPL v2 license](https://www.gnu.org/licenses/old-licenses/gpl-2.0.html) and maintained by the [R Foundation](https://www.r-project.org/foundation/), with binaries available for GNU/Linux, macOS and Windows. While the basic installer comes bundled with a command line interface, several graphical integrated development environments are available, of which RStudio and RStudio Server are the most noteworthy[^1].
In this section we will introduce some of R's basic features and syntax; after going through it, the reader should be better prepared to understand the code used for analysis in the rest of the guide. Having said that, several features are ommited as a complete reference of R is out of the scope of this document. Should the reader feel inclined to learn more about R, out of curiosity or need, a list of manuals maintained by the R Development Core Team is available [here](https://cran.r-project.org/manuals.html).
For completeness and reproducibility purposes, we include a excerpt of system information below.
```{r}
R.version
```
[^1]: The author recommends RStudio as the go-to way to run R code and believes it is one major reason behind R's spike in popularity. Consequently, this guide was written using RStudio v1.1.453 in macOS High Sierra.
### Operators
R's arithmetic and logical operators should be familiar to most programmers, as they are common to other programming languages. It is worth mentioning that arithmetic operators work on scalars and collections.
| Operator | Description |
|:--------:|:-----------|
| **+** | Addition |
| **-** | Subtraction |
| **\* ** | Multiplication|
| **/** | Division |
| **\*\* ** or **^**|Exponentiation |
| **%%** | Modulus |
Table: Some R arithmetic operators
| Operator | Description |
|:--------:|:-----------|
| **>** | Greater than |
| **>=** | Greater than or equal to|
| **<** | Lesser than|
| **<=** | Lesser than or equal to|
| **==** | Equal to |
| **!=** | Not equal to |
Table: Some R logical operators
### Variables
R supports several data types including scalars, vectors (of numbers, strings, booleans, etc), matrices, dataframes and tibbles, among others. The **<-** operator is used to assign a value to a variable. A few examples of variables from these types are shown below:
```{r}
a_scalar <- 4
a_number_vector <- c(1, 2, 3, 4, 5) # all elements of a vector must be of the same type
a_string_vector <- c("a1", "b2", "c3", "d4")
a_boolean_vector <- c(TRUE, FALSE)
# lists can have elements of different types, associated to a key
a_list <- list(name = "John", last_name = "Deer", age = 42, married = FALSE)
# there are several ways to access an element from a collection
a_number_vector[0]
a_string_vector[[1]]
a_list$name
# a matrix is a special kind of vector, with the number of rows and columns as attributes
m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
m
```
### Data Frames and Tibbles
Data types to store data tables are core to R's purpose and functionality, hence they deserve their own subsection. The most common data table type in R are data frames, which can be though of as list of vectors of equal length.
```{r}
name <- c("Mike", "Lucy", "John")
age <- c(20, 25, 30)
student <- c(TRUE, FALSE, TRUE)
df <- data.frame(name, age, student)
df
```
The access operators presented for other data types can also be used to get data cells from a data frame.
```{r}
df[1, ] # R has 1-based indexing, i.e. collections start at 1.
df[1, ]$name # Values from a row can be accessed by column name
df$name # Columns can also be accessed directly from the data frame.
```
Another type of data table available are Tibbles. Tibbles belong to the Tidyverse, a set of libraries to perform data analysis in R following best practices that will be discussed at length in another section. For now, just think of Tibbles as data frames with some quirkyness fixed. As an example, printing a data frame displays all of its rows, which may be problematic for large datasets; printing a tibble displays only the first 10 rows by default.
Data frames can be converted to tibbles and viceversa; the latter can be useful when dealing with older libraries that are not Tidyverse-ready. The author advises the use of tibbles and favors (as many users in the community) leveraging Tidyverse features and adhering to its guidelines as closely as possible.
```{r message=FALSE, warning=FALSE}
# Packrat should have installed dependencies for you,
# you can do it manually by running the command below
# install.packages("tidyverse")
library(tidyverse) # Side note: this is how you import a library in R
```
```{r}
tb <- as_tibble(df)
class(as.data.frame(tb))
```
### Functions
The syntax for function definition in R is pretty straightforward and resemblant of that of other programming languages. A function receives zero, one or multiple arguments and it runs the code included in its body:
> function ( arglist ) {body}
Because of the fact that R targets a specific niche (statistical computing), it offers a very rich set of built-in functions and libraries readily available for installation. Thus, function calls tend to be much more common in R than function definition.
Most of the R ecosystem is focused on dealing (cleaning, plotting, modelling) tabular data. As a simple display of basic statistical features, we can use the *summary* function to get descriptive statistics from a table.
```{r}
summary(tb)
```
Reading data that is already tabular is also very simple, as shown in the example below which reads a CSV file as a data frame. The file corresponds to the iris dataset, a very popular toy dataset for data science available [online](https://archive.ics.uci.edu/ml/datasets/iris).
```{r}
iris <- as_tibble(read.table("iris.csv", header = TRUE, sep = ","))
iris
```
One last useful operator that is worth mentioning is the pipe *%>%*. Pipes allow function chaining in R, which favors readability and (one might say) elegance when a sequence of function calls is needed.
```{r}
# For now, there is no need to understand what each function of this snippet actually does
iris %>%
group_by(Name) %>%
summarize_if(is.numeric, mean) %>%
ungroup()
```
The pipe operator passes the value of its left-hand-side operand as the first argument of its right-hand-side operand. Using it avoids the need of declaring variables to store intermediate results in additional variables (thus cluttering the namespace) or nesting function calls (all those parenthesis can be hard to read).
### Style
As a final remark, although there is no official code style guideline for the R programming language, the author (not surprisingly if you have been paying attention) recommends following the [Tidyverse style guide](http://style.tidyverse.org/index.html)[^2]. The [styler](http://styler.r-lib.org) package is a handy linter and code formatter that can help keeping source code compliant; it includes an RStudio add-in, the easiest way to re-style existing code.
[^2]: This notebook was written with the Tidyverse code style guide in mind. Please blame any non-conformant snippet on the lack of good coffee at that point of the writing process :)
## Installing R and other dependencies
Assuming a machine using Ubuntu 18.04 Bionic Beaver as its operating system, a user with *sudo* access can install the R programming language and R Studio by following these steps:
1. Update system packages to make sure already installed dependencies are up to date.
```{bash, eval=FALSE}
sudo apt update
```
2. Install the R programming language.
```{bash, eval=FALSE}
sudo apt -y install r-base
```
3. RStudio for Ubuntu comes bundled as a *.deb* package. The easiest way to install a *.deb* package in Ubuntu is by using the *gdebi* command, which is not installed in the system by default. It can be installed by running the following command.
```{bash, eval=FALSE}
sudo apt install gdebi-core
```
4. Go to the official [RStudio download page](https://www.rstudio.com/products/rstudio/download/#download) and download the file corresponding to you operating system. In our current scenario, that would be *RStudio 1.1.456 - Ubuntu 16.04+/Debian 9+ (64-bit)*.
5. From the directory in which you downloaded the package at the last step, install Rstudio using *gdebi*.
```{bash, eval=FALSE}
sudo gdebi rstudio-xenial-1.1.456-amd64.deb
```
6. Profit! You should now be able to access Rstudio by running the *rstudio* command or by clicking its icon in the apps menu.
Besides RStudio Desktop, which can be installed by following the previous set of steps, RStudio is also available in a remote access flavor. RStudio Server is an application that can be installed in a web server, thus offering the same features of the desktop version through a web browser. To find out more about this alternative and how to install it, please refer to the [official documentation](https://www.rstudio.com/products/rstudio/download-server/).
The project this document is generated from uses [Packrat](https://rstudio.github.io/packrat/) to keep track of all the libraries that it depends on. This ensures the portability of the software (on import, the appropriate version for the corresponding OS will be installed) and the reproducibility of the analysis. Therefore, assuming you have Git installed, playing around with this notebook should be as easy as:
1. Cloning the project
```{bash, eval=FALSE}
git clone https://github.com/rparrapy/ocds-r-manual.git
```
2. Opening it with RStudio by using the menu bar option **File -> Open Project...**
We will also be using a Postgres database to run our analysis, hence we need to install Postgres in case it is missing in our system. We can do so by running:
```{bash, eval=FALSE}
sudo apt install postgresql postgresql-contrib
```
The last command install the Postgres database engine and creates a *postgres* user as a default account to access our databases.
# Data acquisition
This section is focused on data loading from diverse sources, which is the first step for any analysis process. We will read the data from the four countries this document is concerned with, using them as examples for different data acquisition methods:
* Data from Uruguay will be read using their public web API.
* Data from Paraguay and Mexico will be loaded from local JSON files.
* Data from Colombia will be loaded from JSONB columns of a Postgres database. Since this is the largest dataset we have access to, we will use it as an example of how to handle big data with R using JSON streams.
## From a REST API
The open contracting dataset from Uruguay is available via a RSS endpoint, which includes URLs to the latest releases. It is worth noting that since only latest releases are accessible, not much in depth analysis can be done using this dataset. Having said that, it is still a perfectly suitable example to showcase R's HTTP fetching capabilities; we start by fetching the RSS feed and creating a data frame where each row corresponds to a release.
```{r}
library(XML)
xml_url <- "http://www.comprasestatales.gub.uy/ocds/rss"
xml_file <- xmlParse(xml_url)
xml_top <- xmlRoot(xml_file)[[1]]
index_tb <- as_tibble(xmlToDataFrame(xmlElementsByTagName(xml_top, "item")))
index_tb
```
In the snippet above, we start by importing R's *XML* library since Uruguay's API responses are in this data format. The *xmlParse* method fetches the content given by the RSS feed url and parses it, returning a tree-like nested array. We then get the root of the parsed structure and seek every element with the *item* tag by calling the *xmlElementsByTagName* method. We convert the result to a data frame and after that to a tibble using *xmlToDataFrame* and *as_tibble* respectively.
We then proceed to fetch the JSON content corresponding to each release, using the values of the **link** column defined above. There are many libraries to handle JSON with R, each with its list of pros and cons; because of its focus in performance and web api consumption, we use [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html) in this guide.
```{r message=FALSE, warning=FALSE}
library(jsonlite)
```
```{r, cache=TRUE}
# we run fromJSON (which fetches the url) for every row of the data frame
releases <- apply(index_tb["link"], 1, function(r) {
return(as_tibble(fromJSON(r, flatten = TRUE)[["releases"]]))
})
releases[[1]]
```
The main function in the snippet above is *fromJSON*, which is provided by *jsonlite*. We use it to fetch the JSON files from the url contained in the *link* column of *index_tb*. The flatten flag indicates that we want the function to squeeze the nested JSON structure as much as possible, to make it suitable for a tabular format. We repeat the process for each row of the tibble using the *apply* function, which applies the function it receives as a parameter to each row and returns the values as a list.
At this point we have a list of tibbles, each tibble representing a release. As our next step, we can merge all those tibbles, filling missing values (for example, according to the stage they correspond to, some releases include tender information while some do not) with NA (R's constant for 'Not Available'). Merging severel tibbles row wise is achieved by calling the *bind_rows* function.
```{r, cache=TRUE}
releases_tb <- bind_rows(releases)
```
Because data from Uruguay includes only the latest 500 releases, actual compilation of records is not possible (as not enough data is available). Therefore, this is as far as we will go with their open contracting data.
## From JSON files
We can also read records or releases from a set of JSON files, published by one or more official publishers associated to a country. That is the case of Paraguay, the next country we will be adding to our analysis. Paraguayan open contracting data are published by the National Directorate of Public Tender (*Dirección Nacional de Contrataciones Públicas*, DNCP) and the Ministry of Finance (*Ministerio de Hacienda*); as they share ocids, we should be able to use their published data together.
To continue with this guide, you should now download a copy of these datasets to your local machine, and store them under the *./data* directory (which should be empty if you just cloned this project). That is:
1. Head over to [https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ](https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ).
2. Download the *data.zip* file and extract its content inside the *./data* directory.
3. If succesful, you should now have two new folders with .json files in them in your system: *./data/records_dncp* and *./data/releases_mh*.
Before actually loading the data, lets take a small detour to make our data munging more efficient. The size of the datasets we use in this guide ranges from non trivial (with processing taking a few minutes) to challenging (with processing taking several hours). This is aggravated by the fact that R runs in a single thread by default, not leveraging the multicore capabilities that most modern day computers offer. To change that, we can use the *parallel* library, which allows us to run apply-like functions in parallel.
```{r, warning=FALSE, message=FALSE}
library(glue) # This is just a utility library to concatenate strings
library(parallel)
library(lubridate) # if you ever have to work with dates, do yourself a favor and use this
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)
clusterExport(cl, c("fromJSON", "paste", "as_tibble", "ymd_hms", "select"))
```
To start using parallel, we define a cluster of size equal to the number of cores available in our machine minus one (not leaving a single core available for other computation can lead to our computer freezing) with the *makeCluster* function. Each of the workers of our newly created cluster runs an R interpreter with no other dependencies; to import libraries we are going to need to them we use the *clusterExport* function.
We are now ready to start with DNCP data, which is stored under *data/records_dncp*:
```{r, cache=TRUE}
files <- list.files("data/records_dncp/", "*.json")
records <- parLapply(cl, files, function(r) {
file <- paste("data/records_dncp/", r, sep = "")
parsed <- fromJSON(file, flatten = TRUE)
records <- parsed[["records"]]
publishedDate <- ymd_hms(parsed[['publishedDate']])
records$publishedDate <- rep(publishedDate, nrow(records))
return(as_tibble(records))
})
records[[1]]
```
The code snippet above uses several of our already familiar functions such as *fromJSON*, *as_tibble* and so on. There are three newcomers that demand explanation:
* *list.files* returns a list of paths corresponding to files under the path that it receives as its first parameter, that match the regular expression it receives as its second parameter.
* *paste* simply concatenates every string it receives as a parameter, placing the *sep* string between them.
* *parLapply* is the parallel version of *lapply*, scheduling different executions of the function it receives as a parameters in different workers of our cluster. This can be done since reading different JSON files is an embarrasingly parallel task.
```{r, cache=TRUE}
dncp_records_tb <- bind_rows(records)
remove(records)
dncp_records_tb
```
To free up some space, we use the *remove* function to explicitly clear intermediate results from memory. Data from the Ministry of Finance can be read likewise:
```{r, cache=TRUE}
files <- list.files("data/releases_mh/", "*.json")
releases <- parLapply(cl, files, function(r) {
file <- paste("data/releases_mh/", r, sep = "")
parsed <- fromJSON(file, flatten = TRUE)
releases <- parsed[["releases"]]
publishedDate <- ymd_hms(parsed[['publishedDate']])
# there are some files that don't include any release (?)
if (!is.null(nrow(releases)) && nrow(releases) > 0) {
releases$publishedDate <- rep(publishedDate, nrow(releases))
}
return(as_tibble(releases))
})
mh_releases_tb <- bind_rows(releases)
remove(releases)
mh_releases_tb
```
## From a Postgres database with JSONB columns
For Mexico and Colombia we process data as a stream, i.e. one record at a time. Data frames and tibbles are stored in memory by R, while stream processing only requires one record to be stored in memory. For this reason, this data processing paradigm is particularly useful when dealing with big data.
To continue with this guide, you should now download the backup files of these databases to your local machine, and restore them to your local Postgres instance. To do so:
1. Again, head over to [https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ](https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ).
2. Download the files *ocds_colombia.dump* and *ocds_mexico.dump* to your local hard drive.
3. Create the databases by running *sudo -u postgres createdb ocds_colombia* and *sudo -u postgres createdb ocds_mexico* from a terminal session.
3. Within a terminal session, from the directory where you downloaded the files, run *pg_restore -d ocds_colombia ocds_colombia_dump* and *pg_restore -d ocds_mexico ocds_mexico.dump*.
4. Check the databases you just created, they should have several tables which were created by the restore process. In particular, the *data* table should contain multiple records that we will read for our analysis.
The library we are using (jsonlite) includes the *stream_in* function to support [ndjson](http://ndjson.org) a convenient format to store multiple JSON records in a single file. Since our records are stored in a Postgres database, we have to write a similar feature ourselves. Let's write a function to do precisely that in the next code snippets:
```{r}
library(RPostgreSQL)
db_engine <- "PostgreSQL"
host <- "localhost"
user <- "postgres" # if you are concerned about security
password <- "" # you should probably change this 2 lines
port <- 5433 # set the port of your Postgres instance, default is 5432
query <- "SELECT id, data FROM data"
drv <- dbDriver(db_engine)
con_colombia <- dbConnect(drv, host = host, port = port,
dbname = "ocds_colombia", user = user, password = password)
con_mexico <- dbConnect(drv, host = host, port = port,
dbname = "ocds_mexico", user = user, password = password)
```
We connect and query Postgres through [DBI](https://cran.r-project.org/web/packages/DBI/index.html), a database interface definition for communication between R and relational database management systems. In particular, we use [RPostgreSQL](https://cran.r-project.org/web/packages/RPostgreSQL/index.html), a DBI implementation for PostgreSQL databases.
```{r}
stream_in_db <- function(con, query, page_size = 1000, acc = 0) {
current_id <- 0
return(function(handler) {
repeat{
paged_query <- paste(query, "WHERE id > ", current_id, "ORDER BY id ASC LIMIT", page_size)
data <- dbGetQuery(con, paged_query)
if (dim(data)[1] == 0) {
break
}
acc <- handler(data[['data']], acc)
current_id <- tail(data[['id']], n=1)
}
return(acc)
})
}
stream_in_colombia <- stream_in_db(con_colombia, query)
stream_in_mexico <- stream_in_db(con_mexico, query)
```
Let's recap on what we just did:
1. First, we defined a function *stream_in_db* that receives database connection parameters and a query, and runs the query against the specified database.
1. *stream_in_db* returns a function, which applies a handler received as a parameter to every page of the result set.
1. Finally, we used *stream_in_db* to get functions pointing at the databases of Mexico and Colombia.
Pagination using *LIMIT* and *OFFSET* is known to degrade its performance as we scan pages further appart from the start. To speed things up as much as possible we use an approach know as *keyset pagination*,
which leverages the fact that we have an index defined for the primary key. For a more detailed comparison between pagination approaches please read [this article](http://allyouneedisbackend.com/blog/2017/09/24/the-sql-i-love-part-1-scanning-large-table/).
Now let's take our streaming functions for a ride. To test them out, we can define a dummy handler to count the number of rows of data that result from our query.
```{r message=FALSE, warning=FALSE, eval=FALSE}
sanity_checker <- function(data, acc) {
m <- parLapply(cl, data, function(e) {
t <- fromJSON(e, flatten = TRUE)
return(1)
})
return(acc + Reduce("+", m))
}
```
# Analysis of contracting data
This section introduces a bunch of useful tools to perform basic data analysis using R, accompanied by some basic examples to showcase their features.
## Introducing the Tidyverse
The Tidyverse is a collection of R packages aimed at data science tasks; besides the libraries, it shares a common set of data structures, a style guide, and an underlying philosophy for data analysis.
The core Tidyverse [packages](https://www.tidyverse.org/packages/) are:
* gglplot2: a system for declaratively creating graphics, based on The Grammar of Graphics, a tool that enables the user to concisely describe most of the graphical mappings used in data visualization. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
* dplyr: an implementation of a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges
* tidyr: provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
* readr: provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf).
* purrr: enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
* tibble: a modern reimplementation of R's star data structure: the data frame. At this point, the user should be fairly familiar with tibbles.
* stringr: provides a cohesive set of functions designed to make working with strings as easy as possible.
* forcats: provides a suite of useful tools that solve common problems with factors, which are used by R to handle categorical variables.
For the purposes of this guide, the first three packages above are the most important ones. Hence, tidyr and dplyr will be covered in the remainder of this section and ggplot2 will be described in the next one.
## Data cleaning functions
According to the Tidyverse philosophy, data cleaning is the process of making data tidy. Data being tidy means:
* Each variable must have its own column.
* Each observation must have its own row.
* Each value must have its own cell.
Every package from the Tidyverse is designed to work with tidy data; when dealing with an untidy dataset, the first step of our analysis must therefore be using **tidyr** to clean it up. There are three main verbs provided by this package to help making data tidy: gather, spread and separate.
### Gathering
A frequent issue with untidy data is a dataset where some column names are not variable *names* but variable *values*. For example, consider the code snippet below:
```{r}
world_population = tibble(
country = c("Paraguay", "Uruguay", "Colombia", "Mexico"),
"2017" = c(7000000, 3000000, 45000000, 127000000),
"2018" = c(7200000, 3200000, 46000000, 128000000),
)
world_population
```
In the example above, the data is untidy because *2017* and *2018* are values of the implicit *year* variable. This means each row corresponds to two observations, not one. We can fix this problem by applying the *gather* function as follows:
```{r}
world_population %>% gather(`2017`, `2018`, key = "year", value = "inhabitants")
```
*Gather* receives as parameters the names of the columns we want to pivot, and the names of the two new columns we want to create. Much better! In the final result, the problematic columns are dropped and our dataset is now 100% tidy.
### Spreading
Gathering is useful when one row corresponds to more than one observation. Spreading works in the opposite scenario, when a single observation is scattered accross multiple rows. Consider a simple extension of our example dataset:
```{r}
world_count = tibble(
country = c("Paraguay", "Uruguay", "Colombia", "Mexico", "Paraguay", "Uruguay", "Colombia", "Mexico"),
year = 2018,
type = c("inhabitants", "inhabitants", "inhabitants", "inhabitants", "cars", "cars", "cars", "cars"),
count = c(7000000, 3000000, 45000000, 127000000, 1000000, 500000, 10000000, 75000000)
)
world_count
```
In this case we have variables stored as cell values, which is the case of *inhabitants* and *cars*. How do we fix it? Just watch the **spread** function do its magic below:
```{r}
world_count %>% spread(key = type, value = count)
```
*Spread* turns every value in the column given by its *key* parameter and turns it into a separate column, filling in the cells with the corresponding value from the column given by its *value* parameter.
### Separating
Our final tidyr function, **separate**, allows us to solve cases in which multiple variable values are stored in the same cell. As an example, consider the result of parsing the Uruguayan RSS feed that we downloaded in a previous section.
```{r}
index_tb
```
It is clear that the title column contains values for two variables: *id_compra* and *release_id*, which are separated by a comma. Let's use **separate** to make things tidy:
```{r}
index_tb %>%
separate(title, into = c("id_compra", "release_id"), sep = ",") %>%
transform(id_compra=str_replace(id_compra,"id_compra:","")) %>%
transform(release_id=str_replace(release_id,"release_id:","")) %>%
# the last 2 lines just remove the needless prefixes
head(5)
```
*Separate* splits the value of a single column by the character given by the *sep* parameter; the names of the newly defined columns are given by the *into* parameter, which should be a collection of strings. The last two lines of the code snippet above are just a sneak peak of one of the features of the package we will cover next: dplyr.
## Data analysis functions
Once we have tidied up our data with **tidyr**, the next steps of data analysis usually involve data manipulation in one way or the other. Selecting specific columns and rows based on a condition, adding composite columns, summarizing data, etc. can be cited among the operations that are frequently performed as a part of the analytical process. The Tidyverse includes **dplyr** as the go-to tool for this kind of data manipulation, we will now cover some of its basic functions.
### Mutating
The **mutate** function allows the user to add a new column to a tibble based on the values of one (or more) already existing columns. Let's use the dataset of releases from Uruguay to illustrate a case where this function might come in handy:
```{r}
releases_tb
```
Let's suppose we need a column with the *name* of the month when a release was made. This value is not readily available in the dataset, but the information is contained in the *date* column and can be extracted with some help from **mutate**:
```{r}
uruguay_releases_with_month = releases_tb %>% mutate(month = month.name[month(date)])
## month.name is a vector with the name of each month
uruguay_releases_with_month[c('ocid', 'month')]
```
*Mutate* receives a tibble as its first parameter; it receives a variable assignment expression as the second parameter, which is evaluated row-wise to produce the new column.
### Selecting
In our code snippet above we selected certain columns of our tibble using array-indexing notation. **dplyr** provides a function to do the same thing, **select**:
```{r}
select(uruguay_releases_with_month, ocid, month)
```
*Select* receives a variable number of parameters, the first one must be a tibble or data frame and the rest of them are the columns to be included in the selection.
### Filtering
A very common operation when dealing with data is to select a subset of rows that are of interest for our analysis. For example, let's assume we are only interested in releases that are related to awards. We can identify the stage of the contracting process to which a release is related through the *tag* column.
```{r}
releases_tb[1,]$tag
```
Knowing this, it is easy to get the collection of releases that correspond to awards by using **filter** as shown below:
```{r}
awards_tb = releases_tb %>% filter(tag == 'award')
awards_tb
```
```{r}
awards_tb[10, ]$tag
```
*Filter* receives a tibble as its first parameter and a boolean expression as the second one, the boolean expression is evaluated for each row and only rows wich evaluate to *TRUE* are included in the final result.
### Summarizing
Summarization allows us to aggregate variables based on a preset grouping. To illustrate this we can revisit our example using the Iris dataset:
```{r}
iris %>%
group_by(Name) %>%
summarize_if(is.numeric, mean)
```
In the snippet above, we start by grouping the dataset by species name and then summarize every numeric column using the mean as the aggregator function.
# Data visualization
Data visualization can be defined as the visual mapping of data using visual cues such as size, shape and color to represent the variation of the values of the variables in a dataset. It is a good way to communicate complex information, since it is much easier to spot patterns and make comparisons than with raw data.
This section introduces the reader to some basic concepts of data visualization, providing a few guidelines on how to choose an appropriate visual representation of a dataset. Although it is definitely helpful to acquire such theoretical background, the good news for the reader are that R gently guides you towards the right choice (if you let it). The last statement will become clearer once we cover **ggplot2**, R's main data visualization library and a very important component of the Tidyverse.
Finally, we will test what we just learned by plotting OCDS data using **ggplot2**.
## A data visualization primer
Whenever we visualize data, we are encoding variables using visual attributes such as size, shape or color. Consider for example a quantitative variable, differences among values associated to each observation can be represented in several ways, as displayed in figure 2:
```{r, echo=FALSE, fig.cap="Potential attribute mapping for a quantitative variable."}
knitr::include_graphics("visual_cues.jpg")
```
As the reader might have noticed, many mappings are possible, but not all of them seem equally appropriate. In fact, statisticians William Cleveland and Robert McGill explored this intuition further by running some experiments with human volunteers to determine which visual attributes encoded quantitative information more accurately. Their findings are summarized in figure 3:
```{r, echo=FALSE, fig.cap="Preattentive attributes sorted by their suitability to encode quantitative variables."}
knitr::include_graphics("ordered_visual_cues.jpg")
```
While this preference order is a good guideline for quantitative variables, other types of data map differently to visual attributes. For example, while color hue is a poor choice to encode a quantitative variable, it works well to encode a categorical one.
Different combinations of encoding choices render different types of plots. In this manual we will use four type of plots, which will be introduced as we use them to describe OCDS data in a later subsection: bar charts, line charts, box plots and histograms. There are several other types of charts, and additional considerations that can be taken when visualizing data; for a more in depth tour of these topics we recommend the great visualization course by Peter Aldhous, which is available (online)[http://paldhous.github.io/ucb/2016/dataviz/week2.html] (figures from this section were taking from the course site).
## An introduction to ggplot2
**ggplot2** is the Tidyverse package for data visualization. It is based on the *grammar of graphics*, a formal grammar to declaratively describe most of the most common charts used when visualizing data.
Graphs are described in ggplot by using a concise set of elements which can be combined according to a basic structure defined by the grammar of graphics. A simplified version of the grammar can be read below:
```{r, eval=FALSE}
ggplot(data = [DATA]) +
[GEOM_FUNCTION](mapping = aes([MAPPINGS]))
```
You can consider the snippet above as a template for graphics. To make a plot, you simply fill in the elements within brackets with actual values:
* [DATA] is pretty self-explanatory, it is the placeholder for the dataset that we want to plot.
* [GEOM] should be replaced by any **geom** function provided by ggplot2. geoms are geometrical objects used to represent data such as points, lines, etc.
* Finally [MAPPINGS] should be a collection of ggplot2 **aesthetics**, which are basically visual encoding specifications for the variables we want to visualize.
Let's see this template in action by plotting our well-known Iris dataset as a scatter plot:
```{r}
ggplot(data = iris) + geom_point(mapping = aes(x = PetalLength, y = PetalWidth, color = Name))
```
As the reader might have noticed, **ggplot2** favors convention over configuration and any plot that we draw with it includes many good practices out of the box. For example, in the scatter plot above we got axis labelling and a legend for *free* just by specifying our set of aesthetic mappings.
There are several other components of the **ggplot2** grammar, but for this brief introduction our simplified template will do. To check which other elements are available, besides an exhaustive list of *geoms* and *aesthetics* please visit the [official website](https://ggplot2.tidyverse.org) of the library.
## Visualizing Open Contracting data
Now that we know the basics of **ggplot2** we can go ahead and start making some plots ...almost. Since we are mostly interested in comparing amounts spent by year and buyer, we can start off by extracting all rows of our dataset that contain at least one contract. We achieve exactly that by building a boolean index with one element per row of our dataset, indicating whether its corresponding row passes our condition or not.
```{r}
contract_indices <- parApply(cl, dncp_records_tb, 1, function(r) {
contracts <- r['compiledRelease.contracts'][[1]]
return(!is.null(contracts) && !is.null(dim(contracts)))
})
```
Once we have our boolean index, we can use it to filter out the records that we are not interested in and keep the ones that are useful for our analysis; which we achieve this by doing `dncp_records_tb[contract_indices, ]` in the snippet below. After that, we extract the contracts and the buyer name from the records we kept.
```{r}
contracts <- parApply(cl, dncp_records_tb[contract_indices, ], 1, function(r) {
result <- r['compiledRelease.contracts'][[1]]
result['buyer.name'] <- r['compiledRelease.buyer.name']
result['publishedDate'] <- r['publishedDate']
return(result)
})
```
The `contracts` object defined above is a nested collection of contracts. To make it a single tibble we call the `bind_rows` function. We explicitly free the memory we are no longer going to use by removing the `contracts` object.
```{r}
contracts_dncp <- bind_rows(contracts)
rm(contracts)
```
We now have a contracts dataset from DNCP ready for plotting and analysis. We can follow similar steps to get one from the Finance Ministry.
```{r}
contracts <- parApply(cl, mh_releases_tb, 1, function(r) {
result <- r['contracts'][[1]]
result['buyer.name'] <- r['buyer.name']
result['publishedDate'] <- r['publishedDate']
return(result)
})
```
```{r}
contracts_hacienda <- bind_rows(contracts)
rm(contracts)
```
We know there is significant overlap between the publications of DNCP and the Ministry of Finance, so naively merging both tibbles will end up in contract duplication, hence making our analysis invalid. First, let's make sure the overlap actually exists, by selecting the contracts that are present in the DNCP dataset **and** the Ministry of Finance dataset. The `subset` function selects rows from a tibble based on a condition; in this case, a non-empty subset would confirm the existance of duplicates.
```{r}
overlap <- subset(
contracts_dncp,
(dncpContractCode %in% contracts_hacienda[['dncpContractCode']]))['dncpContractCode']
dim(overlap)
```
Having confirmed the presence of duplicate contracts, we need to come up with a smarter merging strategy: lets merge both tibbles together and group them by *dncpContractCode* (this should group duplicates together), sort the rows within each group by *publishedDate* in ascending order, and finally pick the last row from each group. This preprocessing step, coded in the snippet below, ensures that we only include the most current version of a contract in our analysis.
```{r}
contracts_paraguay <- bind_rows(contracts_dncp, contracts_hacienda) %>%
group_by(dncpContractCode) %>%
arrange(publishedDate) %>%
slice(n()) %>%
ungroup
contracts_paraguay
```
Next, there is the issue of contract currency. Let's have a look at which currencies are used and how many contracts were signed for each one.
```{r}
contracts_paraguay %>%
group_by(value.currency) %>%
summarise(count = n())
```
Results presented in the table above show that the vast majority of contracts were signed with amounts in Paraguayan Guaranies (PYG), a non-so-surprising fact given that we are looking at Paraguayan contracts. We could include contracts with amounts in US dollars by using a given exchange rate, but to keep things simple we will consider only the former currency.
To continue with our analysis, lets get a tibble of our contract dataset grouped by year and some interesting summaries. First, we extract the year from the *dateSigned* field using the `mutate` function, then we filter some invalid rows (na values) and outliers while keeping only the contracts in PYG using the `filter` function. Finally, we group the remaining contracts by year with the `group_by` function and obtain the total count and amount per group using the `summarise` function.
```{r}
by_year <- contracts_paraguay %>%
mutate(signed_year = year(dateSigned)) %>%
filter(!is.na(signed_year), !is.na(value.amount),
signed_year < 2019, signed_year > 2009, value.currency == 'PYG') %>%
group_by(signed_year) %>%
summarise(count = n(), amount = sum(value.amount))
```
With our dataset defined above, we can draw line plots with a few lines of **ggplot2** magic. Let's plot the number of contracts and the total amount for all contracts per year:
```{r}
ggplot(data=by_year, aes(x=signed_year, y=count)) +
geom_line(color="blue") +
geom_point() +
labs(x = 'Year', y = 'Number of contracts')
```
```{r}
ggplot(data=by_year, aes(x=signed_year, y=amount)) +
geom_line(color='springgreen4') +
geom_point() +
labs(x = 'Year', y = 'Total Amount (PYG)')
```
We have done some line plots, what about drawing some bars? Lets draw the number of contracts per buyer for a given year (2017). You should be noticing a pattern by now: we start by getting our data in the right shape using **dplyr** verbs, and only then we do the plotting. Bar charts require an additional intermediate step, at least to give them the typical top-n look: buyer names are character arrays and hence sorted lexicographically by default, we need to redefine the column using the factor type (used for categorical values) and set a custom ordering using the `order` function.
```{r}
count_by_buyer <- contracts_paraguay %>%
mutate(signed_year = year(dateSigned)) %>%
filter(value.currency == 'PYG', signed_year == 2017) %>%
group_by(buyer.name) %>%
summarise(ccount = n()) %>%
arrange(desc(ccount)) %>%
head(5)
count_by_buyer$buyer.name <- factor(
count_by_buyer$buyer.name,