combine.Rmd

---
title: "combining IVY+ dataframes"
output: html_notebook
---

```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(stringr)
```

## load data

`read_delim` will imputed data types of columns from the first 1000 rows.  Later steps will transform all dataframes to share the same column types.  Four files had project failures and should be investigated.
```{r load-data}
JHU <- read_delim("data/JHU_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

UPenn <- read_delim("data/UPenn_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

MIT <- read_delim("data/MIT_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

#Cornell <- read_delim("data/Cornell_Normalized.tsv", 
#    "\t", escape_double = FALSE, trim_ws = TRUE)

Brown <- read_delim("data/Brown_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

Duke <- read_delim("data/Duke_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

Yale <- read_delim("data/Yale_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

UChicago <- read_delim("data/UChicago_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

Harvard <- read_delim("data/Harvard_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

Princeton <- read_delim("data/Princeton_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

Columbia <- read_delim("data/Columbia_Normalized.tsv", 
    "\t", escape_double = FALSE, trim_ws = TRUE)

```

## Idenitfy Standard data structure

We'll use `glimpse()` and the JHU dataframe to identify the standard data types across all variables in all dataframes.
```{r}
glimpse(JHU)
```


## Transform variables

Three files were imputed with incorrect guesses

1. `read_delim()` imputed *UChicago* `gov_doc` to `character`.  Transformed below to `integer` 
2. `read_delim()` imputed *Princeton* & *Columbia* `pub_year` to `character`. Transformed below to `integer`

One file had a variable name misspelled

1. renamed `UChicago$other_committments` to `UChicago$other_commitments`

```{r}
glimpse(UChicago)
UChicago2 <-  UChicago %>% 
  mutate(gov_doc = as.integer(gov_doc)) %>% 
  mutate(other_commitments = other_committments) 
UChicago2$other_committments <- NULL  
glimpse(UChicago2)
```

```{r}
glimpse(Princeton)
Princeton2 <- Princeton %>% 
  mutate(pub_year = as.integer(pub_year)) %>% 
  mutate(call_no = callno)
Princeton2$callno <- NULL
glimpse(Princeton2)
```

```{r}
glimpse(Columbia)
Columbia2 <- Columbia %>% 
  mutate(pub_year = as.integer(pub_year)) %>% 
  mutate(call_no = callno)
Columbia2$callno <- NULL
glimpse(Columbia2)
```


## Combine all dataframes into one 
```{r}

big_df <- bind_rows(JHU, UPenn, MIT,
                    # Cornell, 
                    Brown, 
                    Yale, Harvard, Duke, 
                    UChicago2,   # as.integer(gov_docs) 
                    Princeton2,  # as.integer(pub_year)
                    Columbia2)   # as.integer(pub_year)

```

## Further Transformations

Per July 6 meeting.  Group decided on the following transformations as part of the data cleaning process...

- Make LC Class all upper case
- Set non-four-digit years to _blank_
- Set all years not between 1440 and 2017 to _blank_
- Compute new column with facet count for resolved_OCLC, i.e., frequency count for each row based on resolved_OCLC

### Make LC Class all uppercase

```{r}
big_df <- big_df %>% 
  mutate(lc_class = str_to_upper(lc_class))
```

### Transform NA characters in pub_country

pub_country has variable values "na" which stand for Netherlands Antilles.  This is confused with the NA_Character_ once analysis is done in Tableau.  Transform "na" to "NethAnt"

```{r}
big_df <- big_df %>% 
  mutate(pub_country = if_else(is.na(pub_country), "NA_character_", pub_country)) %>% 
  mutate(pub_country = if_else(pub_country == "na", "NethAnt", pub_country)) 
```


### pub_year integrity transformation

- Keep only 4-digit years in pub_year
- Set non--four-digit years to blank
- keep only pub_year between 1440 and 2017

```{r}
big_df <- big_df %>% 
  mutate(pub_year = if_else(str_detect(pub_year, 
                                       "\\d{4}"), 
                            as.character(pub_year), 
                            NA_character_)) %>%
  mutate(pub_year = as.integer(pub_year)) %>% 
  mutate(pub_year = if_else(pub_year >= 1440 
                            & pub_year <= 2017, 
                            pub_year, NA_integer_))
```

### Eliminate duplicates

```{r}
big_df <- big_df %>% 
  distinct(primary_key, .keep_all = TRUE)

```


## Priviledge: completeness heuristic

In the case where there is duplication in the `resolved_oclc` key, keep the first item that has data in each of the following variables:  pub_year, pub_country, language, lc_class


```{r}
# works for resolved_oclc = 7459090
# works for resolved_oclc = 890946

big_scored_df <- big_df %>% 
  # filter(resolved_oclc == "890946" | resolved_oclc == 7459090) %>% 
  # select(member_id, primary_key, resolved_oclc, pub_year, pub_country, language, lc_class) %>% 
  mutate(score = 4) %>% 
  mutate(score = if_else(is.na(pub_year), score - 1, score)) %>% 
  mutate(score = if_else(is.na(pub_country), score - 1, score)) %>% 
  mutate(score = if_else(is.na(language), score - 1, score)) %>% 
  mutate(score = if_else(is.na(lc_class), score - 1, score)) %>% 
  arrange(resolved_oclc, -score) %>% 
  distinct(resolved_oclc, .keep_all = TRUE)

```

## Join 

Join the `big_df` data.frame with the scored `big_scored_df` data.frame and note which is the selected (i.e. "scored") row by adding a column `selected` with a cell value of "*True*"

```{r}
big_selected_df <- left_join(big_df, big_scored_df)

big_selected_df <- big_selected_df %>% 
  mutate(selected = if_else(is.na(score), NA_character_, "True"))
```

## Write combined dataframe as one TSV file

```{r}
write.table(big_df, file = "data/combined.tsv", sep = "\t",  row.names = FALSE)
write.table(big_scored_df, file = "data/combined_scored.tsv", sep = "\t",  row.names = FALSE)
write.table(big_selected_df, file = "data/combined_selected.tsv", sep = "\t",  row.names = FALSE)
```


## Problems Detected -- Problem Resolved

The **Problems Dectected -- Problems Resolved** Section does not need to be run to generate the combined data.frame.  This section is only to document the discovery of a problem.  See the **NOTE** below.

###  Frequency of resolved_OCLC Number

Compute new column with facet count for `resolved_OCLC`, i.e., frequency count for each row based on resolved_OCLC

NOTE:  After getting a freshly de-duped file from UPenn and running `distinct()` on the big_data.frame (`big_df`) the data.frame has been reduced by 1% and the largest duplication on resolved OCLC numbers is 7.  This is understood to be correct.


```
big_df %>% 
  count(resolved_oclc) %>% 
  arrange(-n)
```

#### List `member_id`
```
big_df %>% 
  select(member_id) %>% 
  count(member_id) %>% 
  arrange(-n)
```

#### Find examples of duplicates for each institution
```
big_df %>% 
  filter(member_id == "yale") %>% # brown, columbia, cornell
  # duke, harvard, JHU, mit, princeton, uchicago, upenn, yale
  count(primary_key) %>% 
  arrange(-n)
```


#### filter on a specific `resolved_oclc`

resolved_oclc == 883016 has 7 rows  
```
big_df %>% 
  filter(resolved_oclc == 883016)  # 42 ; rows 411-420
```

```
big_df %>% 
  filter(resolved_oclc == 493136) %>% 
  group_by(member_id) %>% # columbia & princeton
  select(member_id, call_no, 
         lc_class, primary_key, resolved_oclc)
  
```


#### filter on a specific `resolved_oclc`

resolved_oclc == 276173 has 116 rows
```
big_df %>% 
  filter(resolved_oclc == 276173) # row 10 ; rows 10
```

#### Distinct primary_key

looks like Princeton (perhaps; more research) has 50K duplicate rows by primary_key.  Use `distinct()` to retain only unique/distinct rows from an input tbl.
```
big_distinct_primary_key_df <-  big_df %>% 
  distinct(primary_key, .keep_all = TRUE) 

big_distinct_primary_key_df
```


## Failures

Note that four dataframes had parsing failures.  See console messages integrated into the project notebook ('combine.nb.html') for details.

1. Cornell had 26 parsing failures out of over 1 Million rows.  The failures appear to be in the pub_year variable.
```
problems(Cornell)
```


1. Brown had 43 parsing failures out of 165,000 rows.  The failures appear to be in the pub_year variable.

```
problems(Brown)
```


1. University of Harvard had 489 parsing failures out of almost 670,000 rows.  The failures appear to be in the pub_year variable

```
problems(Harvard)
```


## Investigating Problem with NA_Character_

Tableau doesn't distinquish case in NA.  This creates a problem, in Tableau, for the `lc_class` field.  So far we've used three versions of NA:

- NA = N/A or empty or null
- NA1 = Architecture
- Na = Architecture

In email from Joyce -- dated 8/10/2017 -- there is a problem where lower case letters exist and cause problems for Tableau.  Tableau automatically combines case (e.g.  “ND” and “Nd” or “Ml” and “ML”) but this is a problem for 'NA', "Na", and "NA1".  I'm assuming this problem is taking place when Tableau analyzes the  `lc_class` field.  However, in code chunk 8 the `lc_class` field is being coerced to uppercase.  I am confused as to how Tableau is seeing some instances with the 2nd characters in lowercase.

```
# big_df %>% 
sort(unique(big_selected_df$lc_class))
```