Skip to content

Commit a53d9f3

Browse files
committed
update data validation episode
1 parent d5c4de9 commit a53d9f3

File tree

1 file changed

+65
-68
lines changed

1 file changed

+65
-68
lines changed

episodes/validate.Rmd

Lines changed: 65 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ exercises: 2
1212

1313
::::::::::::::::::::::::::::::::::::: objectives
1414

15-
- Demonstrate how to covert case data to `linelist` data
15+
- Demonstrate how to covert case data into `linelist` data
1616
- Demonstrate how to tag and validate data to make analysis more reliable
1717

1818
::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,50 +21,51 @@ exercises: 2
2121

2222
This episode requires you to:
2323

24-
- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv)
25-
- Save it in the `data/` folder.
24+
- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file
25+
- And save it in the `data/` folder.
2626

2727
:::::::::::::::::::::
2828

2929
## Introduction
3030

31-
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
32-
it's essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent
33-
analyses. Otherwise you might find that your analysis suddenly stops working when specific variables appear or disappear, or their underlying data types (like `<date>` or `<chr>`) change. Specifically, this additional layer involves: 1) verifying the presence and correct data type of certain columns within
34-
your dataset, a process commonly referred to as **tagging**; 2) implementing measures to
35-
check that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
31+
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `<date>` or `<chr>`), etc. Specifically, this additional step involves:
3632

33+
1. Verifying the presence and correct data type of certain columns within
34+
your dataset, a process commonly referred to as **tagging**;
35+
2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
36+
37+
38+
This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package
39+
to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package.
3740

38-
This episode focuses on tagging and validate outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/)
39-
package. Let's start by loading the package `{rio}` to read data and the package `{linelist}`
40-
to create a linelist object. We'll use the pipe `%>%` to connect some of their functions, including others from
41-
the package `{dplyr}`, so let's also call to the tidyverse package:
4241

4342
```{r,eval=TRUE,message=FALSE,warning=FALSE}
4443
# Load packages
45-
library(tidyverse) # for {dplyr} functions and the pipe %>%
44+
library(tidyverse) # to access {dplyr} functions and the pipe %>% operator from {magrittr}
4645
library(rio) # for importing data
4746
library(here) # for easy file referencing
48-
library(linelist) # for taggin and validating
47+
library(linelist) # for tagging and validating
4948
```
5049

5150
::::::::::::::::::: checklist
5251

53-
### The double-colon
52+
### The double-colon (`::`) operator
5453

55-
The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the
56-
current environment.
54+
The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
55+
advantages including the followings:
5756

58-
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
57+
* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
58+
* Allowing to call a function from a package without loading the whole package
59+
with library().
5960

60-
This help us remember package functions and avoid namespace conflicts.
61+
For example, the command `dplyr::filter(data, condition)` means we are calling
62+
the `filter()` function from the `{dplyr}` package.
6163

6264
:::::::::::::::::::
6365

6466

6567

66-
Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode.
67-
This involves loading the dataset into the working environment and view its structure and content.
68+
Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and view its structure and content.
6869

6970
```{r, eval=FALSE}
7071
# Read data
@@ -88,28 +89,27 @@ cleaned_data
8889

8990
:::::::::::::::::::::::: discussion
9091

91-
<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an emergency? How do you safeguard your analysis from this inconvenience? -->
92+
<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an outbreak? How do you safeguard your analysis from this inconvenience? -->
9293

9394
### An unexpected change
9495

9596
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server :grin:. However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:!
9697

97-
How can you detect if the data input is **still valid** to replicate the analysis code you wrote the day before?
98+
How can you detect if the input data is **still valid** to replicate the analysis code you wrote the day before?
9899

99100
::::::::::::::::::::::::
100101

101102
:::::::::::::::::::::::: instructor
102103

103104
If learners do not have an experience to share, we as instructors can share one.
104105

105-
An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
106+
A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
106107

107108
::::::::::::::::::::::::
108109

109-
## Creating a linelist and tagging elements
110+
## Creating a linelist and tagging columns
110111

111-
Once the data is loaded and cleaned, we convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the
112-
below code chunk.
112+
Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the below code chunk.
113113

114114
```{r}
115115
# Create a linelist object from cleaned data
@@ -125,26 +125,26 @@ linelist_data
125125
```
126126

127127
The `{linelist}` package supplies tags for common epidemiological variables
128-
and a set of appropriate data types for each. You can view the list of available tags by the variable name
129-
and their acceptable data types for each using `linelist::tags_types()`.
128+
and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function.
130129

131130

132131
::::::::::::::::::::::::::::::::::::: challenge
133132

134-
Let's **tag** more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
133+
Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection.
135134

136135
Now:
137136

138137
- **Explore** the available tag names in {linelist}.
139-
- **Find** what other variables in the cleaned dataset can be associated with any of these available tags.
140-
- **Tag** those variables as above using `linelist::make_linelist()`.
138+
- **Find** what other variables in the input dataset can be associated with any of these available tags.
139+
- **Tag** those variables as shown above using the `linelist::make_linelist()`
140+
function.
141141

142142
:::::::::::::::::::: hint
143143

144144
Your can get access to the list of available tag names in {linelist} using:
145145

146146
```{r, eval=FALSE}
147-
# Get a list of available tags by name and data types
147+
# Get a list of available tags names and data types
148148
linelist::tags_types()
149149
150150
# Get a list of names only
@@ -166,7 +166,7 @@ linelist::make_linelist(
166166
)
167167
```
168168

169-
How these additional tags are visible in the output?
169+
Are these additional tags visible in the output?
170170

171171
<!-- Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html). -->
172172

@@ -177,32 +177,32 @@ How these additional tags are visible in the output?
177177
## Validation
178178

179179
To ensure that all tagged variables are standardized and have the correct data
180-
types, use the `linelist::validate_linelist()`, as
181-
shown in the example below:
180+
types, use the `linelist::validate_linelist()` function, as shown in the example below:
182181

183-
```r
182+
```{r}
184183
linelist::validate_linelist(linelist_data)
185184
```
186185

187-
<!-- If your dataset requires a new tag, set the argument -->
188-
<!-- `allow_extra = TRUE` when creating the linelist object with its corresponding-->
189-
<!-- datatype. -->
186+
<!-- If your dataset requires a new tag other than those defined in the -->
187+
<!-- {linelist} package, use `allow_extra = TRUE` when creating the -->
188+
<!-- linelist object with its corresponding datatype using the -->
189+
<!-- `linelist::make_linelist()` function. -->
190190

191191

192192

193193
::::::::::::::::::::::::: challenge
194194

195-
Let's **validate** some tagged variables. Let's simulate a situation in an ongoing outbreak. You wake up one day to discover that the data stream you have rely on has a new set of entries (i.e., rows or observations) and one variable that has a change of data type.
195+
Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed.
196196

197-
For example, let's assume the variable `age` changed from a double (`<dbl>`) variable to character (`<chr>`).
197+
Let's consider the example where the type `age` variable has changed from a double (`<dbl>`) to character (`<chr>`).
198198

199199
To simulate this situation:
200200

201-
- **Change** the variable data type,
201+
- **Change** the data type of the variable ,
202202
- **Tag** the variable into a linelist, and then
203203
- **Validate** it.
204204

205-
Describe how `linelist::validate_linelist()` reacts when input data has a different variable data type.
205+
Describe how `linelist::validate_linelist()` reacts when there is a change in the data type of one variable of the input data.
206206

207207
:::::::::::::::::::::::::: hint
208208

@@ -224,8 +224,6 @@ cleaned_data %>%
224224

225225
> Please run the code line by line, focusing only on the parts before the pipe (`%>%`). After each step, observe the output before moving to the next line.
226226
227-
If the `age` variable changes from double (`<dbl>`) to character (`<chr>`) we get the following:
228-
229227
```{r}
230228
cleaned_data %>%
231229
# simulate a change of data type in one variable
@@ -242,12 +240,12 @@ Why are we getting an `Error` message?
242240

243241
<!-- Should we have a `Warning` message instead? Explain why. -->
244242

245-
Explore other situations to understand this behavior. Let's try these additional changes to variables:
243+
Explore other situations to understand this behavior by converting:
246244

247-
- `date_onset` changes from a `<date>` variable to character (`<chr>`),
248-
- `gender` changes from a character (`<chr>`) variable to integer (`<int>`).
245+
- `date_onset` from `<date>` to character (`<chr>`),
246+
- `gender` character (`<chr>`) to integer (`<int>`).
249247

250-
Then tag them into a linelist for validation. Does the `Error` message propose to us the solution?
248+
Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue?
251249

252250
::::::::::::::::::::::::::
253251

@@ -283,7 +281,7 @@ cleaned_data %>%
283281
linelist::validate_linelist()
284282
```
285283

286-
We get `Error` messages because of the mismatch between the predefined tag type (from `linelist::tags_types()`) and the tagged variable class in the linelist.
284+
We get `Error` messages because the default type of these variable in `linelist::tags_types()` is different from the one we set them at.
287285

288286
The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.
289287

@@ -293,17 +291,17 @@ The `Error` message inform us that in order to **validate** our linelist, we mus
293291

294292
::::::::::::::::::::::::: challenge
295293

296-
What step along the `{linelist}` workflow of tagging and validating would response to the absence of a variable?
294+
Beyond tagging and validating the linelist object, what extra step do we needed when building the object?
297295

298296
:::::::::::::::::::::::::: solution
299297

300-
About losing variables, you can simulate this scenario:
298+
Let's simulate a scenario about losing a variable :
301299

302300
```{r}
303301
cleaned_data %>%
304-
# simulate a change of data type in one variable
302+
# remove the variable 'age'
305303
select(-age) %>%
306-
# tag one variable
304+
# tag variable 'age' that no longer exist
307305
linelist::make_linelist(
308306
age = "age"
309307
)
@@ -316,35 +314,35 @@ cleaned_data %>%
316314

317315
## Safeguarding
318316

319-
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged
320-
columns, you will receive an error or warning message, as shown in the example below.
317+
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
321318

322319
```{r, warning=TRUE}
323320
new_df <- linelist_data %>%
324321
dplyr::select(case_id, gender)
325322
```
326323

327-
This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using `linelist::lost_tags_action()`.
324+
This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function.
328325

329326
::::::::::::::::::::::::::::::::::::: challenge
330327

331328
Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message.
332329

333-
- First, run this code to count the frequency per category within a categorical variable:
330+
- First, run this code to count the frequency of each category within a categorical variable:
334331

335332
```{r,eval=FALSE}
336333
linelist_data %>%
337334
dplyr::select(case_id, gender) %>%
338335
dplyr::count(gender)
339336
```
340337

341-
- Set behavior for lost tags in a `linelist` to "error" as follows:
338+
- Set the behavior for lost tags in a `linelist` to "error" as follows:
342339

343340
```{r, eval=FALSE}
344341
# set behavior to "error"
345342
linelist::lost_tags_action(action = "error")
346-
```
347-
- Now, re-run the above code segment with `dplyr::count()`.
343+
```
344+
345+
- Now, re-run the above code chunk with `dplyr::count()`.
348346

349347
Identify:
350348

@@ -368,7 +366,7 @@ linelist::lost_tags_action()
368366

369367
::::::::::::::::::::::::::::::::::::::::::::::::
370368

371-
A `linelist` object resembles a data frame but offers richer features
369+
A `linelist` object resembles a data frame but offers richer features
372370
and functionalities. Packages that are linelist-aware can leverage these
373371
features. For example, you can extract a data frame of only the tagged columns
374372
using the `linelist::tags_df()` function, as shown below:
@@ -377,23 +375,22 @@ using the `linelist::tags_df()` function, as shown below:
377375
linelist::tags_df(linelist_data)
378376
```
379377

380-
This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
378+
This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode!
381379

382380
:::::::::::::::::::::::::::::::::::: checklist
383381

384382
### When should I use `{linelist}`?
385383

386384
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
387385

388-
`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis.
389-
Check the "Get started" vignette section about
390-
[When you should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
386+
`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis. Check the "Get started" vignette section about
387+
[When I should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
391388

392389
:::::::::::::::::::::::::::::::::::::::::::
393390

394391

395392
::::::::::::::::::::::::::::::::::::: keypoints
396393

397-
- Use `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
394+
- Use the `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
398395

399396
::::::::::::::::::::::::::::::::::::::::::::::::

0 commit comments

Comments
 (0)