You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/validate.Rmd
+65-68Lines changed: 65 additions & 68 deletions
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ exercises: 2
12
12
13
13
::::::::::::::::::::::::::::::::::::: objectives
14
14
15
-
- Demonstrate how to covert case data to`linelist` data
15
+
- Demonstrate how to covert case data into`linelist` data
16
16
- Demonstrate how to tag and validate data to make analysis more reliable
17
17
18
18
::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,50 +21,51 @@ exercises: 2
21
21
22
22
This episode requires you to:
23
23
24
-
- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv)
25
-
-Save it in the `data/` folder.
24
+
- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file
25
+
-And save it in the `data/` folder.
26
26
27
27
:::::::::::::::::::::
28
28
29
29
## Introduction
30
30
31
-
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
32
-
it's essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent
33
-
analyses. Otherwise you might find that your analysis suddenly stops working when specific variables appear or disappear, or their underlying data types (like `<date>` or `<chr>`) change. Specifically, this additional layer involves: 1) verifying the presence and correct data type of certain columns within
34
-
your dataset, a process commonly referred to as **tagging**; 2) implementing measures to
35
-
check that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
31
+
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `<date>` or `<chr>`), etc. Specifically, this additional step involves:
36
32
33
+
1. Verifying the presence and correct data type of certain columns within
34
+
your dataset, a process commonly referred to as **tagging**;
35
+
2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
36
+
37
+
38
+
This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package
39
+
to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package.
37
40
38
-
This episode focuses on tagging and validate outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/)
39
-
package. Let's start by loading the package `{rio}` to read data and the package `{linelist}`
40
-
to create a linelist object. We'll use the pipe `%>%` to connect some of their functions, including others from
41
-
the package `{dplyr}`, so let's also call to the tidyverse package:
42
41
43
42
```{r,eval=TRUE,message=FALSE,warning=FALSE}
44
43
# Load packages
45
-
library(tidyverse) # for {dplyr} functions and the pipe %>%
44
+
library(tidyverse) # to access {dplyr} functions and the pipe %>% operator from {magrittr}
46
45
library(rio) # for importing data
47
46
library(here) # for easy file referencing
48
-
library(linelist) # for taggin and validating
47
+
library(linelist) # for tagging and validating
49
48
```
50
49
51
50
::::::::::::::::::: checklist
52
51
53
-
### The double-colon
52
+
### The double-colon (`::`) operator
54
53
55
-
The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the
56
-
current environment.
54
+
The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
55
+
advantages including the followings:
57
56
58
-
For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
57
+
* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
58
+
* Allowing to call a function from a package without loading the whole package
59
+
with library().
59
60
60
-
This help us remember package functions and avoid namespace conflicts.
61
+
For example, the command `dplyr::filter(data, condition)` means we are calling
62
+
the `filter()` function from the `{dplyr}` package.
61
63
62
64
:::::::::::::::::::
63
65
64
66
65
67
66
-
Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode.
67
-
This involves loading the dataset into the working environment and view its structure and content.
68
+
Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and view its structure and content.
68
69
69
70
```{r, eval=FALSE}
70
71
# Read data
@@ -88,28 +89,27 @@ cleaned_data
88
89
89
90
:::::::::::::::::::::::: discussion
90
91
91
-
<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an emergency? How do you safeguard your analysis from this inconvenience? -->
92
+
<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an outbreak? How do you safeguard your analysis from this inconvenience? -->
92
93
93
94
### An unexpected change
94
95
95
96
You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server :grin:. However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:!
96
97
97
-
How can you detect if the data input is **still valid** to replicate the analysis code you wrote the day before?
98
+
How can you detect if the input data is **still valid** to replicate the analysis code you wrote the day before?
98
99
99
100
::::::::::::::::::::::::
100
101
101
102
:::::::::::::::::::::::: instructor
102
103
103
104
If learners do not have an experience to share, we as instructors can share one.
104
105
105
-
An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
106
+
A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
106
107
107
108
::::::::::::::::::::::::
108
109
109
-
## Creating a linelist and tagging elements
110
+
## Creating a linelist and tagging columns
110
111
111
-
Once the data is loaded and cleaned, we convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the
112
-
below code chunk.
112
+
Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the below code chunk.
113
113
114
114
```{r}
115
115
# Create a linelist object from cleaned data
@@ -125,26 +125,26 @@ linelist_data
125
125
```
126
126
127
127
The `{linelist}` package supplies tags for common epidemiological variables
128
-
and a set of appropriate data types for each. You can view the list of available tags by the variable name
129
-
and their acceptable data types for each using `linelist::tags_types()`.
128
+
and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function.
130
129
131
130
132
131
::::::::::::::::::::::::::::::::::::: challenge
133
132
134
-
Let's **tag** more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
133
+
Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection.
135
134
136
135
Now:
137
136
138
137
-**Explore** the available tag names in {linelist}.
139
-
-**Find** what other variables in the cleaned dataset can be associated with any of these available tags.
140
-
-**Tag** those variables as above using `linelist::make_linelist()`.
138
+
-**Find** what other variables in the input dataset can be associated with any of these available tags.
139
+
-**Tag** those variables as shown above using the `linelist::make_linelist()`
140
+
function.
141
141
142
142
:::::::::::::::::::: hint
143
143
144
144
Your can get access to the list of available tag names in {linelist} using:
145
145
146
146
```{r, eval=FALSE}
147
-
# Get a list of available tags by name and data types
147
+
# Get a list of available tags names and data types
148
148
linelist::tags_types()
149
149
150
150
# Get a list of names only
@@ -166,7 +166,7 @@ linelist::make_linelist(
166
166
)
167
167
```
168
168
169
-
How these additional tags are visible in the output?
169
+
Are these additional tags visible in the output?
170
170
171
171
<!-- Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html). -->
172
172
@@ -177,32 +177,32 @@ How these additional tags are visible in the output?
177
177
## Validation
178
178
179
179
To ensure that all tagged variables are standardized and have the correct data
180
-
types, use the `linelist::validate_linelist()`, as
181
-
shown in the example below:
180
+
types, use the `linelist::validate_linelist()` function, as shown in the example below:
182
181
183
-
```r
182
+
```{r}
184
183
linelist::validate_linelist(linelist_data)
185
184
```
186
185
187
-
<!-- If your dataset requires a new tag, set the argument -->
188
-
<!-- `allow_extra = TRUE` when creating the linelist object with its corresponding-->
189
-
<!-- datatype. -->
186
+
<!-- If your dataset requires a new tag other than those defined in the -->
187
+
<!-- {linelist} package, use `allow_extra = TRUE` when creating the -->
188
+
<!-- linelist object with its corresponding datatype using the -->
189
+
<!-- `linelist::make_linelist()` function. -->
190
190
191
191
192
192
193
193
::::::::::::::::::::::::: challenge
194
194
195
-
Let's **validate** some tagged variables. Let's simulate a situation in an ongoing outbreak. You wake up one day to discover that the data stream you have rely on has a new set of entries (i.e., rows or observations) and one variable that has a change of data type.
195
+
Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed.
196
196
197
-
For example, let's assume the variable `age` changed from a double (`<dbl>`) variable to character (`<chr>`).
197
+
Let's consider the example where the type `age`variable has changed from a double (`<dbl>`) to character (`<chr>`).
198
198
199
199
To simulate this situation:
200
200
201
-
-**Change** the variable data type,
201
+
-**Change** the data type of the variable ,
202
202
-**Tag** the variable into a linelist, and then
203
203
-**Validate** it.
204
204
205
-
Describe how `linelist::validate_linelist()` reacts when input data has a different variable data type.
205
+
Describe how `linelist::validate_linelist()` reacts when there is a change in the data type of one variable of the input data.
206
206
207
207
:::::::::::::::::::::::::: hint
208
208
@@ -224,8 +224,6 @@ cleaned_data %>%
224
224
225
225
> Please run the code line by line, focusing only on the parts before the pipe (`%>%`). After each step, observe the output before moving to the next line.
226
226
227
-
If the `age` variable changes from double (`<dbl>`) to character (`<chr>`) we get the following:
228
-
229
227
```{r}
230
228
cleaned_data %>%
231
229
# simulate a change of data type in one variable
@@ -242,12 +240,12 @@ Why are we getting an `Error` message?
242
240
243
241
<!-- Should we have a `Warning` message instead? Explain why. -->
244
242
245
-
Explore other situations to understand this behavior. Let's try these additional changes to variables:
243
+
Explore other situations to understand this behavior by converting:
246
244
247
-
-`date_onset`changes from a `<date>` variable to character (`<chr>`),
248
-
-`gender`changes from a character (`<chr>`) variable to integer (`<int>`).
245
+
-`date_onset` from `<date>` to character (`<chr>`),
246
+
-`gender` character (`<chr>`) to integer (`<int>`).
249
247
250
-
Then tag them into a linelist for validation. Does the `Error` message propose to us the solution?
248
+
Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue?
251
249
252
250
::::::::::::::::::::::::::
253
251
@@ -283,7 +281,7 @@ cleaned_data %>%
283
281
linelist::validate_linelist()
284
282
```
285
283
286
-
We get `Error` messages because of the mismatch between the predefined tag type (from`linelist::tags_types()`) and the tagged variable class in the linelist.
284
+
We get `Error` messages because the default type of these variable in `linelist::tags_types()` is different from the one we set them at.
287
285
288
286
The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.
289
287
@@ -293,17 +291,17 @@ The `Error` message inform us that in order to **validate** our linelist, we mus
293
291
294
292
::::::::::::::::::::::::: challenge
295
293
296
-
What step along the `{linelist}` workflow of tagging and validating would response to the absence of a variable?
294
+
Beyond tagging and validating the linelist object, what extra step do we needed when building the object?
297
295
298
296
:::::::::::::::::::::::::: solution
299
297
300
-
About losing variables, you can simulate this scenario:
298
+
Let's simulate a scenario about losing a variable :
301
299
302
300
```{r}
303
301
cleaned_data %>%
304
-
# simulate a change of data type in one variable
302
+
# remove the variable 'age'
305
303
select(-age) %>%
306
-
# tag one variable
304
+
# tag variable 'age' that no longer exist
307
305
linelist::make_linelist(
308
306
age = "age"
309
307
)
@@ -316,35 +314,35 @@ cleaned_data %>%
316
314
317
315
## Safeguarding
318
316
319
-
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged
320
-
columns, you will receive an error or warning message, as shown in the example below.
317
+
Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
321
318
322
319
```{r, warning=TRUE}
323
320
new_df <- linelist_data %>%
324
321
dplyr::select(case_id, gender)
325
322
```
326
323
327
-
This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using `linelist::lost_tags_action()`.
324
+
This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function.
328
325
329
326
::::::::::::::::::::::::::::::::::::: challenge
330
327
331
328
Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message.
332
329
333
-
- First, run this code to count the frequency per category within a categorical variable:
330
+
- First, run this code to count the frequency of each category within a categorical variable:
334
331
335
332
```{r,eval=FALSE}
336
333
linelist_data %>%
337
334
dplyr::select(case_id, gender) %>%
338
335
dplyr::count(gender)
339
336
```
340
337
341
-
- Set behavior for lost tags in a `linelist` to "error" as follows:
338
+
- Set the behavior for lost tags in a `linelist` to "error" as follows:
342
339
343
340
```{r, eval=FALSE}
344
341
# set behavior to "error"
345
342
linelist::lost_tags_action(action = "error")
346
-
```
347
-
- Now, re-run the above code segment with `dplyr::count()`.
343
+
```
344
+
345
+
- Now, re-run the above code chunk with `dplyr::count()`.
348
346
349
347
Identify:
350
348
@@ -368,7 +366,7 @@ linelist::lost_tags_action()
368
366
369
367
::::::::::::::::::::::::::::::::::::::::::::::::
370
368
371
-
A `linelist` object resembles a data frame but offers richer features
369
+
A `linelist` object resembles a data frame but offers richer features
372
370
and functionalities. Packages that are linelist-aware can leverage these
373
371
features. For example, you can extract a data frame of only the tagged columns
374
372
using the `linelist::tags_df()` function, as shown below:
@@ -377,23 +375,22 @@ using the `linelist::tags_df()` function, as shown below:
377
375
linelist::tags_df(linelist_data)
378
376
```
379
377
380
-
This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
378
+
This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode!
381
379
382
380
:::::::::::::::::::::::::::::::::::: checklist
383
381
384
382
### When should I use `{linelist}`?
385
383
386
384
Data analysis during an outbreak response or mass-gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
387
385
388
-
`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis.
389
-
Check the "Get started" vignette section about
390
-
[When you should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
386
+
`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis. Check the "Get started" vignette section about
387
+
[When I should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
391
388
392
389
:::::::::::::::::::::::::::::::::::::::::::
393
390
394
391
395
392
::::::::::::::::::::::::::::::::::::: keypoints
396
393
397
-
- Use `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
394
+
- Use the `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
0 commit comments