update data validation episode

Karim-Mane · Karim-Mane · commit a53d9f393ec6 · 2025-07-03T11:18:58.000Z
diff --git a/episodes/validate.Rmd b/episodes/validate.Rmd
@@ -12,7 +12,7 @@ exercises: 2
 
 ::::::::::::::::::::::::::::::::::::: objectives
 
-- Demonstrate how to covert case data to `linelist` data
+- Demonstrate how to covert case data into `linelist` data
 - Demonstrate how to tag and validate data to make analysis more reliable
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
@@ -21,50 +21,51 @@ exercises: 2
 
 This episode requires you to:
 
-- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv)
-- Save it in the `data/` folder.
+- Download the [cleaned_data.csv](https://epiverse-trace.github.io/tutorials-early/data/cleaned_data.csv) file
+- And save it in the `data/` folder.
 
 :::::::::::::::::::::
 
 ## Introduction
 
-In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
-it's essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent
-analyses. Otherwise you might find that your analysis suddenly stops working when specific variables appear or disappear, or their underlying data types (like `<date>` or `<chr>`) change. Specifically, this additional layer involves: 1) verifying the presence and correct data type of certain columns within
-your dataset, a process commonly referred to as **tagging**; 2) implementing measures to 
-check that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
+In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, it's essential to establish an additional fundamental layer to ensure the integrity and reliability of subsequent analyses. Otherwise you might encounter issues during the analysis process due to creation or removal of specific variables, changes in their underlying data types (like `<date>` or `<chr>`), etc. Specifically, this additional step involves:
 
+1. Verifying the presence and correct data type of certain columns within
+your dataset, a process commonly referred to as **tagging**; 
+2. Implementing measures to make sure that these tagged columns are not inadvertently deleted during further data processing steps, known as **validation**.
+
+
+This episode focuses on tagging and validating outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/) package. Let's start by loading the package `{rio}` to read data and the `{linelist}` package
+to create a linelist object. We'll use the pipe operator (`%>%`) to connect some of their functions, including others from the package `{dplyr}`. For this reason, we will also load the {tidyverse} package.
 
-This episode focuses on tagging and validate outbreak data using the [linelist](https://epiverse-trace.github.io/linelist/)
- package. Let's start by loading the package `{rio}` to read data and the package `{linelist}` 
-to create a linelist object. We'll use the pipe `%>%` to connect some of their functions, including others from 
-the package `{dplyr}`, so let's also call to the tidyverse package:
 
 ```{r,eval=TRUE,message=FALSE,warning=FALSE}
 # Load packages
-library(tidyverse) # for {dplyr} functions and the pipe %>%
+library(tidyverse) # to access {dplyr} functions and the pipe %>% operator from {magrittr}
 library(rio) # for importing data
 library(here) # for easy file referencing
-library(linelist) # for taggin and validating
+library(linelist) # for tagging and validating
 ```
 
 ::::::::::::::::::: checklist
 
-### The double-colon
+### The double-colon (`::`) operator
 
-The double-colon `::` in R lets you call a specific function from a package without loading the entire package into the 
-current environment. 
+The `::` in R lets you access functions or objects from a specific package without attaching the entire package to the search path. It offers several important
+advantages including the followings:
 
-For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package.
+* Telling explicitly which package a function comes from, reducing ambiguity and potential conflicts when several packages have functions with the same name.
+* Allowing to call a function from a package without loading the whole package
+with library().
 
-This help us remember package functions and avoid namespace conflicts.
+For example, the command `dplyr::filter(data, condition)` means we are calling
+the `filter()` function from the `{dplyr}` package.
 
 :::::::::::::::::::
 
 
 
-Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode.
- This involves loading the dataset into the working environment and view its structure and content. 
+Import the dataset following the guidelines outlined in the [Read case data](../episodes/read-cases.Rmd) episode. This involves loading the dataset into the working environment and view its structure and content. 
 
 ```{r, eval=FALSE}
 # Read data
@@ -88,28 +89,27 @@ cleaned_data
 
 :::::::::::::::::::::::: discussion
 
-<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an emergency? How do you safeguard your analysis from this inconvenience? -->
+<!-- Have you ever experienced an unexpected change in the input data set when running an analysis during an outbreak? How do you safeguard your analysis from this inconvenience? -->
 
 ### An unexpected change
 
 You are in an emergency response situation. You need to generate daily situation reports. You automated your analysis to read data directly from the online server :grin:.  However, the people in charge of the data collection/administration needed to **remove/rename/reformat** one variable you found helpful :disappointed:!
 
-How can you detect if the data input is **still valid** to replicate the analysis code you wrote the day before?
+How can you detect if the input data is **still valid** to replicate the analysis code you wrote the day before?
 
 ::::::::::::::::::::::::
 
 :::::::::::::::::::::::: instructor
 
 If learners do not have an experience to share, we as instructors can share one.
 
-An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
+A scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results.
 
 ::::::::::::::::::::::::
 
-## Creating a linelist and tagging elements
+## Creating a linelist and tagging columns
 
-Once the data is loaded and cleaned, we convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the 
-below code chunk.
+Once the data is loaded and cleaned, we can convert the cleaned case data into a `linelist` object using `{linelist}` package, as in the below code chunk.
 
 ```{r}
 # Create a linelist object from cleaned data
@@ -125,26 +125,26 @@ linelist_data
 ```
 
 The `{linelist}` package supplies tags for common epidemiological variables 
-and a set of appropriate data types for each. You can view the list of available tags by the variable name
-and their acceptable data types for each using `linelist::tags_types()`.
+and a set of appropriate data types for each. You can view the list of available tags by the variable name and their acceptable data types using the `linelist::tags_types()` function.
 
 
 ::::::::::::::::::::::::::::::::::::: challenge 
 
-Let's **tag** more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection.
+Let's **tag** more variables. In some datasets, it is possible to encounter variable names that are different from the available tag names. In such cases, we can associate them based on how variables were defined for data collection.
 
 Now:
 
 - **Explore** the available tag names in {linelist}.
-- **Find** what other variables in the cleaned dataset can be associated with any of these available tags.
-- **Tag** those variables as above using `linelist::make_linelist()`.
+- **Find** what other variables in the input dataset can be associated with any of these available tags.
+- **Tag** those variables as shown above using the `linelist::make_linelist()`
+function.
 
 :::::::::::::::::::: hint
 
 Your can get access to the list of available tag names in {linelist} using:
 
 ```{r, eval=FALSE}
-# Get a list of available tags by name and data types
+# Get a list of available tags names and data types
 linelist::tags_types()
 
 # Get a list of names only
@@ -166,7 +166,7 @@ linelist::make_linelist(
 )
 ```
 
-How these additional tags are visible in the output? 
+Are these additional tags visible in the output? 
 
 <!-- Do you want to see a display of available and tagged variables? You can explore the function `linelist::tags()` and read its [reference documentation](https://epiverse-trace.github.io/linelist/reference/tags.html). -->
 
@@ -177,32 +177,32 @@ How these additional tags are visible in the output?
 ## Validation
 
 To ensure that all tagged variables are standardized and have the correct data 
-types, use the `linelist::validate_linelist()`, as 
-shown in the example below:
+types, use the `linelist::validate_linelist()` function, as shown in the example below:
 
-```r
+```{r}
 linelist::validate_linelist(linelist_data)
 ```
 
-<!-- If your dataset requires a new tag, set the argument -->
-<!-- `allow_extra = TRUE` when creating the linelist object with its corresponding-->
-<!-- datatype. -->
+<!-- If your dataset requires a new tag other than those defined in the -->
+<!-- {linelist} package, use `allow_extra = TRUE` when creating the -->
+<!--  linelist object with its corresponding datatype using the  -->
+<!-- `linelist::make_linelist()` function. -->
 
 
 
 ::::::::::::::::::::::::: challenge
 
-Let's **validate** some tagged variables. Let's simulate a situation in an ongoing outbreak. You wake up one day to discover that the data stream you have rely on has a new set of entries (i.e., rows or observations) and one variable that has a change of data type. 
+Let's assume the following scenario during an ongoing outbreak. You notice at some point that the data stream you have been relying on has a set of new entries (i.e., rows or observations), and the data type of one variable has changed. 
 
-For example, let's assume the variable `age` changed from a double (`<dbl>`) variable to character (`<chr>`).
+Let's consider the example where the type `age` variable has changed from a double (`<dbl>`) to character (`<chr>`).
 
 To simulate this situation:
 
-- **Change** the variable data type,
+- **Change** the data type of the variable ,
 - **Tag** the variable into a linelist, and then 
 - **Validate** it.
 
-Describe how `linelist::validate_linelist()` reacts when input data has a different variable data type.
+Describe how `linelist::validate_linelist()` reacts when there is a change in the data type of one variable of the input data.
 
 :::::::::::::::::::::::::: hint
 
@@ -224,8 +224,6 @@ cleaned_data %>%
 
 > Please run the code line by line, focusing only on the parts before the pipe (`%>%`). After each step, observe the output before moving to the next line.
 
-If the `age` variable changes from double (`<dbl>`) to character (`<chr>`) we get the following:
-
 ```{r}
 cleaned_data %>%
   # simulate a change of data type in one variable
@@ -242,12 +240,12 @@ Why are we getting an `Error` message?
 
 <!-- Should we have a `Warning` message instead? Explain why. -->
 
-Explore other situations to understand this behavior. Let's try these additional changes to variables:
+Explore other situations to understand this behavior by converting:
 
-- `date_onset` changes from a `<date>` variable to character (`<chr>`), 
-- `gender` changes from a character (`<chr>`) variable to integer (`<int>`).
+- `date_onset` from `<date>` to character (`<chr>`), 
+- `gender` character (`<chr>`) to integer (`<int>`).
 
-Then tag them into a linelist for validation. Does the `Error` message propose to us the solution?
+Then tag them into a linelist for validation. Does the `Error` message suggest a fix to the issue?
 
 ::::::::::::::::::::::::::
 
@@ -283,7 +281,7 @@ cleaned_data %>%
   linelist::validate_linelist()
 ```
 
-We get `Error` messages because of the mismatch between the predefined tag type (from `linelist::tags_types()`) and the tagged variable class in the linelist.
+We get `Error` messages because the default type of these variable in  `linelist::tags_types()` is different from the one we set them at.
 
 The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline.
 
@@ -293,17 +291,17 @@ The `Error` message inform us that in order to **validate** our linelist, we mus
 
 ::::::::::::::::::::::::: challenge
 
-What step along the `{linelist}` workflow of tagging and validating would response to the absence of a variable?
+Beyond tagging and validating the linelist object, what extra step do we needed when building the object?
 
 :::::::::::::::::::::::::: solution
 
-About losing variables, you can simulate this scenario:
+Let's simulate a scenario about losing a variable :
 
 ```{r}
 cleaned_data %>%
-  # simulate a change of data type in one variable
+  # remove the variable 'age'
   select(-age) %>%
-  # tag one variable
+  # tag variable 'age' that no longer exist
   linelist::make_linelist(
     age = "age"
   )
@@ -316,35 +314,35 @@ cleaned_data %>%
 
 ## Safeguarding
 
-Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged 
-columns, you will receive an error or warning message, as shown in the example below.
+Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged columns, you will receive an error or warning message, as shown in the example below.
 
 ```{r, warning=TRUE}
 new_df <- linelist_data %>%
   dplyr::select(case_id, gender)
 ```
 
-This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using `linelist::lost_tags_action()`. 
+This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using the `linelist::lost_tags_action()` function. 
 
 ::::::::::::::::::::::::::::::::::::: challenge 
 
 Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message.
 
-- First, run this code to count the frequency per category within a categorical variable:
+- First, run this code to count the frequency of each category within a categorical variable:
 
 ```{r,eval=FALSE}
 linelist_data %>%
   dplyr::select(case_id, gender) %>%
   dplyr::count(gender)
 ```
 
-- Set behavior for lost tags in a `linelist` to "error" as follows:
+- Set the behavior for lost tags in a `linelist` to "error" as follows:
 
 ```{r, eval=FALSE}
 # set behavior to "error"
 linelist::lost_tags_action(action = "error")
-```  
-- Now, re-run the above code segment with `dplyr::count()`.
+```
+
+- Now, re-run the above code chunk with `dplyr::count()`.
 
 Identify:
 
@@ -368,7 +366,7 @@ linelist::lost_tags_action()
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
-A  `linelist` object resembles a data frame but offers richer features 
+A `linelist` object resembles a data frame but offers richer features 
 and functionalities. Packages that are linelist-aware can leverage these 
 features. For example, you can extract a data frame of only the tagged columns 
 using the `linelist::tags_df()` function, as shown below:
@@ -377,23 +375,22 @@ using the `linelist::tags_df()` function, as shown below:
 linelist::tags_df(linelist_data)
 ```
 
-This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode!
+This allows for the use of tagged variables only in downstream analysis, which will be useful for the next episode!
 
 :::::::::::::::::::::::::::::::::::: checklist
 
 ### When should I use `{linelist}`?
 
 Data analysis during an outbreak response or mass-gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables).
 
-`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis.
-Check the "Get started" vignette section about
-[When you should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
+`{linelist}` is more appropriate for this type of ongoing or long-lasting analysis. Check the "Get started" vignette section about
+[When I should consider using {linelist}?](https://epiverse-trace.github.io/linelist/articles/linelist.html#should-i-use-linelist) for more information.
 
 :::::::::::::::::::::::::::::::::::::::::::
 
 
 ::::::::::::::::::::::::::::::::::::: keypoints 
 
-- Use `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
+- Use the `{linelist}` package to tag, validate, and prepare case data for downstream analysis.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::