|
| 1 | +--- |
| 2 | +title: "Introduction to Statistics with R: Reproducible Research and Communication of Results" |
| 3 | +blurb: "Blue and white mean war" |
| 4 | +coverImage: 13 |
| 5 | +author: "Dereck Mezquita" |
| 6 | +date: 2023-10-20 |
| 7 | +tags: [statistics, mathematics, probability, data] |
| 8 | +published: true |
| 9 | +comments: true |
| 10 | +output: |
| 11 | + html_document: |
| 12 | + keep_md: true |
| 13 | +--- |
| 14 | + |
| 15 | +```{r setup, include=FALSE} |
| 16 | +# https://bookdown.org/yihui/rmarkdown-cookbook/hook-html5.html |
| 17 | +if (knitr::is_html_output()) knitr::knit_hooks$set( |
| 18 | + plot = function(x, options) { |
| 19 | + cap <- options$fig.cap |
| 20 | + # x <- paste0("/courses/", x) |
| 21 | + as.character(htmltools::tag( |
| 22 | + "Figure", list(src = x, alt = cap, paste("\n\t", cap, "\n", sep = "")) |
| 23 | + )) |
| 24 | + } |
| 25 | +) |
| 26 | +
|
| 27 | +knitr::knit_hooks$set(optipng = knitr::hook_optipng) # optipng = '-o7' |
| 28 | +knitr::opts_chunk$set(dpi = 300, fig.width = 10, fig.height = 7) |
| 29 | +``` |
| 30 | + |
| 31 | +# Reproducible Research and Communication of Results |
| 32 | + |
| 33 | +## Literate Programming and Dynamic Documents |
| 34 | + |
| 35 | +Modern data science and statistical analysis benefit greatly from literate programming[^1], where code, narrative, and results coexist in a single document. This practice enhances reproducibility, transparency, and collaboration by bundling code, outputs, and prose together. Key tools in the R ecosystem for literate programming include **RMarkdown**, **Quarto**, and **knitr**, each enabling you to create dynamic, interactive documents that can be easily shared and updated. |
| 36 | + |
| 37 | +### RMarkdown, Quarto, and knitr |
| 38 | + |
| 39 | +**Conceptual Overview:** |
| 40 | +- **RMarkdown:** A file format (`.Rmd`) that merges code (R and other languages), markdown text, and output. When rendered, it produces outputs like HTML, PDF, or Word documents. |
| 41 | +- **Quarto:** A next-generation tool that extends beyond RMarkdown, supporting multiple languages (R, Python, Julia) and offering more publishing and formatting options. |
| 42 | +- **knitr:** The R package that underpins RMarkdown, facilitating the execution of code chunks and insertion of their results into documents. |
| 43 | + |
| 44 | +These tools allow you to present statistical analysis as a coherent narrative: you can explain your methods, show your code, and present results (tables, figures) inline. This reduces the “copy-paste” cycle, making your workflow more efficient, transparent, and less error-prone. |
| 45 | + |
| 46 | +**Key Advantages:** |
| 47 | +- **Reproducibility:** If someone reruns your `.Rmd` or Quarto document, they get the same results, ensuring transparency and trust. |
| 48 | +- **Version Control:** Storing `.Rmd` or Quarto source files in Git allows you to track changes in code, text, and data processing steps. |
| 49 | +- **Communication:** By weaving code and outputs together, collaborators and stakeholders can follow the logic of the analysis, making complex methods more accessible. |
| 50 | + |
| 51 | +### Parameterised Reports and Interactive Notebooks |
| 52 | + |
| 53 | +**Parameterised Reports:** |
| 54 | +- Parameterised reports allow you to define parameters (e.g., dataset name, date range, filtering criteria) at the start of your document. |
| 55 | +- When you “knit” or “render” the report, you can supply different values for these parameters without editing the code. This is useful if you need to produce the same analysis for multiple groups, time periods, or scenarios. |
| 56 | + |
| 57 | +**Example:** |
| 58 | +Let’s assume we have a parameter `region` that specifies which subset of data we want to analyse. In an RMarkdown file, you can define parameters in the YAML header: |
| 59 | + |
| 60 | +\`\`\` |
| 61 | +--- |
| 62 | +title: "Sales Report" |
| 63 | +params: |
| 64 | + region: "Asia" |
| 65 | +output: html_document |
| 66 | +--- |
| 67 | +\`\`\` |
| 68 | + |
| 69 | +Within the report, you can refer to `params$region` to filter data accordingly. |
| 70 | + |
| 71 | +**Code Example (Assuming a CSV with sales data):** |
| 72 | +Please note, if we need a real dataset, let me know and I can suggest one. For demonstration, let’s assume we have `sales_data.csv` with columns `Region`, `Product`, `Sales`. If we do not have such data, we can simulate it. |
| 73 | + |
| 74 | +\`\`\{r echo=TRUE\} |
| 75 | +# Simulate some sales data if we don't have a real dataset: |
| 76 | +# This is for demonstration; in practice, you'd read a real dataset. |
| 77 | +if(!requireNamespace("data.table", quietly=TRUE)) { |
| 78 | + install.packages("data.table") |
| 79 | +} |
| 80 | +library(data.table) |
| 81 | + |
| 82 | +set.seed(123) |
| 83 | +dt <- data.table( |
| 84 | + Region=sample(c("Asia","Europe","Americas"), 100, replace=TRUE), |
| 85 | + Product=sample(c("Gadget","Widget","Thingamajig"), 100, replace=TRUE), |
| 86 | + Sales=round(runif(100, 10, 1000),2) |
| 87 | +) |
| 88 | + |
| 89 | +# Filter by parameter: |
| 90 | +selected_data <- dt[Region == params$region] |
| 91 | +head(selected_data) |
| 92 | +\`\`\` |
| 93 | + |
| 94 | +Running this report with `region = "Europe"` will produce a similar analysis focusing only on Europe’s data. |
| 95 | + |
| 96 | +**To render with different parameters:** |
| 97 | +You can run from the R console: |
| 98 | +\`\`\{r, eval=FALSE\} |
| 99 | +rmarkdown::render("report.Rmd", params=list(region="Europe")) |
| 100 | +\`\`\` |
| 101 | + |
| 102 | +This approach makes it easy to produce many customised reports programmatically. |
| 103 | + |
| 104 | +**Interactive Notebooks:** |
| 105 | +- RMarkdown and Quarto support notebook interfaces. |
| 106 | +- When you open an `.Rmd` file in RStudio, you can run code chunks interactively and see outputs immediately, making exploratory analysis more intuitive. |
| 107 | +- Quarto supports Jupyter-like notebook behaviour for multiple languages. |
| 108 | +- You can also embed `shiny` apps or `htmlwidgets` for truly interactive experiences. |
| 109 | + |
| 110 | +**Example with ggplot2 Plot:** |
| 111 | +\`\`\{r sales-plot, echo=TRUE, message=FALSE, warning=FALSE\} |
| 112 | +if(!requireNamespace("ggplot2", quietly=TRUE)) { |
| 113 | + install.packages("ggplot2") |
| 114 | +} |
| 115 | +library(ggplot2) |
| 116 | + |
| 117 | +ggplot(selected_data, aes(x=Product, y=Sales)) + |
| 118 | + geom_boxplot(fill="steelblue", alpha=0.7) + |
| 119 | + labs(title=paste("Sales Distribution in", params$region), |
| 120 | + x="Product", y="Sales") + |
| 121 | + theme_minimal() |
| 122 | +\`\`\` |
| 123 | + |
| 124 | +This code chunk will produce a boxplot of Sales by Product for the specified region. By changing the parameter `region` in your report’s YAML header or through `rmarkdown::render()`, you instantly get a new plot for a different region—no manual changes needed. |
| 125 | + |
| 126 | +### Quarto vs RMarkdown |
| 127 | + |
| 128 | +**Quarto** builds on the concept of RMarkdown, offering: |
| 129 | +- **More Extensive Language Support:** R, Python, Julia, and Observable JS. |
| 130 | +- **Flexible Output Formats:** Quarto can produce scientific articles, books, websites, and blogs more seamlessly. |
| 131 | +- **Built-In Visual Themes:** Enhanced styling and layout options. |
| 132 | + |
| 133 | +If you’re starting fresh, you might consider Quarto as a more modern choice, but RMarkdown remains widely used and integrated into RStudio. |
| 134 | + |
| 135 | +### knitr Under the Hood |
| 136 | + |
| 137 | +**knitr:** The engine that executes code chunks in `.Rmd` files, capturing outputs and inserting them into the final document. It’s highly customisable: |
| 138 | +- Control code chunk options (echo, eval, results, warnings). |
| 139 | +- Integrate with caching to speed up repeated renders. |
| 140 | +- Manage figure sizes, captions, and layouts easily. |
| 141 | + |
| 142 | +**Example of Chunk Options:** |
| 143 | +\`\`\{r summary, echo=TRUE, results="markup"\} |
| 144 | +summary(selected_data$Sales) |
| 145 | +\`\`\` |
| 146 | + |
| 147 | +This displays a summary of `Sales` directly in the output, no copy-paste needed. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +**Key Takeaways:** |
| 152 | +- **Literate Programming:** Combine code, text, and results for reproducible, communicative research documents. |
| 153 | +- **RMarkdown/Quarto:** Flexible formats that let you produce static and interactive documents, enhancing collaboration and reproducibility. |
| 154 | +- **Parameterised Reports:** Automate repetitive analysis tasks by setting parameters at render time. |
| 155 | +- **Interactive Notebooks:** Run code live, see outputs instantly, and refine analysis on the fly. |
| 156 | + |
| 157 | +By using RMarkdown or Quarto (backed by knitr), you create living documents that tell the full story of your data analysis journey. These tools help ensure your work can be validated, reused, and understood by collaborators, reviewers, and future you—a crucial asset in any data science workflow. |
| 158 | + |
| 159 | +[^1]: Knuth, D. E. (1984). Literate Programming. The Computer Journal, 27(2), 97–111. |
| 160 | + |
0 commit comments