generated from jhudsl/OTTR_Template
-
Notifications
You must be signed in to change notification settings - Fork 1
/
06-package-management.Rmd
195 lines (127 loc) · 12.8 KB
/
06-package-management.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```
# Managing package versions
## Learning Objectives
```{r, fig.alt="This chapter will demonstrate how to: Understand that versions of software influence analysis outcomes. Find what package versions you are using. Print session info in all of your analyses so it is more clear what packages and versions you are using.", out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf62875ddf7_0_394")
```
As we discussed previously, sometimes two different researchers can run the same code and same data and get different results!
```{r, fig.alt="Ruby the researcher and Avi the associate are both very confused and slightly horrified that they both ran the same code and data but received different results. ", out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_673")
```
What Ruby and Avi may not realize is that although they may have used the same code and data, the software packages that they have on each of their computers might be very different. Even if they have the same software packages, they likely don't have the same versions and versions can influence results! Different computing environments are not only a headache to detangle, they also can influence the reproducibility of your results [ @BeaulieuJones2017].
```{r, fig.alt="Ruby has a particular computing environment she has developed her code from. This computing environment is represented as a bubble above her computer with various hexagons with version numbers as well as Rstudio and R installed on her computer. Her code ran just fine on her particular computing environment. Avi attempted to run Ruby’s code on his very different local computing environment and got an error. His computer runs the same code but came up with a different result!", out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf62875ddf7_0_404")
```
There are multiple ways to deal with variations in computing environments so that your analyses will be reproducible and we will discuss a few different strategies for tackling this problem in this course and its follow up course. But for now, we will start with the least intensive to implement: session info.
There are two strategies for dealing with software versions that we will discuss in this chapter. Either of these strategies can be used alone or you can use both. They address different aspects of the computing environment discrepancy problem.
### Strategy 1: Session Info - record a list of your packages
One strategy to combat different software versions is to list the **session info**. This is the easiest (though not most comprehensive) method for handling differences in software versions is to have your code list details about your computing environment.
Session info can lead to clues as to why results weren't reproducible. For example, if both Avi and Ruby ran notebooks and included a session info print out it may look like this:
```{r, fig.alt="Two session info print outs are show side by side. One is labeled as ‘Ruby’s session info print out’ and the other as ‘Avi’s session info print out’. Highlighted we can see that they have different R versions: 4.0.2 vs 4.0.5. They also have different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions! If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues to why that might be! This can give them items to look into for determining why the results didn’t reproduce as expected.", out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.g102dc56db08_0_0")
```
Session info shows us that they have different R versions and different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions. If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues for any discrepancies. This can give them items to look into for determining why the results didn’t reproduce as expected.
### Strategy 2: Package managers - share a useable snapshot of your environment
**Package managers** can help handle your computing environment for you in a way that you can share them with others. In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. For R and Python versions of the exercises, we will be using different managers, but the foundational strategy will be the same: include a file that someone else could replicate your package set up from.
```{r, fig.alt="In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. In this example we show that Ruby has an environment, and using a package manager, has taken a snapshot of her computing environment. That snapshot is shared with Avi, who can use the package manager to attempt to build it on his own computer. This will help address some differences in package versions between two individual’s computers.", out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.g102dc56db08_49_26")
```
For both exercises, we will download an environment 'snapshot' file we've set up for you, then we will practice adding a new package to the environments we've provided, and add them to your new repository along with the rest of your example project files.
- For Python, we'll use `conda` for package management and store this information in a `environment.yml` file.
- For R, we'll use `renv` for package management and store this information in a `renv.lock` file.
## Get the exercise project files (or continue with the files you used in the previous chapter)
<details> <summary>**Get the Python project example files**</summary>
[Click this link to download](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/chapter-zips/python-heatmap-chapt-6.zip).
```{bash, include = FALSE}
mkdir -p chapter-zips
wget -O chapter-zips/python-heatmap-chapt-6.zip https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/chapter-zips/python-heatmap-chapt-6.zip
```
Now double click your chapter zip file to unzip. For Windows you may have to [follow these instructions](https://support.microsoft.com/en-us/windows/zip-and-unzip-files-f6dde0a7-0fec-8294-e1d3-703ed85e7ebc).
```{bash, include = FALSE}
unzip -o chapter-zips/python-heatmap-chapt-6.zip -d chapter-zips/
```
</details>
<details> <summary>**Get the R project example files**</summary>
[Click this link to download](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/chapter-zips/r-heatmap-chapt-6.zip).
```{bash, include = FALSE}
mkdir -p chapter-zips
wget -O chapter-zips/r-heatmap-chapt-6.zip https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/chapter-zips/r-heatmap-chapt-6.zip
```
Now double click your chapter zip file to unzip. For Windows you may have to [follow these instructions](https://support.microsoft.com/en-us/windows/zip-and-unzip-files-f6dde0a7-0fec-8294-e1d3-703ed85e7ebc).
```{bash, include = FALSE}
unzip -o chapter-zips/r-heatmap-chapt-6.zip -d chapter-zips/
```
</details>
## Exercise 1: Print out session info
<details> <summary>**Python version of the exercise**</summary>
In your scientific notebook, you'll need to add two items.
1. Add the `import session_info` to a code chunk at the beginning of your notebook.
2. Add `session_info.show()` to a new code chunk at the very end of your notebook.
2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter.
</details>
<details> <summary>**R version of the exercise**</summary>
1. In your Rmd file, add a chunk in the very end that looks like this:
`````
```{r}
sessionInfo()
```
`````
2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter.
</details>
## Exercise 2: Package management
<details> <summary>**Python version of the exercise**</summary>
1. Download this [starter conda environment.yml file](https://raw.githubusercontent.com/jhudsl/reproducible-python-example/main/environment.yml) by clicking on the link and place it with your example project files directory.
```{bash, include = FALSE}
wget -nc https://raw.githubusercontent.com/jhudsl/reproducible-python-example/main/environment.yml
```
2. Navigate to your example project files directory using [command line](https://towardsdatascience.com/a-quick-guide-to-using-command-line-terminal-96815b97b955).
3. Create your conda environment by using this file in the command.
```
conda env create --file environment.yml
```
4. Activate your conda environment using this command.
```
conda activate reproducible-python
```
5. Now start up JupyterLab again using this command:
```
jupyter lab
```
6. Follow [these instructions to add the environment.yml file to the GitHub repository you created in the previous chapter](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository#adding-a-file-to-a-repository-on-github). Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe.
### More resources on how to use conda
- [Install Jupyter using your own environment (Mac specific)](https://medium.com/swlh/installing-jupyter-notebook-and-using-your-own-environment-on-mac-fa41efd4639d)
- [Definitive guide to using conda](https://whiteboxml.com/blog/the-definitive-guide-to-python-virtual-environments-with-conda)
</details>
<details> <summary>**R version of the exercise**</summary>
**First install the `renv` package**
1. Go to RStudio and the Console pane:
2. Install `renv` using (you should only need to do this once per your computer or RStudio environment).
```
install.packages("renv")
```
**Now set up `renv` to use in your project**
1. Change to your current directory for your project using `setwd()` in your console window (_don't put this in a script or notebook_).
2. Use this command in your project:
```
renv::init()
```
This will start up `renv` in your particular project
\*[What's `::` about?](https://stackoverflow.com/questions/35240971/what-are-the-double-colons-in-r) -- in brief it allows you to use a function from a package without loading the entire thing with `library()`.
3. Now you can develop your project as you normally would; installing and removing packages in R as you see fit. For the purposes of this exercise, let's install the `styler` package using the following command. (The styler package will come in handy for styling our code in the next chapter).
```
install.packages("styler")
```
Now that we have installed `styler` we will want to add it to our renv snapshot.
4. To add any packages we've installed to our renv snapshot we will use this command:
```
renv::snapshot()
```
This will save whatever packages we are currently using to our environment snapshot file called `renv.lock`. This `renv.lock` file is what we can share with our collaborators so they can replicate our computing environment.
If your package installation attempts are unsuccessful and you'd like to revert to the previous state of your environment, you can run `renv::restore()`. This will restore your `renv.lock` file to what it was before you attempted to install `styler` or whatever packages you tried to install.
5. You should see an `renv.lock` file is now created or updated! You will want to always include this file with your project files. This means we will want to add it to our GitHub!
6. Follow [these instructions to add your renv.lock file to the GitHub repository you created in the previous chapter](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository#adding-a-file-to-a-repository-on-github). Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe.
</details>
After you've added your computing environment files to your GitHub, you're ready to continue using them with your IDE to actually work on the code in your notebook!
**Any feedback you have regarding this exercise is greatly appreciated; you can fill out [this form](https://forms.gle/ygSSwoGaEATA2S65A)!**