forked from nsidc/data_strategies_for_future_us
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_strategies_page.qmd
179 lines (133 loc) · 6.03 KB
/
data_strategies_page.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: "Data Strategies for a Future Us"
author: "Andy P. Barrett"
date: last-modified
date-format: iso
format:
html:
toc: true
colorlinks: true
---
## Introduction
We are all data wranglers or data scientists. It is our job to
collect data, process data, analyse data and generate results in the
forms of plots, tables and eventually new knowledge. In the past,
many of us have done this work in isolation, we store data on our
laptops, sharing files by email or dropbox, google drive, etc as
needed. This approach has mostly worked. However, more
collaborative, and more open and reproducible ways of working, as well
as increases in data volumes, and ways of accessing data requires us
to find new, or at least modified, data strategies.
## What are Data Strategies?
Collaborating with a research team or the wider scientific community
requires us to solve problems of data access and share knowledge about
how data was analyzed. How do we make data available to others? Howa.
do we document the processing steps? Can you remember how a dataset
you processed last week was cleaned? If your computer crashes or
get's stolen could you reproduce your processing workflow easily?
When it comes to submit a paper, finish a project, hand off to someone
else on the team, or pick up a project after vacation; how easy is it
to do this. Data strategies, used in this context, is a catchall term
that includes where to process data (locally or on a remote system),
how to document and manage data processing for a future you or others,
how to make this data accessible to others, how to make the processing
and analysis workflows reproducible.
## Who is a future us?
A future us is first and foremost, you tomorrow or next week when you
pick up a project again, you next month when you discover you discover
a problem with the data, or in siz months to a year from now when
"reviewer 2" requests major changes. It is other members of your team
who might want to work with the data and want to know where it came
from and how it was processed. And it is the wider scientific
community who may want to reproduce your work.
## A simple data workflow
```{mermaid}
flowchart TD
A["Collect Data
Field, Remote Sensing, Model"] -->B[Clean Data\nCalibration, QC, Reprojection] -->C[Aggregate\nTime or spatial means] -->D[Analysis] -->E[Make Plots and Tables] -->F[Write Manuscript]
F-- Review -->B
E-- Iterate-->C
```
## Collaborative and reproducible workflows
### Where to process data?
- Download data to Local Machine and process locally
- Do some processing in cloud to reduce data size
- Do all of your processing in the cloud
Depends on data volume:
- how long will it take to download (Mbps);
- Can you store all that data (cost and space);
- Do you have the computing power for processing;
- Does your team need a common computing environment;
- Do you need to share data at each step or just an end product.
## What does this look like in the cloud?
*Placeholder for examples of hybrid and cloud workflows*
## How to future-proof workflows and make them reproducible
**FAIR** - **F**indable, **A**ccessible, **I**nteroperable, **R**eusable applies to
the future you and your tean as well.
## Document the Analysis
Document each step. Where did you get the data, which files, which version. Write it down. Anywhere is good but using a script is better.
Example: `earthaccess`, `wget` or `curl`
Can you (or anyone else) easily reproduce your processing pipeline. Having it scripted so just a few commands can set the process going is really helpful.
How can this be done with GUI interfaces? E.g. ArcGIS, QGIS, Excel - screenshots, journal commands.
## Make sure data are **F**indable and **A**ccessible
Does everyone know where you data is and can they access it? *E.g. A
shared machine, DropBox, Google Drive, Cloud Storage.*
## Data Management
Raw data is immutable^[The file cannot be changed after it is
created]!
Save intermediate data not just final - saves processing time if you
have to redo one or a few steps.
Use consistent folder and file name patterns.
```
(base) nsidc-442-abarrett:data_strategies_for_a_future_us$ tree Data
Data
├── calibrated
├── cleaned
├── figures
├── final
├── monthly_averages
├── raw
└── results
7 directories, 0 files
```
Setup config files to both document filepaths and patterns but also so
your scripts can use these without hardcoding them into code.
```
from pathlib import Path
DATAPATH = Path.home() / path_to_project / Data
RAW_DATA_PATH = DATAPATH / raw
CLEANED_DATA_PATH = DATAPATH / cleaned
FINAL_DATA_PATH = DATAPATH / final
MONTHLY_PATH = DATAPATH / monthly_averages
RESULTS_PATH = DATAPATH / results # Path for final results tables
# and files
FIGURES_PATH = DATAPATH / figures # Path for plots etc.
```
In any script you can then do something like...
```
from filepaths import FINAL_DATA_PATH
files = FINAL_DATA_PATH.glob('*')
```
## Use standard file formats and include metadata
It can take time to set up but it will save time in the long run.
Use metadata standards.
### Common File Formats
- GeoTIFF for imagery or 2D raster data
- NetCDF for multi-dimensional data (3D+)
- Shapefiles or GeoJSON for vector data
- csv for tabular data (Make it *tidy*^[A framework for organising
tabular data. See [Wickham, 2014. *Tidy Data*](https://doi.org/10.18637/jss.v059.i10)])
If you are working in the cloud, *C*loud *O*ptimized *F*ormats can
help.
Avoid Excel and proprietary formats.
## Metadata, metadata, metadata
Metadata makes your data to be interoperable and reusable.
Metadata standards and conventions ensure that standard tools can
read/interpret the data, and that there is a common
reference/ontology^[A formal naming convention]. Not
always perfect but worth following.
- What is the CRS^[Coordinate Reference System] (projection, grid mapping) of the data?
- What are the units?
- What is the variable name?
- What is the source of the data?
- What script produced the data?