-
Notifications
You must be signed in to change notification settings - Fork 0
/
datamining.qmd
311 lines (198 loc) · 11.7 KB
/
datamining.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# Obtaining additional biodiversity data
As part of the __Deliverable 1__ of the __WP3__, we looked for additional marine biodiversity data related to Europe that was available elsewhere but not on OBIS. This included data published in the literature, data repositories and other biodiversity databases like [GBIF](https://gbif.org).
::: {.callout-note title="Continuous process"}
This process is still ongoing and will keep until the end of the project.
:::
We used the following procedures for each source:
- Literature/repositories: after finding the appropriate and relevant sources (see below) we ingested the data using the `obisdi` package structure.
- GBIF: after identifying potential datasets available on GBIF, we followed the data harvesting procedure followed by OBIS nodes (details below).
## `obisdi` package
For enabling a streamlined and standard ingestion of data throughout the project we developed the `obisdi` package, which is [available on GitHub.](https://github.com/iobis/obisdi) The idea behind the package (and basically all the structure) came from the [Tracking Invasive Alien Species (TrIAS)](https://github.com/trias-project) project' checklist recipe [(see more here)](https://github.com/trias-project/checklist-recipe), which provides a standard structure for mapping data to the Darwin Core standard. Using this structure, all the mapping is fully documented and can be tracked. Also, it's possible to directly ingest the data to the IPT from a GitHub repository.
Every project created with the `obisdi` package have the following structure:
- a folder for data, containing two other folders - one for _raw_ data (where the original data files are stored) and one for _processed_ data (where the final edited files are stored).
- a README file containing the basic details about the dataset and the repository
- an RMarkdown file which contains the mapping to the DwC standard.
By _knitting_ the RMarkdown files, it's also possible to generate a docs folder that can be used as a website (through GitHub pages), providing an easy access information for the general community.
![Workflow for data ingestion using obisdi](images/obisdiexplanation.png){width=90%}
## Additional data from literature and repositories
### BioTIME
BioTIME is a database containing time series of ecological data from the terrestrial, freshwater and marine realm. We downloaded the full database [(available here)](https://biotime.st-andrews.ac.uk/getFullDownload.php) and using the metadata information we identified those marine studies (on the Europe region) which were not available on OBIS. This identification procedure was based on a fuzzy matching of the titles with the OBIS dataset titles. For those that were probably relevant, we manually checked the datasets to confirm its relevance.
At the end we identified 4 new datasets that could be included, and proceeded with the data ingestion.
::: {.callout-warning}
At this moment, only one of the identified datasets was already ingested. The others are under processing and will soon be ingested.
:::
### Literature
We searched on __Web of Science__ for articles that could potentially contain datasets valuable for our project. We used the following search string: TS=((marine OR ocean* OR coastal) AND (("biodiversity data") OR (dataset) OR ("time series" OR time-series)) AND (species OR occurrence OR biodiversity OR fauna) AND (europe* OR global)). From the returned list (~2000 articles) we (1) matched the titles with the dataset names or bibliographic citations from OBIS to verify if the dataset was already included on OBIS, and (2) screened (manually) to identify if the dataset was valuable. Note that this is not a systematic review, but an exploratory search. Because the number of records was considerably large and the screening involves evaluating the data quality and the methods that generated it, in this first phase of the project we screened the first 100 records (ordered by relevance), and will keep screening in the following months.
### Data repositories
We searched the data repositories __FigShare__, __Zenodo__, and __Dryad__ for datasets linked with marine data on the region of our study. For each of those repositories, a distinct search strategy was applied, based on their structure. Dataset names were fuzzy matched with dataset titles on OBIS and those identified as not available on OBIS were screened to assess its relevance. In this first phase of the project we screened the first 50 records, and will keep screening in the following months.
Once one dataset is identified for inclusion, it will be ingested using the `obisdi` structure.
Codes for obtaining the information from those data repositories are available on the last section.
### Other sources
We also received suggestions of datasets directly from the participants of the project. We checked if the suggested dataset was not already on OBIS and, if not, we ingested the dataset.
## Additional data from GBIF
After we obtained the list of species occurring on the study area, we [downloaded](datadownload.qmd) the occurrence data from GBIF. From the occurrence data, we identified the unique datasets from which the data came from. We then counted the number of data each dataset contributed to the final data. We selected those datasets that had a high contribution of data (more than 50000 occurrences) as potential datasets that could be included in OBIS.
For the datasets with potential for inclusion, we first identified those that are already part of OBIS and excluded them from the search. With the remaining datasets, we screened for relevance.
The harvesting of the datasets to OBIS is done with the contribution and approval of an OBIS node. To do that, we follow this procedure:
1. An __issue__ is open on the GitHub repo https://github.com/iobis/obis-network-datasets, indicating the dataset
2. One of the OBIS nodes will review the issue and verify the relevance and quality of the dataset
3. If the dataset is deemed valuable, then the OBIS node approves it and its harvested to the OBIS dataset.
::: {.callout-note}
Only datasets with CC0, CC-BY or CC-BY-NC license were considered for inclusion. More information on the [OBIS manual.](https://manual.obis.org/policy.html)
:::
# Codes for obtaining information from data repositories
## Zenodo
```{r eval = F}
# Get records from Zenodo using API connection
# Create a function to retrieve the records for a certain query
get_zenodo <- function(query){
response <- httr::GET('https://zenodo.org/api/records',
query = list(q = query,
size = 2000, page = 1))
t_resp <- httr::content(response, "parsed", encoding = "UTF-8")
results <- lapply(t_resp, function(x){
data.frame(title = x$title, doi = x$doi)
})
results <- do.call("rbind", results)
return(results)
}
zen_results <- get_zenodo("+access_right:open +resource_type.type:dataset +title:marine +title:species")
write.csv(zen_results, paste0("source_lists/zen_", format(Sys.Date(), "%d%m%Y"), ".csv"),
row.names = F)
```
## FigShare
```{r eval = F}
# Get records from FigShare using API connection
library(httr)
# Create a function to retrieve the records for a certain query
query_fig <- '{
"item_type": 3,
"search_for": "(:title: marine OR :title: ocean OR :title: coastal) AND (:title: europe OR :title: global) AND (:title: species OR :title: biodiversity)",
"limit": 1000,
"offset": 0
}'
get_figshare <- function(query, maxtry = 7000){
off <- seq(0, maxtry, by = 1000)
retnum <- 1000
k <- 1
allres <- list()
while(retnum == 1000 & k <= length(off)) {
query <- gsub('"offset": [[:digit:]]*', paste0('"offset": ', off[k]), query)
response <- POST("https://api.figshare.com/v2/articles/search", body=query,
httr::add_headers(`accept` = 'application/json'),
httr::content_type('application/json'))
if (response$status_code != 200) {
results <- data.frame(title = NA, doi = NA, resource_title = NA)
retnum <- 1000
} else {
t_resp <- httr::content(response, "parsed", encoding = "UTF-8")
results <- lapply(t_resp, function(x){
data.frame(title = x$title, doi = x$doi, resource_title = x$resource_title)
})
results <- do.call("rbind", results)
retnum <- nrow(results)
}
allres[[k]] <- results
k <- k + 1
}
return(allres)
}
fig_q1 <- get_figshare(query_fig)
# Bind all results
fig_results <- do.call("rbind", fig_q1)
write.csv(fig_results, paste0("source_lists/fig_", format(Sys.Date(), "%d%m%Y"), ".csv"),
row.names = F)
```
## Dryad
```{r eval = F}
# Get records from Dryad using API connection
library(httr)
# Create a function to retrieve the records for a certain query
get_dryad <- function(query, maxtry = 2000, addstop = T, verbose = T){
off <- seq(1, ceiling(maxtry/100))
retnum <- 100
k <- 1
allres <- list()
while(retnum == 100 & k <= length(off)) {
if (verbose) cat("Downloading page", k, "\n")
response <- httr::GET('https://datadryad.org/api/v2/search',
query = list(q = query,
per_page = 100, page = k))
if (response$status_code != 200) {
results <- data.frame(title = NA, doi = NA, resource_title = NA)
retnum <- 100
} else {
t_resp <- httr::content(response, "parsed", encoding = "UTF-8")
results <- lapply(t_resp$`_embedded`$`stash:datasets`, function(x){
id <- x$identifier
title <- x$title
if (is.null(title)) {
title <- "NOT FOUND"
}
data.frame(title = title, doi = id)
})
results <- do.call("rbind", results)
retnum <- nrow(results)
}
allres[[k]] <- results
k <- k + 1
if (addstop) {
Sys.sleep(5)
}
}
return(allres)
}
dry_q1 <- get_dryad("marine species europe")
dry_q1 <- do.call("rbind", dry_q1)
dry_q2 <- get_dryad("marine species global")
dry_q2 <- do.call("rbind", dry_q2)
# Bind all results
dry_results <- rbind(dry_q1, dry_q2)
write.csv(dry_results, paste0("source_lists/dry_", format(Sys.Date(), "%d%m%Y"), ".csv"),
row.names = F)
```
## Code for fuzzy matching from data repositories
```{r eval = FALSE}
library(readxl)
library(tidyverse)
library(sf)
zen <- read.csv("source_lists/zen_23062023.csv")
dry <- read.csv("source_lists/dry_23062023.csv")
fig <- read.csv("source_lists/fig_23062023.csv")
full <- rbind(
zen[,c("title", "doi")],
dry[,c("title", "doi")],
fig[,c("title", "doi")]
)
# Get OBIS datasets
# Open study area shapefile
starea <- st_read("~/Research/mpa_europe/mpaeu_studyarea/data/shapefiles/mpa_europe_starea_v2.shp")
starea <- st_bbox(starea)
# Download list of all obis datasets in the study area
datasets <- robis::dataset(
geometry = st_as_text(st_geometry(st_as_sfc(st_bbox(starea))))
)
#### PYTHON IMPLEMENTATION
library(reticulate)
use_python("/usr/local/bin/python3")
fuz <- import("rapidfuzz")
sources <- tolower(full$title)
compare <- tolower(datasets$title)
match_frat <- match_title <- rep(NA, length(sources))
cli::cli_progress_bar("Running fuzzy matching...", total = length(sources))
for (s in 1:length(sources)) {
frat <- rep(NA, length(compare))
for (z in 1:length(compare)) {
frat[z] <- fuz$fuzz$ratio(sources[s], compare[z])
}
match_title[s] <- compare[which.max(frat)]
match_frat[s] <- max(frat, na.rm = T)
cli::cli_progress_update()
}
cli::cli_progress_done()
cross_check <- full
cross_check$match_titles <- match_title
cross_check$fuzzy_ratio <- match_frat
#### END OF PYTHON IMPLEMENTATION
# Save for external edition
write_csv(cross_check, "final_lists/datarepo_datasets_comparison.csv")
```