-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathEML_Creation_Workflow.R
448 lines (389 loc) · 24.1 KB
/
EML_Creation_Workflow.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
## Workflow Overview -------------------------------------------------------------
# Title: NPS EML Creation Workflow
#
# Summary: This script acts as a template file for end-to-end creation of EML
# metadata in R for DataStore. The metadata generated will be of sufficient
# quality for the Data Package Reference Type and can be used to automatically
# populate the DataStore fields for this reference type. The script utilizes
# multiple R packages and the example inputs are for an EVER Veg Map AA dataset.
# The example script is meant to either be run as a test of the process or to be
# replaced with your own content. This is a step by step process where each
# section (indicated by dashed lines) should be reviewed, edited if necessary,
# and run one at a time. After completing a section there is often something to
# do external to R (e.g. open a text file and add content). Several
# EMLassemblyline functions are decision points and may only apply to certain
# data packages. This workflow takes advantage of the NPSdataverse, an R-based
# ecosystem that includes external EML creation tools such as the R packages
# EMLassemblyline and EML. However, these tools were not designed to work with
# DataStore. Therefore, the NPSdatavers and this workflow also incorporate
# steps from NPS-developed R packages such as EMLeditor and DPchecker. You will
# necessarily over-write some of the information generated by EMLassemblyline.
# That is OK and is expected behavior.
# Good additional references include:
# EMLassemblyline: https://ediorg.github.io/EMLassemblyline/
# EMLeditor: https://nationalparkservice.github.io/EMLeditor/index.html
# NPS EML Script: https://nationalparkservice.github.io/NPS_EML_Script/
# EVER Veg Map AA dataset for testing purposes:
# https://github.com/nationalparkservice/NPS_EML_Script/tree/main/Example_files
# Contributors: Judd Patterson ([email protected]) and Rob Baker
# Last Updated: 23 February, 2023
## Install and Load R Packages -------------------------------------------------
# Install packages. If you have not recently installed packages, please
# re-install them (especially NPSdataverse) as they are under constant
# development. If you run into errors installing packages from github on NPS
# computers you may first need to run:
# >options(download.file.method="wininet")
# If you are on the VPN, you will need to set your CRAN mirror to Texas 1.
# Download the relevant R packages:
install.packages("devtools")
devtools::install_github("nationalparkservice/NPSdataverse")
install.packages(c("lubridate", "tidyverse"))
# Load packages
library(NPSdataverse)
library(lubridate)
library(tidyverse)
# When loading packages, you may be advised to update to more recent versions
# of dependent packages. Most of these updates likely are not critical. However,
# it is important that you update to the latest versions of EMLeditor and
# DPchecker as these NPS packages are under constant development.
## Set Overall Package Details -------------------------------------------------
# All of the following items should be reviewed and updated to fit the package
# at hand. For vectors with more than one item, keep the order the same (i.e.
# item #1 should correspond to the same file in each vector)
# Metadata filename - becomes the filename, so make sure it ends in _metadata to
# comply with data package specifications
metadata_id <- "TEST_EVER_AA_metadata"
# Overall package title
package_title <- "TEST_Everglades National Park Accuracy Assessment (AA) Data Package"
# Description of data collection status - choose from 'ongoing' or 'complete'
data_type <- "complete"
# Path to data file(s)
working_folder <- paste0(str_trim(getwd()),"/","Example_files")
# Vector of dataset filenames:
data_files <- c("qry_Export_AA_Points.csv",
"qry_Export_AA_VegetationDetail.csv")
# If the only .csv files in your working_folder are datasets for your data
# package, you can use:
# data_files <- list.files(pattern="*.csv")
# Vector of dataset names (brief name for each file)
data_names <- c("TEST_AA Point Data",
"TEST_AA Vegetation Data")
# Vector of dataset descriptions (about 10 words describing each file).
# Descriptions will be used in auto-generated tables within the ReadMe and DRR.
# If you need to use more than about 10 words, consider putting that information
# in the abstract, methods, or additional info sections.
data_descriptions <- c("TEST_Everglades Vegetation Map Accuracy Assessment point data",
"TEST_Everglades Vegetation Map Accuracy Assessment vegetation data")
# Tell EMLassemblyline where your files will ultimately be located. Create a
# vector of dataset URLs - for DataStore. I recommend setting this to the main
# reference page. All data files from a single data package can be accessed from
# the same page so the URLs are the same.
# The code from the draft reference you initiated above (replace 293181 with
# your code)
DSRefCode<-2293181
# No need to edit this
DSURL<-paste0("https://irma.nps.gov/DataStore/Reference/Profile/", DSRefCode)
# No need to edit this
data_urls <-c(rep(DSURL, length(data_files)))
# Single file or Vector (list) of tables and fields with scientific names that
# can be used to fill the taxonomic coverage metadata. Add additional items as
# necessary. Comment these out and do not run FUNCTION 5 (below) if your data
# package does not contain species information.
data_taxa_tables <- c("qry_Export_AA_VegetationDetail.csv")
# alternatively, if you have multiple files with taxanomic info:
# data_taxa_tables <-c("qry_Export_AA_VegetationDetails1.csv",
# "qry_Export_AA_VegetationDetails2.csv",
# "etc.csv")
# Tell EMLassemblyline the column name where your scientific names are within
# the data files. We suggest using DarwinCore names for your data columns:
# https://dwc.tdwg.org/terms/
data_taxa_fields <- c("Scientific_Name")
# Table and fields that contain geographic coordinates and site names to fill
# the geographic coverage metadata. Comment these out and do not run FUNCTION 4
# (below) if your data package does not contain geographic information. If the
# only geographic information you are supplying is the park units (and their
# bounding boxes), you can skip this step; these data and the corresponding
# GPS coordinates will be automatically added at a later step.
data_coordinates_table <- "qry_Export_AA_Points.csv"
data_latitude <- "decimalLatitude"
data_longitude <- "decimalLongitude"
data_sitename <- "Point_ID"
# Start date and end date.
# This should indicate collection date of the first and last data point in the
# data package (across all files) and does not include any planning, pre- or
# post-processing time. The format should be one that complies with the
# International Standards Organization's standard 8601. The recommended format
# for EML is: YYYY-MM-DD, where Y is the four digit year, M is the two digit
# month code (01 - 12 for example, January = 01), and D is the two digit day of
# the month (01 - 31).
startdate <- ymd("2010-01-26")
enddate <- ymd("2013-01-04")
## EMLassemblyline Functions ---------------------------------------------------
# The next set of functions are meant to be considered one by one and only run
# if applicable to a particular data package. The first year will typically see
# all of these run, but if the data format and protocol stay constant over time
# it may be possible to skip some in future years. Additionally some datasets
#may not have geographic or taxonomic component.
# FUNCTION 1 - Core Metadata Information
# Creates blank TXT template files for the abstract, additional information,
# custom units, intellectual rights, keywords, methods, and personnel. Be sure
# the edit the personnel text file in Excel as it has columns. Remember that the
# role "creator" is required! EMLassemblyline will also warn you if you do not
# include a "PI" role, but you can ignore the warning; this role is not
# required. Typically these files can be reused between years.
# We encourage you to craft your abstract in a text editor, NOT Word. Your
# abstract will be forwarded to data.gov, DataCite, google dataset search, etc.
# so it is worth some time to carefully consider what is relevant and important
# information for an abstract. Abstracts must be greater than 20 words. Good
# abstracts tend to be 250 words or less. You may consider including the
# following information: The premise for the data collection (why was it done?),
# why is it important, a brief overview of relevant methods, and a brief
# explanation of what data are included such as the period of time, location(s),
# and type of data collected. Keep in mind that if you have lengthy descriptions
# of methods, provenance, data QA/QC, etc it may be better to expand upon these
# topics in a Data Release Report or similar document uploaded separately to
# DataStore.
# Currently this function inserts a Creative Common 0 license. The CC0 license
# will need to be updated. However, to ensure that the licence meets NPS
# specifications and properly coincides with CUI designations, the best way to
# update the license information is during a later step using
# EMLeditor::set_int_rights(). There is no need to edit this .txt file.
template_core_metadata(path = working_folder,
license = "CC0") # that '0' is a zero!
# FUNCTION 2 - Data Table Attributes
# Creates an "attributes_datafilename.txt" file for each data file. This can be
# opened in Excel (we recommend against trying to update these in a text editor)
# and fill in/adjust the columns for attributeDefinition, class, unit, etc.
# refer to https://ediorg.github.io/EMLassemblyline/articles/edit_tmplts.html
# for helpful hints and view_unit_dictionary() for potential units. This will
# only need to be run again if the attributes (name, order or new/deleted
# fields) are modified from the previous year. NOTE that if these files already
# exist from a previous run, they are not overwritten.
template_table_attributes(path = working_folder,
data.table = data_files,
write.file = TRUE)
# FUNCTION 3 - Data Table Categorical Variable
# Creates a "catvars_datafilename.txt" file for each data file that has columns
# with a class = categorical. These .txt files will include each unique 'code'
# and allow input of the corresponding 'definition'.NOTE that since the
# list of codes is harvested from the data itself, it's possible that additional
# codes may have been relevant/possible but they are not automatically included
# here. Consider your lookup lists carefully to see if additional options should
# be included (e.g if your dataset DPL values are all set to "Accepted" this
# function will not include "Raw" or "Provisional" in the resulting file and you
# may want to add those manually). NOTE that if these files already exist from a
# previous run, they are not overwritten.
template_categorical_variables(path = working_folder,
data.path = working_folder,
write.file = TRUE)
# FUNCTION 4 - Geographic Coverage
# If the only geographic coverage information you plan on using are park
#boundaries, you can skip this step. You can add park unit connections using
#EMLeditor, which will automatically generate properly formatted GPS coordinates
#for the park bounding boxes.
#If you would like to add additional GPS coordinates (such as for specific site
#locations, survey plots, or bounding boxes for locations within a park, etc)
#please do.
#Creates a geographic_coverage.txt file that lists your sites as points as long
# as your coordinates are in lat/long. If your coordinates are in UTM it is
# probably easiest to convert them first or create the geographic_coverage.txt
#file another way (see https://nationalparkservice.github.io/QCkit/ for R
# functions that will convert UTM to lat/long).
template_geographic_coverage(path = working_folder,
data.path = working_folder,
data.table = data_coordinates_table,
lat.col = data_latitude,
lon.col = data_longitude,
site.col = data_sitename,
write.file = TRUE)
# FUNCTION 5 - Taxonomic Coverage
# Creates a taxonomic_coverage.txt file if you have taxonomic data.
# Currently supported authorities are 3 = ITIS, 9 = WORMS, and 11 = GBIF.
template_taxonomic_coverage(path = working_folder,
data.path = working_folder,
taxa.table = data_taxa_tables,
taxa.col = data_taxa_fields,
taxa.authority = c(3,11),
taxa.name.type = 'scientific',
write.file = TRUE)
## Create an EML File ----------------------------------------------------------
# Run this (it may take a little while) and see if it validates (you should see
# 'Validation passed'). It will generate an R object called "my_metadata".
# The function could alert you of some issues to review as. Run the function
# 'issues()' at the end of the process to get feedback on items that might be
# missing or need attention. Fix these issues and then re-run the make_eml()
# function.
my_metadata <- make_eml(path = working_folder,
dataset.title = package_title,
data.table = data_files,
data.table.name = data_names,
data.table.description = data_descriptions,
data.table.url = data_urls,
temporal.coverage = c(startdate, enddate),
maintenance.description = data_type,
package.id = metadata_id)
## Check for EML validity ------------------------------------------------------
# This is a good point to pause and test whether your EML is valid. If your EML
eml_validate(my_metadata)
# if your EML is valid you should see the following (admittedly crypitic):
# [1] TRUE
# attr(,"errors")
# character(0)
# if your EML is not schema valid, the function will notify you of specific
# problems you need to address. We HIGHLY recommend that you use the
# EMLassemblyline and/or EMLeditor functions to fix your EML and do not attempt
# to edit it by hand.
## Add NPS specific fields to EML ----------------------------------------------
# Now that you have valid EML metadata, you need to add NPS-specific elements
# and fields. For instance, unit connections, DOIs, referencing a DRR, etc. More
# information about these functions can be found at:
# https://nationalparkservice.github.io/EMLeditor/.
## Add Controlled Unclassified Information (CUI) codes -------------------------
# This is a required step. It is important to indicate not only that your data
# package contains CUI, but also to inform users if your data package does NOT
# contain CUI because empty fields can be ambiguous (does it not contain CUI or
# did the creators just miss that step?). You can choose from one of five CUI
# dissemination codes. Watch out for the spaces! These are:
# PUBLIC - Does NOT contain CUI.
# FED ONLY - Contains CUI. Only federal employees should have access
# (similar to "internal only" in DataStore).
# FEDCON - Contains CUI. Only federal employees and federal contractors should
# have access (also very much like current "internal only" setting in
# DataStore).
# DL ONLY - Contains CUI. Should only be available to a named list of
# individuals (where and how to list those individuals TBD)
# NOCON - Contains CUI. Federal, state, local, or tribal employees may have
# access, but contractors cannot.
# More information about these codes can be found at:
# https://www.archives.gov/cui/registry/limited-dissemination
my_metadata <- set_cui(my_metadata, "PUBLIC")
# note that in this case I have added the CUI code to the original R object,
# "my_metadata" but by giving it a new name, i.e. "my_meta2" I could have
# created a new R object. Sometimes creating a new R object is preferable
# because if you make a mistake you don't need to start over again.
## Set the Intellectual Rights--------------------------------------------------
# EMLassemblyine and ezEML provide some attractive looking boilerplate for
# setting the intellectual rights. It looks reasonable and so is easy to just
# keep. However, NPS has some specific regulations about what can and cannot be
# in the intellectualRights tag. Use set_int_rights() to replace the text with
# NPS-approved text. Note: You must first add the CUI dissemination code using
# set_cui() as the dissemination code and license must agree. That is, you
# cannot give a data package with a PUBLIC dissemination code a "restricted"
# license (and vise versa: a restricted data package that contains CUI cannot
# have a public domain or CC0 license). You can choose from one of three
# options:
# "restricted": If the data contains Controlled Unclassified Information (CUI),
# the intellectual rights must read: "This product has been determined to
# contain Controlled Unclassified Information (CUI) by the National Park
# Service, and is intended for internal use only. It is not published under an
# open license. Unauthorized access, use, and distribution are prohibited."
# "public": If the data do not contain CUI, the default is the public domain.
# The intellectual rights must read: "This work is in the public domain. There
# is no copyright or license."
# "CC0": If you need a license, for instance if you are working with a partner
# organization that requires a license, use CC0: "The person who associated a
# work with this deed has dedicated the work to the public domain by waiving all
# of his or her rights to the work worldwide under copyright law, including all
# related and neighboring rights, to the extent allowed by law. You can copy,
# modify, distribute and perform the work, even for commercial purposes, all
# without asking permission."
# The set_int_rights() function will also put the name of your license in a
# field in EML for DataStore harvesting.
# choose from "restricted", "public" or "CC0" (zero), see above:
my_metadata <- set_int_rights(my_metadata, "public")
## Add a data package DOI (optional) -------------------------------------------
# Add your data package's Digital Object Identifier (DOI) to the metadata. The
# set_datastore_doi() function requires that you are logged on to the VPN. It
# initiates a draft data package reference on DataStore, and populates the
# reference with a title pulled from your metadata, “[DRAFT] : <your data
# package title>”. This temporary title is purely for your tracking purposes and
# can easily be updated later. The set_datastore_doi() function will then insert
# the corresponding DOI for your data package into your metadata. There are a
# few things to keep in mind:
# 1) Your DOI and the data package reference are not yet active and are not
# publicly accessible until after review and activation/publication.
# 2) Be sure to upload your data package to the correct draft reference! It is
# easy to create several draft references with the same draft title so
# check the reference ID number carefully (We are working on making this
# process easier and less error prone).
# There is no need to fill in additional fields in DataStore at this point -
# many of them will be auto-populated based on the metadata you upload. Any
# fields you do populate will be over-written by the content in your metadata.
my_metadata <- set_datastore_doi(my_metadata)
## Add information about a DRR (optional) --------------------------------------
# If you are producing (or plan to produce) a DRR, add links to the DRR
# describing the data package.
# Similar to when you added the data package DOI, you will need the DOI for the
# DRR you are drafting as well as the DRR's Title. Again, go to DataStore and
# initiate a draft DRR, including a title. For the purposes of the data package,
# there is no need to populate any other fields. At this point, you do not need
# to activate the DRR reference and, while a DOI has been reserved for your DRR,
# it will not be activated until after publication so that you have plenty of
# time to construct the DRR.
my_metadata <- set_drr(my_metadata, 7654321, "DRR Title")
## Set the language ------------------------------------------------------------
# This is the human language (as opposed to computer language) that the data
# package and metadata are constructed in. Examples include English, Spanish,
# Navajo, etc. A full list of available languages is available from the Library
# of Congress. Please use the "English Name of Language" as an input. The
# function will then convert your input to the appropriate 3-character ISO
# 639-2 code.
# Available languages: https://www.loc.gov/standards/iso639-2/php/code_list.php
my_metadata <- set_language(my_metadata, "English")
## Add content unit links ------------------------------------------------------
# These are the park units where data were collected from, for instance ROMO,
# not ROMN. If the data package includes data from more than one park, they can
# all be listed. For instance, if data were collected from all park units within
# a network, each unit should be listed separately rather than the network.
# This is because the geographic coordinates corresponding to bounding boxes for
# each park unit listed will automatically be generated and inserted into the
# metadata. Individual park units will be more informative than the bounding box
# for the entire network.
park_units <- c("ROMO", "GRSD", "YELL")
my_metadata <- set_content_units(my_metadata, park_units)
## Add the Producing Unit(s) ---------------------------------------------------
# This is the unit(s) responsible for generating the data package. It may be a
# single park (ROMO) or a network (ROMN). It may be identical to the units
# listed in the previous step, overlapping, or entirely different.
# a single producing unit:
my_metadata <- set_producing_units(my_metadata, "ROMN")
# alternatively, a list of producing units:
my_metadata <- set_producing_units(my_metadata, c("ROMN", "GRYN"))
## Validate your EML -----------------------------------------------------------
# Almost done! This is another great time to validate your EML and make sure
# Everything is schema valid. Run:
eml_validate(my_metadata)
# if your EML is valid you should see the following (admittedly crypitic):
# [1] TRUE
# attr(,"errors")
# character(0)
# if your EML is not schema valid, the function will notify you of specific
# problems you need to address. We HIGHLY recommend that you use the
# EMLassemblyline and/or EMLeditor functions to fix your EML and do not attempt
# to edit it by hand.
## Write your EML to an xml file -----------------------------------------------
# Now it's time to convert your R object to an .xml file and save it. Keep in
# mind that the file name should end with "_metadata.xml".
write_eml(my_metadata, "mymetadatafilename_metadata.xml")
## Check your .xml file --------------------------------------------------------
# You're EML metadata file should be ready for upload. You can run some
# additional tests on your .xml metadata file alone using:
check_eml()
# This assumes that your .xml is in your working directory and that is is the
# only .xml file in your working directory.
## Check your data package -----------------------------------------------------
# If your data package is now complete, you can run some test prior to upload
# to make sure that the package fits some minimal set of requirements and that
# the data and metadata are properly specified and coincide. This assumes that
# your data package is in the root of your R project.
run_congruence_checks()
# alternatively, you can tell the run_congruence_checks where your data package
# is. The format should look something like:
run_congruence_checks("C:/Users/yourusername/Documents/my_data_package")
## Congratulations -------------------------------------------------------------
# If everything checked out, you should be ready to upload your data package!
# If you initiated a draft reference and inserted a DOI, make sure to upload
# it to the correct draft reference that corresponds to your DOI. Remember, you
# can upload multiple files simultaneously by highlighting them all rather than
# uploading one-by-one.