EML_Creation_Workflow.R

## Workflow Overview -------------------------------------------------------------
# Title: NPS EML Creation Workflow
#
# Summary: This script acts as a template file for end-to-end creation of EML 
# metadata in R for DataStore. The metadata generated will be of sufficient
# quality for the Data Package Reference Type and can be used to automatically
# populate the DataStore fields for this reference type. The script utilizes 
# multiple R packages and the example inputs are for an EVER Veg Map AA dataset.
# The example script is meant to either be run as a test of the process or to be 
# replaced with your own content. This is a step by step process where each 
# section (indicated by dashed lines) should be reviewed, edited if necessary, 
# and run one at a time. After completing a section there is often something to
# do external to R (e.g. open a text file and add content). Several 
# EMLassemblyline functions are decision points and may only apply to certain 
# data packages. This workflow takes advantage of the NPSdataverse, an R-based
# ecosystem that includes external EML creation tools such as the R packages
# EMLassemblyline and EML. However, these tools were not designed to work with 
# DataStore. Therefore, the NPSdatavers and this workflow also incorporate 
# steps from NPS-developed R packages such as EMLeditor and DPchecker. You will
# necessarily over-write some of the information generated by EMLassemblyline. 
# That is OK and is expected behavior.

# Good additional references include:
# EMLassemblyline: https://ediorg.github.io/EMLassemblyline/
# EMLeditor: https://nationalparkservice.github.io/EMLeditor/index.html
# NPS EML Script: https://nationalparkservice.github.io/NPS_EML_Script/
# EVER Veg Map AA dataset for testing purposes:
# https://github.com/nationalparkservice/NPS_EML_Script/tree/main/Example_files

# Contributors: Judd Patterson (judd_patterson@nps.gov) and Rob Baker 
# (robert_baker@nps.gov)
# Last Updated: 23 February, 2023

## Install and Load R Packages -------------------------------------------------
# Install packages. If you have not recently installed packages, please 
# re-install them (especially NPSdataverse) as they are under constant 
# development. If you run into errors installing packages from github on NPS
# computers you may first need to run:
# >options(download.file.method="wininet")
# If you are on the VPN, you will need to set your CRAN mirror to Texas 1.

# Download the relevant R packages:
install.packages("devtools")
devtools::install_github("nationalparkservice/NPSdataverse")
install.packages(c("lubridate", "tidyverse"))

# Load packages
library(NPSdataverse)
library(lubridate)
library(tidyverse)

# When loading packages, you may be advised to update to more recent versions
# of dependent packages. Most of these updates likely are not critical.  However,
# it is important that you update to the latest versions of EMLeditor and 
# DPchecker as these NPS packages are under constant development.

## Set Overall Package Details -------------------------------------------------
# All of the following items should be reviewed and updated to fit the package 
# at hand. For vectors with more than one item, keep the order the same (i.e. 
# item #1 should correspond to the same file in each vector)

# Metadata filename - becomes the filename, so make sure it ends in _metadata to
# comply with data package specifications
metadata_id <- "TEST_EVER_AA_metadata"

# Overall package title
package_title <- "TEST_Everglades National Park Accuracy Assessment (AA) Data Package"

# Description of data collection status - choose from 'ongoing' or 'complete'
data_type <- "complete"
  
# Path to data file(s)
working_folder <- paste0(str_trim(getwd()),"/","Example_files")
  
# Vector of dataset filenames: 
data_files <- c("qry_Export_AA_Points.csv",
                "qry_Export_AA_VegetationDetail.csv")
# If the only .csv files in your working_folder are datasets for your data 
# package, you can use:
# data_files <- list.files(pattern="*.csv")
  
# Vector of dataset names (brief name for each file)
data_names <- c("TEST_AA Point Data",
                "TEST_AA Vegetation Data")
  
# Vector of dataset descriptions (about 10 words describing each file). 
# Descriptions will be used in auto-generated tables within the ReadMe and DRR. 
# If you need to use more than about 10 words, consider putting that information
# in the abstract, methods, or additional info sections.
data_descriptions <- c("TEST_Everglades Vegetation Map Accuracy Assessment point data",
                       "TEST_Everglades Vegetation Map Accuracy Assessment vegetation data")

# Tell EMLassemblyline where your files will ultimately be located. Create a 
# vector of dataset URLs - for DataStore. I recommend setting this to the main 
# reference page. All data files from a single data package can be accessed from
# the same page so the URLs are the same.

# The code from the draft reference you initiated above (replace 293181 with 
# your code)
DSRefCode<-2293181

# No need to edit this
DSURL<-paste0("https://irma.nps.gov/DataStore/Reference/Profile/", DSRefCode)

# No need to edit this
data_urls <-c(rep(DSURL, length(data_files)))
  
# Single file or Vector (list) of tables and fields with scientific names that 
# can be used to fill the taxonomic coverage metadata. Add additional items as 
# necessary. Comment these out and do not run FUNCTION 5 (below) if your data
# package does not contain species information.
data_taxa_tables <- c("qry_Export_AA_VegetationDetail.csv")
# alternatively, if you have multiple files with taxanomic info:
# data_taxa_tables <-c("qry_Export_AA_VegetationDetails1.csv", 
#                      "qry_Export_AA_VegetationDetails2.csv", 
#                      "etc.csv")

# Tell EMLassemblyline the column name where your scientific names are within
# the data files. We suggest using DarwinCore names for your data columns:
# https://dwc.tdwg.org/terms/
data_taxa_fields <- c("Scientific_Name")

# Table and fields that contain geographic coordinates and site names to fill
# the geographic coverage metadata. Comment these out and do not run FUNCTION 4 
# (below) if your data package does not contain geographic information. If the
# only geographic information you are supplying is the park units (and their
# bounding boxes), you can skip this step; these data and the corresponding 
# GPS coordinates will be automatically added at a later step.
data_coordinates_table <- "qry_Export_AA_Points.csv"
data_latitude <- "decimalLatitude"
data_longitude <- "decimalLongitude"
data_sitename <- "Point_ID"
    
# Start date and end date. 
# This should indicate collection date of the first and last data point in the 
# data package (across all files) and does not include any planning, pre- or 
# post-processing time. The format should be one that complies with the 
# International Standards Organization's standard 8601. The recommended format 
# for EML is: YYYY-MM-DD, where Y is the four digit year, M is the two digit 
# month code (01 - 12 for example, January = 01), and D is the two digit day of
# the month (01 - 31).
startdate <- ymd("2010-01-26")
enddate <- ymd("2013-01-04")

## EMLassemblyline Functions ---------------------------------------------------
# The next set of functions are meant to be considered one by one and only run
# if applicable to a particular data package. The first year will typically see
# all of these run, but if the data format and protocol stay constant over time
# it may be possible to skip some in future years. Additionally some datasets 
#may not have geographic or taxonomic component.

# FUNCTION 1 - Core Metadata Information
# Creates blank TXT template files for the abstract, additional information, 
# custom units, intellectual rights, keywords, methods, and personnel. Be sure
# the edit the personnel text file in Excel as it has columns. Remember that the
# role "creator" is required! EMLassemblyline will also warn you if you do not 
# include a "PI" role, but you can ignore the warning; this role is not
# required. Typically these files can be reused between years.

# We encourage you to craft your abstract in a text editor, NOT Word. Your
# abstract will be forwarded to data.gov, DataCite, google dataset search, etc.
# so it is worth some time to carefully consider what is relevant and important
# information for an abstract. Abstracts must be greater than 20 words. Good
# abstracts tend to be 250 words or less. You may consider including the
# following information: The premise for the data collection (why was it done?),
# why is it important, a brief overview of relevant methods, and a brief
# explanation of what data are included such as the period of time, location(s),
# and type of data collected. Keep in mind that if you have lengthy descriptions
# of methods, provenance, data QA/QC, etc it may be better to expand upon these
# topics in a Data Release Report or similar document uploaded separately to 
# DataStore.

# Currently this function inserts a Creative Common 0 license. The CC0 license 
# will need to be updated. However, to ensure that the licence meets NPS 
# specifications and properly coincides with CUI designations, the best way to
# update the license information is during a later step using 
# EMLeditor::set_int_rights(). There is no need to edit this .txt file.
template_core_metadata(path = working_folder, 
                       license = "CC0") # that '0' is a zero!

# FUNCTION 2 - Data Table Attributes
# Creates an "attributes_datafilename.txt" file for each data file. This can be
# opened in Excel (we recommend against trying to update these in a text editor)
# and fill in/adjust the columns for attributeDefinition, class, unit, etc. 
# refer to https://ediorg.github.io/EMLassemblyline/articles/edit_tmplts.html 
# for helpful hints and view_unit_dictionary() for potential units. This will
# only need to be run again if the attributes (name, order or new/deleted 
# fields) are modified from the previous year. NOTE that if these files already
# exist from a previous run, they are not overwritten.
template_table_attributes(path = working_folder, 
                          data.table = data_files, 
                          write.file = TRUE)

# FUNCTION 3 - Data Table Categorical Variable
# Creates a "catvars_datafilename.txt" file for each data file that has columns 
# with a class = categorical. These .txt files will include each unique 'code' 
# and allow input of the corresponding 'definition'.NOTE that since the
# list of codes is harvested from the data itself, it's possible that additional
# codes may have been relevant/possible but they are not automatically included
# here. Consider your lookup lists carefully to see if additional options should
# be included (e.g if your dataset DPL values are all set to "Accepted" this 
# function will not include "Raw" or "Provisional" in the resulting file and you
# may want to add those manually). NOTE that if these files already exist from a
# previous run, they are not overwritten.
template_categorical_variables(path = working_folder, 
                               data.path = working_folder, 
                               write.file = TRUE)

# FUNCTION 4 - Geographic Coverage
# If the only geographic coverage information you plan on using are park 
#boundaries, you can skip this step. You can add park unit connections using 
#EMLeditor, which will automatically generate properly formatted GPS coordinates
#for the park bounding boxes.

#If you would like to add additional GPS coordinates (such as for specific site 
#locations, survey plots, or bounding boxes for locations within a park, etc) 
#please do. 

#Creates a geographic_coverage.txt file that lists your sites as points as long
# as your coordinates are in lat/long. If your coordinates are in UTM it is 
# probably easiest to convert them first or create the geographic_coverage.txt 
#file another way (see https://nationalparkservice.github.io/QCkit/ for R 
# functions that will convert UTM to lat/long).
template_geographic_coverage(path = working_folder, 
                             data.path = working_folder,
                             data.table = data_coordinates_table, 
                             lat.col = data_latitude, 
                             lon.col = data_longitude,
                             site.col = data_sitename, 
                             write.file = TRUE)

# FUNCTION 5 - Taxonomic Coverage
# Creates a taxonomic_coverage.txt file if you have taxonomic data. 
# Currently supported authorities are 3 = ITIS, 9 = WORMS, and 11 = GBIF.
template_taxonomic_coverage(path = working_folder, 
                            data.path = working_folder, 
                            taxa.table = data_taxa_tables,
                            taxa.col = data_taxa_fields, 
                            taxa.authority = c(3,11),
                            taxa.name.type = 'scientific', 
                            write.file = TRUE)

## Create an EML File ----------------------------------------------------------
# Run this (it may take a little while) and see if it validates (you should see
# 'Validation passed'). It will generate an R object called "my_metadata".
# The function could alert you of some issues to review as. Run the function 
# 'issues()' at the end of the process to get feedback on items that might be 
# missing or need attention. Fix these issues and then re-run the make_eml()
# function.
my_metadata <- make_eml(path = working_folder,
               dataset.title = package_title,
               data.table = data_files,
               data.table.name = data_names,
               data.table.description = data_descriptions,
               data.table.url = data_urls,
               temporal.coverage = c(startdate, enddate),
               maintenance.description = data_type,
               package.id = metadata_id)

## Check for EML validity ------------------------------------------------------
# This is a good point to pause and test whether your EML is valid. If your EML
eml_validate(my_metadata)

# if your EML is valid you should see the following (admittedly crypitic):
# [1] TRUE
# attr(,"errors")
# character(0)

# if your EML is not schema valid, the function will notify you of specific 
# problems you need to address. We HIGHLY recommend that you use the
# EMLassemblyline and/or EMLeditor functions to fix your EML and do not attempt
# to edit it by hand.

## Add NPS specific fields to EML ----------------------------------------------
# Now that you have valid EML metadata, you need to add NPS-specific elements 
# and fields. For instance, unit connections, DOIs, referencing a DRR, etc. More
# information about these functions can be found at: 
# https://nationalparkservice.github.io/EMLeditor/.

## Add Controlled Unclassified Information (CUI) codes -------------------------
# This is a required step. It is important to indicate not only that your data 
# package contains CUI, but also to inform users if your data package does NOT
# contain CUI because empty fields can be ambiguous (does it not contain CUI or
# did the creators just miss that step?). You can choose from one of five CUI
# dissemination codes. Watch out for the spaces! These are:
# PUBLIC - Does NOT contain CUI.
# FED ONLY - Contains CUI. Only federal employees should have access 
#  (similar to "internal only" in DataStore).
# FEDCON - Contains CUI. Only federal employees and federal contractors should
#  have access (also very much like current "internal only" setting in 
#  DataStore).
# DL ONLY - Contains CUI. Should only be available to a named list of 
#  individuals (where and how to list those individuals TBD)
# NOCON - Contains CUI. Federal, state, local, or tribal employees may have
#  access, but contractors cannot.
# More information about these codes can be found at:
# https://www.archives.gov/cui/registry/limited-dissemination

my_metadata <- set_cui(my_metadata, "PUBLIC")
# note that in this case I have added the CUI code to the original R object, 
# "my_metadata" but by giving it a new name, i.e. "my_meta2" I could have
# created a new R object. Sometimes creating a new R object is preferable 
# because if you make a mistake you don't need to start over again.

## Set the Intellectual Rights--------------------------------------------------
# EMLassemblyine and ezEML provide some attractive looking boilerplate for 
# setting the intellectual rights. It looks reasonable and so is easy to just 
# keep. However, NPS has some specific regulations about what can and cannot be
# in the intellectualRights tag. Use set_int_rights() to replace the text with
# NPS-approved text. Note: You must first add the CUI dissemination code using
# set_cui() as the dissemination code and license must agree. That is, you 
# cannot give a data package with a PUBLIC dissemination code a "restricted" 
# license (and vise versa: a restricted data package that contains CUI cannot 
# have a public domain or CC0 license). You can choose from one of three
# options:

# "restricted": If the data contains Controlled Unclassified Information (CUI), 
# the intellectual rights must read: "This product has been determined to 
# contain Controlled Unclassified Information (CUI) by the National Park 
# Service, and is intended for internal use only. It is not published under an
# open license. Unauthorized access, use, and distribution are prohibited."

# "public": If the data do not contain CUI, the default is the public domain. 
# The intellectual rights must read: "This work is in the public domain. There 
# is no copyright or license."

# "CC0": If you need a license, for instance if you are working with a partner
# organization that requires a license, use CC0: "The person who associated a 
# work with this deed has dedicated the work to the public domain by waiving all
# of his or her rights to the work worldwide under copyright law, including all
# related and neighboring rights, to the extent allowed by law. You can copy, 
# modify, distribute and perform the work, even for commercial purposes, all
# without asking permission."

# The set_int_rights() function will also put the name of your license in a 
# field in EML for DataStore harvesting.

# choose from "restricted", "public" or "CC0" (zero), see above:
my_metadata <- set_int_rights(my_metadata, "public")

## Add a data package DOI (optional) -------------------------------------------
# Add your data package's Digital Object Identifier (DOI) to the metadata. The 
# set_datastore_doi() function requires that you are logged on to the VPN. It 
# initiates a draft data package reference on DataStore, and populates the 
# reference with a title pulled from your metadata, “[DRAFT] : <your data 
# package title>”. This temporary title is purely for your tracking purposes and
# can easily be updated later. The set_datastore_doi() function will then insert
# the corresponding DOI for your data package into your metadata. There are a 
# few things to keep in mind:
#   1) Your DOI and the data package reference are not yet active and are not 
#      publicly accessible until after review and activation/publication.
#   2) Be sure to upload your data package to the correct draft reference! It is
#      easy to create several draft references with the same draft title so 
#      check the reference ID number carefully (We are working on making this 
#      process easier and less error prone).
# There is no need to fill in additional fields in DataStore at this point - 
# many of them will be auto-populated based on the metadata you upload. Any
# fields you do populate will be over-written by the content in your metadata.
my_metadata <- set_datastore_doi(my_metadata)

## Add information about a DRR (optional) --------------------------------------
# If you are producing (or plan to produce) a DRR, add links to the DRR 
# describing the data package.

# Similar to when you added the data package DOI, you will need the DOI for the
# DRR you are drafting as well as the DRR's Title. Again, go to DataStore and 
# initiate a draft DRR, including a title. For the purposes of the data package,
# there is no need to populate any other fields. At this point, you do not need
# to activate the DRR reference and, while a DOI has been reserved for your DRR,
# it will not be activated until after publication so that you have plenty of 
# time to construct the DRR.
my_metadata <- set_drr(my_metadata, 7654321, "DRR Title")

## Set the language ------------------------------------------------------------
# This is the human language (as opposed to computer language) that the data
# package and metadata are constructed in. Examples include English, Spanish, 
# Navajo, etc. A full list of available languages is available from the Library
# of Congress. Please use the "English Name of Language" as an input. The 
# function will then convert your input to the appropriate 3-character ISO 
# 639-2 code.
# Available languages: https://www.loc.gov/standards/iso639-2/php/code_list.php
my_metadata <- set_language(my_metadata, "English")

## Add content unit links ------------------------------------------------------
# These are the park units where data were collected from, for instance ROMO, 
# not ROMN. If the data package includes data from more than one park, they can
# all be listed. For instance, if data were collected from all park units within
# a network, each unit should be listed separately rather than the network. 
# This is because the geographic coordinates corresponding to bounding boxes for
# each park unit listed will automatically be generated and inserted into the 
# metadata. Individual park units will be more informative than the bounding box
# for the entire network. 
park_units <- c("ROMO", "GRSD", "YELL")
my_metadata <- set_content_units(my_metadata, park_units)

## Add the Producing Unit(s) ---------------------------------------------------
# This is the unit(s) responsible for generating the data package. It may be a 
# single park (ROMO) or a network (ROMN). It may be identical to the units 
# listed in the previous step, overlapping, or entirely different.
# a single producing unit:
my_metadata <- set_producing_units(my_metadata, "ROMN")
# alternatively, a list of producing units:
my_metadata <- set_producing_units(my_metadata, c("ROMN", "GRYN"))

## Validate your EML -----------------------------------------------------------
# Almost done! This is another great time to validate your EML and make sure
# Everything is schema valid. Run:
eml_validate(my_metadata)

# if your EML is valid you should see the following (admittedly crypitic):
# [1] TRUE
# attr(,"errors")
# character(0)

# if your EML is not schema valid, the function will notify you of specific 
# problems you need to address. We HIGHLY recommend that you use the
# EMLassemblyline and/or EMLeditor functions to fix your EML and do not attempt
# to edit it by hand.

## Write your EML to an xml file -----------------------------------------------
# Now it's time to convert your R object to an .xml file and save it. Keep in 
# mind that the file name should end with "_metadata.xml".
write_eml(my_metadata, "mymetadatafilename_metadata.xml")

## Check your .xml file -------------------------------------------------------- 
# You're EML metadata file should be ready for upload. You can run some 
# additional tests on your .xml metadata file alone using:
check_eml()
# This assumes that your .xml is in your working directory and that is is the
# only .xml file in your working directory.

## Check your data package -----------------------------------------------------
# If your data package is now complete, you can run some test prior to upload
# to make sure that the package fits some minimal set of requirements and that 
# the data and metadata are properly specified and coincide. This assumes that
# your data package is in the root of your R project. 
run_congruence_checks()

# alternatively, you can tell the run_congruence_checks where your data package
# is. The format should look something like:
run_congruence_checks("C:/Users/yourusername/Documents/my_data_package")

## Congratulations -------------------------------------------------------------
# If everything checked out, you should be ready to upload your data package!
# If you initiated a draft reference and inserted a DOI, make sure to upload
# it to the correct draft reference that corresponds to your DOI. Remember, you
# can upload multiple files simultaneously by highlighting them all rather than
# uploading one-by-one.