Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOL Darwin Core Archive #731

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

EOL Darwin Core Archive #731

wants to merge 37 commits into from

Conversation

mo-nathan
Copy link
Member

Implements the taxon based Darwin Core Archive (DwCA) that EOL expects. Involves a bit of renaming and refactoring to differentiate between the two forms of DwCA.

To test:

  • Go to a page where you can download a report (e.g., the Download link from a Species List page)
  • Select the 'Taxon Darwin Core Archive for EOL' option
  • Download and checkout the resulting zip file.
  • Note that this is not expected to pass through the GBIF validator since it has no occurrence data.

@JoeCohen
Copy link
Member

JoeCohen commented Jan 3, 2021

I ran the manual test; unzipped the downloaded file, resulting in 3 well-formed files.
I don't understand the intended use of the multimedia.csv. In particular, how does one relate the images to the taxa?

@mo-nathan
Copy link
Member Author

@JoeCohen The first column of the multimedia.csv should match the first column of the taxa.csv file. So if you look at a line in multimedia.csv you should be able to figure out which taxon it is associated with. Similarly if you search for an id from taxa.csv you should find all the images for that taxon in multimedia.csv (as URLs of course).

@JoeCohen
Copy link
Member

JoeCohen commented Jan 4, 2021

Thanks @mo-nathan. I wasn't looking far enough down the first column of the multimedia file; I saw a bunch of "1"'s and stopped looking.
Another question:
Which images is it picking? (I exported my observations, but it's not using all of my images for those Observations.)

@mo-nathan
Copy link
Member Author

Here's the feedback from Jen Hammock @ EOL:

"Looks to me like you're almost there. You do need a couple of things, mostly in your meta, a couple in your media file.

The taxonID column in the taxa file should go to http://rs.tdwg.org/dwc/terms/taxonID

And the first four columns in your media file should go to

http://purl.org/dc/terms/identifier
http://purl.org/dc/terms/type
http://purl.org/dc/terms/format
http://rs.tdwg.org/ac/terms/accessURI

And the values in the http://purl.org/dc/terms/identifier column should be unique in the file. You can re-use the values from the http://rs.tdwg.org/ac/terms/accessURI column for this if you like.

Finally, your http://rs.tdwg.org/dwc/terms/taxonID column should also appear in the media file, to provide the taxon mappings. Does that make sense?"

I sent her another zip file based on the new code changes and will update this PR when I hear back.

query.project(attribute(:images, :id),
attribute(:images, :when),
attribute(:images, :copyright_holder),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line detected around arguments.

attribute(:observations, :long),
attribute(:observations, :alt),
attribute(:observations, :notes),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line detected around arguments.

attribute(:names, :text_name),
attribute(:names, :author),
attribute(:names, :rank),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line detected around arguments.

attribute(:locations, :west),
attribute(:locations, :high),
attribute(:locations, :low),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line detected around arguments.


attribute(:users, :name),
attribute(:users, :login),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line detected around arguments.

@mo-nathan mo-nathan marked this pull request as draft February 8, 2021 01:05
family = extract_name(level) if level.start_with?("Family")
return [family, kingdom] if family && kingdom
end
return ["", kingdom] if kingdom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line after guard clause.


private

def parse_genus(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation detected.

name.split(' ')[0]
end

def higher_taxa(row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation detected.

result
end

def parse_classification(classification)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation detected.

nil
end

def extract_name(level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation detected.

@coveralls
Copy link
Collaborator

coveralls commented Mar 13, 2021

Coverage Status

coverage: 94.426% (-0.1%) from 94.523%
when pulling 4894c73 on eol-dwca
into 0e94b1f on main.

level.split(": _")[1].chomp("_")
end

def genus_classification(genus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation detected.


private

def parse_genus(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 (not 4) spaces for indentation.

name.split(' ')[0]
end

def higher_taxa(row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 (not 4) spaces for indentation.

result
end

def parse_classification(classification)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 (not 4) spaces for indentation.

nil
end

def extract_name(level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 (not 4) spaces for indentation.

level.split(": _")[1].chomp("_")
end

def genus_classification(genus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 2 (not 4) spaces for indentation.

private

def parse_genus(name)
name.split(' ')[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping.

private

def parse_genus(name)
name.split(' ')[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument ' ' is redundant because it is implied by default.

# generate CSV & meta.xml and bundle into a Zip
def render
filename = "#{::Rails.root}/public/dwca/eol_meta.xml"
content << ["meta.xml", File.open(filename).read]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use File.read.

# generate CSV & meta.xml and bundle into a Zip
def render
filename = "#{::Rails.root}/public/dwca/gbif_meta.xml"
content << ["meta.xml", File.open(filename).read]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use File.read.

locality = locality.blank? ? county : "#{county}, #{locality}"
county = nil
@country, @state, @county, @locality = val.split(", ", 4)
if @county && [email protected]!(/ (Co\.|Parish)$/, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a guard clause (return unless @county && [email protected]!(/ (Co\.|Parish)$/, "")) instead of wrapping the code inside a conditional expression.

@codeclimate
Copy link
Contributor

codeclimate bot commented Nov 7, 2022

Code Climate has analyzed commit 183e8e7 and detected 16 issues on this pull request.

Here's the issue category breakdown:

Category Count
Style 16

View more on Code Climate.

@mo-nathan
Copy link
Member Author

mo-nathan commented Feb 17, 2024

This PR also includes changes to support the GBIF Darwin Core Archive. However, there are some issues with the resulting ZIP file.

  1. The new eml.xml file. I don't remember this being required the last time I was really playing with this code, but it now appears to be required to pass the GBIF validator (https://www.gbif.org/tools/data-validator). It has some dynamic info (dates) that should be consistent with when the document archive is created. Right now I just dropped a hard coded version of the file to get the GBIF validator to pass.
  2. I created an output file when running the code locally. This result is here: https://mushroomobserver.org/gbif.zip. After adding the eml.xml file mentioned above, I created another archive that is here: https://mushroomobserver.org/valid-gbif.zip. This version technically passes the validator, but it has a bunch of warnings that should be reviewed and potentially fixed. The current results of running this file through the validator are:

The file can be indexed by GBIF
Some issues were detected by the validator:
Resource Structure | The EML document does not validate against the schema
Metadata Content |
The description of the dataset is missing or too short
The resource creator is missing or is incomplete

Warning Count
Modified date invalid 97709
Multimedia date invalid 97709
Continent derived from coordinates 97524
Taxon match higherrank 1500
Taxon match none 336
Country coordinate mismatch 213
Presumed negated longitude 205
Country derived from coordinates 184
Country invalid 113
Zero coordinate 28
Presumed negated latitude 17
Presumed swapped coordinate 1
Geodetic datum assumed WGS84 97709

These sould be reviewed and cleaned up where reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants