Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AcTOR query and img function #247

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

andschar
Copy link
Contributor

@andschar andschar commented May 5, 2020

Pull Request

That's the first part of the the PR to include the AcTOR data source into webchem (Issue #209). It's not yet finished (documentation etc. missing) and here for discussion.

I haven't found any non-allowances and generally the EPA has rather open policies about their data, though this is still web scraping and not an official API. Probably best to ask them.

Once we have decided on this source, I will update the PR.

PR task list:

  • Update NEWS
  • Add tests (if appropriate)
  • Update documentation with devtools::document()
  • Check package passed

#' @import httr xml2
#'
#' @param query character; search term.
#' @param from character; type of input. Only "cas".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAS are the only parameters to be used in the query AFAIK.

Comment on lines +54 to +65
chemical_name <- trimws(xml_text(xml_nodes(site, ".chemicalNameFont")))
cas_dsstox <- xml_nodes(site, "#dsstoxSubstanceIdContainerId")
cas <- trimws(xml_text(xml_node(cas_dsstox, "#casrnId")))
dsstox <- trimws(xml_text(xml_child(cas_dsstox[[1]], 3))) # error prone
inchi <- trimws(xml_text(xml_node(site, "#inchiContainerId")))
inchi <- trimws(sub("InChi: InChI=", "", inchi, fixed = TRUE))
inchikey <- trimws(xml_text(xml_node(site, "#inchiKeyContainerId")))
inchikey <- trimws(sub("InChi Key:", "", inchikey))
formula <- trimws(xml_text(xml_node(site, "#molFormulaContainerId")))
formula <- trimws(sub("Molecular Formula:", "", formula))
molecularweight <- trimws(xml_text(xml_node(site, "#molWeightContainerId")))
molecularweight <- trimws(sub("Molecular Weight:", "", molecularweight))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only retrieves basic parameters named here. AcTOR however, contains a lot more data, though in very unstructured formats that are imho not easily parsable (links to documents, various data formats etc.).
Yet, I think it's quite useful to retrieve nice common names and the DSSTOX-ID which is used in the Comptox apllication as an identifier.

#'
#' }
#'
actor <- function(query,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit like a get_*() function, but it retrieves no actual AcTOR-ID (there is none). Hence I sticked to the get_*() style, apart from the function name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this more like a function we would call after get_*()? There are many common endings, e.g. *_query(), *_compinfo(), *_prop(), *_convert(). Apart from molecular weight the rest are IDs, so I assume this function would mostly be used for ID conversions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It's probably better to name it get_actor(). Ok? I wouldn't name it get_actorid() since there is no AcTOR id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think get_*() style should be used, because there is no ACToR specific ID as you said. I was thinking more like actor_query(), or actor_compinfo() etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, misunderstood you. I'm fine with actor_query() too.

#' actor_img(comp)
#' }
#'
actor_img = function(query,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this functionality because it was possible. So tell me what you think, whether we need such image functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images have been mentioned recently in Issue #132 and PR #235. I think there is general agreement that images would add a lot to webchem. I will open a separate issue for images so we can discuss the design of these functions so we wouldn't have to discuss consistency later:)

Copy link
Contributor

@stitam stitam May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened Issue #249 for this discussion.

@andschar
Copy link
Contributor Author

andschar commented May 5, 2020

Ohoh, I searched a little more and found a robots.txt here:
https://www.epa.gov/robots.txt
stating:
Disallow: ACToR
Though the robots.txt also states to aim to prevent crawling, not scraping

Copy link
Contributor

@stitam stitam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andschar! While looking for more information on ACToR I found this link: https://actor.epa.gov/actor/download.xhtml and the "Details" document points to a web service! The schema seems quite complex, I haven't looked into it in detail, but maybe it would help us access more structured data?

#'
#' }
#'
actor <- function(query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this more like a function we would call after get_*()? There are many common endings, e.g. *_query(), *_compinfo(), *_prop(), *_convert(). Apart from molecular weight the rest are IDs, so I assume this function would mostly be used for ID conversions?

#' actor_img(comp)
#' }
#'
actor_img = function(query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images have been mentioned recently in Issue #132 and PR #235. I think there is general agreement that images would add a lot to webchem. I will open a separate issue for images so we can discuss the design of these functions so we wouldn't have to discuss consistency later:)

@stitam stitam mentioned this pull request May 6, 2020
@Aariq
Copy link
Collaborator

Aariq commented May 6, 2020

Ohoh, I searched a little more and found a robots.txt here:
https://www.epa.gov/robots.txt
stating:
Disallow: ACToR
Though the robots.txt also states to aim to prevent crawling, not scraping

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.

@andschar andschar changed the title Dev actor comptox Add AcTOR query and img function May 7, 2020
@andschar
Copy link
Contributor Author

andschar commented May 7, 2020

Ohoh, I searched a little more and found a robots.txt here:
https://www.epa.gov/robots.txt
stating:
Disallow: ACToR
Though the robots.txt also states to aim to prevent crawling, not scraping

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.

I have sent them a mail. Let's just wait for the reply.

@stitam
Copy link
Contributor

stitam commented May 7, 2020

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/

@andschar
Copy link
Contributor Author

andschar commented May 7, 2020

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/

Holy, I have completely not seen the AcTOR webservice! Where have you found it? Seems not too well documented^^.
I guess this makes my function obsolete and we could everything via the webservice.

I think the robots.txt doesn't have an influence on a webservice.

@andschar andschar mentioned this pull request May 7, 2020
4 tasks
@stitam
Copy link
Contributor

stitam commented May 7, 2020

It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml

@andschar
Copy link
Contributor Author

andschar commented May 7, 2020

It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml

Now that's really confusing. I have been at this site several times and always thought that there, one could only download the SQL dump: actor_2015q3.sql.gz. Have never looked into Details. Damn. More eyes see definitely more :)

I wrote them a second mail and asked them about the current state of the web service.
Can change the function afterwards to use the web service.

@Aariq Aariq marked this pull request as draft May 12, 2020 00:37
@Aariq Aariq linked an issue May 15, 2020 that may be closed by this pull request
@stitam stitam added this to the RC2019F milestone Sep 5, 2020
@stitam stitam removed this from the RC2019F milestone Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Images
3 participants