Conversation
chenejac
left a comment
There was a problem hiding this comment.
@bkampe thanks for this. I didn't test it yet, but I briefly reviewed the code. There is one tiny comment about gitignore file. And I have one more comment about SPARQL API based approach. Ivan Mrsulja makes SPARQL API based approach working in the case of DSpace ETL (#63). There is a parameter for the main script file (the value of the parameter might be tdb or sparql). I am wondering whether that approach might be copied in this PR as well?
| /example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/logs/ | ||
| /example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/data/ | ||
| /example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/previous-harvest/ |
There was a problem hiding this comment.
I was thinking that lines 3-7 includes this case?
There was a problem hiding this comment.
I think that adding:
**/data
**/logs
**/previous-harvest
to root-level .gitignore will solve this issue.
There was a problem hiding this comment.
Should have been in there, I thought. Will take a look.
ivanmrsulja
left a comment
There was a problem hiding this comment.
LGTM! Please check out my comments, I believe they can be helpfull.
| //log.trace("Adding record: " + fixedkey + "_" + recID); | ||
| //log.trace("data: "+ sb.toString()); | ||
| //log.info("rhOutput: "+ this.rhOutput); | ||
| //log.info("recID: "+recID); |
There was a problem hiding this comment.
You can probably remove these comments (and also in the other places you commented-out the code snippets), it will clean up the code slightly.
| sb.append(" <"); | ||
| sb.append(SpecialEntities.xmlEncode(field)); | ||
| sb.append(">"); | ||
|
|
||
| // insert field value | ||
| sb.append(SpecialEntities.xmlEncode(val.toString().trim())); | ||
|
|
||
| // Field END | ||
| sb.append("</"); | ||
| sb.append(SpecialEntities.xmlEncode(field)); | ||
| sb.append(">\n"); | ||
| return sb.toString(); |
There was a problem hiding this comment.
Can these appends be chained using StringBuilder's default builder pattern?
| .replaceAll(" |/", "_") | ||
| .replaceAll("\\(|\\)", "") | ||
| .replaceAll("/", "_"); |
There was a problem hiding this comment.
I think you can use replaceAll("[ /]", "_").replaceAll("[()]", "") to make this more clear.
There was a problem hiding this comment.
I'm not good at regex. And this was obviously incrementally added. I'm happy to use your suggestion to make it cleaner.
| .replaceAll(" |/", "_") | ||
| .replaceAll("\\(|\\)", "") | ||
| .replaceAll("/", "_"); | ||
| if (!Character.isDigit(fixedkey.charAt(0)) && !fixedkey.equals("abstract_inverted_index")) { |
There was a problem hiding this comment.
Can fixedKey ever be null? If yes, then I think there should be a null-check for that edge case.
| } | ||
|
|
||
| public String getTagName(String field, Object val) { | ||
| StringBuffer sb = new StringBuffer(); |
There was a problem hiding this comment.
Why is StringBuffer used here? If this class will not be used in a multithreaded environment I think we should switch to using StringBuilder everywhere because it is a lot faster.
| <!-- <param name="file">https://api.openalex.org/works?filter=concepts.id:C10238366|C118416809|C119823426|C120991184|C122156500|C123657996|C124363303|C127416549|C147176958|C148803439|C154226666|C158049464|C158550234|C1631582|C173560066|C178432105|C190831278|C196316656|C203115093|C203299862|C205300905|C2775926657|C2776009117|C2776081408|C2776136241|C2776161637|C2776311590|C2776445639|C2776748203|C2776825979|C2777231864|C2777364373|C2777800518|C2777831296|C2778206487|C2778647717|C2778684775|C2778753569|C2778906150|C2779054714|C2779201158|C2779265402|C2779331490|C2779635184|C2780021121|C2780113678|C2780344732|C2780886216|C2780933643|C2781052401,authorships.institutions.country_code:de&per-page=200&mailto=forschungsatlas@tib.eu&cursor=*</param>--> | ||
|
|
||
| <param name="output">raw-records.config.xml</param> | ||
| <param name="namespaceBase">http://vivo.example.com/harvest/aims_users/</param> |
There was a problem hiding this comment.
What is the purpose of this parameter, and what is the difference between this and baseURI in *.datamap.xsl. Moreover, this is hardcoded value in *-datamap.xsl file, meaning if someone change the value in this file, also has to update the another file to make harvesting process working properly. I found file changenamespace-all.config in other examples. Any chance to use it for openalex etl process?
There was a problem hiding this comment.
I just copied it from another already existing config file. I have to find out what its purpose is.
...ripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openalex-to-vivo.datamap.xsl
Show resolved
Hide resolved
| xmlns:node-publication='http://vivo.example.com/harvest/aims_users/fields/publication/' | ||
| xmlns:fn='http://www.w3.org/2005/xpath-functions' | ||
| xmlns:functx='http://www.functx.com' | ||
| xmlns:vivo-oa='http://lod.tib.eu/onto/vivo-oa/' |
There was a problem hiding this comment.
I will remove all the TIB specific things..
| xmlns:c4o='http://purl.org/spar/c4o/' > | ||
|
|
||
| <xsl:output method = "xml" indent = "yes"/> | ||
| <xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable> |
There was a problem hiding this comment.
hardcoded to TIB specific value
There was a problem hiding this comment.
I will remove all the TIB specific things.
| <xsl:if test="normalize-space( $cited_by_count )"> | ||
| <rdf:Description rdf:about="{$baseURI}gcf_{$oaid}"> | ||
| <rdf:type rdf:resource="http://purl.org/spar/c4o/GlobalCitationCount"/> | ||
| <c4o:hasGlobalCountSource rdf:resource="https://forschungsatlas01.develop.service.tib.eu/individual/n4885"/> |
There was a problem hiding this comment.
Have to find out why there is a hardcoded individual URI.
There was a problem hiding this comment.
This links to the entity where the count comes from. So this is the URI of OpenAlex itself in the Forschungsatlas.
I will see how to replace this with an object for the data source OpenAlex that is not tied to a localized property.
| --> | ||
| <config> | ||
| # harvesting publications from TIB – Leibniz Information Centre for Science and Technology. exchange the ROR ID to test with your institution. | ||
| <param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&per-page=200&mailto=forschungsatlas@tib.eu&cursor=*</param> |
There was a problem hiding this comment.
I would suggest adding here an example with timestamp. I have used this one:
| <param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181&per-page=200&mailto=forschungsatlas@tib.eu&cursor=*</param> | |
| <param name="file">https://api.openalex.org/works?filter=authorships.institutions.ror:https://ror.org/04aj4c181,from_publication_date:2024-12-01,to_publication_date:2024-12-31&per-page=200&mailto=forschungsatlas@tib.eu&cursor=*</param> |
|
OpenAlex updated the API, more specifically the paging mechanism. Seems like "cursor paging" is new: https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging From their latest mail:
|
Added an example of fetching publication metadata from Openalex based on the JSON fetch.
Three example queries are available in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openAlexfetch.config.xml. Choose one for first test or modify it to fit to your needs.
OpenAlex fetch is already used in the Research Atlas: https://forschungsatlas.fid-bau.de/research
Namespace in example-scripts/bash-scripts/full-harvest-examples/1.13-1.15-examples/example-openalex/openalex-to-vivo.datamap.xsl needs to be adjusted according to your settings in runtime.properties:
<xsl:variable name = "baseURI">https://forschungsatlas.fid-bau.de/individual/</xsl:variable>
JSONFetch.java was extended to be capable of handling nested object. Also some filtering for unwanted characters was added to avoid problems with the XSLTranslator (javax.xml.transform)
Closes #56