Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching on DwC field datasetName #3006

Open
rondlg opened this issue Sep 18, 2020 · 12 comments
Open

Searching on DwC field datasetName #3006

rondlg opened this issue Sep 18, 2020 · 12 comments

Comments

@rondlg
Copy link

rondlg commented Sep 18, 2020

Hi,
how do I search for occurrences with a particular value in the datasetName field?

I can see the data in the record page but don't know which field it is in the advanced search list.
image

If that field isn't available to search on it would be really helpful if it could be added

Thanks

Sharon

@rukayaj
Copy link

rukayaj commented Jan 27, 2022

I guess this is similar to the request made in #3026

Here's some more info about our particular use case: We need to group some records in multiple datasets together, and then download them. If I do a free text search like https://www.gbif.org/occurrence/search?q=Artsprosjekt_55-12_PolyNor for "Artsprosjekt_55-12_PolyNor" (one of the project names) then I get the correct records, but no way to download them. We've been publishing the project name for each record under 'datasetName' (e.g. https://www.gbif.org/occurrence/3436305215).

This functionality is necessary for our researchers. If there's no planned work on this maybe @MortenHofft you have another idea for an additional field we can add the datasetName to which is searchable, as a work-around?

@MortenHofft
Copy link
Member

MortenHofft commented Jan 27, 2022

I cannot really think if any beyond those mentioned in the referenced issue.

Which is essentially using publisher, institution, eventIds and collection (when appropriate of course - they shouldn't be misused just to group records that isn't in fact a collection). Or split into multiple datasets. I cannot think of any other way to group records across or within datasets.

Perhaps others can think of another approach? @ManonGros? If not and there is a strong request, then the danger is that we will see bad data that misuse e.g. collectionCode as a hack to achieve what is needed. That would be a shame.

@rukayaj
Copy link

rukayaj commented Jan 27, 2022

Is projectID the projectID from the EML? Pity it's not on a record level... I suppose one could argue that the grouped datasets are all part of a 'collection', kind of? Is collectionCode that bad a hack do you think? I see in the definition it says 'identifying the collection or data set from which the record was derived'.

@ManonGros
Copy link

I cannot think of any alternative (other than the ones listed in the other issue). The collection code hack isn't ideal, especially in the context of specimen records (for observations it would make more sense).

Yes the projectID is from the EML so it is for all the records in a given dataset. This is the same problem as the networks (they include whole datasets). I suppose we could:

  • investigate whether projectID or networks could be at the record level (although this wasn't their intended purpose and it might be difficult to do)
  • or consider making the datasetName field searchable (that might be better)
  • or have/use a new term (I am not sure about that).

@ahahn-gbif do you have any input on the topic? (the question is how to aggregate/download records that are part of several datasets)

@MortenHofft
Copy link
Member

Should it be possible to be part of multiple "projects/datasets"

@ahahn-gbif
Copy link

ProjectID in GBIF (and EML metadata) is presently given preference for projects run by or through GBIF (BID, BIFA, CESP and friends). The term is not (to my knowledge) defined again at record level in Darwin Core., so that the limitation is, as recognized, that a) a projectID is applied at dataset level, and that b) not more than one projectID can be assigned to the dataset. In that sense, I would advise against that choice.

Overloading any DwC term to find a work-around for some practical need is not a good idea. https://dwc.tdwg.org/terms/#dwc:datasetName is defined as "The name identifying the data set from which the record was derived.". If that is factually correct in the data, then we would not want to encourage using other terms against their actual definition.

If there is a recognized need in the community to be able to search this term through the user interface, this may be a change request. It is quite possibly not a wide-spread user demand, so that my question would be how often it is used (yearly reporting? regularly?), and by which kind of "customers". Is it possibly more an API access option that would satisfy this need?

@rukayaj
Copy link

rukayaj commented Jan 28, 2022

I would actually think this is quite a common scenario, and that there are many field projects which go out on yearly collection trips, taking specimens which go into several collections. And then of course it's necessary for the individual projects to be able to see only their specimens.

@rondlg
Copy link
Author

rondlg commented Jan 28, 2022

My 2-penneth is that it's definitely common at my institution to want to do this kind of thing and it's not easy to do right now.

There are a few things that we use the datasetName field for. Usually it is something with funding but not always:

The name of a digitization project
An expedition
A Research Project
A Lab
etc. etc.

Users have asked us how to retrieve the data associated with one or more of the above. Sometimes it's to show funders that a goal was achieved either in a single institution or across multiple institutions or we would like to be able to include/reference gbif datasets for a particular datasetName on our our web properties.

The example I give here is to our Rapid Inventories project that has been going for decades. They would like to be able to retrieve everything from a given expedition and the records cut not only across institutions but also across taxa.

Maybe this is tied up with events, I dunno but if it is we still need something simple for users and providers to work with.

I'll show my ignorance but is there a place to mint id's for projects/expeditions? If there is great, if not we are stuck with datasetName.

Our CMS allows us to record multiple projects per occurrence.

@ManonGros
Copy link

I think @albenson-usgs also mentioned the need for aggregating specific occurrences across datasets. If I remember correctly, the collectionCode was/is used for that purpose.

@rondlg
Copy link
Author

rondlg commented Feb 3, 2022

collectionCode is a problem for us to use in this regard because it is used at a much higher level. For example to distinguish between the "Bird" collection and the "Fossil Herps" collection. These values are also unitary.

@rukayaj
Copy link

rukayaj commented Feb 8, 2022

Should it be possible to be part of multiple "projects/datasets"

We've just had a request for this: gbif-norway/helpdesk#90

@timrobertson100
Copy link
Member

Please see gbif/pipelines#662 where we intend to implement multivalue dataset ID and name search capabilities shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants