-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata utililites #591
base: main
Are you sure you want to change the base?
Conversation
src/scippneutron/meta/_model.py
Outdated
corresponding: bool = False | ||
"""Whether the person is the corresponding / contact author.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit unusual to have this here, since the class is named Person
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer a separate Author
model?
Different representations deal with corresponding authors / contacts differently. Some keep them completely separate from other contributors (SciCat, CIF). Others have a 'role' that can indicate this (NeXus).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, just wanted to make sure you had a reason. But multiple persons in an experiment could be corresponding authors for separate publications, so having it here may be problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model anyway does not encode what the person is associated with. They can be, e.g., and instrument scientist because the model was read from an input file. But that does not mean that they should be represented in the output file.
In general, I think info about people cannot be automated in a robust manner. So I see Person
more as something a user inputs when creating an output file. And it may be behind a domain type like Author
or PrincipalInvestigator
. So maybe that is the solution: Rely on concrete domain types to encode the corresponding
property, not a model field.
src/scippneutron/meta/_model.py
Outdated
If there is no DOI for the concrete version, | ||
a general DOI for the software may be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also meant for displaying information for users or for publications? If so, do we need to consider:
- Do we want a way to have both general and specific DOI at the same time?
- Do we need a separate URL to link to the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to use this for archiving and provenance. That is, to indicate which software was used to produce some data. For this, I don't see a need to link to some documentation.
But if you want to create a UI from this, then we should probably add the fields you mentioned.
In general, I think we need to somehow encode URLs or DOIs in our packages so we can extract them here. Or we need a hard coded list. But I don't see a good way for either to get the DOI for a specific release. Do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the dunder method I mention below?
src/scippneutron/meta/_model.py
Outdated
return cls( | ||
name=package_name, | ||
version=version(package_name), | ||
url=_deduce_package_source_url(package_name), | ||
doi=None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhere saw software defining __cite__
or something like that. Should we consider supporting that (and defining it in our software)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a larger discussion. So I opened https://github.com/orgs/scipp/discussions/3634
src/scippneutron/meta/_model.py
Outdated
class Source(BaseModel): | ||
"""Information about a neutron source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then this should be named NeutronSource
?
src/scippneutron/meta/_model.py
Outdated
|
||
frequency: SourceFrequency | ||
"""The source frequency in Hz.""" | ||
pulse_duration: PulseDuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reactor sources do not have pulses or a frequency. So maybe class PulsedNeutronSource
is what this should be named? Or PulsedSource
, since photon sources have a frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I tried to be generic with the different source types and probes. But it currently does not look like we will support other types of sources any time soon. So we could specialise to pulsed neutron sources. Or keep the probe generic but limit ourselves to pulsed sources.
I think I prefer the latter. How about you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the source parameters because Neil noted that they are not so easy to define. Is the name still an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for me. Names can always be changed later.
src/scippneutron/meta/_model.py
Outdated
|
||
ESS_SOURCE = Source( | ||
frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')), | ||
pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess how to define how long the pulse is is not so clear. The number that is commonly quoted is 2.86 ms.
But how much tail should we include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is what I got from ESSspectroscopy. The length might depend on the specific use case. It may be best to stick strictly to metadata here and not specify duration or frequency.
src/scippneutron/meta/_model.py
Outdated
ESS_SOURCE = Source( | ||
frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')), | ||
pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')), | ||
source_type=SourceType.SpallationNeutronSource, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply above.
src/scippneutron/meta/_model.py
Outdated
case (facility, site): | ||
return facility, site | ||
case facility: | ||
return facility, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the site also be 'ESS'
for ESS instead of being None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be. The distinction only exists in SciCat and even there it's not strict. creationLocation
is supposed to be site:facility:instrument
. But I have been using ESS:DMSC:instrument
for simulated data. This would work best if we have site="ESS"
.
src/scippneutron/meta/_orcid.py
Outdated
return ORCIDiD(value) | ||
|
||
|
||
_ORCID_RESOLVER: str = 'https://orcid.org' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really get why the url is part of the id?
Entries should just be 0000-0000-0001-0082
, without the url prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://support.orcid.org/hc/en-us/articles/360006897674-Structure-of-the-ORCID-Identifier
With these kinds of ids (same for DOI) it is generally recommended to encode them as a complete URI.
Note that this accepts inputs of the plain id without the resolver. But it adds the resolver when storing the id.
tests/meta/meta_model_test.py
Outdated
with snx.File(scn.data.get_path('PG3_4844_event.nxs')) as f: | ||
experiment = meta.Measurement.from_nexus_entry(f['entry']) | ||
assert experiment.title == 'diamond cw0.533 4.22e12 60Hz [10x30]' | ||
assert experiment.run_number == '4844' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.
I'm guessing this is more the role of Scicat? Should we then have a link to the raw data on scicat or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.
I think this may be more on the side of auxiliary data as far as our workflows are concerned. I.e., data that is used for processing the main data (events). I'm not sure what a model of these would look like.
Should we then have a link to the raw data on scicat or something?
Yes! I am not sure how best to track inputs. When we load a scicat dataset in a workflow, it is relatively easy to get all required parameters. But when we only specify a file, we can only report the file name in the output. And that is not super helpful because files can be renamed.
ba9c201
to
b9e24c4
Compare
7d0e402
to
7fa45b0
Compare
src/scippneutron/metadata/_model.py
Outdated
|
||
name: str | ||
"""Free form name of the person.""" | ||
orcid: ORCIDiD | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's now confusing that the class is ORCIDiD, but the member is orcid
. Should it be orcid_id
?
Cf. your comment:
They distinguish between 'ORCID' the service and 'ORCID iD' the identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I got lazy.
src/scippneutron/metadata/_model.py
Outdated
return None | ||
|
||
|
||
class SourceType(str, enum.Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I changed it to only inherit from Enum
. This should not affect usability because pydantic converts strings to enums anyway.
I also tried enum.StrEnum
instead of inheriting from str
directly. But it leads to the same docs.
8318089
to
7f1a601
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for a v1 to me, provided that @nvaytet is also happy.
Required by SQW
Inheriting from str leads to noisy docs.
7f1a601
to
71a1fcd
Compare
Fixes #473
This is a first implementation of common metadata models for data reduction. The models are based on an analysis of the file formats we care about: Common Metadata Schemas.md (This is an updated version of the document in the issue: #473 (comment)).
The
ProcessLog
is missing because I don't feel confident that we have nailed down our approach to it sufficiently.Please read the docstrings to make sure you understand the purpose of the different models to check whether this makes sense.
I used Pydantic to make it easy to implement validation (the ORCID type is a more sophisticated version of what I implemented in Scitacean.) and to help convert inputs to the expected types. But almost all fields are plain strings that need no validation. So I can change it to use dataclasses if you think that we should not take on Pydantic as a dependency.
I am going to open PRs in ESSreduce and ESSsans that use part of this new code. See the cross refs below.