Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata utililites #591

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Add metadata utililites #591

wants to merge 15 commits into from

Conversation

jl-wynen
Copy link
Member

Fixes #473

This is a first implementation of common metadata models for data reduction. The models are based on an analysis of the file formats we care about: Common Metadata Schemas.md (This is an updated version of the document in the issue: #473 (comment)).

The ProcessLog is missing because I don't feel confident that we have nailed down our approach to it sufficiently.

Please read the docstrings to make sure you understand the purpose of the different models to check whether this makes sense.

I used Pydantic to make it easy to implement validation (the ORCID type is a more sophisticated version of what I implemented in Scitacean.) and to help convert inputs to the expected types. But almost all fields are plain strings that need no validation. So I can change it to use dataclasses if you think that we should not take on Pydantic as a dependency.

I am going to open PRs in ESSreduce and ESSsans that use part of this new code. See the cross refs below.

src/scippneutron/io/cif.py Outdated Show resolved Hide resolved
Comment on lines 172 to 173
corresponding: bool = False
"""Whether the person is the corresponding / contact author."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit unusual to have this here, since the class is named Person?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer a separate Author model?
Different representations deal with corresponding authors / contacts differently. Some keep them completely separate from other contributors (SciCat, CIF). Others have a 'role' that can indicate this (NeXus).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, just wanted to make sure you had a reason. But multiple persons in an experiment could be corresponding authors for separate publications, so having it here may be problematic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model anyway does not encode what the person is associated with. They can be, e.g., and instrument scientist because the model was read from an input file. But that does not mean that they should be represented in the output file.
In general, I think info about people cannot be automated in a robust manner. So I see Person more as something a user inputs when creating an output file. And it may be behind a domain type like Author or PrincipalInvestigator. So maybe that is the solution: Rely on concrete domain types to encode the corresponding property, not a model field.

Comment on lines 234 to 235
If there is no DOI for the concrete version,
a general DOI for the software may be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also meant for displaying information for users or for publications? If so, do we need to consider:

  • Do we want a way to have both general and specific DOI at the same time?
  • Do we need a separate URL to link to the docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to use this for archiving and provenance. That is, to indicate which software was used to produce some data. For this, I don't see a need to link to some documentation.

But if you want to create a UI from this, then we should probably add the fields you mentioned.

In general, I think we need to somehow encode URLs or DOIs in our packages so we can extract them here. Or we need a hard coded list. But I don't see a good way for either to get the DOI for a specific release. Do you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the dunder method I mention below?

Comment on lines 259 to 264
return cls(
name=package_name,
version=version(package_name),
url=_deduce_package_source_url(package_name),
doi=None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhere saw software defining __cite__ or something like that. Should we consider supporting that (and defining it in our software)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a larger discussion. So I opened https://github.com/orgs/scipp/discussions/3634

Comment on lines 322 to 323
class Source(BaseModel):
"""Information about a neutron source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this should be named NeutronSource?


frequency: SourceFrequency
"""The source frequency in Hz."""
pulse_duration: PulseDuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reactor sources do not have pulses or a frequency. So maybe class PulsedNeutronSource is what this should be named? Or PulsedSource, since photon sources have a frequency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I tried to be generic with the different source types and probes. But it currently does not look like we will support other types of sources any time soon. So we could specialise to pulsed neutron sources. Or keep the probe generic but limit ourselves to pulsed sources.
I think I prefer the latter. How about you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the source parameters because Neil noted that they are not so easy to define. Is the name still an issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me. Names can always be changed later.

src/scippneutron/io/cif.py Outdated Show resolved Hide resolved
src/scippneutron/meta/__init__.py Outdated Show resolved Hide resolved
src/scippneutron/meta/_orcid.py Outdated Show resolved Hide resolved

ESS_SOURCE = Source(
frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')),
pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess how to define how long the pulse is is not so clear. The number that is commonly quoted is 2.86 ms.
But how much tail should we include?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is what I got from ESSspectroscopy. The length might depend on the specific use case. It may be best to stick strictly to metadata here and not specify duration or frequency.

ESS_SOURCE = Source(
frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')),
pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')),
source_type=SourceType.SpallationNeutronSource,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have the wavelength range of the source? The same question about the tail would also apply for wavelengths...
ess-source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my reply above.

case (facility, site):
return facility, site
case facility:
return facility, None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the site also be 'ESS' for ESS instead of being None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be. The distinction only exists in SciCat and even there it's not strict. creationLocation is supposed to be site:facility:instrument. But I have been using ESS:DMSC:instrument for simulated data. This would work best if we have site="ESS".

return ORCIDiD(value)


_ORCID_RESOLVER: str = 'https://orcid.org'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get why the url is part of the id?
Entries should just be 0000-0000-0001-0082, without the url prefix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://support.orcid.org/hc/en-us/articles/360006897674-Structure-of-the-ORCID-Identifier
With these kinds of ids (same for DOI) it is generally recommended to encode them as a complete URI.

Note that this accepts inputs of the plain id without the resolver. But it adds the resolver when storing the id.

with snx.File(scn.data.get_path('PG3_4844_event.nxs')) as f:
experiment = meta.Measurement.from_nexus_entry(f['entry'])
assert experiment.title == 'diamond cw0.533 4.22e12 60Hz [10x30]'
assert experiment.run_number == '4844'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.

I'm guessing this is more the role of Scicat? Should we then have a link to the raw data on scicat or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.

I think this may be more on the side of auxiliary data as far as our workflows are concerned. I.e., data that is used for processing the main data (events). I'm not sure what a model of these would look like.

Should we then have a link to the raw data on scicat or something?

Yes! I am not sure how best to track inputs. When we load a scicat dataset in a workflow, it is relatively easy to get all required parameters. But when we only specify a file, we can only report the file name in the output. And that is not super helpful because files can be renamed.

@jl-wynen jl-wynen force-pushed the metadata-util branch 3 times, most recently from ba9c201 to b9e24c4 Compare January 24, 2025 12:01
@jl-wynen jl-wynen force-pushed the metadata-util branch 2 times, most recently from 7d0e402 to 7fa45b0 Compare January 24, 2025 13:16

name: str
"""Free form name of the person."""
orcid: ORCIDiD | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's now confusing that the class is ORCIDiD, but the member is orcid. Should it be orcid_id?
Cf. your comment:

They distinguish between 'ORCID' the service and 'ORCID iD' the identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I got lazy.

return None


class SourceType(str, enum.Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little annoying that the entire str API ends up in the generated docs (also for RadiationProbe below), but don't know what can be done about it...
Screenshot_20250127_112156

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I changed it to only inherit from Enum. This should not affect usability because pydantic converts strings to enums anyway.

I also tried enum.StrEnum instead of inheriting from str directly. But it leads to the same docs.

Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for a v1 to me, provided that @nvaytet is also happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Metadata utilities
3 participants