Add metadata utililites #591

jl-wynen · 2025-01-21T15:22:41Z

Fixes #473

This is a first implementation of common metadata models for data reduction. The models are based on an analysis of the file formats we care about: Common Metadata Schemas.md (This is an updated version of the document in the issue: #473 (comment)).

The ProcessLog is missing because I don't feel confident that we have nailed down our approach to it sufficiently.

Please read the docstrings to make sure you understand the purpose of the different models to check whether this makes sense.

I used Pydantic to make it easy to implement validation (the ORCID type is a more sophisticated version of what I implemented in Scitacean.) and to help convert inputs to the expected types. But almost all fields are plain strings that need no validation. So I can change it to use dataclasses if you think that we should not take on Pydantic as a dependency.

I am going to open PRs in ESSreduce and ESSsans that use part of this new code. See the cross refs below.

src/scippneutron/io/cif.py

SimonHeybrock · 2025-01-22T04:28:41Z

src/scippneutron/meta/_model.py

+    corresponding: bool = False
+    """Whether the person is the corresponding / contact author."""


Bit unusual to have this here, since the class is named Person?

Would you prefer a separate Author model?
Different representations deal with corresponding authors / contacts differently. Some keep them completely separate from other contributors (SciCat, CIF). Others have a 'role' that can indicate this (NeXus).

I don't have a strong preference, just wanted to make sure you had a reason. But multiple persons in an experiment could be corresponding authors for separate publications, so having it here may be problematic?

This model anyway does not encode what the person is associated with. They can be, e.g., and instrument scientist because the model was read from an input file. But that does not mean that they should be represented in the output file.
In general, I think info about people cannot be automated in a robust manner. So I see Person more as something a user inputs when creating an output file. And it may be behind a domain type like Author or PrincipalInvestigator. So maybe that is the solution: Rely on concrete domain types to encode the corresponding property, not a model field.

SimonHeybrock · 2025-01-22T04:31:15Z

src/scippneutron/meta/_model.py

+    If there is no DOI for the concrete version,
+    a general DOI for the software may be used.


Is this also meant for displaying information for users or for publications? If so, do we need to consider:

Do we want a way to have both general and specific DOI at the same time?

Do we need a separate URL to link to the docs?

My intention was to use this for archiving and provenance. That is, to indicate which software was used to produce some data. For this, I don't see a need to link to some documentation.

But if you want to create a UI from this, then we should probably add the fields you mentioned.

In general, I think we need to somehow encode URLs or DOIs in our packages so we can extract them here. Or we need a hard coded list. But I don't see a good way for either to get the DOI for a specific release. Do you?

See the dunder method I mention below?

SimonHeybrock · 2025-01-22T04:34:25Z

src/scippneutron/meta/_model.py

+        return cls(
+            name=package_name,
+            version=version(package_name),
+            url=_deduce_package_source_url(package_name),
+            doi=None,
+        )


I somewhere saw software defining __cite__ or something like that. Should we consider supporting that (and defining it in our software)?

This is a larger discussion. So I opened https://github.com/orgs/scipp/discussions/3634

SimonHeybrock · 2025-01-22T04:36:47Z

src/scippneutron/meta/_model.py

+class Source(BaseModel):
+    """Information about a neutron source.


Then this should be named NeutronSource?

SimonHeybrock · 2025-01-22T04:37:18Z

src/scippneutron/meta/_model.py

+
+    frequency: SourceFrequency
+    """The source frequency in Hz."""
+    pulse_duration: PulseDuration


Reactor sources do not have pulses or a frequency. So maybe class PulsedNeutronSource is what this should be named? Or PulsedSource, since photon sources have a frequency?

True. I tried to be generic with the different source types and probes. But it currently does not look like we will support other types of sources any time soon. So we could specialise to pulsed neutron sources. Or keep the probe generic but limit ourselves to pulsed sources.
I think I prefer the latter. How about you?

I removed the source parameters because Neil noted that they are not so easy to define. Is the name still an issue?

Ok for me. Names can always be changed later.

src/scippneutron/io/cif.py

src/scippneutron/meta/__init__.py

src/scippneutron/meta/_orcid.py

nvaytet · 2025-01-23T09:25:58Z

src/scippneutron/meta/_model.py

+
+ESS_SOURCE = Source(
+    frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')),
+    pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')),


I guess how to define how long the pulse is is not so clear. The number that is commonly quoted is 2.86 ms.
But how much tail should we include?

Good point. This is what I got from ESSspectroscopy. The length might depend on the specific use case. It may be best to stick strictly to metadata here and not specify duration or frequency.

nvaytet · 2025-01-23T09:28:21Z

src/scippneutron/meta/_model.py

+ESS_SOURCE = Source(
+    frequency=SourceFrequency(sc.scalar(14.0, unit='Hz')),
+    pulse_duration=PulseDuration(sc.scalar(0.003, unit='s')),
+    source_type=SourceType.SpallationNeutronSource,


Should we also have the wavelength range of the source? The same question about the tail would also apply for wavelengths...

See my reply above.

nvaytet · 2025-01-23T09:30:32Z

src/scippneutron/meta/_model.py

+        case (facility, site):
+            return facility, site
+        case facility:
+            return facility, None


Shouldn't the site also be 'ESS' for ESS instead of being None?

Could be. The distinction only exists in SciCat and even there it's not strict. creationLocation is supposed to be site:facility:instrument. But I have been using ESS:DMSC:instrument for simulated data. This would work best if we have site="ESS".

nvaytet · 2025-01-23T09:36:29Z

src/scippneutron/meta/_orcid.py

+    return ORCIDiD(value)
+
+
+_ORCID_RESOLVER: str = 'https://orcid.org'


I don't really get why the url is part of the id?
Entries should just be 0000-0000-0001-0082, without the url prefix?

See https://support.orcid.org/hc/en-us/articles/360006897674-Structure-of-the-ORCID-Identifier
With these kinds of ids (same for DOI) it is generally recommended to encode them as a complete URI.

Note that this accepts inputs of the plain id without the resolver. But it adds the resolver when storing the id.

nvaytet · 2025-01-23T09:40:27Z

tests/meta/meta_model_test.py

+    with snx.File(scn.data.get_path('PG3_4844_event.nxs')) as f:
+        experiment = meta.Measurement.from_nexus_entry(f['entry'])
+    assert experiment.title == 'diamond cw0.533 4.22e12 60Hz [10x30]'
+    assert experiment.run_number == '4844'


I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.

I'm guessing this is more the role of Scicat? Should we then have a link to the raw data on scicat or something?

I don't know where would be the place to put it, but do we need a/several field(s) for all the parameters used in the experiment?
E.g. sample rotation angle, chopper info, slit openings, etc.

I think this may be more on the side of auxiliary data as far as our workflows are concerned. I.e., data that is used for processing the main data (events). I'm not sure what a model of these would look like.

Should we then have a link to the raw data on scicat or something?

Yes! I am not sure how best to track inputs. When we load a scicat dataset in a workflow, it is relatively easy to get all required parameters. But when we only specify a file, we can only report the file name in the output. And that is not super helpful because files can be renamed.

nvaytet · 2025-01-27T10:25:59Z

src/scippneutron/metadata/_model.py

+
+    name: str
+    """Free form name of the person."""
+    orcid: ORCIDiD | None = None


I don't know if it's now confusing that the class is ORCIDiD, but the member is orcid. Should it be orcid_id?
Cf. your comment:

They distinguish between 'ORCID' the service and 'ORCID iD' the identifier.

Agree. I got lazy.

nvaytet · 2025-01-27T10:27:33Z

src/scippneutron/metadata/_model.py

+        return None
+
+
+class SourceType(str, enum.Enum):


A little annoying that the entire str API ends up in the generated docs (also for RadiationProbe below), but don't know what can be done about it...

Agree. I changed it to only inherit from Enum. This should not affect usability because pydantic converts strings to enums anyway.

I also tried enum.StrEnum instead of inheriting from str directly. But it leads to the same docs.

SimonHeybrock

Looks good for a v1 to me, provided that @nvaytet is also happy.

Required by SQW

Inheriting from str leads to noisy docs.

This was referenced Jan 21, 2025

Use ScippNeutron's metadata utilities scipp/essreduce#166

Draft

Use ScippNeutron's metadata utilities scipp/esssans#195

Draft

SimonHeybrock reviewed Jan 22, 2025

View reviewed changes

nvaytet reviewed Jan 23, 2025

View reviewed changes

jl-wynen force-pushed the metadata-util branch 3 times, most recently from ba9c201 to b9e24c4 Compare January 24, 2025 12:01

jl-wynen mentioned this pull request Jan 24, 2025

Use ScippNeutron's metadata utilities scipp/essdiffraction#124

Draft

jl-wynen force-pushed the metadata-util branch 2 times, most recently from 7d0e402 to 7fa45b0 Compare January 24, 2025 13:16

nvaytet reviewed Jan 27, 2025

View reviewed changes

jl-wynen force-pushed the metadata-util branch from 8318089 to 7f1a601 Compare January 30, 2025 10:16

SimonHeybrock approved these changes Feb 4, 2025

View reviewed changes

nvaytet approved these changes Feb 6, 2025

View reviewed changes

jl-wynen added 14 commits February 6, 2025 15:13

Add ORCID id type

4c63fc9

Add metadata models

462ca5e

Enable pydantic.mypy plugin

96171e5

Use metadata models in CIF

098767c

Add missing dateutil dependency

2d7d9f4

Required by SQW

Update dependencies

8091529

Update deps

7f443a8

Fix doctest problems

975b14a

Rename meta to metadata

ab473b7

Update copyright year

de7edba

Remove source params and add name

bc9a887

Set site=facility if no separate site

9a716bf

Use plain enums

9eddd8a

Inheriting from str leads to noisy docs.

Rename orcid to orcid_id

71a1fcd

jl-wynen enabled auto-merge February 6, 2025 14:18

jl-wynen force-pushed the metadata-util branch from 7f1a601 to 71a1fcd Compare February 6, 2025 14:18

Apply automatic formatting

8d79bcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata utililites #591

Add metadata utililites #591

jl-wynen commented Jan 21, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 22, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 22, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 22, 2025

SimonHeybrock Jan 22, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 22, 2025

SimonHeybrock Jan 22, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 22, 2025

SimonHeybrock Jan 22, 2025

jl-wynen Jan 24, 2025

SimonHeybrock Jan 27, 2025

nvaytet Jan 23, 2025

jl-wynen Jan 23, 2025

nvaytet Jan 23, 2025

jl-wynen Jan 23, 2025

nvaytet Jan 23, 2025

jl-wynen Jan 23, 2025

nvaytet Jan 23, 2025

jl-wynen Jan 23, 2025

nvaytet Jan 23, 2025

jl-wynen Jan 23, 2025

nvaytet Jan 27, 2025

jl-wynen Jan 30, 2025

nvaytet Jan 27, 2025

jl-wynen Jan 30, 2025

SimonHeybrock left a comment

		corresponding: bool = False
		"""Whether the person is the corresponding / contact author."""

		If there is no DOI for the concrete version,
		a general DOI for the software may be used.

		class Source(BaseModel):
		"""Information about a neutron source.

		return ORCIDiD(value)


		_ORCID_RESOLVER: str = 'https://orcid.org'

Add metadata utililites #591

Are you sure you want to change the base?

Add metadata utililites #591

Conversation

jl-wynen commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock left a comment

Choose a reason for hiding this comment