Add CIF writer #477

jl-wynen · 2023-12-07T15:05:13Z

@celinedurniak, @AndrewSazonov any feedback is welcome!

This is a first version of a CIF writer that is still fairly low-level. But it is ready for testing in workflows.

Below there is an example for the POWGEN reduction in our docs. It is possible to add wrappers for encoding the remaining metadata. But I'd wait for #473 before doing that.

Note that the tof coord is basically made up because we have no proper focussing implementation.

import scipp as sc
from scippneutron.io import cif
from datetime import datetime, timezone

da_dspacing = sc.io.load_hdf5("data/powgen_reduced_dspacing.h5")

intensity = da_dspacing.data
dspacing = sc.midpoints(da_dspacing.coords['dspacing']).to(unit='Å', copy=False)
tof = dspacing.copy(deep=False)
tof.unit = 'us'
reduced_data = sc.DataArray(
    intensity,
    coords={'tof': tof}
).rename_dims(dspacing='tof')

cal = sc.DataArray(
    sc.array(dims=['cal'], values=[1.0]),
    coords={'power': sc.array(dims=['cal'], values=[1])}
)

now = datetime.now(tz=timezone.utc)
block = cif.Block(
    "LaB6",
    [
        {
            'diffrn_radiation.probe': 'neutron',
            'diffrn_source.beamline': 'POWGEN',
            'diffrn_source.device': 'spallation',
            'diffrn_source.facility': 'Spallation Neutron Source',
        },
        {
            'pd_proc.info_datetime': now,
            'computing.diffrn_reduction': 'essdiffraction v0',
            'pd_proc.info_data_reduction': '''Normalized by proton charge
Normalized by Vanadium run 4866
Calibrated using FERNS_d4832_2011_08_24''',
        },
        {
            'audit.creation_date': now,
            'audit_contact_author.name': 'Jan-Lukas Wynen',
            'audit_contact_author.id_orcid': 'https://orcid.org/0000-0002-3761-3201',
            'audit_contact_author.email': '[email protected]',
        },
    ],
)
block.add_powder_calibration(cal, comment='This calibration was made up')
block.add_reduced_powder_data(reduced_data)

cif.save_cif('data/powgen_reduced.cif', block)

Unused now

celinedurniak · 2023-12-11T03:14:01Z

What will be the final structure of the cal DataArray (with DIFA, DIFC...) ?
Will the versioning of essdiffraction use the same convention as Scipp (i.e., month.year)?
Will you provide links to the other files required for reduction, like Vanadium, sample container, empty instrument and the reduction script or Jupyter notebook...?
Will the author name be one of the Scipp developers or the PI of the related proposal or the team taken from the proposal (PI + local contact + collaborators)?

SimonHeybrock

Thorough testing 👍

Some minor questions:

SimonHeybrock · 2023-12-11T05:15:42Z

src/scippneutron/io/cif.py

+            to the file.
+        """
+        self._pairs = dict(pairs) if pairs is not None else {}
+        self._comment = _encode_non_ascii(comment)


Call setter to avoid duplication?

SimonHeybrock · 2023-12-11T05:16:27Z

src/scippneutron/io/cif.py

+        self._columns = {}
+        for key, column in columns.items():
+            self[key] = column
+        self._comment = _encode_non_ascii(comment)


Use setter?

SimonHeybrock · 2023-12-11T05:18:15Z

src/scippneutron/io/cif.py

+        sep = (
+            '\n'
+            if any(';' in item for row in formatted_values for item in row)
+            else ' '
+        )


Can you explain this? Maybe also in a comment?

Done. It's about handling multi-line strings as tested in test_write_block_single_loop_multi_line_string:

loop_ _diffrn.ambient_environment _diffrn.id ; water and some salt ; 123 sulfur x6a

SimonHeybrock · 2023-12-11T05:19:32Z

src/scippneutron/io/cif.py

+        self._name = ''
+        self.name = name
+        self._content = _convert_input_content(content) if content is not None else []
+        self._comment = _encode_non_ascii(comment)


Use setter?

SimonHeybrock · 2023-12-11T05:21:17Z

src/scippneutron/io/cif.py

+          >>> da = sc.DataArray(intensity, coords={'tof': tof})
+          >>> block.add_reduced_powder_data(da)
+        """
+        self.add(_make_reduced_powder_loop(data, comment=comment))


If we have a free helper function anyway, given that this wrapper does nearly nothing, is it even worth having it?

I only added it for convenience since we use method chaining in other interfaces. But I agree that it is not a strong argument here because most 'interesting' blocks won't be created and written in a single expression. Should I remove?

Generally I think having fewer functions and methods is good, but if you prefer keeping it I am fine with that as well.

src/scippneutron/io/cif.py

SimonHeybrock · 2023-12-11T05:29:39Z

src/scippneutron/io/cif.py

+        if value.variance is not None:
+            without_unit = sc.scalar(value.value, variance=value.variance)
+            s = f'{without_unit:c}'
+        else:
+            s = str(value.value)


I have never thought about this before, but is formatting depending on the system locale, i.e., will people in some Germanic countries (like Sweden) get a comma instead of a dot in floating-point numbers? If so, we should probably avoid this, i.e., ensure we always write with dot?

I'm pretty sure this is not the case. I just changed my system's number formatting to German and I still get periods, not commas.
Doing some cursory searching, Python's '{}' uses periods as separators, unless specified otherwise in the braces. C++ uses the 'C locale' by default. So unless someone calls setlocale, we're fine. But this is global, so it's possible that another library that we load into Python does this. But even then, I'm unsure if C++'s streams respect it.

SimonHeybrock · 2023-12-11T05:30:51Z

tests/io/cif_test.py

+        res
+        == '''data_datetime
+
+_audit.creation_date 2023-12-01T15:09:45+00:00


iirc timezone info was deprecated in NumPy and Scipp does not support it, i.e., everything is UTC. But here you test datetime, so we get the `+00:00 when formatting?

The test is about support for datetime objects so that we don't require wrapping everything in Scipp Variables.
It is actually unfortunate that we don't print the timezone info with Scipp objects as the CIF dictionary kind of hints that they should be present.

jl-wynen · 2023-12-11T10:16:31Z

What will be the final structure of the cal DataArray (with DIFA, DIFC...) ?

As shown in the docstring:

cal = sc.DataArray(
    sc.array(dims=['cal'], values=[3.4, 0.2]),
    coords={'power': sc.array(dims=['cal'], values=[0, 1])},
)

which corresponds to tzero = 3.4, DIFC = 0.2.
This is per data block, i.e., for one group of pixels that are focussed in the same way. I don't know yet how to handle focussing into multiple groups because that will need multiple data blocks.

* Will the versioning of `essdiffraction` use the same convention as Scipp (i.e., month.year)?

Yes. Are you asking because of the computing.diffrn_reduction field in the example? If so, we still need to decide how to encode software because there will be multiple programs that need to be listed.

* Will you provide links to the other files required for reduction, like Vanadium, sample container, empty instrument and the reduction script or Jupyter notebook...?

I'd very much like to. But I don't think CIF has any way of doing so. Does it?

* Will the author name be one of the Scipp developers or the PI of the related proposal or the team taken from the proposal (PI + local contact + collaborators)?

It will be whatever the person running the code decides. For autoreduction, we can define authorship ourselves. But otherwise, this will be up to the user. (Same as for SciCat datasets. This is one reason for a unified representation, a la #473.)

rozyczko

Code looks good and produces well structured CIF files for instrumental/experiment data.

rozyczko · 2024-01-12T09:35:18Z

src/scippneutron/io/cif.py

+.. code-block::
+
+  #\\#CIF_1.1
+  data_example


All your keywords are of CIF 2.0 type (e.g. diffrn_radiation.probe) as opposed to the 1.1 standard (diffrn_radiation_probe)
This should likely say #\\#CIF_2.0 as it does in
https://github.com/COMCIFS/Powder_Dictionary/blob/master/cif_pow.dic

As I understand it, this depends on the dictionary, not the format. The 1.1 format already allowed periods in names: https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax (see the grammar). This is one reason why I made it so that it writes the dict version into the file.
But I can change it to use the 2.0 header. The files don't use any CIF 2 features but should be compatible with it. The question is, do the readers care or would they reject the file with this change?
Also, CIF2 allows a byte order mark, should I add it?

No, I was just a bit concerned about versioning. 1.1 seems fine and is surely more prevalent than 2.0.
Even the 2.0 page ( https://www.iucr.org/resources/cif/cif2 ) calls it an alternative format, not replacement.

Let's stick with 1.1, since as you say, it does have support for the new syntax.

AndrewSazonov · 2024-01-22T14:29:10Z

Everything looks good, at least for the moment. When we have more TOF support in EasyDiffraction, we can check if something needs to be extended or changed. A couple of comments below:

What will be the final structure of the cal DataArray (with DIFA, DIFC...) ?

As shown in the docstring:
cal = sc.DataArray(
    sc.array(dims=['cal'], values=[3.4, 0.2]),
    coords={'power': sc.array(dims=['cal'], values=[0, 1])},
)
which corresponds to tzero = 3.4, DIFC = 0.2. This is per data block, i.e., for one group of pixels that are focussed in the same way. I don't know yet how to handle focussing into multiple groups because that will need multiple data blocks.

The simplest way is to use only one diffraction measurement in a data block. Otherwise, the diffractogram_id label can be used to identify the diffraction measurement to which the data presented in the PD_DATA (PD_CALC, PD_MEAS, PD_PROC) category belong to. The same label is also available in the PD_CALIB_D_TO_TOF category to identify the diffractogram to which the calibration relates.

* Will you provide links to the other files required for reduction, like Vanadium, sample container, empty instrument and the reduction script or Jupyter notebook...?
I'd very much like to. But I don't think CIF has any way of doing so. Does it?

I haven't found any built-in ways to do this in CIF. Perhaps we could create a loop with custom keys for the file description and urls to SciCat or GitHub?

jl-wynen · 2024-01-22T14:44:19Z

Thanks for you comments, @AndrewSazonov! We should probably revisit these questions when you have a reader in easyDiffraction. Because we can then try out different solutions so see what works.

jl-wynen added 22 commits December 6, 2023 16:00

Start cif writer

104ff80

Fix scalar detection

52afafa

Start lower level cif writer

8bbe43e

Write loops

0c38b03

Support comments in loops

68a9300

Remove old code

788cbb7

Support datetime

2edbd4e

Implement basic save_cif

09a48aa

Support comments for blocks

5024f9f

Test save_cif with actual file

a524049

Make cif.Chunk public

a77599a

Add cif.Block.save

ed43230

Write CIF file heading

89d1f8d

Escape non-utf8 chars

1bae27d

Check block name length

c5e9eb3

Begin documenting the cif writer

691088b

Disallow whitespace in block name

ecb84cd

Add docstrings to cif writer

089945d

Encode CIF schema

f8d3a80

Add method to add powder data

c844782

Require matching dims in loop

f26bd65

Add helpers for adding reduced data

9fb263e

jl-wynen requested a review from SimonHeybrock December 7, 2023 15:05

Remove io.table

0f9e3bf

Unused now

jl-wynen force-pushed the cif-writer branch from c07b3ba to 0f9e3bf Compare December 7, 2023 15:18

SimonHeybrock reviewed Dec 11, 2023

View reviewed changes

jl-wynen added 2 commits December 11, 2023 11:18

Use setter to encode comment consistently

83ff464

Default to cal id = power

434011e

jl-wynen added 2 commits December 11, 2023 11:21

Explain separator choice in loop

19cbab0

Remove cif.Block.save

903e440

SimonHeybrock approved these changes Dec 11, 2023

View reviewed changes

rozyczko approved these changes Jan 12, 2024

View reviewed changes

jl-wynen merged commit e5cd680 into main Jan 22, 2024
7 checks passed

jl-wynen deleted the cif-writer branch January 22, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CIF writer #477

Add CIF writer #477

jl-wynen commented Dec 7, 2023

celinedurniak commented Dec 11, 2023

SimonHeybrock left a comment

SimonHeybrock Dec 11, 2023

SimonHeybrock Dec 11, 2023

SimonHeybrock Dec 11, 2023

jl-wynen Dec 11, 2023

SimonHeybrock Dec 11, 2023

SimonHeybrock Dec 11, 2023

jl-wynen Dec 11, 2023

SimonHeybrock Dec 11, 2023

SimonHeybrock Dec 11, 2023

jl-wynen Dec 11, 2023

SimonHeybrock Dec 11, 2023

SimonHeybrock Dec 11, 2023

jl-wynen Dec 11, 2023

jl-wynen commented Dec 11, 2023

rozyczko left a comment

rozyczko Jan 12, 2024

jl-wynen Jan 12, 2024

rozyczko Jan 16, 2024

AndrewSazonov commented Jan 22, 2024

jl-wynen commented Jan 22, 2024

Add CIF writer #477

Add CIF writer #477

Conversation

jl-wynen commented Dec 7, 2023

celinedurniak commented Dec 11, 2023

SimonHeybrock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jl-wynen commented Dec 11, 2023

rozyczko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSazonov commented Jan 22, 2024

jl-wynen commented Jan 22, 2024