Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

McStas loader and time binning workflow. #5

Closed
wants to merge 25 commits into from
Closed

McStas loader and time binning workflow. #5

wants to merge 25 commits into from

Conversation

YooSunYoung
Copy link
Member

@YooSunYoung YooSunYoung commented Dec 1, 2023

Fixes #6
docs/examples/workflow.ipynb has an example workflow that can load McStas data and bin into time dimension.

TODO:

  • Make small subset of simulation data for loader testing.
  • Update workflow example to use generated data or downloaded sample data.
    - [ ] Add another workflow example that has measurement data sample. Will be done separately.
  • Add unit tests for loaders and minimum reduction functions.
    - [ ] Add documentation of loaders.

@YooSunYoung YooSunYoung self-assigned this Dec 1, 2023
@YooSunYoung
Copy link
Member Author

Here is the script I used to make a subset of data:

from pathlib import Path
import h5py
import numpy as np
import os

file_path =  Path('pulse5_z.h5')
copied_file_path = Path('pulse5_z.h5'.replace('.h5', '_subset.h5'))

print(f"Copy subset of {file_path} to {copied_file_path}")

# Remove file path if exists.
os.remove(copied_file_path)

# Copy all fields and subset of event data field.
# If you copy a file and remove a field, the file size does not change.
with h5py.File(file_path, 'r') as file:
    entry = file['entry1']
    dataset = entry['data']
    event_path = 'bank01_events_dat_list_p_x_y_n_id_t'
    events = dataset['bank01_events_dat_list_p_x_y_n_id_t']['events'][()]
    
    print("Original Shape: ", events.shape)
    with h5py.File(copied_file_path, 'w') as copied_file:
        copied_entry = copied_file.create_group('entry1')
        copied_data = copied_entry.create_group('data')
        
        # Copy all non-data fields.
        for key in entry.keys():
            if key == 'data':
               continue
            else:
                entry.copy(key, copied_entry, key)
        
        # Copy all non-event data fields.
        for data_key in copied_entry['data'].keys():
            if data_key != event_path:
                dataset.copy(data_key, copied_data, data_key)

        # Copy subset of events.
        copied_data.create_group(event_path)
        choices = np.random.random(events.shape[0]) < 0.00001
        subset = events[choices]
        copied_data[event_path].create_dataset('events', data=subset)
        print("Subset Shape: ", subset.shape)

@YooSunYoung YooSunYoung marked this pull request as ready for review December 7, 2023 15:26
@nvaytet nvaytet self-assigned this Dec 8, 2023
docs/examples/workflow.ipynb Outdated Show resolved Hide resolved
"This page will show simulation data workflow as an example. <br>\n",
"They are written with ``sciline``, so we will show how to collect ``providers`` and ``parameters`` to build a workflow pipeline and compute the required result. <br>\n",
"\n",
"First, we will set up scipp logging widget in the notebook."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our current guideline for the use of logging? I know we have it in other workflows in the ess repo, but with the use of sciline, I personally feel that logging in the way we used to do it in ess is not so useful?

If you have a graph of the pipeline, it's more useful than a list of steps in a log?

Or maybe there was a very good reason you needed to add logging in this notebook?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was one of the requests from Justin that he wants to quickly check the range of t_list while computing the result.

Since we wants to see the intermediate result, not just a progress, it was the easiest way that I could think of...

"\n",
"file_path = small_mcstas_sample() # Replace it with your data file path\n",
"\n",
"nmx_workflow = build_workflow(file_path)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In esssans and essreflectometry, we have kept the explicit use of Pipeline in the notebook, and we didn't hide it behind a build_workflow wrapper.

Should we keep it consistent and not use a wrapper?
Or do you think having to build the pipeline is too much code for users to write?

I personally think that it's good for users to have to build it, it gives them more understanding of what is going on under the hood.

But maybe it was a requirement from the NMX people?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe it was a requirement from the NMX people?

No, it was not a requirement.
But since nmx doesn't really do so much here, I thought building the pipe itself was not so important.
But I agree that it should be consistent with others. I will unwrap them again.

src/ess/nmx/detector.py Outdated Show resolved Hide resolved
return Events(sc.DataArray(data=weights, coords={'t': t_list, 'id': id_list}))


providers = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in other projects, the providers are now tuples instead of lists.

return NMXProviders([*logging_providers, *loader_providers, *reduction_providers])


def collect_default_parameters() -> NMXParams:
Copy link
Member

@nvaytet nvaytet Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would use the default parameters as in essreflectometry:

params={
    **default_parameters,
    QBins: sc.geomspace(dim='Q', start=0.008, stop=0.075, num=200, unit='1/angstrom'),
    SampleRotation[Sample]: sc.scalar(0.7989, unit='deg'),
    Filename[Sample]: "sample.nxs",
    SampleRotation[Reference]: sc.scalar(0.8389, unit='deg'),
    Filename[Reference]: "reference.nxs",
}

and then explicitly create the pipeline in the notebook?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it was a bit annoying to have default_parameters as a dictionary, because it's mutable.
I often accidentally changed the original dictionary so that is why I made this function instead...

...


class Grouped(sl.Scope[FileType, sc.DataArray], sc.DataArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make a name a little less generic? I didn't know what it was immediately. Something like GroupedByDetectorId? (if that is indeed what it is...)

Calculate the distance between two points.
"""
diff = point_b - point_a
return Distance(sc.sqrt(sc.dot(diff, diff)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sc.norm?

...


def calculate_distance(point_a: Vector3D, point_b: Vector3D) -> Distance:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function appears to not be used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... they are from the original notebook. I didn't know how much would be included in the loader example. I removed them for now.

RotationAngle = NewType("RotationAngle", sc.Variable)


def rotation_matrix(axis: Vector3D, theta: RotationAngle) -> RotationMatirx:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unused?

@YooSunYoung YooSunYoung requested a review from nvaytet December 8, 2023 13:23
Copy link
Member

@SimonHeybrock SimonHeybrock Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should chat about all the providers and how things were split up here. In particular I don't think any of the McStas details should be exposed downstream, and none of the "downstream" types defined, e.g., in reduction.py should depend on the file type.

The domain types and providers in Sciline are intended for encapsulation and hiding implementation details of one part of the pipeline (such as loading data) from other parts (such as data reduction).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure what you exactly mean
so I just tried removing file-type generic providers here: #7

@YooSunYoung
Copy link
Member Author

Closing this in favor of #9

@YooSunYoung YooSunYoung deleted the loader branch December 12, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loaders for simulation data and measurement data.
3 participants