WIP: refactor Abins, replace SData with Euphonic data structures #37350

ajjackson · 2024-05-13T13:57:50Z

Description of work

Summary of work

Remove abins.sdata.SData, use classes from euphonic.spectra instead

Purpose of work

Abins currently manages collections of simulated spectra with an "SData" class. This assumes that the data will be structured in a tree with heirarchy atom -> quantum-order. This is not flexible enough to deal with coherent scattering, which might not be decomposable by atom.

Euphonic is an existing dependency and uses Spectrum1D, Spectrum2D and Spectrum1DCollection classes to manage spectra including arbitrary metadata. Effectively this is a flat table structure, amenable to "query" or "pipeline" workflows. This seems to be an appropriate tool, so here we extend that to add the missing Spectrum2DCollection and work towards using those natively in Abins.

Further detail of work

Add a method to SData to produce an equivalent SpectrumNDCollection object
Update abins to use these objects for all "output" management (i.e. summing relevant subsets and creating workspaces)
Update abins to simulate data directly into a SpectrumNDCollection object
Delete SData class and related unit tests
Move patches and extensions to Euphonic classes upstream into a new euphonic release
Remove Euphonic subclasses from mantid, update Euphonic dependency version

To test:

Reviewer

Please comment on the points listed below (full description).
Your comments will be used as part of the gatekeeper process, so please comment clearly on what you have checked during your review. If changes are made to the PR during the review process then your final comment will be the most important for gatekeepers. In this comment you should make it clear why any earlier review is still valid, or confirm that all requested changes have been addressed.

Code Review

Is the code of an acceptable quality?
Does the code conform to the coding standards?
Are the unit tests small and test the class in isolation?
If there is GUI work does it follow the GUI standards?
If there are changes in the release notes then do they describe the changes appropriately?
Do the release notes conform to the release notes guide?

Functional Tests

Do changes function as described? Add comments below that describe the tests performed?
Do the changes handle unexpected situations, e.g. bad input?
Has the relevant (user and developer) documentation been added/updated?

Does everything look good? Mark the review as Approve. A member of @mantidproject/gatekeepers will take care of it.

Gatekeeper

If you need to request changes to a PR then please add a comment and set the review status to "Request changes". This will stop the PR from showing up in the list for other gatekeepers.

ajjackson · 2024-06-17T14:34:44Z

This initially led to some severe performance regressions, due to the relatively high overhead of iterating over a Euphonic Spectrum1DCollection/Spectrum2DCollection object, which performs unit conversions while spawning new objects. Some efficiency tweaks can be made to the Euphonic objects, but mostly we can recover the performance by iterating in other ways (over the metadata and underlying arrays).

In addition, we can start to introduce multiprocessing-based parallelism, which plays nicely with loops over "flat" tables.

ajjackson · 2024-06-21T15:37:06Z

Very strange test failure there: something is going wrong with the Pychop call inside a multiprocessing starmap, but only for OSX and Windows. I wonder if there is some OS-dependent namespace hackery that runs into issues because both abins and Pychop have an Instrument class 🤔

To begin migration from SData to euphonic data types, allow existing SData to produce a Spectrum1DCollection. Then we can start modifying code that _uses_ SData to use this datatype instead. So far we modify the _atom_number_s method which creates output workspaces from the Abins calculation. Spectrum2DCollection is not implemented yet, so initially we leave the old implementation in place for 2D cases.

- Although Spectrum1DCollection provides a .select() function, we ended up using a chain of filters here because it can't handle numerical comparisons. Perhaps this would be a bit cleaner if the select() method accepted our filter functions. - The Spectrum1DCollection group_by() method throws an error if the key is in the top level of metadata (i.e. because it applies to all rows.) It would be good to clean that upstream too.

This is not very thoroughly tested here, and should be moved upstream once everything looks happy on Abins side. - Wrap Spectrum1DCollection to deal with group_by() issue where metadata is the same for all items (and so not in line_data) - Implement Spectrum2DCollection - Have SData.get_spectrum_collection() dispatch to 1D or 2D as appropriate - Tweak workspace writers to use either SpectrumNCollection class

As the relevant information is available in spectra, this allows us to drop a redundant parameter and reduce the number of objects passed around

- Now the Algorithm classes never deal with SData at all - SData still used internally and cached

Move the threshold-check logic to module-level functions, and leave existing method as a wrapper. Then we should be able to delete the class later... Also remove some useless parameters/logic that make the return value optional. We can just ignore the return value if we don't care about it.

Bin checks on Spectrum2DCollection need adapting to interpret 3-D z_data array correctly. This implementation is a bit inefficient as it needlessly instantiates a Spectrum2D; leave more verbose/efficient implementation for upstream.

This functionality is also available in Euphonic, and it would make sense to remove this entirely from Abins and use it there. However - this will need a bit of tinkering to work with Spectrum2DCollection, and doing it upstream would avoid a lot of duplication - it would be better to apply kinematic constraints (and instrument broadening) after spectra have been filtered and summed into output groups So let's defer that for now

- Push SData a bit deeper in the chain of functions - Stop using SDataByAngle, instead use angle metadata in SpectrumCollection objects - Patch SpectrumCollection so that group_by() and select() work as expected (i.e. run succesfully) when some keys are at top level of metadata (i.e. common to all lines). - Apply broadening to SpectrumCollection objects. For now we keep using Abins implementation of broadening so that tests pass; this should be replaced with Euphonic implementation later.

This is a significant step towards removing SData entirely. For compatibility at this point, Spectrum objects mostly store the bin centres rather than bin edges. When SData is gone it would nicer to move to storing bin edges. (Centres can be computed from edges, but edges can only be guessed from centres...)

Officially the Spectrum metadata should not include float because they can't be reliably compared for equality

Continue dissolving SData: apply DW to output spectra, drop now-unused method from class

- create new autoconvolution routine that works with spectrum collection - integrate downsampling into autoconvolution loop over atoms - this could save a bit of memory, or benefit from parallism over atoms - not clear if benefit is large at this point, but it doesn't add much complexity either. YAGNI? - apply q2 order corrections, converting 1D initial spectra to 2D maps for quantum orders with isotropic Debye-Waller correction (i.e. all orders above fundamental) - drop _broaden_sdata as this is no longer needed

The Spectrum1DCollection/Spectrum2DCollection classes allow arbitrary metadata, so there is no need for a technical distinction between the class that collects over atoms and quantum orders and one that collects over angles.

For now, fallback via SData if no isotroptic calculation

This test is running very slowly due to extra autoconvolution calculations, which are not required by the equivalent tests in AbinsBasicTest. The reason is that the default value of QuantumOrderEventsNumber has changed. This was a useful indication that the autoconvolution needs more optimisation in the ongoing SData->SpectrumCollection refactor, but still not a lot of fun to figure out! Due to caching, the first of these three tests to run is slow and then the other two are fast...

- Rather than go through the data transformation on arrays of zeros, we can return an almost-empty SpectrumCollection immediately. - That finally eliminates SData from this module! There is definitely room for optimisation as autoconvolution has become more expensive, but the order-1 stuff seems to be running in similar time. It would be a good idea to do some profiling later...

Finally, the cathartic part of the refactor where I get to delete 500 lines of code at once

This was breaking a doctest

Remove innermost loop of Spectrum creation by assembling y_data array directly.

The .from_spectra() constructor has a lot of overhead as it checks the types and numerical agreement of axes; if we are comparing Collections this should only be checked once. If this tweak works well it should move upstream to euphonic. Here we refer to a private method on the 1D collection from the 2D collection: really this can live on a parent class.

Not seeing a big impact on test timings here, but this _should_ scale better as we add more atoms.

- The euphonic SpectumCollection select() method doesn't work when all lines in the collection match the criterion, because such metadata is moved into the top-level metadata rather than the per-line section which is checked. The key is therefore treated as missing. - Compounding this, the data container cannot be entirely empty, so it allows an error to be raised if no matches are found. - More reasonably, the `from_spectra()` constructor cannot operate on an empty list, as this means no (common) axis information woudl be available. In this commit we avoid calling this when no data is available.

Spectrum.x_data (etc.) properties are quite expensive to call as they trigger a call to ureg to build Units and call Quantity.to() to ensure output unit is consistent with the stored unit name. Try to avoid calling it inside a loop! In practice this means using private attributes a bit more; not ideal, but we pin the Euphonic version and I maintain the package so for now it is safe. Avoid calling `* ureg` where we know we have an array and can build Quantity directly. This saves on some type-checking. Implement an "unsafe" constructor for Spectrum1DCollection from a series of Spectrum1D when we already know that the axes will be consistent. Again this saves a lot of time spent on consistency checks.

This iterator makes it more practical to iterate over the content of a spectrum collection without (slowly) creating new Spectrum objects Profiling shows a dramatic impact in the places changed so far.

As the new data container facilitates a more "functional" programming style with generators, it is quite simple to incorporate multiprocessing and get the best performance we can from a typical IDAaaS instance. Test calculations are showing that with a few cores we get a factor of 2x or so, which seems worth having on what can be a bottleneck step (if there are a lot of atoms and few k-points, which is typical for molecular calcualtions.) I expect that this step will be decoupled from other possible parallelism over k-points or atom intensity calculations, so it can use the full available processor count.

Instead of deconstructing to individual spectra, filtering and reconstructing, just look at the metadata and then index the collection in one go. This is a lot more efficient as it avoids much copying of x_data etc. The [spectrum for spectrum in ...] line was previously taking 70% of the function run time in a test calculation. Now it is dominated by _calculate_s_isotropic (97% of time) as it should be.

A simple for-loop over data rows is converted to a Pool.starmap for an easy speedup. For a benzene test case on a 30-core EPYC VM, this reduced time in the function from 1.85s to 0.43s

Iteration over metadata is much cheaper than spawning new spectrum objects.

With my test case of a Benzene cell (one k-point, fair number of atoms, 10th-order expansion): by default, parallelising the inner loop over 2D spectra gives a decent speedup as Pool is increased to around 15-16 cores, then deterioriates as we work up to the 30 cores available. This seems is significantly improved by setting a chunksize of 5, but we leave this adjustable in abins.parameters. (There is still a *slight* edge at some optimal core count, but it is no longer worth the user effort of checking!)

By converting the data to a 2-D array and reshaping at the end, we get more data into a single starmap() call and better parallel efficiency. At this point it is also helpful to increase the chunksize a bit: at 100 the cost increase to 1-D is slight and 2-D benefits a lot. Perhaps they should be different parameters and more benchmarking/adaptive setting used to optimise. But for now this is a lot better than serial operation! Performance is now satisfactory, I will not attempt a shared-memory implementation for the current branch (which is supposed to be focused on data refactor...)

On previous attempt I thought this wouldn't work and that starmap was required. The partial function seems a cleaner way of making it clear that only one parameter is actually being mapped over. It does also seem to work with a staticmethod, so we can bring the function definition closer to where it is used.

Getting strange pychop CI errors on Windows and Mac. See if it is specific to multiprocessing...

Mac/Win tests are failing when the Pychop resolution is computed in parallel. We don't really need to do that on every thread though; we can compute the energy function once before distributing.

sf1919 · 2024-08-21T13:52:57Z

Apologies... wrong PR

github-actions · 2024-09-04T16:01:14Z

👋 Hi, @ajjackson,

Conflicts have been detected against the base branch. Please rebase your branch against the base branch.

This moves responsibility for method-specific details to the instruments; at a higher-level we can just call this method to ensure that we are in a good position for parallel computation.

sf1919 · 2025-01-16T16:19:20Z

@ajjackson is this PR still one to be worked on?

ajjackson · 2025-01-16T16:37:10Z

Yes it is still in progress. Effort moved upstream to Euphonic for a bit and will be back here shortly (with many deleted lines of code.)

ajjackson force-pushed the sdata-euphonic branch from 7dafc5f to 4130746 Compare May 16, 2024 08:45

ajjackson force-pushed the sdata-euphonic branch from c2e4ae9 to fcd1115 Compare May 24, 2024 16:49

ajjackson added 26 commits August 6, 2024 16:07

More Abins spectra->workspaces tidy-up

d441a92

Rework abins "masses table" to work with spectrum

fff8df7

As the relevant information is available in spectra, this allows us to drop a redundant parameter and reduce the number of objects passed around

Change return type of spowdersemiempiricalcalculator

863e39a

- Now the Algorithm classes never deal with SData at all - SData still used internally and cached

Move threshold check, fix 2D systemtest

f8a3b7e

Bin checks on Spectrum2DCollection need adapting to interpret 3-D z_data array correctly. This implementation is a bit inefficient as it needlessly instantiates a Spectrum2D; leave more verbose/efficient implementation for upstream.

Continue pushing SData further back into nested function heirarchy

a4a03dc

Black doing silly things with string concatenation again

93adbf6

Format mass metadata to string

792ef6a

Officially the Spectrum metadata should not include float because they can't be reliably compared for equality

Tidy up import

cce2a97

Apply Debye-Waller factor to SpectrumCollection, drop SData method

e582bc5

Continue dissolving SData: apply DW to output spectra, drop now-unused method from class

Drop SData autoconvolution method, port test to replacement function

1979850

Minor cleanup

f299b71

Remove unused SDataByAngle class and tests

106445b

The Spectrum1DCollection/Spectrum2DCollection classes allow arbitrary metadata, so there is no need for a technical distinction between the class that collects over atoms and quantum orders and one that collects over angles.

calculate_s_isotropic directly into Spectra

8ef7ded

For now, fallback via SData if no isotroptic calculation

Tidy up comments and references to SData

77918a9

Remove Abins SData class

9fe2c52

Finally, the cathartic part of the refactor where I get to delete 500 lines of code at once

Correct axis-order for dummy 2D spectra

c9af6f9

This was breaking a doctest

ajjackson added 14 commits August 6, 2024 16:07

Slightly improve performance of spectrum broadening

4d5b9ac

Remove innermost loop of Spectrum creation by assembling y_data array directly.

Extend broadening efficiency tweak to 2D

3184ecd

Not seeing a big impact on test timings here, but this _should_ scale better as we add more atoms.

Implement iter_metadata for spectrum collections, speed up loops

ca000e0

This iterator makes it more practical to iterate over the content of a spectrum collection without (slowly) creating new Spectrum objects Profiling shows a dramatic impact in the places changed so far.

Implement parallel broadening over 1D spectra

501a11d

A simple for-loop over data rows is converted to a Pool.starmap for an easy speedup. For a benzene test case on a 30-core EPYC VM, this reduced time in the function from 1.85s to 0.43s

Tidy up: for now 2D parallel broadening is not implemented

2b83bf2

Abins2D optimisation: avoid unnecessary Spectrum iteration

57210c6

Iteration over metadata is much cheaper than spawning new spectrum objects.

Tidy up an import statement

7bcf9d4

ajjackson force-pushed the sdata-euphonic branch from 835cf29 to 7bcf9d4 Compare August 6, 2024 16:07

ajjackson added 3 commits August 7, 2024 09:39

TMP: try converting parallel map of broadening to series

8ffae13

Getting strange pychop CI errors on Windows and Mac. See if it is specific to multiprocessing...

Precompute Pychop resolution before parallel map on spectra

908b654

Mac/Win tests are failing when the Pychop resolution is computed in parallel. We don't really need to do that on every thread though; we can compute the energy function once before distributing.

sf1919 added Has Conflicts Used by the bot to label pull requests that have conflicts and removed Has Conflicts Used by the bot to label pull requests that have conflicts labels Aug 21, 2024

github-actions bot added the Has Conflicts Used by the bot to label pull requests that have conflicts label Sep 4, 2024

Create "prepare_resolution()" instrument method

05536ab

This moves responsibility for method-specific details to the instruments; at a higher-level we can just call this method to ensure that we are in a good position for parallel computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: refactor Abins, replace SData with Euphonic data structures #37350

WIP: refactor Abins, replace SData with Euphonic data structures #37350

ajjackson commented May 13, 2024 •

edited

Loading

ajjackson commented Jun 17, 2024

ajjackson commented Jun 21, 2024

sf1919 commented Aug 21, 2024

github-actions bot commented Sep 4, 2024

sf1919 commented Jan 16, 2025

ajjackson commented Jan 16, 2025

WIP: refactor Abins, replace SData with Euphonic data structures #37350

Are you sure you want to change the base?

WIP: refactor Abins, replace SData with Euphonic data structures #37350

Conversation

ajjackson commented May 13, 2024 • edited Loading

Description of work

Summary of work

Purpose of work

Further detail of work

To test:

Reviewer

Code Review

Functional Tests

Gatekeeper

ajjackson commented Jun 17, 2024

ajjackson commented Jun 21, 2024

sf1919 commented Aug 21, 2024

github-actions bot commented Sep 4, 2024

sf1919 commented Jan 16, 2025

ajjackson commented Jan 16, 2025

ajjackson commented May 13, 2024 •

edited

Loading