-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Monte Carlo Workspace Algorithm #38341
base: release-next
Are you sure you want to change the base?
Conversation
95efe61
to
190679a
Compare
4e478ed
to
005e651
Compare
ffee360
to
685746e
Compare
Framework/Algorithms/inc/MantidAlgorithms/CreateMonteCarloWorkspace.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for sticking my oar in here - I found this very interesting (I do a similar thing to generate random data in 3D Q for single-crystal peak integration testing).
I'm aware it may be a work in progress/unfinished, so please forgive me, but I had a couple of questions:
(1) What is the advantage of this over using numpy random number generation like so (assuming the bin-width is small enough you don't need to worry about variation in the underlying pdf over the width of a bin) ?
params = np.array([2.5e4, 0.06, 0.015, 30000, 30, 50])
peak_func = FunctionFactory.Instance().createPeakFunction("BackToBackExponential")
[peak_func.setParameter(ipar, params[ipar]) for ipar in range(peak_func.nParams())]
comp_func = FunctionWrapper(peak_func) + FlatBackground(A0=params[-1])
# simuate data
np.random.seed(1)
x = np.linspace(ptrue[3]-350, ptrue[3]+500, 61)
y = np.random.poisson(comp_func(x))
e = np.sqrt(y)
ws = CreateWorkspace(DataX=x, DataY=y, DataE=e)
(2) This seems to only support single histogram workspaces, but it doesn't check the number of histograms in the input workspace (unless I missed it?) - would it just make sense to generalise it and add a (parallelised) loop over histograms?
|
||
int CreateMonteCarloWorkspace::computeNumberOfIterations(const Mantid::HistogramData::HistogramY &yData) { | ||
double total_counts = std::accumulate(yData.begin(), yData.end(), 0.0); | ||
return static_cast<int>(std::round(total_counts)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this rounded to 0? I suppose you'd get 0s in the data, this might be confusing? Perhaps at least throw a warning?
Would it make more sense to add the number of MC events as an input parameter? Say for example you want to see how counting times/stats affect optimiser performance, I think to do this you would need to scale the input workspace which seems a bit clunky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the parameter to specify number of MC events is good, and the default could be the integral of the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commited the suggestion above, please have a look and let me know. The scaling function is not perfect, if the number of MC events is too little, the new workspace appears to be a flat line when overplotted onto the original workspace. I am open to any suggestions on ways to improve it.
|
||
MatrixWorkspace_sptr outputWs = WorkspaceFactory::Instance().create(instWs); | ||
progress.report(); | ||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a stupid question, sorry, but why do you need these sleep statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a stupid question, I found it was the only way to get the progress bar to actually work. If I remove the sleep statements it doesn't work anymore. I believe it's because the code execution might complete so quickly that the progress bar doesn't have enough time to refresh or visually update its state.
If you happen to know a better way to implement it please feel free to let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might've have the progress bar a little wrong (although I'm no expert).
If your algorithm, for example, iterated over the spectra of an input workspace and performed some kind of operation with each, you would have the progress bar update after each spectra operation. I think that's the idea anyway, it's showing the progress of the main bulk of the computation. I think for you this would be in fillHistogramWithRandomData
.
I think you can also set it to display text (e.g. "writing output"). Maybe you can have a look at some other algorithms and see how it's used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code for the progress bar, although personally I prefer the previous look, the new version does not need the sleep statements.
Thank you for taking the time to review my algorithm. To answer to your first question based on my understanding: The main advantage of my approach is that it doesn't assume any specific underlying functional form for the data. By constructing a cumulative distribution function directly from the input data and sampling from it, we can capture the actual empirical distribution, including any irregularities or complexities that may not be well-represented by a predefined function. This is done to deal with data where the underlying distribution is unknown, complex, or doesn't fit standard models. Please correct the above if I am wrong, to be honest I hadn't considered using numpy random number generation. Let me know if you still think it would be a better alternative. |
...source/release/v6.12.0/Inelastic/Algorithms/New_features/38196_CreateMonteCarloWorkspace.rst
Outdated
Show resolved
Hide resolved
|
||
MatrixWorkspace_sptr outputWs = WorkspaceFactory::Instance().create(instWs); | ||
progress.report(); | ||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might've have the progress bar a little wrong (although I'm no expert).
If your algorithm, for example, iterated over the spectra of an input workspace and performed some kind of operation with each, you would have the progress bar update after each spectra operation. I think that's the idea anyway, it's showing the progress of the main bulk of the computation. I think for you this would be in fillHistogramWithRandomData
.
I think you can also set it to display text (e.g. "writing output"). Maybe you can have a look at some other algorithms and see how it's used there.
// The total counts should be less than numIterations because some random numbers are not counted | ||
auto sumCounts = std::accumulate(outputY.begin(), outputY.end(), 0.0); | ||
TS_ASSERT_LESS_THAN(sumCounts, numIterations); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know that this will always happen, is there a chance these two value could be equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are equal by design in this algorithm, this is in contrast to the method of randomly generating data I posted, where the total counts itself would be Poisson distributed (I believe, I should check...) - I've asked @Despiix to check the distribution of counts in each bin from multiple simulations are Poisson distributed (or close enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the parameter to specify number of MC events is good, and the default could be the integral of the input.
@thomashampson - what about making the number of MC events equal to a random number drawn from Poisson distribution with mean of integral of the input? This might help the Poisson-ness of the stats in each bin (if that makes sense...)
c446801
to
491d7ce
Compare
d876e9e
to
5450e90
Compare
|
||
//---------------------------------------------------------------------------------------------- | ||
|
||
Mantid::HistogramData::HistogramY CreateMonteCarloWorkspace::fillHistogramWithRandomData(const std::vector<double> &cdf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether one could optionally output an event workspace (which is essentially what you are simulating), or produce an event workspace and then rebin to match the input workspace?
Does this need to move to v6.13? |
No, I am waiting for someone to review it. All the additional features will be added through a seperate pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing some of the points, in particular setting the errors.
This works well and the tests are good (checking reproducibility with the seed, respects number of MC events etc.).
I checked with the test workspace from the code I posted earlier,
I also checked that the variance of the simulated counts agrees with the average over ~5000 simulations so I think it does seem reasonable to set e = sqrt(y), at least for large counts. I haven't tested the validity of the Poisson distribution etc.
There are some small changes requested: perhaps some unnecessary normalisation done if the monte-carlo events are user specified?
I think there are some additional features I would like (support for workspaces with more than 1 spectrum, perhaps event workspaces as well) but these can be addressed in a separate PR as and when it becomes useful.
@@ -0,0 +1,44 @@ | |||
// Mantid Repository : https://github.com/mantidproject/mantid | |||
// | |||
// Copyright © 2024 ISIS Rutherford Appleton Laboratory UKRI, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably 2025 now!
src/PolarizationCorrections/PolarizationEfficienciesWildes.cpp | ||
src/PolarizationCorrections/PolarizerEfficiency.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look unrelated to the PR, do you need to rebase?
} | ||
|
||
Mantid::HistogramData::HistogramY | ||
CreateMonteCarloWorkspace::scaleInputToMatchMCEvents(const Mantid::HistogramData::HistogramY &yData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me like the CDF gets normalized (such that summation equals 1) anyway so is this necessary? Is it not just possible to set the total MC events as equal to sum of counts in workspace or number passed by user? Seems wasteful to scale the data then sum it up again later to get the total number of MC events in computeNumberOfIterations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right that once you’ve normalized to get the CDF, you already only deal with ratios. The separate scaling step is mostly to ensure the raw Y data sums to the desired total MC events before you compute the CDF. Then computeNumberOfIterations just picks up that (optionally scaled) total.
Practically, you could skip calling scaleInputToMatchMCEvents if you’ve already decided your total MC events. In that case, you’d just ensure your iteration count equals the sum of Y (or user value), normalize to get the CDF, and sample from it. The scaling will still need to be used if no MC events are entered because there are times where the output looks weird if I don't. I am not exactly sure why.
TS_ASSERT_DELTA(totalScaledCounts, 20.0, 1e-6); // Verify the scaled sum matches targetMCEvents | ||
} | ||
|
||
MatrixWorkspace_sptr createInputWorkspace(int numBins, double initialValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a few helper functions here, these are good but perhaps separate them from the tests at the bottom of the file? Makes it easier to read somehow (perhaps just personal preference)
fig.show() | ||
|
||
|
||
.. image:: ../../../images/New.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to rename this file to something more self-explanatory e.g. CreateMonteCarloWorkspace_spectrum.png
?
…ning the random numbers generated.
+ Changed outputWS values
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Jonathan Haigh <[email protected]>
…errors are calculated as the sqrt of the counts.
for more information, see https://pre-commit.ci
Does not work if MC events are much less than data points in original workspace
for more information, see https://pre-commit.ci
5ef163a
to
f30f378
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, in particular renaming the .png and removing the unnecessary additional normalisation when the user supplies the number of events.
I know this PR has been going a while, I'm happy to approve now as I think this algorithm works well and is important to continue with fitbenchmarking work. There are a few nit-picking comments, but as there are a few other features I'd like added separately after this release I think these can be addressed later! What do you think @jclarkeSTFC, @thomashampson, @jhaigh0 and @sf1919 ?
* Determine how many iterations to use for MC sampling. | ||
* If userMCEvents > 0, use that directly; otherwise use the integral of the input data. | ||
*/ | ||
int CreateMonteCarloWorkspace::computeNumberOfIterations(const Mantid::HistogramData::HistogramY &yData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically I would prefer to have this if (userMCEvents > 0)
in the exec
and then this method would be called integrateYData
or similar. But I'm happy to merge as is!
Oh just noticed the copyright is still 2024, is that correct or should it be 2025? Again v minor... |
The base branch was changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would benefit from a better image for the documentation to showcase the algorithm.
Co-authored-by: RichardWaiteSTFC <[email protected]>
Description of work
Summary of work
To benchmark the Mantid fitting engine, we need a method to generate randomly distributed data based on specific fit functions or workspaces. This will allow us to compare the generated data with known true values, providing insights into the performance of the fitting engine. I implemented a Mantid algorithm that uses a Cumulative Distribution Function (CDF) approach. This algorithm:Fixes #38196.
Report to: @thomashampson
Further detail of work
**Expected Outcome:**This algorithm allows us to create random distributions based on known values, making it easier to assess the fitting engine’s accuracy and performance across different scenarios.
To test:
Reviewer
Please comment on the points listed below (full description).
Your comments will be used as part of the gatekeeper process, so please comment clearly on what you have checked during your review. If changes are made to the PR during the review process then your final comment will be the most important for gatekeepers. In this comment you should make it clear why any earlier review is still valid, or confirm that all requested changes have been addressed.
Code Review
Functional Tests
Does everything look good? Mark the review as Approve. A member of
@mantidproject/gatekeepers
will take care of it.Gatekeeper
If you need to request changes to a PR then please add a comment and set the review status to "Request changes". This will stop the PR from showing up in the list for other gatekeepers.