Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a facet for array uncertainty propagation #1797

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

varchasgopalaswamy
Copy link
Contributor

@varchasgopalaswamy varchasgopalaswamy commented Jun 6, 2023

I found pint quite useful in my research, but the one feature I've always wanted was units + uncertainty propagation + support for arrays. I made a small package that uses automatic differentiation to propagate uncertainties for all numpy functions, and I figured adding a facet for it in pint might be of interest! Let me know what you think.

@MichaelTiemannOSC
Copy link
Collaborator

Fascinating new package! Are you intending to obsolete the uncertainties package? I wrote a PR for supporting that: #1615

What I like about your approach is (1) nice modularity with facets, and (2) it uses ExtensionArray rather than wrappers. I think that since Pint and Pint-Pandas has gone all-in on ExtensionArrays, that's a great way to go.

But I wonder how you deal with parsing uncertainties as may inevitably be required when Quantities must be parsed, not only constructed. And is your package ready to do all the uncertainties tasks that would be required by Pint-Pandas?

@hgrecco
Copy link
Owner

hgrecco commented Jun 7, 2023

Interesting PR. I think that @MichaelTiemannOSC questions are important, though.

@varchasgopalaswamy
Copy link
Contributor Author

varchasgopalaswamy commented Jun 7, 2023

@MichaelTiemannOSC

How you deal with parsing uncertainties as may inevitably be required when Quantities must be parsed

By this I am guessing you mean converting a string "3 +- 5 um" into Uncertainty(3,5) * unit_registry('um')? If so, right now I don't think I do that. I think that's going to be a need for me as well down the line (my workflow manager serializes python objects into strings), so I'll likely implement that myself

And is your package ready to do all the uncertainties tasks that would be required by Pint-Pandas?

I don't think I got the ExtensionArray functionality working, primarily because I was having trouble making it work well with Pint normally - and then I saw the update with facets, and realized it was a lot more straightforward than I had thought! Again, having auto_uncertainties work with Pandas is something that's necessary for me so I expect I'll have that functionality in the near term.

Are you intending to obsolete the uncertainties package?

No, for a few reasons. First off, all the features above that you might want that aren't there yet. Second, it feels presumptuous without feedback from other users. Third, my package requires JAX for autodiff, and pip based binary installs of XLA only work on computers that support AVX instruction sets. Now AVX has been out for around a decade so I'm not sure if this is a big deal for the majority of users, but I have had to deal with older HPC systems where this was an issue.

If the PR is approved and users feel that they have everything they need, perhaps we can revisit this question then. Currently I view this PR as adding a bare minimum support for array based uncertainty propagation with units that we can build on going forwards.

@andrewgsavage
Copy link
Collaborator

I see you've used a single class for both scalar and arrays. I'd suggest having different classes for each will make it easier to write an ExtensionArray as pandas seems to be written with that in mind. It is possible to make it work as pint-pandas has shown, and it is improving, but there are still some pint-pandas features that don't currently work but would if pint split scalars and arrays. It would have saved many hours debugging too.

There are several bits of metadata you'll need to store with the ExtensionArray:

  • the numpy dtype you're storing the values and error (int, float...)
  • the error distribution type
  • the unit

This metadata is in addition to dealing with two bits of data (value and error) per scalar. I haven't yet seen an ExtensionArray that does this yet.

I'd suggest getting an ExtensionArray without units working as a first step, before dealing with storing and propgating units. I don't believe there's an ExtensionArray for uncertainties yet, so this would be valueable!

@varchasgopalaswamy
Copy link
Contributor Author

But I wonder how you deal with parsing uncertainties as may inevitably be required when Quantities must be parsed, not only constructed.

@MichaelTiemannOSC the latest commits should add parsing support for scalar uncertainties. I looked through your PR #1615 to get an idea how to do it, but I thought this way seemed a bit easier - I simply added a few lines to string_preprocessor to turn '+/-', '+-' to the unicode "±" so that the existing tokenizer could handle it without needing lookaheads. I think that's fine since all the strings are treated as UTF-8, right?

I haven't tried to deal with array uncertainties, but I notice that pint can't parse array quantities in any case, so I didn't think that was vital to deal with right now. By this I mean

from pint import unit_registry 
import numpy as np 

x = np.asarray([2,3]) * unit_registry('s')
string_rep = str(x) 
# This is [2 3] s 
y = unit_registry.Quantity(string_rep)
# This is 6 s 

I guess that's because the string gets tokenized to (2 * 3) * s?

@MichaelTiemannOSC
Copy link
Collaborator

I'm taking a look at your changes now--seeing how hard/easy it is for me to swap out what I did and swap in what you have done.

@varchasgopalaswamy
Copy link
Contributor Author

I'm taking a look at your changes now--seeing how hard/easy it is for me to swap out what I did and swap in what you have done.

Let me know what you think. You probably need to install the version on GitHub rather than the released version on pypi; I've been adding better typing support.

We're using it in production internally and I've been pretty pleased with how smooth it's been, but I'd appreciate any feedback!

@MichaelTiemannOSC
Copy link
Collaborator

Yes, I am using a GitHub fork of your project (in case I need to suggest changes). So far it's looking very nice, but I'm still getting wires connected between how I used uncertainties and how I expect to us auto_uncertainties.

@hgrecco
Copy link
Owner

hgrecco commented Aug 30, 2023

Awesome that this conversation is moving forward!

@MichaelTiemannOSC
Copy link
Collaborator

Here's my attempt to take stock of the various problems and solutions.

Python handles a range of numeric types (integer, float, complex), but math.isnan only works for int and float.

Numpy supports single- and multi-dimensional arrays. np.isnan operates on sclars and arrays. While one can construct numpy arrays holding arbitrary (object) types, numpy functions such as np.isnan raise errors if it cannot safely convert the data type of the scalar or array to a numeric type. (np.isnat similarly works only on datetime64 and timedelta64.)

The uncertainties package wraps both numeric and numpy types. Purely numeric types are overriden with ufunc functions that implement uncertainty math. Arrays of uncertainties are object arrays as far as Numpy is concerned, but when array operations are performed (such as adding together two arrays), the implicit map down the elements of the arrays activates the uncertainty ufunc to wrap the operation and produce an uncertainty result that can be rewrapped in an object Numpy array.

Numpy functions (such as isnan) are not overriden per se, but are implemented via unumpy versions, such as unumpy.isnan, which works on arrays of uncertainty values. And it's ugly, because any code that could touch uncertainties must check unp.isnan; np.isnan is not overriden the way that wrappers can override. And what's worse, when we consider Pandas arrays, which might use pd.NA instead of np.nan as their null value, we have to check pd.isna and unp.isnan (because unp.isnan doesn't know about pd.NA).

A unified Numpy/uncertainty interface would be nice!

Pint facets combine definitions (classes describing specific unit-related definitions), objects (that encapsulate behavior e.g. Context), and registry (implementing a subclass of PlainRegistry or class that can be mixed with it). The PlainRegistry implements the details of Quantity (which can be any numeric type or, if Numpy is available, array types as well). The NumpyRegistry brings together NumpyQuantity and NumpyUnit to specialize its definitions of Quantity and Unit.

The Measurement facet implements a scalar-only Quantity consisting of an uncertain magnitude and a unit. Quantities combined with Quantities yield Quantities, but Quantities combined with Measurements yield Measurements...except when Quantities contain array values, at which point Measurements are broadcast to the Quantity array:

>>> Q_([1, 2, 3], 'm') + M_(1, 2, 'm')
<Quantity([2.0+/-2.0 3.0+/-2.0 4.0+/-2.0], 'meter')>

Which means that like it or not, the Quantity facet does and will implement uncertainties within its facet.

In a sense the AutoUncertainties changes of this PR are what one would expect if we extended the Measurement facet to handle arrays itself, rather than punting those arrays back to the Quantity facet. And as we see, there's quite a bit of code and structure that has to be duplicated/reimplemented from the Quantity facet to give/(update) Measurements (to) the same functionality.

This gives us the potential niceness of being able to broadcast scalar Quantities to arrays of Measurements, Measurements to arrays of Quantities, and to do array operations on any combination of Quantities and Measurements (provided the shapes are compatible). Nice! But is it really so nice?

If Quantities can, by themselves, do all the work, what's the point of duplicating so much code to create a new thing that can do only half as much? One thing I've noticed is that AutoUncertainties allows uncertain values to be downcasted to nominal values whereas traditional uncertainties require users to explicitly ask for nominal values. I've seen the same sort of thing where some packages implementing complex numbers will just drop the imaginary component to automatically convert the complex number to a float, whereas other packages require the user to explicitly select the real component if that's what they want.

The big changes I think I have seen in the interfaces between Pint and Numpy/Pandas is that object arrays (Numpy) and ExtensionArrays (Pandas) are really solid when it comes to the kind of polymorphic behavior we seek to implement. But the more classes we have running around doing roughly the same things (such as Quantities and Measurements), the more we need to flesh out the combinatorial matrix. I can construct a PintArray from a sequence of Quantities, but not Measurements:

>>> PA_._from_sequence([Q_(1, 'm'), Q_(2, 'm')])
<PintArray>
[1, 2]
Length: 2, dtype: pint[meter]
>>> PA_._from_sequence([M_(1, 2, 'm'), M_(2, 3, 'm')])
*** ValueError: No dtype specified and not a sequence of quantities

Now, AutoUncertainties has a mostly-implemented Pandas EA for itself, but then we have to explain how PintArrays (which are grounded in Quantities) and UncertaintyArrays (grounded in Uncertainties) interoperate.

Finally, I made a Pull Request (hgrecco/pint-pandas#198) to start benchmarking PintArrays. The goal of that benchmarking is to both demonstrate what we consider to be the best, fastest implementation we can cleanly get using PintPandas, document that as a pattern, and ensure that we keep it running fast. New implementation ideas can compare themselves to that benchmark and see whether the new ideas really do help make things better, or instead just different (with possible performance penalties).

I acknowledge that the np.isnan vs. unp.isnan thing is ugly. But everything else about using uncertainties in Quantities looks better to me than creating a parallel university with Measurement (even if renamed Uncertainty).

@codspeed-hq
Copy link

codspeed-hq bot commented Sep 13, 2023

CodSpeed Performance Report

Merging #1797 will not alter performance

Comparing varchasgopalaswamy:uncertainties (e006eb6) with master (6c2dda9)

Summary

✅ 421 untouched benchmarks

@hgrecco
Copy link
Owner

hgrecco commented Sep 13, 2023

@MichaelTiemannOSC Thanks for the detailed explanation. I do agree with your view!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants