-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[skip-ci] Create histv7
, document terminology and design
#19213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Design documents and the implementation will be added over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, LGTM!
Of the comments I added, the "flow bins" are my biggest concern.
|
||
A *bin index* (plural *indices*) refers to a single bin of a dimension. | ||
A *normal bin* is inside an axis and its index starts from 0. | ||
*Underflow* and *overflow* bins, also called *flow bins*, are outside the axis and their index has a special value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could "flow bins" be an unfortunate abbreviation? I never heard it outside UHI discussions and/or the ROOT team. I find it imprecise, because integers and floating-point numbers under- or overflow, but they don't "flow" (other than through neural networks and the likes). Along the same lines, a value can under- and overflow an axis range, but it doesn't flow around the axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of UHI, it is the terminology employed by boost-histogram as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me demonstrate my point a bit better 😄
https://www.google.com/search?q=flow-bin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point: 'flow bins' isn't widely used in the histogramming world yet.
But it appears extensively in boost-histogram documentation (e.g., views) and in UHI documentation (e.g., slicing). I think it's gaining traction, and ROOT histograms could benefit from adopting it.
I personally find 'flow bins' short and clear, but 'under/overflow bins' or 'out-of-range bins' work as well I guess 🤷🏻♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on your point, Stephan, I had also not really heard the term before Silia mentioned it. I do think though that we can still establish it now, because there is no clear and concise term yet. Thinking a bit during the weekend, a potential alternative would be "excess bins" but that has the disadvantage that it's completely new...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Very good starting point.
@@ -0,0 +1,73 @@ | |||
# Design and Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Interface Design"? Unless that will get extended by more aspects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already has some aspects that go beyond interface in the "Miscellaneous" section. The way I view this document are the key choices that lead to the histogram package. For example, one aspect still missing is the goal of being able to support concurrent filling without degrading sequential filling, which we get by storing the bin contents as-is and only using atomic instructions on top. I plan to have the "Code Architecture" (ie how it's organized into classes) in a separate document.
What about "Package Design" maybe, in the sense of "software architecture design"?
hist/histv7/doc/Terminology.md
Outdated
The *bin content type* can be an integer type, a floating-point type, the special `DoubleBinWithError`, or a user-defined type. | ||
|
||
A *bin error* is the Poisson error of a bin content. | ||
With the special `DoubleBinWithError`, it is the square root of the sum of weights squared: $\sqrt{\sum w_i^2}$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: can we prevent weighted filling with the wrong bin content type (double instead of DoubleBinWithError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the prototype I have static_assert
s failing when trying to use weighted filling on integer bin content types. So far I've opted to allow it for floating-point types because that may be interesting when only looking at the bin content, but not the error. It's possible to reconsider at a later stage.
With the special `DoubleBinWithError`, it is the square root of the sum of weights squared: $\sqrt{\sum w_i^2}$ | ||
Otherwise it is the square root of the bin content, which is only correct with unweighted filling. | ||
|
||
A *bin index* (plural *indices*) refers to a single bin of a dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for multi-dimensional histograms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a multi-dimensional histogram, you'll need an array of bin indices to refer to a single bin content. It's indeed ambiguous how we define a "bin": Is it per dimension or is it only once you give coordinates for all dimensions, ie what really has the bin content in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ambiguity theoretically exists also for the TH1-style histograms, but in practice, it is always clear from the context if you are talking about a bin on an axis (1-dim), or a bin in the histogram (possibly an index tuple). I guess one could add a sentence about bins in histogram and cover all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why there is no definition of a "bin" itself 😇 for "bin content" it is clear that it's the bin of a histogram, and for a single "bin index" it makes life easier if that's on a single axis.
Could you concretely propose where you would like to see added what?
hist/histv7/doc/Terminology.md
Outdated
|
||
A *linear index* starts from 0 up to the total number of bins, potentially including flow bins. | ||
For a single axis, it places the flow bins after the normal bins. | ||
The *global index* is a combination of the linear indices from all axes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps give more details here on the flattening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, in the end I would consider this an implementation detail. I plan to have API functions to compute the global index, but in principle I don't want users to do this themselves...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great start to unify the terminology, thanks!
|
||
A *bin index* (plural *indices*) refers to a single bin of a dimension. | ||
A *normal bin* is inside an axis and its index starts from 0. | ||
*Underflow* and *overflow* bins, also called *flow bins*, are outside the axis and their index has a special value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of UHI, it is the terminology employed by boost-histogram as well
No description provided.