Skip to content

[skip-ci] Create histv7, document terminology and design #19213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

hahnjo
Copy link
Member

@hahnjo hahnjo commented Jun 27, 2025

No description provided.

hahnjo added 2 commits June 27, 2025 12:40
Design documents and the implementation will be added over time.
@hahnjo hahnjo self-assigned this Jun 27, 2025
@hahnjo hahnjo requested a review from bellenot as a code owner June 27, 2025 11:05
@hahnjo hahnjo added the in:Hist label Jun 27, 2025
@hahnjo hahnjo requested review from lmoneta and dpiparo as code owners June 27, 2025 11:05
Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, LGTM!

Of the comments I added, the "flow bins" are my biggest concern.


A *bin index* (plural *indices*) refers to a single bin of a dimension.
A *normal bin* is inside an axis and its index starts from 0.
*Underflow* and *overflow* bins, also called *flow bins*, are outside the axis and their index has a special value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could "flow bins" be an unfortunate abbreviation? I never heard it outside UHI discussions and/or the ROOT team. I find it imprecise, because integers and floating-point numbers under- or overflow, but they don't "flow" (other than through neural networks and the likes). Along the same lines, a value can under- and overflow an axis range, but it doesn't flow around the axis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of UHI, it is the terminology employed by boost-histogram as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point: 'flow bins' isn't widely used in the histogramming world yet.
But it appears extensively in boost-histogram documentation (e.g., views) and in UHI documentation (e.g., slicing). I think it's gaining traction, and ROOT histograms could benefit from adopting it.
I personally find 'flow bins' short and clear, but 'under/overflow bins' or 'out-of-range bins' work as well I guess 🤷🏻‍♀️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on your point, Stephan, I had also not really heard the term before Silia mentioned it. I do think though that we can still establish it now, because there is no clear and concise term yet. Thinking a bit during the weekend, a potential alternative would be "excess bins" but that has the disadvantage that it's completely new...

Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Very good starting point.

@@ -0,0 +1,73 @@
# Design and Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Interface Design"? Unless that will get extended by more aspects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already has some aspects that go beyond interface in the "Miscellaneous" section. The way I view this document are the key choices that lead to the histogram package. For example, one aspect still missing is the goal of being able to support concurrent filling without degrading sequential filling, which we get by storing the bin contents as-is and only using atomic instructions on top. I plan to have the "Code Architecture" (ie how it's organized into classes) in a separate document.

What about "Package Design" maybe, in the sense of "software architecture design"?

The *bin content type* can be an integer type, a floating-point type, the special `DoubleBinWithError`, or a user-defined type.

A *bin error* is the Poisson error of a bin content.
With the special `DoubleBinWithError`, it is the square root of the sum of weights squared: $\sqrt{\sum w_i^2}$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: can we prevent weighted filling with the wrong bin content type (double instead of DoubleBinWithError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the prototype I have static_asserts failing when trying to use weighted filling on integer bin content types. So far I've opted to allow it for floating-point types because that may be interesting when only looking at the bin content, but not the error. It's possible to reconsider at a later stage.

With the special `DoubleBinWithError`, it is the square root of the sum of weights squared: $\sqrt{\sum w_i^2}$
Otherwise it is the square root of the bin content, which is only correct with unweighted filling.

A *bin index* (plural *indices*) refers to a single bin of a dimension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for multi-dimensional histograms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a multi-dimensional histogram, you'll need an array of bin indices to refer to a single bin content. It's indeed ambiguous how we define a "bin": Is it per dimension or is it only once you give coordinates for all dimensions, ie what really has the bin content in the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ambiguity theoretically exists also for the TH1-style histograms, but in practice, it is always clear from the context if you are talking about a bin on an axis (1-dim), or a bin in the histogram (possibly an index tuple). I guess one could add a sentence about bins in histogram and cover all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why there is no definition of a "bin" itself 😇 for "bin content" it is clear that it's the bin of a histogram, and for a single "bin index" it makes life easier if that's on a single axis.

Could you concretely propose where you would like to see added what?


A *linear index* starts from 0 up to the total number of bins, potentially including flow bins.
For a single axis, it places the flow bins after the normal bins.
The *global index* is a combination of the linear indices from all axes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps give more details here on the flattening.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, in the end I would consider this an implementation detail. I plan to have API functions to compute the global index, but in principle I don't want users to do this themselves...

Copy link
Contributor

@siliataider siliataider left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great start to unify the terminology, thanks!


A *bin index* (plural *indices*) refers to a single bin of a dimension.
A *normal bin* is inside an axis and its index starts from 0.
*Underflow* and *overflow* bins, also called *flow bins*, are outside the axis and their index has a special value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of UHI, it is the terminology employed by boost-histogram as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants