Skip to content

[skip-ci] Create histv7, document terminology and design #19213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
/graf3d/ @couet
/gui/ @bellenot
/hist/ @lmoneta
/hist/histv7/ @hahnjo
/html/ @dpiparo
/icons/ @bellenot
/interpreter/ @dpiparo
Expand Down
9 changes: 6 additions & 3 deletions hist/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,14 @@
# For the licensing terms see $ROOTSYS/LICENSE.
# For the list of contributors see $ROOTSYS/README/CREDITS.

add_subdirectory(hist) # special CMakeLists.txt
add_subdirectory(histpainter) # special CMakeLists.txt
add_subdirectory(hist)
add_subdirectory(histpainter)
if(root7)
add_subdirectory(histv7)
endif()
if (spectrum)
add_subdirectory(spectrum)
add_subdirectory(spectrumpainter) # special CMakeLists.txt
add_subdirectory(spectrumpainter)
endif()
if(unfold)
add_subdirectory(unfold)
Expand Down
Empty file added hist/histv7/CMakeLists.txt
Empty file.
73 changes: 73 additions & 0 deletions hist/histv7/doc/DesignImplementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Design and Implementation

This document describes key design decisions and implementation choices.

## Templating

Classes are only templated if required for data members, in particular the bin content type `T`.
We use member function templates to accept variable number of arguments (see also below).
Classes are **not** templated to improve performance, in particular not on the axis type(s).
This avoids an explosion of types and simplifies serialization.
Instead axis objects are run-time choices and stored in a `std::variant`.
With a careful design, this still results in excellent performance.

## Performance Optimizations

If required, it would be possible to template performance-critical functions on the axis types.
This was shown beneficial in microbenchmarks for one-dimensional histograms.
However, it will not be implemented until shown useful in a real-world application.
In virtually all cases, filling a (one-dimensional) histogram is negligible compared to reading, decompressing, and processing of data.

The same applies for other optimizations, such as caching the pointer to the axis object stored in the `std::variant`.
Such optimizations should only be implemented with a careful motivation for real-world applications.

## Functions with Variable Number of Arguments

Many member functions have two overloads: one accepting a function parameter pack and one accepting a `std::tuple` or `std::array`.

### Arguments with Different Types

Functions that take arguments with different types expect a `std::tuple`.
An example is `template <typename A...> void Fill(const std::tuple<A...> &args)`.

For user-convenience, a variadic function template forwards to the `std::tuple` overload:
```cpp
template <typename... A> void Fill(const A &...args) {
Fill(std::forward_as_tuple(args...));
}
```
This will forward the arguments as references, so no copy-constructors are called (that could potentially be expensive).

### Arguments with Same Type

In this case, the function has a `std::size_t N` template argument and accepts a `std::array`.
An example is `template <std::size_t N> const T &GetBinContent(const std::array<RBinIndex, N> &args)`

For user-convenience, a variadic function template forwards to the `std::array` overload:
```cpp
template <typename... A> const T &GetBinContent(const A &...args) {
std::array<RBinIndex, sizeof...(A)> a{args...};
return GetBinContent(a);
}
```
This will copy the arguments, which is fine in this case because `RBinIndex` is small (see below).

### Special Arguments

Special arguments are passed last.
Examples include
```cpp
template <typename... A> void Fill(const std::tuple<A...> &args, RWeight w);
template <std::size_t N> void SetBinContent(const std::array<RBinIndex, N> &args, const T &content);
```
The same works for the variadic function templates that will check the type of the last argument.

For profiles, we accept the value with a template type as well to allow automatic conversion to `double`, for example from `int`.

## Miscellaneous

The implementation uses standard [C++17](https://en.cppreference.com/w/cpp/17.html):
* No backports from later C++ versions, such as `std::span`, and
* No ROOT types, to make sure the histogram package can be compiled standalone.

Small objects are passed by value instead of by reference (`RBinIndex`, `RWeight`).
54 changes: 54 additions & 0 deletions hist/histv7/doc/Terminology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Histogram Terminology

This document collects, defines, and explains terms that are used in ROOT's histogram package.
The goal is to start from a common understanding, which should avoid ambiguities and ease discussions.
It also helps (future) developers to navigate the code because classes and methods are named accordingly.
The list is ordered alphabetically, though dependent terms are kept together with their parent.
It is supposed to be exhaustive; any missing term should be added when needed.

An *axis* is a bin configuration in one dimension.
A *regular axis* has equidistant bins in the interval $[a, b)$.
A *variable bin axis* is configured with explicit bin edges $[e_{n}, e_{n+1})$.
A *categorical axis* has a unique label per bin.
*Axes* is the plural of axis and usually means the bin configurations for all dimensions of a histogram.

A *bin content* is the value of a single bin.
The *bin content type* can be an integer type, a floating-point type, the special `RDoubleBinWithError`, or a user-defined type.

A *bin error* is the Poisson error of a bin content.
With the special `RDoubleBinWithError`, it is the square root of the sum of weights squared: $\sqrt{\sum w_i^2}$
Otherwise it is the square root of the bin content, which is only correct with unweighted filling.

A *bin index* (plural *indices*) refers to a single bin of a dimension, an array of indices refers to a bin in a histogram.
A *normal bin* is inside an axis and its index starts from 0.
*Underflow* and *overflow* bins, also called *flow bins*, are outside the axis and their index has a special value.
The *invalid bin index* is another special value.

A *bin index range* is a range from `begin` (inclusive) to `end` (exclusive).
For its purpose, the underflow bin is ordered before all normal bins while the overflow bin is placed after.
As the `end` is exclusive, the invalid bin index is ordered last to make it possible to include the overflow bin.

*Filling* a histogram means to add an entry to a histogram.
*Concurrent filling* allows to modify the same histogram without (external) synchronization.

A *histogram* is the combination of an axes configuration and storage of bin contents.
For most use cases, it also includes (global) *histogram statistics*.
On the one hand, these are the number of entries, the sum of weights, and the sum of weights squared.
The number of *effective entries* can be computed as the ratio $$\frac{(\sum w_i)^2}{\sum w_i^2}$$.
Furthermore, for each dimension the histogram statistics include the sum of weights times value and the sum of weights times value squared.
This allows to compute the arithmetic mean and the standard deviation of the values before binning.

A *linearized index* starts from 0 up to the total number of bins, potentially including flow bins.
For a single axis, it places the flow bins after the normal bins.
The *global index* is a combination of the linearized indices from all axes.

A *profile* is a histogram that computes the arithmetic mean and standard deviation per bin.
During filling, it accepts an additional `double` value and accumulates its sum and sum of squares.

*Slicing* means to extract a subset of the normal bins in each dimension.
Bin contents of excluded normal bins are added to the flow bins.

A *snapshot* is a consistent clone of the histogram during concurrent filling.

A *weight* is an optional floating-point value passed during filling.
It defaults to $1$ if not specified, which is also called unweighted filling.