Order categories and sub-categories #355

JulesVandenbroeck · 2023-11-16T09:58:09Z

JulesVandenbroeck
Nov 16, 2023

Hi everyone,
I have been looking at how Order uses categories and sub-categories. It seems from their documentation that a sub-category does not inherit the selection of their parent category which for me would be implied in the name "sub". Is there a clear case where subcategories could be helpful over simply additional categories?

Answered by riga

Nov 17, 2023

Hi @JulesVandenbroeck ,

thanks for opening this discussion! I think there are two parts to this: one is "order" related, whereas the other one rather has to do with cf itself.

Sub-categories are not inheriting selections

This is a decision made in "order". While subcategories indeed mostly use the selection / phase-space defined by the parent category, there might be scenarios where this is not the case. These scenarios could be highly analysis- or simply use-case dependent and forcing the inheritance of selections within "order" would just seem unreasonably strict. Simultaneously, inheriting selections manually is rather simple:

cat_eq4j = od.Category(
    name="eq4j",
    id=1,
    # …

View full answer

riga · 2023-11-17T13:19:03Z

riga
Nov 17, 2023
Maintainer

Hi @JulesVandenbroeck ,

thanks for opening this discussion! I think there are two parts to this: one is "order" related, whereas the other one rather has to do with cf itself.

Sub-categories are not inheriting selections

This is a decision made in "order". While subcategories indeed mostly use the selection / phase-space defined by the parent category, there might be scenarios where this is not the case. These scenarios could be highly analysis- or simply use-case dependent and forcing the inheritance of selections within "order" would just seem unreasonably strict. Simultaneously, inheriting selections manually is rather simple:

cat_eq4j = od.Category(
    name="eq4j",
    id=1,
    # case 1: ROOT-style selection strings
    selection="n_jets == 4",
    # case 2: function accepting awkward-style events array
    selection=(lambda events: events.n_jets == 4),
)

cat_eq4j_eq1b = cat_eq4j.add_category(
    name=f"{cat_eq4j.name}_eq1b",
    id=2,
    # case 1: ROOT-style selection strings
    selection=od.util.join_root_selection(cat_eq4j.selection, "n_btags == 1"),
    # case 2: function accepting awkward-style events array
    selection=(lambda events: cat_eq4j.selection(events) && (events.n_btags == 1)),
)

Why use sub-categories?

You're absolutely right that - strictly speaking - one does not have to define sub-categories within cf. However, I see one main advantage in using them. Following the example above, imagine you defined 6 "flat" (no parent/child structure) categories

cat_eq4j
cat_eq4j_eq0b
cat_eq4j_eq1b
cat_eq4j_eq2b
cat_eq4j_eq3b
cat_eq4j_eq4b

and you created some plots. It's obvious that evaluating the "cat_eq4j" selection statement is redundant in the first place. And after looking at the plots, you like to see distributions for events with 4 jets and 2 or more b-tags. You could either define a separate "cat_eq4j_ge2b" category and run the histogramming step again (not good since this is often a bottleneck), or you instruct the plotting to merge the cat_eq4j_eq{2,3,4}b categories. But then, you'd have to manually set a title for the merged category (and possibly change some other settings as well).

This is exactly what nested categories do for you. One could start with a structure such as

graph TD
  cat_eq4j --> cat_eq4j_eq0b
  cat_eq4j --> cat_eq4j_eq1b
  cat_eq4j --> cat_eq4j_eq2b
  cat_eq4j --> cat_eq4j_eq3b
  cat_eq4j --> cat_eq4j_eq4b

and create histograms / plots. cf creates histograms only for leaf categories (categories with no children), and the plotting can automatically merge them depending on the nesting structure. Then, staying in this example, you would add "cat_eq4h_ge2b" later on as

graph TD
  cat_eq4j --> cat_eq4j_eq0b
  cat_eq4j --> cat_eq4j_eq1b
  cat_eq4j --> cat_eq4j_ge2b
  cat_eq4j_ge2b --> cat_eq4j_eq2b
  cat_eq4j_ge2b --> cat_eq4j_eq3b
  cat_eq4j_ge2b --> cat_eq4j_eq4b

and you'd only have to rerun the plotting step, since the set of leaf categories did not change.

Hope this helps!

0 replies

pkausw · 2023-11-24T12:06:27Z

pkausw
Nov 24, 2023
Maintainer

Hi @JulesVandenbroeck , does this answer you question? If so, feel free mark this discussion as answered and close it ;)

1 reply

JulesVandenbroeck Nov 24, 2023
Author

Sorry for not closing the discussion, this indeed answered my answer :)

maadcoen · 2024-08-01T10:01:54Z

maadcoen
Aug 1, 2024

Hi @riga

I understand the argument of grouping subcategories so that you don't have to rerun the histogramming. But I think your efficiency argument should be put against an intuition argument, which to me has the upper hand in this case. I also have some reservations about the efficiency argument itself.

intuition

of the plotting

When you add the argument --categories cat_eq4j, you expect it takes the selection for 4 jets. But actually, it takes the events of all the subcategories, cat_eq4j_eq0b, `cat_eq4j_eq1b' ... That's fine as long as you have really added all the subcategories add up to the parent category, but if they don't your plotting something else then you think you're plotting. It gets even worse if (some of) the subcategories have subcategories themselves, cause then everything has to add up at multiple levels. This feels like requiring unnecessary bookkeeping forced upon the user.

Also, if you haven't read the categorization manual, you'll likely just not realise this is the way it works. Maybe this is just me, but this happened exactly to me. I found out you could do subcategories, found this quite useful to organize myself. I made, for instance, to make a chain of selections incl > one lepton > 2 jets > 1 btag. Then I plotted the inclusive plots, and only found out much later columnflow just always showed me the category with one lepton and 2 jets and 2 btag.

of the concept of subcategory

I think the implied inheritance from the prefix "sub" cannot be so easily discarded as "unnecessarily restrictive". I would be interested to know if you have any specific scenarios in mind where this inheritance would be undesirable. To me a subcategory that actually isn't one, should just be a separate category. Calling something with name that doesn't describe it well makes thing only confusing.

efficiency

of the histogramming

I feel like this grouping of categories after histogramming could be achieved by implementing a different class, e.g. CategoryCollection, whose only job is to refer to its subcategories.

of the categorization

Currently, columnflow will for each category evaluate all selections it requires separately. With inheritance, one could avoid this by adding the subcategory selections on top of the already calculated parent selection.

0 replies

pkausw · 2024-08-01T12:37:25Z

pkausw
Aug 1, 2024
Maintainer

Hi @maadcoen ,

just to add my two cents on this from the side:

Intuition

I think this is more a discussion about the philosophy about how one sees categories and how to define phase spaces. If I understand you correctly, your approach is top-bottom, that is you start from a big phase space and start subdividing it. The approach we have chosen is a bottom-up approach - we start with small building blocks and stack those to create larger phase spaces. Both approaches are equally valid, and since "intuition" is a rather subjective topic, it's hard to argue which of the two would be better.

Regarding the manual: Sorry, but it's there for a reason 😅 We are aware of course that it's not complete, which is of course a shortcoming on our side and is still work in progress. However, the parts that are there (together with discussions such as this one) should definitely be considered when using columnflow. And since it clearly states that leave categories need to be orthogonal, I think this argument of yours is not valid. Of course, if we can make this any clearer for the reader/user, we're also happy about feedback!

Efficiency

Since we build histograms for the smallest building blocks, we are able to use the sum operation that is built-in in python. I would argue that using a built-in function is always more favorable than a custom class on top in terms of efficiency.

Your second point is again a question of philosophy I think. If you wanted to created different larger phase spaces in your case, you would also need to rerun everything since your top-most selection changes. Doing the categorization with the smallest building blocks allows for a flexible definition of large chunks of phase spaces without rerunning the chain over and over again - you can simply add histograms, which are fairly far down the line and thus reduce the amount of computations you need to do again.

If you have additional comments and questions let us know.

Cheers,
Philip

2 replies

maadcoen Aug 1, 2024

Thanks for your take! I have some comments though below. In general, you refer to philosophy, but to me this is also concrete: I think of my inclusive phase space, in which I want to investigate the 1 lepton subspace, in which I am interested in the 2 jet region and the 3 or more jet region etc. I'd be interested to analysis ideas that justify the current exclusive use of leaf categories without inheritance from parent categories.

intuition

My point is not that the manual should be unnecessary. I am happy to admit not reading it is not the best idea. But I wanted to point out that if a decision is counterintuitive, it's not a solution to explain that in a manual. The manual should be intuitive if that's possible. I think it's not uncommon that people are not very thorough in reading instructions before using something. So that's why I consider a bad idea to create something that works in a way that goes against common sense.

Then I agree that what's common sense, or what's intuitive is vague and personally dependent. But I would say that, ultimately, they are determined by majority opinion. So that's why I'd like some more discussion about this. Regarding the difference between bottom-up or top-down, I would say that an analysis naturally works from the latter: don't you always start from a large phase space, from which you select an interesting subspace which you then divide into signal regions, control regions... ? Maybe that you could say at some point you start enlarging your phase space starting from the signal region (adding control regions). But still, how does that change the hierarchy argument?

efficiency

I didn't have in mind what you're referring to. I had in mind the following. When you specify --categories mycat, the plotting goes looking for all the leaves descending from mycat, then adds the histograms. Which I don't like, because that only works if the descending categories shouldn't cover the entire phase space of mycat. I would propose that any normal category simply slices the category axis of the histogram at it's own ID. Then you could add a class CategoryCollection that instead slices the histograms at all the IDs of the categories it collects and then adds them.

Concretely, if you look at the task PlotVariables1D, you have the code below. It queries the category passed through

# lines 99-100
category_inst = self.config_inst.get_category(self.branch_data.category)
leaf_category_insts = category_inst.get_leaf_categories() or [category_inst]

and further down the line

# lines 161-172
# selections
h = h[{
    "category": [
        hist.loc(c.id)
        for c in leaf_category_insts
        if c.id in h.axes["category"]
    ],
    "shift": [
        hist.loc(s.id)
        for s in plot_shifts
        if s.id in h.axes["shift"]
    ],
}]

My suggestion would be to have a new class CategoryCollection. For a normal Category, one would always have that
category_inst.get_leaf_categories() >> [category_inst]. While for a CategoryCollection, it would be category_inst.get_leaf_categories() >> [category_inst1, category_inst2, category_inst3, ...].

This also avoids the weird situation where a parent category can be defined with any selection function, without it having an effect on the the category it defines. More specifically, if I currently do the following

twojet = config.add_category("2j", selection="RANDOM_SELECTION",  id=3, label="3 jets")
twojet.add_category("2j1b", selection=["exactly_two_jets", "exactly_one_bjet"])
twojet.add_category("2j2b", selection=["exactly_two_jets", "exactly_two_bjet"])

I could put anything in the RANDOM_SELECTION, it doesn't have any effect at all. The CategoryCollection I would propose wouldn't have this selection argument, avoiding such a situation.

My idea of how this all could work:

config.add_category("1mj", selection="one_or_less_jets", label="<= 1 jets", id=1)
threeplusjet = config.add_category("3pj", selection="three_or_more_jets",  id=3, label=">= 3 jets", id=3)
twojet = config.add_category("2j", selection="exactly_two_jets",  label="3 jets", id=2)
# 2j1b and 2j2b inherits two jet selection. 
# Also when calculating the category ids, both reuses the result from the "exactly_two_jets" categorizer.
twojet.add_category("2j1b", selection="exactly_one_bjet", id=4) 
twojet.add_category("2j2b", selection="exactly_two_bjet", id=5) 

# I already made the histograms, which now have a category_id axis with bins 1 to 5. 
# But now I want the category two or more jets, so I add the following 
config.add_category_collection("p2j", categories=["2j", "3pj"], label=">= 2 jets")

mafrahm Aug 6, 2024
Maintainer

Hi @maadcoen, I totally agree that there are issues with the current approach of categorization since it can lead to accidentally remvoing phase space or double counting. If there are intuitive ways to fix these issues, I'm happy to include them in our framework. However, I feel like your proposal does not solve these issues.

Efficiency

The main idea behind our categorization implementation is being as flexible as possible with the least amount of information stored. In the context of histograms, this means that we need to store information in the smallest units because in that way we can always build larger categories using the smaller units (e.g. 2j category is built by adding 2j1b and 2j2b). Doing this the other way around (e.g. buidling 2j1b from 2j and 1b) does not work since we cannot do an OR of two histogram bins.
Of course we could just store the information for each individual category (then we would not need to sum over categories at any point). This would be easy to implement and might be valid for analyses with few categories. However, if you include the full combinatorics of categories in your analysis (which can be done almost automatically with the create_category_combinations function), the number of categories rises drastically (e.g. in my case, having 3x2x2 categories leads to ~70 categories in total, but we only need 12 bins to build any of these categories)

Intuition

I feel like having flexible categorization is more valuable than intuition. If you find a more intuitive way with the same amount of functionality, you are free to open a pull request.

Issues

From my point of view, the main issue is

double counting accidentally
adding phase space cuts accidentally

For preventing double counting, I currently do not see an easy way to prevent this but I also feel like this is easily prevented if you know what you're doing.

Accidental phase cuts (e.g. building 2j by adding 2j1b and 2j2b does not consider 2jge3b) is an issue I'd also like to solve. However, just storing the 2j category separately is a solution that I'm not happy with.
The most obvious solution from my point of view would be to add a "rest" category to each layer of sub-categories that includes all events that are otherwise not considered. Referring to your code snippet above, this might look like this:

twojet = config.add_category("2j", selection="exactly_two_jets",  label="3 jets", id=2)
twojet.add_category("2j1b", selection=["exactly_two_jets", "exactly_one_bjet"], id=4) 
twojet.add_category("2j2b", selection=["exactly_two_jets", "exactly_two_bjet"], id=5) 

# category that includes all events from `2j` but that are not included from `2j1b` or `2j2b`
# NOTE that the "!" is just a suggestion in how to negate a Categorizer, this feature does not currently exist
twojet.add_category("2jrest", selection=["exactly_two_jets", "!exactly_two_bjet", "!exactly_one_bjet"], id=6)

Alternatively, if you do not want to rely on building parent categories from leafs, you can just add all your categories directly to the config:

# by adding each category separately, we will create category ids / histograms for each category separately
twojet = config.add_category("2j", selection="exactly_two_jets",  label="3 jets", id=2)
config.add_category("2j1b", selection=["exactly_two_jets", "exactly_one_bjet"], id=4) 
config.add_category("2j2b", selection=["exactly_two_jets", "exactly_two_bjet"], id=5)

Inheriting selections

I personally prefer the way it is implemented currently

you see exactly how each leaf category is built by just looking on the leaf category itself instead of having to walk up the full tree (which would probably also take longer)
the combination of selections is done automatically anyways when using create_category_combinations function

CategoryCollection

I am not not sure if this is a good idea

It's another type of meta data users will need to consider
We have limited capacities when it comes to implementing new features
This can already be done with the current categorization implementation

threeplusjet = od.Category("3pj", selection="three_or_more_jets",  id=3, label=">= 3 jets", id=3)
twojet = od.Category("2j", selection="exactly_two_jets",  label="3 jets", id=2)
p2j = config.add_category("p2j", label=">= 2 jets", categories=[twojet, threeplusjet])

TL; DR

flexible categorization is complicated
storing the leaf categories in histograms allows us to build parent categories by taking the sum of categories; building leafs from parents does not work on histogram level
our implementation has some issues and inconveniences that I'd like to solve, but solving these while maintaining the same amount of flexibility is difficult

This discussion has already been quite lengthy. If you want to further discuss this, I feel like it would be better to talk in person since this is a pretty complicated topic that is difficult to discuss via text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order categories and sub-categories #355

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Order categories and sub-categories #355

JulesVandenbroeck Nov 16, 2023

Replies: 4 comments · 3 replies

riga Nov 17, 2023 Maintainer

pkausw Nov 24, 2023 Maintainer

JulesVandenbroeck Nov 24, 2023 Author

maadcoen Aug 1, 2024

intuition

efficiency

pkausw Aug 1, 2024 Maintainer

Intuition

Efficiency

maadcoen Aug 1, 2024

intuition

efficiency

mafrahm Aug 6, 2024 Maintainer

Efficiency

Intuition

Issues

Inheriting selections

CategoryCollection

TL; DR

JulesVandenbroeck
Nov 16, 2023

Replies: 4 comments 3 replies

riga
Nov 17, 2023
Maintainer

pkausw
Nov 24, 2023
Maintainer

JulesVandenbroeck Nov 24, 2023
Author

maadcoen
Aug 1, 2024

pkausw
Aug 1, 2024
Maintainer

mafrahm Aug 6, 2024
Maintainer