Order categories and sub-categories #355
-
Hi everyone, |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
Hi @JulesVandenbroeck , thanks for opening this discussion! I think there are two parts to this: one is "order" related, whereas the other one rather has to do with cf itself.
This is a decision made in "order". While subcategories indeed mostly use the selection / phase-space defined by the parent category, there might be scenarios where this is not the case. These scenarios could be highly analysis- or simply use-case dependent and forcing the inheritance of selections within "order" would just seem unreasonably strict. Simultaneously, inheriting selections manually is rather simple: cat_eq4j = od.Category(
name="eq4j",
id=1,
# case 1: ROOT-style selection strings
selection="n_jets == 4",
# case 2: function accepting awkward-style events array
selection=(lambda events: events.n_jets == 4),
)
cat_eq4j_eq1b = cat_eq4j.add_category(
name=f"{cat_eq4j.name}_eq1b",
id=2,
# case 1: ROOT-style selection strings
selection=od.util.join_root_selection(cat_eq4j.selection, "n_btags == 1"),
# case 2: function accepting awkward-style events array
selection=(lambda events: cat_eq4j.selection(events) && (events.n_btags == 1)),
)
You're absolutely right that - strictly speaking - one does not have to define sub-categories within cf. However, I see one main advantage in using them. Following the example above, imagine you defined 6 "flat" (no parent/child structure) categories
and you created some plots. It's obvious that evaluating the "cat_eq4j" selection statement is redundant in the first place. And after looking at the plots, you like to see distributions for events with 4 jets and 2 or more b-tags. You could either define a separate "cat_eq4j_ge2b" category and run the histogramming step again (not good since this is often a bottleneck), or you instruct the plotting to merge the This is exactly what nested categories do for you. One could start with a structure such as graph TD
cat_eq4j --> cat_eq4j_eq0b
cat_eq4j --> cat_eq4j_eq1b
cat_eq4j --> cat_eq4j_eq2b
cat_eq4j --> cat_eq4j_eq3b
cat_eq4j --> cat_eq4j_eq4b
and create histograms / plots. cf creates histograms only for leaf categories (categories with no children), and the plotting can automatically merge them depending on the nesting structure. Then, staying in this example, you would add "cat_eq4h_ge2b" later on as graph TD
cat_eq4j --> cat_eq4j_eq0b
cat_eq4j --> cat_eq4j_eq1b
cat_eq4j --> cat_eq4j_ge2b
cat_eq4j_ge2b --> cat_eq4j_eq2b
cat_eq4j_ge2b --> cat_eq4j_eq3b
cat_eq4j_ge2b --> cat_eq4j_eq4b
and you'd only have to rerun the plotting step, since the set of leaf categories did not change. Hope this helps! |
Beta Was this translation helpful? Give feedback.
-
Hi @JulesVandenbroeck , does this answer you question? If so, feel free mark this discussion as answered and close it ;) |
Beta Was this translation helpful? Give feedback.
-
Hi @riga I understand the argument of grouping subcategories so that you don't have to rerun the histogramming. But I think your efficiency argument should be put against an intuition argument, which to me has the upper hand in this case. I also have some reservations about the efficiency argument itself. intuition
When you add the argument Also, if you haven't read the categorization manual, you'll likely just not realise this is the way it works. Maybe this is just me, but this happened exactly to me. I found out you could do subcategories, found this quite useful to organize myself. I made, for instance, to make a chain of selections incl > one lepton > 2 jets > 1 btag. Then I plotted the inclusive plots, and only found out much later columnflow just always showed me the category with one lepton and 2 jets and 2 btag.
I think the implied inheritance from the prefix "sub" cannot be so easily discarded as "unnecessarily restrictive". I would be interested to know if you have any specific scenarios in mind where this inheritance would be undesirable. To me a subcategory that actually isn't one, should just be a separate category. Calling something with name that doesn't describe it well makes thing only confusing. efficiency
I feel like this grouping of categories after histogramming could be achieved by implementing a different class, e.g. CategoryCollection, whose only job is to refer to its subcategories.
Currently, columnflow will for each category evaluate all selections it requires separately. With inheritance, one could avoid this by adding the subcategory selections on top of the already calculated parent selection. |
Beta Was this translation helpful? Give feedback.
-
Hi @maadcoen , just to add my two cents on this from the side: IntuitionI think this is more a discussion about the philosophy about how one sees categories and how to define phase spaces. If I understand you correctly, your approach is top-bottom, that is you start from a big phase space and start subdividing it. The approach we have chosen is a bottom-up approach - we start with small building blocks and stack those to create larger phase spaces. Both approaches are equally valid, and since "intuition" is a rather subjective topic, it's hard to argue which of the two would be better. Regarding the manual: Sorry, but it's there for a reason 😅 We are aware of course that it's not complete, which is of course a shortcoming on our side and is still work in progress. However, the parts that are there (together with discussions such as this one) should definitely be considered when using columnflow. And since it clearly states that leave categories need to be orthogonal, I think this argument of yours is not valid. Of course, if we can make this any clearer for the reader/user, we're also happy about feedback! EfficiencySince we build histograms for the smallest building blocks, we are able to use the Your second point is again a question of philosophy I think. If you wanted to created different larger phase spaces in your case, you would also need to rerun everything since your top-most selection changes. Doing the categorization with the smallest building blocks allows for a flexible definition of large chunks of phase spaces without rerunning the chain over and over again - you can simply add histograms, which are fairly far down the line and thus reduce the amount of computations you need to do again. If you have additional comments and questions let us know. Cheers, |
Beta Was this translation helpful? Give feedback.
Hi @JulesVandenbroeck ,
thanks for opening this discussion! I think there are two parts to this: one is "order" related, whereas the other one rather has to do with cf itself.
This is a decision made in "order". While subcategories indeed mostly use the selection / phase-space defined by the parent category, there might be scenarios where this is not the case. These scenarios could be highly analysis- or simply use-case dependent and forcing the inheritance of selections within "order" would just seem unreasonably strict. Simultaneously, inheriting selections manually is rather simple: