Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a PPU component #71

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions src/extra/components/ppu.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,22 +155,29 @@ def train_ids(
return train_ids

def trains(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider a different name for this method? Both in EXtra-data as well as across existing components, we use .trains() to return an iterator over trains. This method however takes something that has train IDs and changes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Do you have something in mind?

  • filter? filter_trains?
  • train_selection? select? select_trains?
  • pick? pick_trains?
  • take? take_trains?

I'd probably go for pick_trains.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select and select_trains also have different meanings already (though in the long run, I would love to pass this component to DataCollection.select_trains, i.e. have some kind of TrainSelector protocol)

From your list, I would probably also go for pick_trains, but it's unfortunate one way or the other that it's the "picking-thing" calling on the "thing-to-be-picked" 🤔
Maybe @takluyver can comment on what sounds most natural.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to pass this component to DataCollection.select_trains, i.e. have some kind of TrainSelector protocol

yes, me too!
But do we have an idea of what/how we want to select trains or how this protocol can look like? I can easily think on how to do that specifically for the PPU device, but I can't imagine other cases for now (maybe you have more use cases as part of the data reduction).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we invert the problem: Instead of having a method that selects trains on some object, have a property or method that returns the correspondig extra_data.by_id slice? Then selection becomes: run.select_trains(ppu.train_sel) (name obviously preliminary)

Whenever we decide on a universal TrainSelector protocol, we can simply add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to something like ppu.train_sel. For the split-sequence option, we could make PPU iterable, giving some sort of TrainSequence or TrainGroup object so you could do [run.select_trains(seq.train_sel) for seq in ppu].

self, split_sequence: bool = False, offset: int = 0
self,
data: Union[DataCollection, SourceData, KeyData] = None,
*,
split_sequence: bool = False,
offset: int = 0,
) -> Union[DataCollection, List[DataCollection]]:
"""Returns a subset of the data only with Trains selected by the PPU.

Args:
data: Data to filter. If set to None (defaut) use the data used at initialization.
split_sequence (bool, optional): Split data per PPU trigger sequence. Defaults to False.
offset (int, optional): offset to apply to train IDs to be selected. Defaults to 0.

Returns:
Union[DataCollection, List[DataCollection]]:
DataCollection(s) containing only trains triggered by the PPU
"""
data = data or self.data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to make this generic. Strictly speaking, the EXtra component is thus not a PPU component any more, but a "trainSelector" component. I wonder if there should be such class per se. I see that this addition to the trains() method makes the PPU more flexible, but it also seems a bit awkward to initialize by PPU device information, in order to then feed it with something else, except for comparison. In other words, I am not sure if a generic class with specific methods would be better than a specific class with a generic method by argument 🙃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. The line you highlighted here is only to make it convenient to use with KeyData or SourceData. Otherwise one would need to first filter the DataCollection and then make source/keydata object. Here you can instantiate all your object at the start of your script and then filter trains for any object as required.

Copy link
Member Author

@tmichela tmichela Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ppu = PPU(run)
filtered_run = ppu.trains()
source = filtered_run['awesome_source']

vs

ppu = PPU(run)
source = run['awesome_source']
filtered_source = ppu.trains(source)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. My comment was not about that particular line in the first place, more about the way this method accepts other source data i.e. allowing to do train picking based on a pre-selected source. This may be more convenient indeed. The default filtering i.e. trains() method without argument is using the PPU device source, if one is actually interested in another source as in your first code block, it is more "natural" to do it the 2nd way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, on a much higher level, I thought whether it would be good have a very generic "trainPicker" or "trainSelector" component in EXtra, that is, a class that could take PPU or any other source for filtering, already at instantiation level.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this comment is not affecting the PR, it is more an outlook to the future. (Maybe it does not make sense, because it lacks a real use case?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not, but do we have other filtering sources/methods at this moment?


train_ids = self.train_ids(labelled=True, offset=offset)
if split_sequence:
return [
self.data.select_trains(by_id[seq.values])
data.select_trains(by_id[seq.values])
for _, seq in train_ids.groupby(train_ids.index)
]
return self.data.select_trains(by_id[train_ids.values])
return data.select_trains(by_id[train_ids.values])