Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph construction before training #781

Open
pweigel opened this issue Jan 21, 2025 · 1 comment
Open

Graph construction before training #781

pweigel opened this issue Jan 21, 2025 · 1 comment
Labels
feature New feature or request

Comments

@pweigel
Copy link
Collaborator

pweigel commented Jan 21, 2025

I've found that the calculations for constructing graphs can be a cpu bottleneck in training (especially when not many cpus are available). It would be very useful if the graphs could be constructed ahead of time/before training to streamline dataloading. I suspect this will be even more of an issue for complicated graph encodings.

This could be a step after the data processing step (e.g. i3 files to parquet) and will be dependent on the graph definition.

@pweigel pweigel added the feature New feature or request label Jan 21, 2025
@RasmusOrsoe
Copy link
Collaborator

Hey @pweigel

The dataloading (especially with SQLite) is quite sensitive to the disk speed, and overly so. Some slurm systems with networked drives are particularly effected. That is something that I've wanted to look into for a while. Could that be the case for you?

The topic of writing specific realizations of data representations to files has been brought up quite a few times over the past years. I'll summarize my own opinion here:

The possible benefit of this is to decrease the CPU-based computational overhead during training but at the cost of flexibility in data files, as the files now contain a specific realization of the data representation. In cases where that flexibility is not given much value - i.e., if you are very sure about which representation you need and where the data representation is costly- this functionality could make sense. However, I do foresee some problems:

  1. A prohibitively slow computational speed of the data representation might indicate inefficiencies in the representation itself, or the code we provide around it. Some folks might choose to write the realization to files instead of fixing the core of the issue.
  2. It is unclear how existing data backends in graphnet could be applied to store arbitrary data representations. We would want to avoid having a zoo of files with different formats that are all representation-specific.

I personally think of the computational cost of the data representation as "real" - you'll be paying that price at inference regardless. So, if the representation creation is prohibitive during training, that could be interpreted as a prohibitively expensive representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants