Graph construction before training #781

pweigel · 2025-01-21T20:54:00Z

I've found that the calculations for constructing graphs can be a cpu bottleneck in training (especially when not many cpus are available). It would be very useful if the graphs could be constructed ahead of time/before training to streamline dataloading. I suspect this will be even more of an issue for complicated graph encodings.

This could be a step after the data processing step (e.g. i3 files to parquet) and will be dependent on the graph definition.

RasmusOrsoe · 2025-02-07T08:01:42Z

Hey @pweigel

The dataloading (especially with SQLite) is quite sensitive to the disk speed, and overly so. Some slurm systems with networked drives are particularly effected. That is something that I've wanted to look into for a while. Could that be the case for you?

The topic of writing specific realizations of data representations to files has been brought up quite a few times over the past years. I'll summarize my own opinion here:

The possible benefit of this is to decrease the CPU-based computational overhead during training but at the cost of flexibility in data files, as the files now contain a specific realization of the data representation. In cases where that flexibility is not given much value - i.e., if you are very sure about which representation you need and where the data representation is costly- this functionality could make sense. However, I do foresee some problems:

A prohibitively slow computational speed of the data representation might indicate inefficiencies in the representation itself, or the code we provide around it. Some folks might choose to write the realization to files instead of fixing the core of the issue.
It is unclear how existing data backends in graphnet could be applied to store arbitrary data representations. We would want to avoid having a zoo of files with different formats that are all representation-specific.

I personally think of the computational cost of the data representation as "real" - you'll be paying that price at inference regardless. So, if the representation creation is prohibitive during training, that could be interpreted as a prohibitively expensive representation.

pweigel added the feature New feature or request label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph construction before training #781

Graph construction before training #781

pweigel commented Jan 21, 2025

RasmusOrsoe commented Feb 7, 2025

Graph construction before training #781

Graph construction before training #781

Comments

pweigel commented Jan 21, 2025

RasmusOrsoe commented Feb 7, 2025