You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've found that the calculations for constructing graphs can be a cpu bottleneck in training (especially when not many cpus are available). It would be very useful if the graphs could be constructed ahead of time/before training to streamline dataloading. I suspect this will be even more of an issue for complicated graph encodings.
This could be a step after the data processing step (e.g. i3 files to parquet) and will be dependent on the graph definition.
The text was updated successfully, but these errors were encountered:
The dataloading (especially with SQLite) is quite sensitive to the disk speed, and overly so. Some slurm systems with networked drives are particularly effected. That is something that I've wanted to look into for a while. Could that be the case for you?
The topic of writing specific realizations of data representations to files has been brought up quite a few times over the past years. I'll summarize my own opinion here:
The possible benefit of this is to decrease the CPU-based computational overhead during training but at the cost of flexibility in data files, as the files now contain a specific realization of the data representation. In cases where that flexibility is not given much value - i.e., if you are very sure about which representation you need and where the data representation is costly- this functionality could make sense. However, I do foresee some problems:
A prohibitively slow computational speed of the data representation might indicate inefficiencies in the representation itself, or the code we provide around it. Some folks might choose to write the realization to files instead of fixing the core of the issue.
It is unclear how existing data backends in graphnet could be applied to store arbitrary data representations. We would want to avoid having a zoo of files with different formats that are all representation-specific.
I personally think of the computational cost of the data representation as "real" - you'll be paying that price at inference regardless. So, if the representation creation is prohibitive during training, that could be interpreted as a prohibitively expensive representation.
I've found that the calculations for constructing graphs can be a cpu bottleneck in training (especially when not many cpus are available). It would be very useful if the graphs could be constructed ahead of time/before training to streamline dataloading. I suspect this will be even more of an issue for complicated graph encodings.
This could be a step after the data processing step (e.g. i3 files to parquet) and will be dependent on the graph definition.
The text was updated successfully, but these errors were encountered: