Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

freshduer · 2025-02-14T02:18:41Z

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update?

MaxiBoether · 2025-02-14T16:55:25Z

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

freshduer · 2025-02-28T09:56:30Z

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

thranks for the reply, out school environment does not allow a docker environment, can modyn be run only with anaconda env?

freshduer · 2025-03-01T09:21:45Z

Hey!
Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.
We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.
I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

thranks for the reply, out school environment does not allow a docker environment, can modyn be run only with anaconda env?

i have tested in anaconda env, works, but need some modifications in cmakelists

freshduer · 2025-03-02T02:31:13Z

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

i have read about the paper, but I am curious about the model storage part. The DLRM online training, like easyrec and deeprec, supports incremental updates in embedding tables without interrupting the model training. Does modyn support this? or just load the entire model and embedding tables into GPU whenever new data comes?

MaxiBoether · 2025-03-03T14:02:46Z

Hey @freshduer,

you have to think about the model training process in chunks. This is what we tried to formalize using the triggers and trigger training sets.

You start with an empty (random) model. Then you collect N datapoints, run a regular training (forward+backward), this generates model 1. You continue to collect M new datapoints. Depending on your training config (use_previous_model), the trainer server now loads the previously trained model, and continuous to train this model on that data, giving you model 2. This is a continous process, but every training is started by a trigger.

The model storage that you mention defines how the model gets persisted to disk (e.g., only the diff, or compressed, etc.). That is completely orthogonal to how you train the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

freshduer commented Feb 14, 2025

MaxiBoether commented Feb 14, 2025

Uh oh!

freshduer commented Feb 28, 2025

Uh oh!

freshduer commented Mar 1, 2025

Uh oh!

freshduer commented Mar 2, 2025

Uh oh!

MaxiBoether commented Mar 3, 2025

Uh oh!

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

Comments

freshduer commented Feb 14, 2025

MaxiBoether commented Feb 14, 2025

Uh oh!

freshduer commented Feb 28, 2025

Uh oh!

freshduer commented Mar 1, 2025

Uh oh!

freshduer commented Mar 2, 2025

Uh oh!

MaxiBoether commented Mar 3, 2025

Uh oh!