Skip to content

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update? #648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
freshduer opened this issue Feb 14, 2025 · 5 comments

Comments

@freshduer
Copy link

Is there support for online training, and is it a full update for the embedding of the DLRM model or a little bit of an update?

@MaxiBoether
Copy link
Contributor

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

@freshduer
Copy link
Author

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

thranks for the reply, out school environment does not allow a docker environment, can modyn be run only with anaconda env?

@freshduer
Copy link
Author

Hey!
Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.
We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.
I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

thranks for the reply, out school environment does not allow a docker environment, can modyn be run only with anaconda env?

i have tested in anaconda env, works, but need some modifications in cmakelists

@freshduer
Copy link
Author

Hey!

Thanks very much for your interest in Modyn. If I understand your question correctly, you ask whether we support online training of a DLRM model? If so, yes, this is precisely (one of the use cases) Modyn is built for. We call it model training on growing datasets, and think of it in terms of finetuning jobs in chunks. For an introduction to our ML pipeline model, I'd recommend you check out at least the first sections of the Modyn paper.

We ran DLRM training for the SIGMOD paper, however, only from a throughput, not accuracy, perspective. We use the DLRM implementation that NVIDIA provides. You can find our version here. I am not sure what you mean by "a little bit of an update". Modyn itself is an execution engine, so to say. You can configure your pipelines to your requirements, i.e., you can define what finetuning means for you. If you could elaborate what you want to achieve, I am more than happy to elaborate on this.

I think a good first step would be reading the modeling and design sections of the paper to get an understanding of how Modyn operates.

i have read about the paper, but I am curious about the model storage part. The DLRM online training, like easyrec and deeprec, supports incremental updates in embedding tables without interrupting the model training. Does modyn support this? or just load the entire model and embedding tables into GPU whenever new data comes?

@MaxiBoether
Copy link
Contributor

Hey @freshduer,

you have to think about the model training process in chunks. This is what we tried to formalize using the triggers and trigger training sets.

You start with an empty (random) model. Then you collect N datapoints, run a regular training (forward+backward), this generates model 1. You continue to collect M new datapoints. Depending on your training config (use_previous_model), the trainer server now loads the previously trained model, and continuous to train this model on that data, giving you model 2. This is a continous process, but every training is started by a trigger.

The model storage that you mention defines how the model gets persisted to disk (e.g., only the diff, or compressed, etc.). That is completely orthogonal to how you train the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants