Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I did another iteration of this. Currently running LLaMa 7B with params on the GPU requires 16GiB of memory. Params on the CPU + lazy transfers require 15.12GiB, which is almost negligible and given that it adds latency of like x4 inference time, I think it's no longer worth mentioning. Sidenote: lazy transfers don't really change anything here and that's what I would expect, since generation loops over the model and therefor all params need to be on the GPU. I'm sure how not having params on the GPU makes a difference, since they can't be garbage collected early either, but the difference is very tiny anyway.
Note that for Stable Diffusion params on the CPU + lazy transfers has more impact, because it uses several models, so once one finishes its params can be garbage collected and the next model params can be loaded lazily, so it does make sense.
I also added an example with Mistral.