Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag #36835

inf3rnus · 2025-03-19T18:41:25Z

What does this PR do?

modeling_utils.py has been modified to allow for the parallelized loading of weights. Two env vars have been introduced, HF_ENABLE_PARALLEL_LOADING which allows weights to be loaded in a multiprocessing pool and HF_PARALLEL_LOADING_WORKERS which specifies how many child processes you want to spawn to handle loading.

Only works for sharded torch based weights and safe tensors for now, although the same pattern can be extended to other weights by other contributors.

This can significantly speed up loading large models on big hardware, ~50%.

e.g. facebook/opt-30b on an AWS EC2 g4dn.metal can be made to load in ~30s with this modification vs ~55s without it.

Using napkin math, this increase in efficiency should equate to thousands of dollars monthly if not more in cold boot compute expenses being saved globally for anyone using HF on the cloud.

Please let me know if anything else needs to be changed to move this forward!

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests? It reuses the existing modeling_utils.py tests, but enables what's required to run the processing pool.

@Rocketknight1 @gante

github-actions · 2025-03-19T18:41:37Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

…g_utils.py

ArthurZucker

Very interesting!
Indeed we do this "sequentially" today. I think using an even bigger checkpoint like a 100B model or 400B would be nice.

Great splitting, having a separate function for load_shard_file!

missing some API simplification:
multiprocessing.set_start_method("spawn", force=True) should be done by ourselves (because you should have access to the device you are loading on!)

Also needs to be guarded against TP (until stable / tested!)

ArthurZucker · 2025-03-20T09:28:29Z

src/transformers/modeling_utils.py

+        if json.loads(os.environ.get("HF_ENABLE_PARALLEL_LOADING", "false")):
+            num_workers = json.loads(os.environ.get("HF_PARALLEL_LOADING_WORKERS", "8"))
+            logger.info(f"Loading model weights in parallel with {num_workers} workers...")


this is nice BUT! I think we need some guards / good defaults:

depends on the number of shard files

should depend on the number of available threads
This will help us finding good sweetspots!

Fully agree on point 1, we should just min() that on len(args_list).

Second point, I'd argue is an enhancement, for now the benefits are so great we can leave it to the user until itr 2.

(I'll be cooking this up real soon!)

haha okay 🤗

ArthurZucker · 2025-03-20T09:31:14Z

src/transformers/modeling_utils.py

+                    splits = tensor_name.split(".")
+                    module = model_to_load
+
+                    for split in splits[:-1]:
+                        module = getattr(module, split)
+
+                    last_key = splits.pop()


we have a util for that, see the tensor_parallel integration that also needs to get the module from the key name 😉

Appreciate it! Will make the adjustment :)

ArthurZucker · 2025-03-20T09:34:24Z

tests/utils/test_modeling_utils_parallel_loading.py

to improve!

I think we can remove this file! I'll see to add some tests myeelf once this has been merged!

inf3rnus · 2025-03-20T23:01:52Z

Also, quick Q, is it normal for some of the CI tests to fail with the 4.50.0 branch, or is my code running amok?

I checked the outputs and I don't think it's my code.

Also, couldn't find a place where the model_utils.py run, I assumed anything with the test prefix would run...

Locally, wrt to just the tests for model_utils (including the parallel tests), they are passing, so I'm inclined to believe this code is stable, although it could certainly benefit from additional tests.

It's setup to run optionally until we suss out any remaining edge cases.

Also curious to hear what your thoughts are on how to improve the parallel tests, I can dream up some ideas, but I don't want to do anything that will result in waste.

inf3rnus · 2025-03-20T23:12:41Z

No longer needed, I've encapsulated this into the model loading parallelism if block

Any ideas on where we could put this?

missing some API simplification:
multiprocessing.set_start_method("spawn", force=True) should be done by ourselves (because you should have access to the device you are loading on!)

I could run some experiments, but if you know of a great location where env vars are reduced into global configuration settings, that's probably where we want to do it, if there's some place where a bunch of package wide initialization occurs.

…t" filesystem

…Y up the code by leveraging get_submodule.

…he multiprocessing start method are removed, now that the package handles this step internally.

inf3rnus · 2025-03-24T21:53:50Z

@ArthurZucker Alright, I've made the requested changes, pls lmk if anything else needs touching up!

Notable changes based on the feedback provided:

The multiprocessing start method (spawn) is now handled in a try finally in the if block that controls model parallelism.
I've incorporated the use of model.get_submodule() for updating the meta model after process workers load weights.
Docs have been updated to remove mention of the spawn method for multiprocessing now that it's handled internally.
Manually setting the start method to spawn in the model loading parallelism test file has been removed.

…odel-loading

…NG. Prefer simple casting.

ArthurZucker

Thanks!

ArthurZucker · 2025-05-01T14:52:35Z

src/transformers/utils/import_utils.py

+def is_true(value: Optional[str]) -> bool:
+    if value is None:
+        return False
+    return value.upper() in ENV_VARS_TRUE_VALUES
+


not a super fan of this, let's just do this:

os.environ.get("MY_VAR", "").upper() in ENV_VARS_TRUE_VALUES````

ArthurZucker · 2025-05-01T14:53:04Z

tests/utils/test_modeling_utils_parallel_loading.py

+
+
+# Declare the normal model_utils.py test as a sideffect of importing the module
+from .test_modeling_utils import ModelUtilsTest  # noqa


I don't think we need it to run on all models not even sure this tests it all?

…ABLE_PARALLEL_LOADING.

inf3rnus · 2025-05-07T20:39:31Z

@Rocketknight1 bumping for visibility 🙏

inf3rnus · 2025-05-13T21:56:59Z

Hey all, just checking in again, figure you guys are slammed. Very appreciative of the time you've spent reviewing this 🙏

inf3rnus · 2025-05-13T21:58:26Z

@Cyrilvallez Tagging you again for a bump, it's right at the finish line. Very much appreciate everyone's eyes on this so far

Cyrilvallez

ALright, sorry for the delay! This in indeed much cleaner, and I agree that using threads is much better as we share the objects by default. This should be thread safe as the different state dicts are orthogonal to each other, but did you check further by any chance? I.e. in the case of quantization, I did not check that we are not mutating other objects in ways that could go through race conditions, did you make sure?

Also, interested to see how much faster this is, given the GIL 🤔 Not entirely sure if the new thread would trigger new GPU streams, and if those would be run concurrently as well I must say, so hard to tell - did you benchmark a bit? 😁🤗

docs/source/en/reference/environment_variables.md

src/transformers/modeling_utils.py

Cyrilvallez · 2025-05-14T16:15:09Z

Oh also, please run make style with ruff==0.11.2 to pass code quality!

…ollection already occurs.

inf3rnus · 2025-05-15T00:54:46Z

@Cyrilvallez

Alright all changes requested have been made!

ALright, sorry for the delay! This in indeed much cleaner, and I agree that using threads is much better as we share the objects by default. This should be thread safe as the different state dicts are orthogonal to each other, but did you check further by any chance? I.e. in the case of quantization, I did not check that we are not mutating other objects in ways that could go through race conditions, did you make sure?

I just did a check on each of the quantizers, taking a look at _load_state_dict_into_meta_model and create_quantized_param for each quantizer method and we should be thread safe as long as each shard is indeed responsible for a unique set of layers. Which the only doubt in my mind there would be with tying weights...

You'd know best, but if we can guarantee that tying of weights only happens to the layers in a given shard, then we should be safe. Otherwise yeah, we're going to need to introduce locks.

And it's really only one of the quantizers that I can see this happening in, it may be happening in the others, but if it is, I'm ignorant to it.

Quantizer of concern: quantizer_torchao

I don't think this should be a problem though because after looking at the code, if I'm not mistaken models always have loaded their input and output embeddings / encoder and decoder weights before loading the rest of the weights? It also only ties weights for the shard containing the module for the input embeddings.

But outside of tying weights, I don't believe that there are any intersections in terms of assignment operations with respect to state, each quantization method only sets new weight values per module on a given layer via the state_dict that is constructed per thread, so each thread's set of layers of the model's overall state should be modified in isolation. They also all do this synchronously, so there should be no unexpected changes to a given module's state due to concurrent access.

quantizer_torchao below for reference (other comments here after this codeblock fyi):

    def create_quantized_param(
        self,
        model: "PreTrainedModel",
        param_value: "torch.Tensor",
        param_name: str,
        target_device: "torch.device",
        state_dict: Dict[str, Any],
        unexpected_keys: List[str],
    ):
        """
        Each nn.Linear layer that needs to be quantized is processed here.
        First, we set the value the weight tensor, then we move it to the target device. Finally, we quantize the module.
        """
        if self.quantization_config.quant_type == "autoquant":
            return

        from torchao.quantization import quantize_

        module, tensor_name = get_module_from_name(model, param_name)
        if self.pre_quantized:
            module._parameters[tensor_name] = torch.nn.Parameter(
                param_value.to(device=target_device), requires_grad=param_value.requires_grad
            )
            if isinstance(module, nn.Linear):
                module.extra_repr = types.MethodType(_linear_extra_repr, module)
        else:
            assert isinstance(self.quantization_config, TorchAoConfig)
            module._parameters[tensor_name] = torch.nn.Parameter(
                param_value, requires_grad=param_value.requires_grad
            ).to(device=target_device)
            # if we are quantizing tied parameters, to avoid tying the quantized weights
            # the correct order to do it is
            # 1. load the weight to model
            # 2. run tie_weights to populate the weights
            # 3. quantize
            input_embed = model.get_input_embeddings()
            if self.quantization_config.untie_embedding_weights and id(module) == id(input_embed):
                model.tie_weights()
                setattr(model.config.get_text_config(decoder=True), "tie_word_embeddings", False)

            # handle AOPerModuleConfig, introduced in torchao 0.11.0+
            if self.quantization_config._get_ao_version() > version.Version("0.10.0"):
                from torchao.quantization import AOPerModuleConfig

                config = self.quantization_config.get_apply_tensor_subclass()
                if isinstance(config, AOPerModuleConfig):
                    module_fqn, _ = param_name.rsplit(".", 1)
                    c = None
                    if module_fqn in config.module_fqn_to_config:
                        c = config.module_fqn_to_config[module_fqn]
                    else:
                        c = config.module_fqn_to_config.get("_default", None)
                    if c is not None:
                        # filter_fn: not filtering out any modules
                        quantize_(module, c, filter_fn=lambda x, fqn: True)
                    return

            quantize_(module, self.quantization_config.get_apply_tensor_subclass())

Also, interested to see how much faster this is, given the GIL 🤔 Not entirely sure if the new thread would trigger new GPU streams, and if those would be run concurrently as well I must say, so hard to tell - did you benchmark a bit? 😁🤗

Surprisingly, and maybe not so surprisingly the performance is even better. I did do a run on facebook/opt-30b again and you get about the same performance in terms of the time it takes to load the files. However, you see between a 5-7s speed up in not having to clone processes, so that's 7s saved!

Presumably what's happening is a lot of this work's IO bound, the only major part AFAIK that isn't is the deserialization of the weights into system RAM. So while only one thread may deserialize weights, the rest are free to do work while that guy sits on the main thread.

If the disk can provide the throughput and you have vCPUs at your disposal, you will see speed up benefits!

The same can be said about offloading weights from RAM into the GPU, that's all likely handled by system calls and driver code that's divorced from any work that needs to actually happen within the python process itself.

That's my theory at least, if we infinite time, we could certainly drill into it 😄...

Prev results were around 30s to load facebook/opt-30b

New results are around 24sish, although this is from memory as I did not put those numbers somewhere at the time. All I know is it was faster than the previous time of 30s by about 6 seconds.

inf3rnus · 2025-05-19T23:30:45Z

@Cyrilvallez @ArthurZucker Pinging again, I appreciate your patience 🙏

Cyrilvallez · 2025-05-22T16:08:54Z

@bot /style

github-actions · 2025-05-22T16:10:33Z

Style fixes have been applied. View the workflow run here.

Cyrilvallez

Hey! Sorry, things were a bit crazy lately 😬 I was able to run make style for you thanks to our new bot, however could you rebase/merge very quickly to take care of the tiny conflict in the imports? 🤗
Also, let's remove the additional test file as I pointed out, I'll see to add some myself after I merge!

Cyrilvallez · 2025-05-22T16:26:47Z

tests/utils/test_modeling_utils_parallel_loading.py

I think we can remove this file! I'll see to add some tests myeelf once this has been merged!

inf3rnus · 2025-05-22T16:37:05Z

@Cyrilvallez All changes have been made! Big time thank you for looking at this. I figure you guys are absolutely slammed right now

Cyrilvallez · 2025-05-23T16:27:09Z

Alright, merging! Thanks for the great contribution and for bearing with us 🙏🤗

HuggingFaceDocBuilderDev · 2025-05-23T16:40:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

inf3rnus · 2025-05-23T16:51:40Z

Thanks folks, viva la Huggingface! 🤗

huggingface#36835) * Get parallel loader working. Include tests. * Update the tests for parallel loading * Rename env variables. * Add docs for parallel model weight loading. * Touch up parallel model loading docs. * Touch up parallel model loading docs again. * Edit comment in test_modeling_utils_parallel_loading.py * Make sure HF_PARALLEL_LOADING_WORKERS is spelled correctly in modeling_utils.py * Correct times for parallelized loading, previous times were for a "hot" filesystem * Update parallel model loading so the spawn method is encapsulated. DRY up the code by leveraging get_submodule. * Update docs on model loading parallelism so that details on setting the multiprocessing start method are removed, now that the package handles this step internally. * Fix style on model loading parallelism changes. * Merge latest version of master's modeling_utils. * Removed unused variable. * Fix argument packing for the parallel loader. * Fix state dict being undefined in the parallel model loader. * Rename variables used in parallel model loading for clarity. Use get_module_from_name(). * Switch to the use of threads for parallel model loading. * Update docs for parallel loading. * Remove the use of json.loads when evaluating HF_ENABLE_PARALLEL_LOADING. Prefer simple casting. * Move parallelized shard loading into its own function. * Remove use of is_true(). Favor checking env var true values for HF_ENABLE_PARALLEL_LOADING. * Update copyright to 2025 in readme for paralell model loading. * Remove garbage collection line in load_shard_file, implicit garbage collection already occurs. * Run formatter on modeling_utils.py * Apply style fixes * Delete tests/utils/test_modeling_utils_parallel_loading.py --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Cyril Vallez <[email protected]>

inf3rnus added 7 commits March 18, 2025 21:18

Get parallel loader working. Include tests.

8fb9b18

Update the tests for parallel loading

27f36f2

Merge branch 'main' into 03-18-25-parallel-model-loading

7e5ecd8

Rename env variables.

e7c3ea5

Add docs for parallel model weight loading.

7599fe2

Touch up parallel model loading docs.

065e102

Touch up parallel model loading docs again.

d31594a

github-actions bot marked this pull request as draft March 19, 2025 18:41

inf3rnus added 2 commits March 19, 2025 14:43

Edit comment in test_modeling_utils_parallel_loading.py

33b3e0f

Merge branch 'main' into 03-18-25-parallel-model-loading

3fb6b65

inf3rnus marked this pull request as ready for review March 19, 2025 18:44

github-actions bot requested review from ArthurZucker and Rocketknight1 March 19, 2025 18:45

Make sure HF_PARALLEL_LOADING_WORKERS is spelled correctly in modelin…

0e22c04

…g_utils.py

ArthurZucker reviewed Mar 20, 2025

View reviewed changes

ArthurZucker added Core: Modeling Internals of the library; Models. from_pretrained labels Mar 20, 2025

inf3rnus mentioned this pull request Mar 21, 2025

Improve Model Download Speeds By ~3x For Large Models #36870

Open

4 tasks

inf3rnus and others added 4 commits March 20, 2025 22:10

Correct times for parallelized loading, previous times were for a "ho…

904bdaf

…t" filesystem

Update parallel model loading so the spawn method is encapsulated. DR…

7e37ba4

…Y up the code by leveraging get_submodule.

Update docs on model loading parallelism so that details on setting t…

a203f6a

…he multiprocessing start method are removed, now that the package handles this step internally.

Fix style on model loading parallelism changes.

14e9eef

inf3rnus requested a review from ArthurZucker March 25, 2025 17:57

inf3rnus added 3 commits April 8, 2025 17:33

Merge remote-tracking branch 'upstream/main' into 03-18-25-parallel-m…

fe1fc0c

…odel-loading

Merge latest version of master's modeling_utils.

d5637e8

Removed unused variable.

e0d37bb

inf3rnus added 2 commits April 30, 2025 17:59

Remove the use of json.loads when evaluating HF_ENABLE_PARALLEL_LOADI…

b8a1470

…NG. Prefer simple casting.

Move parallelized shard loading into its own function.

efb6605

ArthurZucker approved these changes May 1, 2025

View reviewed changes

Remove use of is_true(). Favor checking env var true values for HF_EN…

c66daef

…ABLE_PARALLEL_LOADING.

Cyrilvallez reviewed May 14, 2025

View reviewed changes

docs/source/en/reference/environment_variables.md Outdated Show resolved Hide resolved

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

inf3rnus and others added 4 commits May 14, 2025 20:17

Remove garbage collection line in load_shard_file, implicit garbage c…

610c5e3

…ollection already occurs.

Run formatter on modeling_utils.py

a9cb54b

Merge branch 'main' into 03-18-25-parallel-model-loading

fc76fbb

Apply style fixes

16f3751

Cyrilvallez reviewed May 22, 2025

View reviewed changes

inf3rnus and others added 2 commits May 22, 2025 12:33

Merge main.

cd0f42e

Delete tests/utils/test_modeling_utils_parallel_loading.py

3b9f458

Merge branch 'main' into 03-18-25-parallel-model-loading

b6bf421

Cyrilvallez enabled auto-merge (squash) May 23, 2025 16:27

Cyrilvallez approved these changes May 23, 2025

View reviewed changes

Cyrilvallez merged commit d5f992f into huggingface:main May 23, 2025
20 checks passed



		# Declare the normal model_utils.py test as a sideffect of importing the module
		from .test_modeling_utils import ModelUtilsTest # noqa

Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag #36835

Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag #36835

Uh oh!

Conversation

inf3rnus commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inf3rnus commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inf3rnus commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inf3rnus commented Mar 24, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inf3rnus commented May 7, 2025

Uh oh!

inf3rnus commented May 13, 2025

Uh oh!

inf3rnus commented May 13, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez commented May 14, 2025

Uh oh!

inf3rnus commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inf3rnus commented May 19, 2025

Uh oh!

Cyrilvallez commented May 22, 2025

Uh oh!

github-actions bot commented May 22, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inf3rnus commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrilvallez commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 23, 2025

Uh oh!

inf3rnus commented May 23, 2025

inf3rnus commented Mar 19, 2025 •

edited

Loading

inf3rnus commented Mar 20, 2025 •

edited

Loading

inf3rnus commented Mar 20, 2025 •

edited

Loading

inf3rnus commented May 15, 2025 •

edited

Loading

inf3rnus commented May 22, 2025 •

edited

Loading

Cyrilvallez commented May 23, 2025 •

edited

Loading