Restore from single slice #2084

findmyway · 2025-07-09T07:20:46Z

Note that I'm still using an old version (v0.11.15) of slice_devices method. (This means that it will return the devices from single slice instead of single replica)
The basic idea is to keep the original implementation almost unchanged. And I created another dimension of replicas to broadcast data.
I also tried the original idea of simply restore from single replica. But I get the InvalidShardingError. The reason is obvious, the devices of a process are distributed across different replicas.

orbax/checkpoint/orbax/checkpoint/_src/serialization/type_handlers.py

Lines 1468 to 1474 in 9fc3716

    
           if primary_replica_ids != expected_primary_replica_ids: 
        
             raise InvalidShardingError( 
        
                 'The provided sharding is not valid. The primary replica has the' 
        
                 f' following devices: {primary_replica_ids}, but process indices' 
        
                 ' associated with primary replica devices are expected to be:' 
        
                 f' {primary_replica_pids}.' 
        
             )

My questions:

Any obvious errors or potential improvements with my current implementation?
- One datapoint from my latest test: ~30s on deserialization plus ~60s on broadcasting (only one broadcast in total).
Any idea on how to address the above InvalidShardingError? (My initial thought is that, the resharding should still work after the sum op even though here's a mismatch.)

cpgaffney1

I also tried the original idea of simply restore from single replica. But I get the InvalidShardingError. The reason is obvious, the devices of a process are distributed across different replicas.

Devices from one process belonging to different replicas does violate a fundamental assumption. I suppose the validation just needs to be modified to account for this possibility, if we are saying that it is indeed a possibility.

Ideally unit testing can be improved but it might be tricky to emulate this situation in a unit test. FWIW here are some test cases for SingleReplicaArrayHandler - it's private only because it runs with an internal TPU-based test harness. https://gist.github.com/cpgaffney1/35161a6e6f6e1bc7bf2ffd3df543efe5

cpgaffney1 · 2025-07-09T20:52:23Z

checkpoint/orbax/checkpoint/_src/multihost/multislice.py

@@ -224,8 +225,9 @@ def broadcast_one_replica_to_all(
      - pytree with broadcasted data
      - number of broadcasts performed.
  """
-  num_replicas = global_mesh.devices.shape[replica_axis_index]
-  replica_axis_name = global_mesh.axis_names[replica_axis_index]
+  # num_replicas = global_mesh.devices.shape[replica_axis_index]


I don't quite understand why this was incorrect. Isn't the contract that the replica_axis_index-th dimension of the mesh should be the replica dimension?

Ah sorry, realized the intention here. The idea is to always use a single slice to broadcast, even when n_replicas != n_slices?

cpgaffney1 · 2025-07-09T20:53:18Z

checkpoint/orbax/checkpoint/_src/multihost/multislice.py

-  replica_axis_name = global_mesh.axis_names[replica_axis_index]
+  # num_replicas = global_mesh.devices.shape[replica_axis_index]
+  # replica_axis_name = global_mesh.axis_names[replica_axis_index]
+  replica_axis_name = global_mesh.axis_names[0]  # assuming pp dimension is never used


Unused now?

cpgaffney1 · 2025-07-09T20:54:53Z

checkpoint/orbax/checkpoint/_src/serialization/type_handlers.py

-  # Validate merged params.
-  if enable_validation:
-    await _validate_params(directory, ts_context, use_zarr3=use_zarr3)
+  # # Validate merged params.


Is there also a problem with this check?

This is unrelated to the PR here.

Actually, it may take a relative long time (>500s) in some cases to finish the checking. So I disabled it temporarily.

Ideally we'd also have a config on it.

findmyway · 2025-07-11T06:38:54Z

Thanks!

Looks like you are using two processes for testing here.

Could you also add the (2, 4) mesh shape in the test below? Make sure the mesh is created from jax.experimental.mesh_utils.create_device_mesh to validate my assumption.

https://gist.github.com/cpgaffney1/35161a6e6f6e1bc7bf2ffd3df543efe5#file-type_handlers_test-L291-L295

findmyway added 11 commits July 9, 2025 15:12

rebase

974123f

revert

b69ccfa

allow exists_ok=True

87c4145

disable validation

640e07e

disable validation

4433fc7

debug

bea75bb

debug

55b5891

debug

e3caa32

rebase

c2be04b

rebase

c3ac236

rebase

9fc3716

cpgaffney1 reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restore from single slice #2084

Restore from single slice #2084

findmyway commented Jul 9, 2025 •

edited

Loading

Uh oh!

cpgaffney1 left a comment

Uh oh!

cpgaffney1 Jul 9, 2025

Uh oh!

cpgaffney1 Jul 10, 2025

Uh oh!

findmyway Jul 11, 2025

Uh oh!

cpgaffney1 Jul 9, 2025

Uh oh!

findmyway Jul 11, 2025

Uh oh!

cpgaffney1 Jul 9, 2025

Uh oh!

findmyway Jul 11, 2025

Uh oh!

findmyway commented Jul 11, 2025

Uh oh!

Uh oh!

	if primary_replica_ids != expected_primary_replica_ids:
	raise InvalidShardingError(
	'The provided sharding is not valid. The primary replica has the'
	f' following devices: {primary_replica_ids}, but process indices'
	' associated with primary replica devices are expected to be:'
	f' {primary_replica_pids}.'
	)

Restore from single slice #2084

Are you sure you want to change the base?

Restore from single slice #2084

Conversation

findmyway commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpgaffney1 left a comment

Choose a reason for hiding this comment

Uh oh!

cpgaffney1 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

cpgaffney1 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

findmyway Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

cpgaffney1 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

findmyway Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

cpgaffney1 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

findmyway Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

findmyway commented Jul 11, 2025

Uh oh!

Uh oh!

findmyway commented Jul 9, 2025 •

edited

Loading