deepseek r1 running #102

qihqi · 2025-02-10T22:04:24Z

No description provided.

pgmoka

Left a couple comments.

pgmoka · 2025-02-12T17:31:44Z

torchprime/experimental/torchax_models/deepseek_v3/model.py

+      # self.kv_cache[:bsz, start_pos:end_pos] = self.kv_norm(kv)
+      # self.pe_cache[:bsz, start_pos:end_pos] = k_pe.squeeze(2)


We have changed how we do caching as seen in the bellow lines. Do we have a reason to keep these comments?

pgmoka · 2025-02-12T17:31:51Z

torchprime/experimental/torchax_models/deepseek_v3/model.py

+      # self.k_cache[:bsz, start_pos:end_pos] = k
+      # self.v_cache[:bsz, start_pos:end_pos] = v


See comment bellow

pgmoka · 2025-02-12T17:34:39Z

torchprime/experimental/torchax_models/deepseek_v3/model.py

+    #     assert args.n_routed_experts % world_size == 0
+    #     self.n_routed_experts = args.n_routed_experts
+    #     self.n_local_experts = args.n_routed_experts // world_size
+    #     self.n_activated_experts = args.n_activated_experts
+    #     self.experts_start_idx = rank * self.n_local_experts
+    #     self.experts_end_idx = self.experts_start_idx + self.n_local_experts
+    self.gate = Gate(model_args)
+    #     self.experts = nn.ModuleList(
+    #       [
+    #         Expert(args.dim, args.moe_inter_dim)
+    #         if self.experts_start_idx <= i < self.experts_end_idx
+    #         else None
+    #         for i in range(self.n_routed_experts)
+    #       ]
+    #     )


Do we have reason to leave these as comments rather than removing them?

pgmoka · 2025-02-12T17:36:40Z

torchprime/experimental/torchax_models/deepseek_v3/model.py

+    # if seqlen > 1:
+    #   mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device).triu_(1)


Do we have reason to leave these as comments rather than removing them?

pgmoka · 2025-02-12T17:39:20Z

torchprime/experimental/torchax_models/deepseek_v3/model.py

@@ -544,6 +543,7 @@ def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
      else:
        group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)
      indices = group_scores.topk(self.topk_groups, dim=-1)[1]
+      print('i am here')


Test print. Please remove. If appropriate, we could use some logging at info level.

I see there are a couple other prints on model functions. I would consider applying the same criteria to those.

pgmoka · 2025-02-12T17:50:33Z

torchprime/launcher/Dockerfile

@@ -26,6 +26,7 @@ WORKDIR /workspaces
 # Install torchax
 RUN git clone https://github.com/pytorch/xla.git
 WORKDIR /workspaces/xla/torchax
+RUN git checkout hanq_torchax1


Is this a temporary fix while pytorch/xla@master...hanq_torchax1 is not merged?

pgmoka · 2025-02-12T18:33:35Z

torchprime/experimental/torchax_models/deepseek_v3/prefill_benchmark.py

+
+  tokens = name.split(".")
+  for i, t in enumerate(tokens):
+    if is_integer(t):


Why not using something like type(t) is int?

pgmoka · 2025-02-12T18:36:12Z

torchprime/experimental/torchax_models/deepseek_v3/prefill_benchmark.py

+name0 = "tp0"
+# name1 = "tp1"


Should we add the definition of "name0"?

pgmoka · 2025-02-12T18:39:18Z

torchprime/experimental/torchax_models/deepseek_v3/tests/test_prefill.py

 @pytest.mark.deepseek
-def test_single_device_compile():


Is this test no longer useful?

pgmoka · 2025-02-12T18:44:25Z

torchprime/experimental/torchax_models/deepseek_v3/prefill_benchmark.py

This file contains many helpers for sharding. Should we create a file that contains the sharding tooling written here?

In that case, adding unit tests for these sharding methods could be helpful

qihqi added 6 commits February 6, 2025 22:57

16B deepseek running on v6e

77e406c

Use pytest format for testing

1598ab8

remove old moe

c134f9f

ruff

55c8a49

Add decode benchmark

a72e579

Update the rest, R1 runs on v5p-128

267e0e8

pgmoka reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deepseek r1 running #102

deepseek r1 running #102

Uh oh!

qihqi commented Feb 10, 2025

Uh oh!

pgmoka left a comment

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

pgmoka Feb 12, 2025

Uh oh!

Uh oh!

		# self.kv_cache[:bsz, start_pos:end_pos] = self.kv_norm(kv)
		# self.pe_cache[:bsz, start_pos:end_pos] = k_pe.squeeze(2)

		# self.k_cache[:bsz, start_pos:end_pos] = k
		# self.v_cache[:bsz, start_pos:end_pos] = v

		# if seqlen > 1:
		# mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device).triu_(1)

deepseek r1 running #102

Are you sure you want to change the base?

deepseek r1 running #102

Uh oh!

Conversation

qihqi commented Feb 10, 2025

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!