Add Doge model #35891

LoserCheems · 2025-01-25T18:50:46Z

What does this PR do?

Fixes #35889
Support the Doge-SLM family of small language models.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

This reverts commit 229cdca.

Rocketknight1 · 2025-01-27T14:51:41Z

Hi @LoserCheems, and thanks for the PR! The model looks cool and I like the paper too, but we're trying to add new models using modular in future. You can see a guide here, and an example modular PR here.

If you write modular_doge.py with inheritance then the configuration_ and modeling_ files will be auto-generated. This makes the PR much shorter and easier to review.

LoserCheems · 2025-01-27T17:48:23Z

Thank you @Rocketknight1 , I've written modular_doge.py, but I'm sorry I don't quite understand modular and may make smoe mistakes...

Rocketknight1 · 2025-01-27T18:53:29Z

Hi @LoserCheems yes, don't worry, it's a new feature so everyone is a bit confused about it! 😅

Your modular_doge.py file looks good! The next step is to find code that's copied from other models in transformers, and replace that with inheritance. This will make modular_doge.py much smaller, but the full modeling_doge.py will still be generated without inheritance. You can see some examples in the Qwen2.5 PR:

Classes like DogeMLP and DogeForSequenceClassification look like they use code from other library classes like Llama, so you could just inherit those instead in the modular file. You can run make fix-copies to regenerate modeling_doge.py and confirm that it still works.

LoserCheems · 2025-01-28T03:52:08Z

Thank you @Rocketknight1. In fact, because the weight name or config name is different, can directly inherited class is not much, a total of RMSNorm, RotaryEmbedding , and DogeForSequenceClassification inherited from Llama.

Rocketknight1 · 2025-01-28T15:51:16Z

Hi @LoserCheems, the last code quality error is caused by an unprotected import torch. These need to be guarded by if is_torch_available because some people have JAX-only or TF-only systems, and unguarded imports can make it impossible for them to use the library!

There are unrelated failing tests under tests_torch - you can ignore them for now, but once you can get code quality green then let me know and I'll review the PR.

…_utils

LoserCheems · 2025-01-28T16:25:32Z

Sorry @Rocketknight1, I mistakenly imported PretrainedConfig from modeling_utils, which is now fixed.

LoserCheems · 2025-01-28T16:53:07Z

em🤓, There seems to be something wrong with RotaryEmbedding inherited from Llama.

LoserCheems · 2025-02-04T18:46:19Z

Thank you @Rocketknight1 for your suggestion, I have finished the modification according to the suggestion!

Rocketknight1 · 2025-02-05T17:05:13Z

Yes, looks good to me now! cc @Cyrilvallez @ArthurZucker for core maintainer review. Failing tests are unrelated, should be fixed by rebasing after review / fixing nits

ArthurZucker

Nice work!

src/transformers/models/doge/modular_doge.py

src/transformers/models/doge/configuration_doge.py

src/transformers/models/doge/modular_doge.py

LoserCheems · 2025-02-06T15:44:03Z

Thank you @ArthurZucker for your careful review. I have finished the modification according to your suggestion.

This leads to incorrect weight initialization.

ArthurZucker · 2025-02-10T09:54:17Z

Cool having a look!

We have to provide attention_mask for dynamic mask computation

Cyrilvallez

Alright, here is a new round of reviews! Basically, a lot of code can be cut by inheriting from Llama classes 🤗 The model itself would need to change some weights keys, but very few of them. We are happy to include a nice weight converting script such as can be found here to the library as well, to facilitate the weight name mapping if needed!

I also feel like quite a lot of config arguments could be renamed to what already exist in the library: consistency between models ensure that every user of your model will very quickly understand what's happening with your model and config!
If needed, we are also happy to help with the conversion of your weight name and config args 🤗 But if you don't have that many checkpoints on the hub already, this is usually a pretty straightforward operation!

EDIT: Renaming weights also enables to use modular much more in this case, which in turn means your model will automatically, and 0-day benefit from all refactors/new features in the future!

src/transformers/models/doge/modular_doge.py

Cyrilvallez · 2025-02-18T20:07:45Z

src/transformers/models/doge/modular_doge.py

+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def get_decoder(self):
+        return self.model
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    @add_start_docstrings_to_model_forward(DOGE_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[LossKwargs],
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+            logits_to_keep (`int`, *optional*):
+                If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+                If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+                This is useful when using packed tensor format (single dimension for batch and sequence length).
+
+        Returns:
+
+        Example:
+
+        ```python
+         >>> from transformers import AutoTokenizer, AutoModelForCausalLM
+
+        >>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M")
+        >>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder output consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs[0]
+        # only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )


Here this can all be inherited directly from Llama as well!

What should I do if I inherit llama's forward method with the wrong example?

>>> from transformers import AutoTokenizer, DogeForCausalLM >>> model = DogeForCausalLM.from_pretrained("meta-doge/Doge-2-7b-hf") >>> tokenizer = AutoTokenizer.from_pretrained("meta-doge/Doge-2-7b-hf")

src/transformers/models/doge/modular_doge.py

But there are two problems: 1. `flex_attention_forward` is not updated properly 2. `Example` error in the forward method of DogeForCausalLM

LoserCheems and others added 7 commits January 26, 2025 01:31

Add Doge Model

2a38123

Fix code quality

5c96118

Rollback an error commit

0f689d6

Fix config for open-source weights

229cdca

Revert "Fix config for open-source weights"

749dbcd

This reverts commit 229cdca.

Merge branch 'huggingface:main' into add-doge-model

e2f5c36

Merge branch 'main' into add-doge-model

e66977e

LoserCheems added 2 commits January 28, 2025 01:42

Add modular_doge

ca7630a

Merge branch 'main' into add-doge-model

17388cf

LoserCheems added 2 commits January 28, 2025 11:47

Update Doge inherits from Llama

79c0659

Merge branch 'main' into add-doge-model

f4d895c

LoserCheems added 3 commits January 28, 2025 12:07

Fix import bug

941d6b5

Merge branch 'main' into add-doge-model

4958ff1

[docs] Add usage of doge model

1466142

LoserCheems added 2 commits January 29, 2025 00:13

Fix Doge import pretrainedconfig from modeling_utils to configuration…

aa4fcfd

…_utils

Merge branch 'main' into add-doge-model

c346728

eustlb mentioned this pull request Jan 28, 2025

[Whisper] Fix num_return_sequences behavior #35939

Draft

LoserCheems added 2 commits January 29, 2025 00:30

[docs] remove trust remote code from doge

7cbea89

Merge branch 'main' into add-doge-model

cdcbd34

LoserCheems added 4 commits January 29, 2025 09:05

Fix dynamo bug in doge model

c935266

Merge branch 'main' into add-doge-model

3ab3187

Merge branch 'main' into add-doge-model

2c7e1c8

Update docstrings

9612ddb

LoserCheems added 2 commits February 5, 2025 02:43

Import apply_rotary_pos_emb and repeat_kv from Llama

8e09475

Merge branch 'main' into add-doge-model

7ae953c

Merge branch 'main' into add-doge-model

ee6de6b

ArthurZucker reviewed Feb 6, 2025

View reviewed changes

LoserCheems and others added 6 commits February 6, 2025 22:44

Fix all nits

49bca1f

Merge branch 'main' into add-doge-model

f9e6581

Fix code quality

d33af94

Fix some bugs

e064520

Merge branch 'main' into add-doge-model

408418d

Fix code quality

be9afcd

LoserCheems added 3 commits February 7, 2025 02:41

Remove inherited _update_causal_mask from Llama

1c99852

This leads to incorrect weight initialization.

Fix the wrong tensor orderings in DogeCDMoE

6ae4982

Merge branch 'main' into add-doge-model

0835250

LoserCheems and others added 5 commits February 10, 2025 19:48

Merge branch 'main' into add-doge-model

b88f2de

Merge branch 'huggingface:main' into add-doge-model

3ac6e51

Fix attention mask bug

1de6c83

We have to provide attention_mask for dynamic mask computation

Merge branch 'huggingface:main' into add-doge-model

043034b

Merge branch 'huggingface:main' into add-doge-model

ed2b0cf

Cyrilvallez reviewed Feb 18, 2025

View reviewed changes

LoserCheems added 7 commits February 21, 2025 10:29

Merge branch 'huggingface:main' into add-doge-model

23fbc25

Merge branch 'huggingface:main' into add-doge-model

a2fdf6c

Modify most implementations to inherit from Llama

16eec7a

But there are two problems: 1. `flex_attention_forward` is not updated properly 2. `Example` error in the forward method of DogeForCausalLM

Modify CDMoE for batch efficient implementation

e46218c

Uniform MoE configuration names, just like QwenMoE

03e8763

Fix code quality

8bda150

Fix code quality

328fb1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Doge model #35891

Add Doge model #35891

LoserCheems commented Jan 25, 2025 •

edited

Loading

Rocketknight1 commented Jan 27, 2025

LoserCheems commented Jan 27, 2025

Rocketknight1 commented Jan 27, 2025

LoserCheems commented Jan 28, 2025

Rocketknight1 commented Jan 28, 2025

LoserCheems commented Jan 28, 2025

LoserCheems commented Jan 28, 2025

LoserCheems commented Feb 4, 2025

Rocketknight1 commented Feb 5, 2025

ArthurZucker left a comment

LoserCheems commented Feb 6, 2025

ArthurZucker commented Feb 10, 2025

Cyrilvallez left a comment •

edited

Loading

Cyrilvallez Feb 18, 2025

LoserCheems Feb 20, 2025

Add Doge model #35891

Are you sure you want to change the base?

Add Doge model #35891

Conversation

LoserCheems commented Jan 25, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

Rocketknight1 commented Jan 27, 2025

LoserCheems commented Jan 27, 2025

Rocketknight1 commented Jan 27, 2025

LoserCheems commented Jan 28, 2025

Rocketknight1 commented Jan 28, 2025

LoserCheems commented Jan 28, 2025

LoserCheems commented Jan 28, 2025

LoserCheems commented Feb 4, 2025

Rocketknight1 commented Feb 5, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

LoserCheems commented Feb 6, 2025

ArthurZucker commented Feb 10, 2025

Cyrilvallez left a comment • edited Loading

Choose a reason for hiding this comment

Cyrilvallez Feb 18, 2025

Choose a reason for hiding this comment

LoserCheems Feb 20, 2025

Choose a reason for hiding this comment

LoserCheems commented Jan 25, 2025 •

edited

Loading

Cyrilvallez left a comment •

edited

Loading