Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Doge model #35891

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open

Add Doge model #35891

wants to merge 50 commits into from

Conversation

LoserCheems
Copy link
Contributor

@LoserCheems LoserCheems commented Jan 25, 2025

What does this PR do?

Fixes #35889
Support the Doge-SLM family of small language models.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

to: @ArthurZucker

@Rocketknight1
Copy link
Member

Hi @LoserCheems, and thanks for the PR! The model looks cool and I like the paper too, but we're trying to add new models using modular in future. You can see a guide here, and an example modular PR here.

If you write modular_doge.py with inheritance then the configuration_ and modeling_ files will be auto-generated. This makes the PR much shorter and easier to review.

@LoserCheems
Copy link
Contributor Author

Thank you @Rocketknight1 , I've written modular_doge.py, but I'm sorry I don't quite understand modular and may make smoe mistakes...

@Rocketknight1
Copy link
Member

Hi @LoserCheems yes, don't worry, it's a new feature so everyone is a bit confused about it! 😅

Your modular_doge.py file looks good! The next step is to find code that's copied from other models in transformers, and replace that with inheritance. This will make modular_doge.py much smaller, but the full modeling_doge.py will still be generated without inheritance. You can see some examples in the Qwen2.5 PR:

image

Classes like DogeMLP and DogeForSequenceClassification look like they use code from other library classes like Llama, so you could just inherit those instead in the modular file. You can run make fix-copies to regenerate modeling_doge.py and confirm that it still works.

@LoserCheems
Copy link
Contributor Author

Thank you @Rocketknight1. In fact, because the weight name or config name is different, can directly inherited class is not much, a total of RMSNorm, RotaryEmbedding , and DogeForSequenceClassification inherited from Llama.

@Rocketknight1
Copy link
Member

Hi @LoserCheems, the last code quality error is caused by an unprotected import torch. These need to be guarded by if is_torch_available because some people have JAX-only or TF-only systems, and unguarded imports can make it impossible for them to use the library!

There are unrelated failing tests under tests_torch - you can ignore them for now, but once you can get code quality green then let me know and I'll review the PR.

@LoserCheems
Copy link
Contributor Author

Sorry @Rocketknight1, I mistakenly imported PretrainedConfig from modeling_utils, which is now fixed.

@LoserCheems
Copy link
Contributor Author

em🤓, There seems to be something wrong with RotaryEmbedding inherited from Llama.

@LoserCheems
Copy link
Contributor Author

Thank you @Rocketknight1 for your suggestion, I have finished the modification according to the suggestion!

@Rocketknight1
Copy link
Member

Yes, looks good to me now! cc @Cyrilvallez @ArthurZucker for core maintainer review. Failing tests are unrelated, should be fixed by rebasing after review / fixing nits

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@LoserCheems
Copy link
Contributor Author

Thank you @ArthurZucker for your careful review. I have finished the modification according to your suggestion.

@ArthurZucker
Copy link
Collaborator

Cool having a look!

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, here is a new round of reviews! Basically, a lot of code can be cut by inheriting from Llama classes 🤗 The model itself would need to change some weights keys, but very few of them. We are happy to include a nice weight converting script such as can be found here to the library as well, to facilitate the weight name mapping if needed!

I also feel like quite a lot of config arguments could be renamed to what already exist in the library: consistency between models ensure that every user of your model will very quickly understand what's happening with your model and config!
If needed, we are also happy to help with the conversion of your weight name and config args 🤗 But if you don't have that many checkpoints on the hub already, this is usually a pretty straightforward operation!

EDIT: Renaming weights also enables to use modular much more in this case, which in turn means your model will automatically, and 0-day benefit from all refactors/new features in the future!

Comment on lines 1059 to 1181
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings

def get_decoder(self):
return self.model

def set_decoder(self, decoder):
self.model = decoder

@add_start_docstrings_to_model_forward(DOGE_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[LossKwargs],
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

logits_to_keep (`int`, *optional*):
If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
`input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns:

Example:

```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M")
>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

# decoder output consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
**kwargs,
)

hidden_states = outputs[0]
# only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])

loss = None
if labels is not None:
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs)

if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output

return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this can all be inherited directly from Llama as well!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do if I inherit llama's forward method with the wrong example?

       >>> from transformers import AutoTokenizer, DogeForCausalLM

       >>> model = DogeForCausalLM.from_pretrained("meta-doge/Doge-2-7b-hf")
       >>> tokenizer = AutoTokenizer.from_pretrained("meta-doge/Doge-2-7b-hf")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request to add Doge
4 participants