-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Doge model #35891
base: main
Are you sure you want to change the base?
Add Doge model #35891
Conversation
This reverts commit 229cdca.
Hi @LoserCheems, and thanks for the PR! The model looks cool and I like the paper too, but we're trying to add new models using modular in future. You can see a guide here, and an example modular PR here. If you write |
Thank you @Rocketknight1 , I've written |
Hi @LoserCheems yes, don't worry, it's a new feature so everyone is a bit confused about it! 😅 Your Classes like |
Thank you @Rocketknight1. In fact, because the weight name or config name is different, can directly inherited class is not much, a total of |
Hi @LoserCheems, the last code quality error is caused by an unprotected There are unrelated failing tests under |
Sorry @Rocketknight1, I mistakenly imported |
em🤓, There seems to be something wrong with |
Thank you @Rocketknight1 for your suggestion, I have finished the modification according to the suggestion! |
Yes, looks good to me now! cc @Cyrilvallez @ArthurZucker for core maintainer review. Failing tests are unrelated, should be fixed by rebasing after review / fixing nits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Thank you @ArthurZucker for your careful review. I have finished the modification according to your suggestion. |
This leads to incorrect weight initialization.
Cool having a look! |
We have to provide attention_mask for dynamic mask computation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, here is a new round of reviews! Basically, a lot of code can be cut by inheriting from Llama classes 🤗 The model itself would need to change some weights keys, but very few of them. We are happy to include a nice weight converting script such as can be found here to the library as well, to facilitate the weight name mapping if needed!
I also feel like quite a lot of config arguments could be renamed to what already exist in the library: consistency between models ensure that every user of your model will very quickly understand what's happening with your model and config!
If needed, we are also happy to help with the conversion of your weight name and config args 🤗 But if you don't have that many checkpoints on the hub already, this is usually a pretty straightforward operation!
EDIT: Renaming weights also enables to use modular much more in this case, which in turn means your model will automatically, and 0-day benefit from all refactors/new features in the future!
def set_output_embeddings(self, new_embeddings): | ||
self.lm_head = new_embeddings | ||
|
||
def get_decoder(self): | ||
return self.model | ||
|
||
def set_decoder(self, decoder): | ||
self.model = decoder | ||
|
||
@add_start_docstrings_to_model_forward(DOGE_INPUTS_DOCSTRING) | ||
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC) | ||
def forward( | ||
self, | ||
input_ids: torch.LongTensor = None, | ||
attention_mask: Optional[torch.Tensor] = None, | ||
position_ids: Optional[torch.LongTensor] = None, | ||
past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None, | ||
inputs_embeds: Optional[torch.FloatTensor] = None, | ||
labels: Optional[torch.LongTensor] = None, | ||
use_cache: Optional[bool] = None, | ||
output_attentions: Optional[bool] = None, | ||
output_hidden_states: Optional[bool] = None, | ||
return_dict: Optional[bool] = None, | ||
cache_position: Optional[torch.LongTensor] = None, | ||
logits_to_keep: Union[int, torch.Tensor] = 0, | ||
**kwargs: Unpack[LossKwargs], | ||
) -> Union[Tuple, CausalLMOutputWithPast]: | ||
r""" | ||
Args: | ||
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): | ||
Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., | ||
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored | ||
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. | ||
|
||
logits_to_keep (`int`, *optional*): | ||
If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all | ||
`input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that | ||
token can save memory, which becomes pretty significant for long sequences or large vocabulary size. | ||
If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension. | ||
This is useful when using packed tensor format (single dimension for batch and sequence length). | ||
|
||
Returns: | ||
|
||
Example: | ||
|
||
```python | ||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM | ||
|
||
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M") | ||
>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M") | ||
|
||
>>> prompt = "Hey, are you conscious? Can you talk to me?" | ||
>>> inputs = tokenizer(prompt, return_tensors="pt") | ||
|
||
>>> # Generate | ||
>>> generate_ids = model.generate(inputs.input_ids, max_length=30) | ||
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] | ||
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." | ||
```""" | ||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions | ||
output_hidden_states = ( | ||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states | ||
) | ||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict | ||
|
||
# decoder output consists of (dec_features, layer_state, dec_hidden, dec_attn) | ||
outputs = self.model( | ||
input_ids=input_ids, | ||
attention_mask=attention_mask, | ||
position_ids=position_ids, | ||
past_key_values=past_key_values, | ||
inputs_embeds=inputs_embeds, | ||
use_cache=use_cache, | ||
output_attentions=output_attentions, | ||
output_hidden_states=output_hidden_states, | ||
return_dict=return_dict, | ||
cache_position=cache_position, | ||
**kwargs, | ||
) | ||
|
||
hidden_states = outputs[0] | ||
# only compute necessary logits, and do not upcast them to float if we are not computing the loss | ||
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep | ||
logits = self.lm_head(hidden_states[:, slice_indices, :]) | ||
|
||
loss = None | ||
if labels is not None: | ||
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs) | ||
|
||
if not return_dict: | ||
output = (logits,) + outputs[1:] | ||
return (loss,) + output if loss is not None else output | ||
|
||
return CausalLMOutputWithPast( | ||
loss=loss, | ||
logits=logits, | ||
past_key_values=outputs.past_key_values, | ||
hidden_states=outputs.hidden_states, | ||
attentions=outputs.attentions, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here this can all be inherited directly from Llama as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should I do if I inherit llama's forward method with the wrong example?
>>> from transformers import AutoTokenizer, DogeForCausalLM
>>> model = DogeForCausalLM.from_pretrained("meta-doge/Doge-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-doge/Doge-2-7b-hf")
But there are two problems: 1. `flex_attention_forward` is not updated properly 2. `Example` error in the forward method of DogeForCausalLM
What does this PR do?
Fixes #35889
Support the Doge-SLM family of small language models.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
to: @ArthurZucker