-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding adapter support for NeoX #523
base: legacy
Are you sure you want to change the base?
Conversation
@calpt , sorry to bother you. did you by chance check this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajesujoba Sorry for the delay in getting back to you. I've done an initial review of your changes and left some comments. Overall, this looks very promising, thanks again for your efforts.
Before this can be merged, please additionally make sure:
- all checks and tests in the CI pipeline pass
- all points in our contributing guide are addressed
- in
tests_adapters/models
, a new module for GPT-NeoX is added (this is currently not correctly documented in the contributing guide, we'll update)
Additionally, I've provided a fix for the issue regarding output embeddings you've raised and we've discussed in #537.
src/transformers/__init__.py
Outdated
_import_structure["models.gpt_neox"].extend( | ||
[ | ||
"TFGPTNeoXForCausalLM", | ||
"TFGPTNeoXForQuestionAnswering", | ||
"TFGPTNeoXForSequenceClassification", | ||
"TFGPTNeoXModel", | ||
"TFGPTNeoXPreTrainedModel", | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the addition of these imports related to changes of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I removed them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding all points in our contributing guide are addressed, it is optional in the documentation and I tried it but having some mismatches in dimension, Is it still optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what exactly you're referring to from the contributing guide? The Parallel inference and static head conversion points are still optional (although highly recommended). If Parallel support is not implemented, please make sure to remove the test mixin classes starting with "Parallel..." from the model test class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was referring to the Parallel inference and static head conversion points are still optional (although highly recommended). And as you recommended I would remove test mixin classes starting with "Parallel...". I would also remove some of the other tests such as IA3TestMixin, LoRATestMixin, PrefixTuningTestMixin, UniPELTTestMixin,
as they require adding classification head 'add_classification_head'
and the GPT_NeoX model in this version does not have that .
head = QuestionAnsweringHead(self, head_name, num_labels, layers, activation_function, id2label) | ||
self.add_prediction_head(head, overwrite_ok) | ||
|
||
class GPTNeoXModelWithHeads(GPTNeoXAdapterModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As XModelWithHeads classes are deprecated, we wouldn't want to add this class for newly supported models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calpt , one more thing to point out. GPTNeoX configuration does not include 'hidden_dropout_prob'
and attention_probs_dropout_prob
. And running test_adapters with this would result in an error. What do you suggest? Similarly, GPTNeoXTokenizer also does not exist, the current tests_adapter does not support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the tokenizer, I think we should we should allow using fast tokenizers anyway. Could you check if/ which issues occur when you switch to loading the fast tokenizer for GPTNeoX?
Regarding the dropout issue, we would probably add some default dropout if the keys are not present in the model config. I can look into this. You could just add the missing key to the model config temporarily to make the tests pass and we can change it later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Sounds good
@@ -95,7 +105,7 @@ def __init__(self, config): | |||
self.rotary_ndims, config.max_position_embeddings, base=config.rotary_emb_base | |||
) | |||
self.norm_factor = torch.sqrt(torch.tensor(self.head_size, dtype=torch.float32)).to(torch.get_default_dtype()) | |||
self.query_key_value = nn.Linear(config.hidden_size, 3 * config.hidden_size) | |||
self.query_key_value = LoRALinear(config.hidden_size, 3 * config.hidden_size, "selfattn", config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As GPT-NeoX groups all attention matrices together into one linear layers similar to GPT-2, this should probably use LoRAMergedLinear
instead of LoRALinear
if i'm not mistaken (see GPT-2 implementation for reference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this already!
Hey, thanks again for your efforts in contributing new model architectures to adapter-transformers and sorry for the silence on our side. In the last few weeks, we've been working on a large refactoring of our project, which will ultimately result in the release of Adapters, the next-generation adapters library. See #584. As a consequence, we plan to merge any new model integrations directly to the new codebase, which currently can be found on this branch. Unfortunately, this necessitates some changes in the model integration code (detailed here, see already integrated models such as BERT, BART etc. for reference). If you'd be willing to update your model integration to target the new library yourself, we'd be super happy to help you on this. Otherwise, we might look into upgrading and merging some of the open model integration PRs ourselves in the future. For more details, again see #584. |
Added adapter support for GPTNeoX with tests. Although at the moment, while training the adapter module, it also trains the CLM head (which is not the expected situation). This I already raised here.