Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add cache_dir support to CrossEncoder #2784

Merged
merged 9 commits into from
Sep 10, 2024

Conversation

RoyBA
Copy link
Contributor

@RoyBA RoyBA commented Jun 26, 2024

Support for cache_dir Argument to CrossEncoder

Description:

This pull request introduces the cache_dir argument to the CrossEncoder class, enabling users to specify a directory for caching model files. This addition resolves the issue of inconsistent file locations when working with local_files_only=True.

Problem:

When initializing the CrossEncoder class with local_files_only=True, the following methods download and cache files to different directories:

  • AutoConfig.from_pretrained
  • AutoModelForSequenceClassification.from_pretrained
  • AutoTokenizer.from_pretrained

This inconsistency arises because there is no direct way to pass the cache_dir argument to AutoConfig.from_pretrained, leading to config files being stored in the HF_HOME cache directory and PyTorch model files being stored in the specified cache_dir.

Solution:

To address this issue, I utilized the snapshot_download function from the huggingface_hub library. This function allows for downloading a complete snapshot of a repo's files, including the model, config, and tokenizer files, and stores them in the specified cache_dir. By passing the cache_dir argument to the CrossEncoder class, users can now ensure that all files are consistently stored in the same directory.

Changes:

  1. Added the cache_dir argument to the CrossEncoder class __init__ method.
  2. Utilized the snapshot_download function to download the model files or get the path to the relevant snapshot from the cache.
  3. Passed the downloaded snapshot path to AutoConfig.from_pretrained, AutoModelForSequenceClassification.from_pretrained, and AutoTokenizer.from_pretrained methods.

Testing:

  • Verified that the CrossEncoder class correctly loads model files from the specified cache_dir.
  • Ensured backward compatibility by testing without the cache_dir argument, confirming that files are still downloaded to the default directories.

@tomaarsen
Copy link
Collaborator

Hello!

I'm trying to think whether it would also be sufficient to pass the cache_dir to the from_pretrained, a bit like we do here: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py#L53

Then we can avoid loading the model ourselves, and relying more on transformers instead (which is my preference).

  • Tom Aarsen

@RoyBA
Copy link
Contributor Author

RoyBA commented Jun 28, 2024

Hello!

I'm trying to think whether it would also be sufficient to pass the cache_dir to the from_pretrained, a bit like we do here: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py#L53

Then we can avoid loading the model ourselves, and relying more on transformers instead (which is my preference).

  • Tom Aarsen

Hi!

Yep, you're right, it’s better to rely on transformers to load the model. I’ve updated the CrossEncoder class to include the config_args parameter and passed it to AutoConfig.from_pretrained as you suggested. Now, cache_dir can be passed through config_args, ensuring the same behavior as with AutoModelForSequenceClassification and AutoTokenizer.

Thank you for the feedback! Please let me know if there are any further adjustments needed.

@tomaarsen
Copy link
Collaborator

tomaarsen commented Jun 28, 2024

Nice! I think this is a good direction.
That said, perhaps we can do both: with https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py I also have cache_dir because it's inconvenient to have to pass it in 3 separate args.

Just like

		trust_remote_code: bool = False,
		revision: Optional[str] = None,
        local_files_only: bool = False,

these would otherwise also need to be passed to all 3 configs.


There's one difficult consideration that remains: the naming. @muellerzr would you be able to help me with that?
There's a lot of parameter naming left over from before I picked up, which don't correspond with the naming schemes from transformers.

Here's the setup:
sentence-transformers currently supports 2 model types:

I have concerns regarding keyword arguments:

  • SentenceTransformer uses model_kwargs, tokenizer_kwargs, and config_kwargs.
  • CrossEncoder uses automodel_args and tokenizer_args, no config args yet (this PR might change that).
  • Transformer uses model_args and tokenizer_args and config_args.
  • I believe the transformers project uses ..._kwargs because they are just keyword arguments.

And cache directory:

  • SentenceTransformer uses cache_folder
  • CrossEncoder doesn't use it yet (this PR might change that)
  • Transformer uses cache_dir
  • the transformers project uses cache_dir everywhere.

I'm considering a big (but soft) deprecation that introduces the same format for all parameters across the project. The old argument names would still work until e.g. v4.0.0, but they'd give a warning to upgrade to the new parameters instead. Perhaps with a decorator like this and this. I'd love to hear your thoughts @muellerzr.

cc @osanseviero

  • Tom Aarsen

@RoyBA
Copy link
Contributor Author

RoyBA commented Jun 28, 2024

Nice! I think this is a good direction. That said, perhaps we can do both: with https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py I also have cache_dir because it's inconvenient to have to pass it in 3 separate args.

Just like

		trust_remote_code: bool = False,
		revision: Optional[str] = None,
        local_files_only: bool = False,

these would otherwise also need to be passed to all 3 configs.

Yeah, I totally agree with you. I've added the cache_dir argument back and passed it to all three from_pretrained methods (AutoConfig, AutoModelForSequenceClassification, and AutoTokenizer).

@tomaarsen
Copy link
Collaborator

Much appreciated! Other than the naming things that I mentioned & will wait for Zach to get back to me on, I think this is almost ready to go :)
Thanks for updating the PR so quickly with my recommendations.

  • Tom Aarsen

@tomaarsen tomaarsen changed the title Add cache_dir support to CrossEncoder [feat] Add cache_dir support to CrossEncoder Sep 10, 2024
Shouldn't be (too) breaking as they're far down the list of kwargs
@tomaarsen tomaarsen merged commit 1cd91ab into UKPLab:master Sep 10, 2024
11 checks passed
@tomaarsen
Copy link
Collaborator

Thanks a bunch! And my apologies for the delay. This will be included in the next release.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants