[`feat`] Add cache_dir support to CrossEncoder #2784

RoyBA · 2024-06-26T14:14:10Z

Support for `cache_dir` Argument to CrossEncoder

Description:

This pull request introduces the cache_dir argument to the CrossEncoder class, enabling users to specify a directory for caching model files. This addition resolves the issue of inconsistent file locations when working with local_files_only=True.

Problem:

When initializing the CrossEncoder class with local_files_only=True, the following methods download and cache files to different directories:

AutoConfig.from_pretrained
AutoModelForSequenceClassification.from_pretrained
AutoTokenizer.from_pretrained

This inconsistency arises because there is no direct way to pass the cache_dir argument to AutoConfig.from_pretrained, leading to config files being stored in the HF_HOME cache directory and PyTorch model files being stored in the specified cache_dir.

Solution:

To address this issue, I utilized the snapshot_download function from the huggingface_hub library. This function allows for downloading a complete snapshot of a repo's files, including the model, config, and tokenizer files, and stores them in the specified cache_dir. By passing the cache_dir argument to the CrossEncoder class, users can now ensure that all files are consistently stored in the same directory.

Changes:

Added the cache_dir argument to the CrossEncoder class __init__ method.
Utilized the snapshot_download function to download the model files or get the path to the relevant snapshot from the cache.
Passed the downloaded snapshot path to AutoConfig.from_pretrained, AutoModelForSequenceClassification.from_pretrained, and AutoTokenizer.from_pretrained methods.

Testing:

Verified that the CrossEncoder class correctly loads model files from the specified cache_dir.
Ensured backward compatibility by testing without the cache_dir argument, confirming that files are still downloaded to the default directories.

…CrossEncoder class for clarity

tomaarsen · 2024-06-28T09:15:51Z

Hello!

I'm trying to think whether it would also be sufficient to pass the cache_dir to the from_pretrained, a bit like we do here: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py#L53

Then we can avoid loading the model ourselves, and relying more on transformers instead (which is my preference).

Tom Aarsen

RoyBA · 2024-06-28T09:55:14Z

Hello!

I'm trying to think whether it would also be sufficient to pass the cache_dir to the from_pretrained, a bit like we do here: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py#L53

Then we can avoid loading the model ourselves, and relying more on transformers instead (which is my preference).

Tom Aarsen

Hi!

Yep, you're right, it’s better to rely on transformers to load the model. I’ve updated the CrossEncoder class to include the config_args parameter and passed it to AutoConfig.from_pretrained as you suggested. Now, cache_dir can be passed through config_args, ensuring the same behavior as with AutoModelForSequenceClassification and AutoTokenizer.

Thank you for the feedback! Please let me know if there are any further adjustments needed.

tomaarsen · 2024-06-28T10:53:17Z

Nice! I think this is a good direction.
That said, perhaps we can do both: with https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py I also have cache_dir because it's inconvenient to have to pass it in 3 separate args.

Just like

		trust_remote_code: bool = False,
		revision: Optional[str] = None,
        local_files_only: bool = False,

these would otherwise also need to be passed to all 3 configs.

There's one difficult consideration that remains: the naming. @muellerzr would you be able to help me with that?
There's a lot of parameter naming left over from before I picked up, which don't correspond with the naming schemes from transformers.

Here's the setup:
sentence-transformers currently supports 2 model types:

SentenceTransformer
- Which uses Transformer under the hood. 99% of people don't initialize this one themselves.
CrossEncoder

I have concerns regarding keyword arguments:

SentenceTransformer uses model_kwargs, tokenizer_kwargs, and config_kwargs.
CrossEncoder uses automodel_args and tokenizer_args, no config args yet (this PR might change that).
Transformer uses model_args and tokenizer_args and config_args.
I believe the transformers project uses ..._kwargs because they are just keyword arguments.

And cache directory:

SentenceTransformer uses cache_folder
CrossEncoder doesn't use it yet (this PR might change that)
Transformer uses cache_dir
the transformers project uses cache_dir everywhere.

I'm considering a big (but soft) deprecation that introduces the same format for all parameters across the project. The old argument names would still work until e.g. v4.0.0, but they'd give a warning to upgrade to the new parameters instead. Perhaps with a decorator like this and this. I'd love to hear your thoughts @muellerzr.

cc @osanseviero

Tom Aarsen

…d` methods

RoyBA · 2024-06-28T11:54:00Z

Nice! I think this is a good direction. That said, perhaps we can do both: with https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Transformer.py I also have cache_dir because it's inconvenient to have to pass it in 3 separate args.

Just like
		trust_remote_code: bool = False,
		revision: Optional[str] = None,
        local_files_only: bool = False,
these would otherwise also need to be passed to all 3 configs.

Yeah, I totally agree with you. I've added the cache_dir argument back and passed it to all three from_pretrained methods (AutoConfig, AutoModelForSequenceClassification, and AutoTokenizer).

tomaarsen · 2024-06-28T14:31:51Z

Much appreciated! Other than the naming things that I mentioned & will wait for Zach to get back to me on, I think this is almost ready to go :)
Thanks for updating the PR so quickly with my recommendations.

Tom Aarsen

into pr-2784

Shouldn't be (too) breaking as they're far down the list of kwargs

tomaarsen · 2024-09-10T13:31:38Z

Thanks a bunch! And my apologies for the delay. This will be included in the next release.

Tom Aarsen

Royba added 2 commits June 26, 2024 16:39

Add cache_dir support to CrossEncoder

8f0b3cd

Rename snapshot_download result from 'model_name' to 'model_path' in …

b4e7661

…CrossEncoder class for clarity

Add config_args to CrossEncoder to enhance AutoConfig flexibility

9c7570a

Add cache_dir support to CrossEncoder and pass to all `from_pretraine…

b7fb495

…d` methods

Royba and others added 3 commits June 28, 2024 19:57

Removed trailing whitespaces

5b603a9

Merge branch 'master' of https://github.com/UKPLab/sentence-transformers

8909dfa

into pr-2784

Fix formatting

ce7fde8

tomaarsen changed the title ~~Add cache_dir support to CrossEncoder~~ [feat] Add cache_dir support to CrossEncoder Sep 10, 2024

tomaarsen added 2 commits September 10, 2024 15:08

Remove * around optional, match other doc entries

93c367a

Reorder kwargs slightly

b8229e2

Shouldn't be (too) breaking as they're far down the list of kwargs

tomaarsen merged commit 1cd91ab into UKPLab:master Sep 10, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Add cache_dir support to CrossEncoder #2784

[`feat`] Add cache_dir support to CrossEncoder #2784

RoyBA commented Jun 26, 2024

tomaarsen commented Jun 28, 2024

RoyBA commented Jun 28, 2024 •

edited

Loading

tomaarsen commented Jun 28, 2024 •

edited

Loading

RoyBA commented Jun 28, 2024

tomaarsen commented Jun 28, 2024

tomaarsen commented Sep 10, 2024

[feat] Add cache_dir support to CrossEncoder #2784

[feat] Add cache_dir support to CrossEncoder #2784

Conversation

RoyBA commented Jun 26, 2024

Support for cache_dir Argument to CrossEncoder

tomaarsen commented Jun 28, 2024

RoyBA commented Jun 28, 2024 • edited Loading

tomaarsen commented Jun 28, 2024 • edited Loading

RoyBA commented Jun 28, 2024

tomaarsen commented Jun 28, 2024

tomaarsen commented Sep 10, 2024

[`feat`] Add cache_dir support to CrossEncoder #2784

[`feat`] Add cache_dir support to CrossEncoder #2784

Support for `cache_dir` Argument to CrossEncoder

RoyBA commented Jun 28, 2024 •

edited

Loading

tomaarsen commented Jun 28, 2024 •

edited

Loading