[WIP] Proof of concept of integration with Hugging Face Hub #3206

osanseviero · 2021-07-29T13:28:31Z

Hi Gensim team! I hereby propose a proof of concept integration with the Hugging Face Hub

As mentioned in https://groups.google.com/g/gensim/c/47hWFeRDJOA and discussed in Twitter, this integration would allow you and your users to freely download and host models from the Hugging Face Hub (https://huggingface.co/). The current implementation just adds downloading models from the Hub, here is an example which downloads https://huggingface.co/Gensim/glove-twitter-25:

import gensim.downloader as api

model = api.load("glove-twitter-25", from_hf=True)
print(model.most_similar("cat"))

Sorry for the lack of tests, I wanted to share an early demo to see if this is a line of work that you're interested. Please let me know your thoughts

Some follow-ups:

Retrieving info on all models from the Hugging Face Hub.
Adding widgets that will allow users to try out the models directly in the browser.
Add datasets to the Hugging Face Hub.
Add different UI items to the Hub, such as filtering for all gensim models and adding automatic code snippets.

FYI @LysandreJik, @julien-c, @lhoestq.

gojomo · 2021-07-29T20:29:01Z

For reasons which predate this integration demo, I believe the gensim.downloader, as currenly implemented, is a security risk that would be better to eliminate rather than expand. For more details, see #2283.

(To the extent there could be value downloading raw models from the HF hub, a purposeful API for narrowly doing that would be preferable to a parameterized from_hf switch layering more custom behavior into gensim.downloader. Is there a huggingface-hub-operation library/package that could just be imported when need?)

piskvorky · 2021-07-30T07:30:59Z

Thanks @osanseviero ! I'm +1 on supporting external data storages in general, including from Hugging Face Hub. Seeing as there hasn't been much activity on our gensim-data storage (none of us have the time), I'll be happy to offload this functionality to a more active hub.

I don't have the same reservations about gensim.downloader that @gojomo has; if a malicious person takes over our Github repos (gensim, gensim-data, smart_open…), they can make our users execute arbitrary code anyway. Same if they take over our PyPI account. No major additional attack surface there.

Some questions:

Who uploads these models to HFH and how?
What's the maintenance process? E.g. when we want to modify something or if there's a problem.
All model releases in gensim-data are immutable (by agreement). So once released, that model stays forever; any updates / fixes must come as a new, separate model release. Can HFH guarantee that also technically?

osanseviero · 2021-07-30T07:56:10Z

Let me answer your questions

Who uploads these models to HFH and how?

For the canonical models (the existing models under Releases), the models would be in the Gensim organization in the Hub. https://huggingface.co/Gensim. Only organization members would be able to upload models.

If it's of interest for the community members to upload their own models, they can do so by uploading to their account (not Gensim organization). I think this might bring security concerns though (running any code), but this can be supported if it's of interest. (In the current PR all downloaded models are from Gensim organization which you would have full control)

What's the maintenance process? E.g. when we want to modify something or if there's a problem.

Answering this and the previous question, all Hugging Face repos are git-based. This means that you can simply do git clone, git commit, git push and your standard git workflows in order to upload new models or update existing ones. There is also commit history, versioning and branches that should help with this. https://huggingface.co/docs/hub/main#whats-a-repository

All model releases in gensim-data are immutable (by agreement). So once released, that model stays forever; any updates / fixes must come as a new, separate model release. Can HFH guarantee that also technically?

I don't think we have guarantees for this, but if model repos are not updated, then there should be no changes. For new models there can be new model repos.

gojomo · 2021-07-30T17:56:02Z

I don't have the same reservations about gensim.downloader that @gojomo has; if a malicious person takes over our Github repos (gensim, gensim-data, smart_open…), they can make our users execute arbitrary code anyway. Same if they take over our PyPI account. No major additional attack surface there.

A key difference is those takeovers would create a record, & require extra build/packaging steps, & pass through package-managers with their own recordkeeping. The current gensim.downloader process causes users to run novel code that (1) has no git-version history at all; (2) is never packaged/released/installed via package manager; (3) is downloaded & executed in response to a .load() method call whose name & docs give no idea is installing/running arbitrary new executable code.

osanseviero · 2021-08-02T08:52:40Z

I'm not an expert on the topic, but my experience with other libraries is that this approach is very frequent.

For example, sklearn (and other libraries) uses joblib and pickle which run arbitrary code (security warning in sklearn, warning in joblib), but are still widely used for model sharing. Installing spacy pipelines is done through installing wheel files (pip install hf.co/.../en_core.whl), which shows similar concerns to what you just shared.

If models are only installed from a Hugging Face organization managed by you, as @piskvorky mentioned, the only risk would be if someone enters your account and changes a repo, which would mean larger security concerns, and there is git-based versioning built in so mitigation should be straightforward.

gojomo · 2021-08-02T17:24:40Z

Of course it's true that any code using unpickle on an untrusted file could inadvertently run arbitrary code from that file, even if it's thought of as mere data. Warnings reminding users of this fact should appear near any APIs using unpickle – the Gensim docs for .load() on each class should be more explicit about this.

But also, this means projects that want to distribute pickle-based files responsibly should carefully version those files, with clear audit logs of the who/how/when of their creation & release. I'm unsure if HuggingFace Hub does this. If it doesn't, and it welcomes & promotes pickle-based models from untrusted sources, then someday a bad actor will use the HuggingFace Hub to compromise its users with a juicy-seeming but trojaned model.

The various data bundles offered via gensim-data aren't version controlled, and as far as I can tell their replacement in Github's 'release' file download area can occur without any notifications to maintainers or externally-visible logging of the changes. That's also a concern! But the inclusion of a __init__.py shim of Python source-code, outside any version-control or release process, that's downloaded & run without any warnings, is still an escalation over the risks of pickle files.

A fix for Gensim, or other projects in the same position, could be for any expansion of the data offerings to be clearly attributed to a responsible contributor, reviewed via normal release processes, and put into a named release with frozen artifacts. This can still pull the bulk of data from elsewhere, when needed later, on-demand - but the code would insist on a specific artifact by secure hash, ensuring the intent of the responsible contributor can't be silently/secretly corrupted, and that any/all users of the same fixed release are running the same vetted code/data (creating "safety in numbers").

piskvorky · 2021-08-02T18:17:24Z

Changing how the gensim-data models and corpora are uploaded / maintained is definitely possible – pending a strong contributor to actually implement it.

If I understand correctly, the options now are:

Rethink and reimplement the whole gensim-data API first, before considering integration with HFH.
Pros: We can architect a better system.
Cons: Non-trivial work; keep in mind no one has stepped up to do it for years.
Accept this PR.
Pros: Useful, to at least N users, with N >= 3 :) (real user base for this HFH-Gensim integration unclear)
Cons: Adding a bit of complexity we may eventually replace via 1) or 3).
Other?

Ideally, we'd sit down one weekend and do 1) = rewrite the gensim-data API properly. Or possibly outsource the whole "models/corpora management" problem altogether, if it turns out something sufficient already exists. With "sufficient" both in terms of technology and long-term stability (avoiding any hard-dependencies on startups, "it's been a great journey kthxbye").

Personally, 1) is too low on my list of priorities to ever happen. So unless there's sufficient interest from you @gojomo or others, I'm in favour of 2). Because 1) can be done later anyway.

gojomo · 2021-08-02T18:53:50Z

I don't get the point of this integration. If HuggingFace Hub offers convenient curated models and/or datasets, my expectation would be that a user should be able to do something like...

% pip install huggingface-hub
% python -m huggingface_hub snapshot_download "glove-twitter-25"

...then work with the explicit file(s) that were downloaded. They'd do this from their own code, perhaps leveraging any example code that's included as part of the bundle.

There's no need to layer extra cryptic, repository-specific downloading flags (from_hf=True) into an already dubious feature – which also would make gensim.downloader harder to deprecate or fix.

It's also a bad idea to scatter responsibility for this download-flow across two packages (Gensim & huggingface-hub) when it'd work just fine in one, with a Huggingface-maintained library fully handling the Huggingface-specific usability, security concerns, documentation, failure-modes, and so forth. It seems to me that all the code in the if from_hf: branch of this PR can more usefully live inside the huggingface_hub package - providing any theorized user benefit, without the tech debt of extra project coupling.

Regarding the independent issuse of what should happen to gensim.downloader, I'll discuss that on #2283.

osanseviero · 2021-08-02T20:59:49Z

I don't get the point of this integration. If HuggingFace Hub offers convenient curated models and/or datasets, [...]

It's also a bad idea to scatter responsibility for this download-flow across two packages (Gensim & huggingface-hub) [...]. It seems to me that all the code in the if from_hf: branch of this PR can more usefully live inside the huggingface_hub package - providing any theorized user benefit, without the tech debt of extra project coupling.

Let me reply some of the points since maybe I was not clear on the benefits of using the Hugging Face Hub. I agree that scattering responsibility might not be the best idea, but this PR is mostly a proof of concept to begin this discussion. I would actually suggest to replace the Gensim data release system in favour of the Hugging Face Hub. You mention that this logic should live in the huggingface-hub library, but if you look at the if from_hf branch, it's only 4 lines of code and it's specific to Gensim.

from huggingface_hub import snapshot_download
name = "Gensim/" + name
download_dir = snapshot_download(name, cache_dir=BASE_DIR)
os.rename(download_dir, data_folder_dir)

I would also like to take a step back on the larger benefits of this integration and how I see this going.

There would be a Gensim organization in which only the core contributors of Gensim have access.
@gojomo, all Hugging Face models are git-based repos. In SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283 you mention the lack of attribution as an issue. With Hugging Face changes to model repositories would be properly authored in the commit. So there is a clear audit log of who/how/when (author, diff, date) of each change. You can find more about HF repos here.
You also mention "There's zero documentation of what kind of object this opaque magic call just delivered.". Model repos usually have Model Cards that document the model (example, which will properly document things from overall usage to intended uses, biases and training procedures.
Given that only members of the Gensim organization would have edit access to the repositories, you can set the documentation standards that you deem correct for the repos.

I think #2283 presents good questions and concerns in regards to what's the best way to share pretrained models for Gensim, I just wanted to make sure there's clear understanding of the potential benefits of this integration.

gojomo · 2021-08-03T01:05:42Z

Let me reply some of the points since maybe I was not clear on the benefits of using the Hugging Face Hub. I agree that scattering responsibility might not be the best idea, but this PR is mostly a proof of concept to begin this discussion. I would actually suggest to replace the Gensim data release system in favour of the Hugging Face Hub. You mention that this logic should live in the huggingface-hub library, but if you look at the if from_hf branch, it's only 4 lines of code and it's specific to Gensim.
from huggingface_hub import snapshot_download
name = "Gensim/" + name
download_dir = snapshot_download(name, cache_dir=BASE_DIR)
os.rename(download_dir, data_folder_dir)

Indeed - it's tiny bolt-on in an odd place, and any benefits for users to do Hub-specific operations would seem better implemented in the huggingface_hub package.

I would also like to take a step back on the larger benefits of this integration and how I see this going.

There would be a Gensim organization in which only the core contributors of Gensim have access.

@gojomo, all Hugging Face models are git-based repos. In SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283 you mention the lack of attribution as an issue. With Hugging Face changes to model repositories would be properly authored in the commit. So there is a clear audit log of who/how/when (author, diff, date) of each change. You can find more about HF repos here.

You also mention "There's zero documentation of what kind of object this opaque magic call just delivered.". Model repos usually have Model Cards that document the model (example, which will properly document things from overall usage to intended uses, biases and training procedures.

Given that only members of the Gensim organization would have edit access to the repositories, you can set the documentation standards that you deem correct for the repos.

I think #2283 presents good questions and concerns in regards to what's the best way to share pretrained models for Gensim, I just wanted to make sure there's clear understanding of the potential benefits of this integration.

Here's the thing: not a single dataset/model in gensim-data was created by the Gensim team. They're all just copied from other places for durability.

The models/formats we have so far aren't unique to Gensim. I don't think any of the existing files even use pickle formatting, so that risky avenue for inadvertently running arbitrary untrusted code isn't even a current consideration. (If we started sharing our own Gensim-native model saves, that'd become a factor.)

The datasets we've imported are a random grab-bag & I don't think we know (or even could know?) which, if any, are getting usage – compared to people who'd more likely go to the original sources, wherever they're still up.

Github hosting has a longer track record of providing stable URLs – learning your system requires a new set of maintainer accounts, provider-specific workflows, and a new dependency on a startup that might go away.

The discipline of documenting things via your 'model cards' might be good, but I have a hard time imagining any Gensim contributor further fleshing out these descriptions, of 3rd-party models, for the benefit of Huggingface Hub users. This is more-or-less an orphaned part of the project, not very central, not very maintained, & not the subject of any upcoming planned improvements. It's not been touched in over 3 years.

So I tend to think you're barking up the wrong tree.

If you like the particular datasets/models in gensim-data, I suppose you could add any or all to the HFHub yourself. (They wouldn't need to – & perhaps should not – have any 'Gensim' branding, as Gensim is not the originator of any of them.)

I'd be happy there'd be a redundant source, and if HFHub proves reliable over time I suppose these download options might eventually get mentioned in our docs/examples as an alternate or even preferred source.

julien-c · 2021-08-04T11:26:45Z

So there is a clear audit log of who/how/when (author, diff, date) of each change.

Note it's on our roadmap @huggingface to add support for GPG-signed commits

versae · 2022-10-04T10:50:50Z

It's been over a year and now HF supports GPG-signed commits. Is there any update or extra thought on this? Is handling separately HF snapshot download and Gensim load still the best way to host user models on the hub for use with Gensim?

gojomo · 2022-10-04T18:06:41Z

All my prior preferences/concerns still apply.

Specifically, these aren't really Gensim's datasets, just public datasets that at one point seemed useful to provide via a 2nd reliable download source. Huggingface may already have them under other names, or if Huggingface would find them useful, could add them – ideally from the original, better-documented, surely-canonical sources rather than Gensim's project-peculiar repackagings.

If the same datasets were reliably downloadable/documented from Huggingface, I'd favorably view any contributions which update Gensim docs/tutorials/examples to grab them from Huggingface, using Huggingface packages/APIs, rather than Gensim's mirrors. (That's because I find the current api/downloader approach unwise from a security-hygeine perspective & unhelpfully opaque in what it delivers. Directing coders to acquire primary files of known formats, then using Gensim filepath APIs to explicitly load them, seems better to me on every relevant dimension: explicitness, security, maintainability, teaching profesional practices, etc.)

But separate from potentially preferring the hub as a download source whenever practical, bolting a from_hf= option onto the Gensim downloader – retaining the opacity/indirectness of the current approach, with a redundant 2nd code path that could break separately from the legacy option – makes no sense to me at all. It adds complexity/coupling/duplication for negligible benefit.

lbourdois · 2023-05-15T11:44:01Z

Hi 👋

As part of a project I have to use pre-trained gensim models from different sources. To name a few: http://vectors.nlpl.eu/repository/, https://wikipedia2vec.github.io/wikipedia2vec/pretrained/, https://github.com/dccuchile/spanish-word-embeddings, https://github.com/Embedding/Chinese-Word-Vectors, https://github.com/3Top/word2vec-api, https://github.com/nathanshartmann/portuguese_word_embeddings, https://fauconnier.github.io/#data, etc.
My inventory is still ongoing but I think the final list should contain something between 600 and 1000 models (I'm at 522 right now and haven't yet tried to reference English models that are the most common).

The sources being varied, there is not a single protocol to load them easily (most of the time you have to download them manually, some must be unzipped, others not, the documentation is not necessarily in English, etc.).
So I tell myself that this should lead to a very complicated pipeline to be able to use any model whenever I want. And that to simplify things, it would be enough to be able to put them in a single place on which it would be possible to download them remotely and then simply use them via gensim.
I naturally thought of the HF Hub but found that there were hardly any word embedding type models on it and gensim was not managed 😢
I see that discussions have taken place here.
The line model = api.load("model_name_on_the_hub", from_hf=True) pleased me because it simply answered my problem but does not seem to convince.
The following code does the job very well (the "lbourdois/43" model is here the 43rd http://vectors.nlpl.eu/repository/ model I added to my HF account to be able to do the test below):

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model = KeyedVectors.load_word2vec_format(hf_hub_download(repo_id="lbourdois/43", filename="model.bin"), binary=True, unicode_errors="ignore")
model.most_similar("exemple")

[('exemple?', 0.764797568321228),
 ('spearphishing', 0.7313995361328125),
 ('analogie', 0.7245253324508667),
 ('exemple.dans', 0.7238249182701111),
 ('exemple.on', 0.7150753140449524),
 ('ptplot', 0.7103664875030518),
 ('mail.contoso.com', 0.7083507776260376),
 ('?prenons', 0.7068799734115601),
 ('l’acrostiche', 0.7055863738059998),
 ('concernen', 0.7036089301109314)]

Proceeding in this way does not bother me for my part but I find it a pity that it is a "hidden solution". In the sense that a simple additional argument from_hf=True as proposed would be more easily found by adding a line in the documentation of gensim explaining what it is for, or via the "Use in gensim" button which would then be available on the HF Hub. I think it would also invite people to put their word2vec type models on the HF Hub and improve their referencement, where currently models are split between various sources that are more or less easy to find. It can also be thought that this would lead to the development of demonstrators using these models with gradio or streamlit applications (which could illustrate the gensim documentation).

In short, all this to say that we can indeed do without a from_hf=True but that it would still be a plus to have this option available by making things more user friendly

osanseviero added 2 commits July 29, 2021 13:36

Proof of concept

bff4816

Cleanup

b8834e2

gojomo mentioned this pull request Aug 2, 2021

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Proof of concept of integration with Hugging Face Hub #3206

[WIP] Proof of concept of integration with Hugging Face Hub #3206

osanseviero commented Jul 29, 2021

gojomo commented Jul 29, 2021

piskvorky commented Jul 30, 2021 •

edited

Loading

osanseviero commented Jul 30, 2021

gojomo commented Jul 30, 2021 •

edited

Loading

osanseviero commented Aug 2, 2021

gojomo commented Aug 2, 2021

piskvorky commented Aug 2, 2021 •

edited

Loading

gojomo commented Aug 2, 2021 •

edited

Loading

osanseviero commented Aug 2, 2021 •

edited

Loading

gojomo commented Aug 3, 2021

julien-c commented Aug 4, 2021

versae commented Oct 4, 2022

gojomo commented Oct 4, 2022

lbourdois commented May 15, 2023

[WIP] Proof of concept of integration with Hugging Face Hub #3206

Are you sure you want to change the base?

[WIP] Proof of concept of integration with Hugging Face Hub #3206

Conversation

osanseviero commented Jul 29, 2021

gojomo commented Jul 29, 2021

piskvorky commented Jul 30, 2021 • edited Loading

osanseviero commented Jul 30, 2021

gojomo commented Jul 30, 2021 • edited Loading

osanseviero commented Aug 2, 2021

gojomo commented Aug 2, 2021

piskvorky commented Aug 2, 2021 • edited Loading

gojomo commented Aug 2, 2021 • edited Loading

osanseviero commented Aug 2, 2021 • edited Loading

gojomo commented Aug 3, 2021

julien-c commented Aug 4, 2021

versae commented Oct 4, 2022

gojomo commented Oct 4, 2022

lbourdois commented May 15, 2023

piskvorky commented Jul 30, 2021 •

edited

Loading

gojomo commented Jul 30, 2021 •

edited

Loading

piskvorky commented Aug 2, 2021 •

edited

Loading

gojomo commented Aug 2, 2021 •

edited

Loading

osanseviero commented Aug 2, 2021 •

edited

Loading