-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Proof of concept of integration with Hugging Face Hub #3206
base: develop
Are you sure you want to change the base?
Conversation
For reasons which predate this integration demo, I believe the (To the extent there could be value downloading raw models from the HF hub, a purposeful API for narrowly doing that would be preferable to a parameterized |
Thanks @osanseviero ! I'm +1 on supporting external data storages in general, including from Hugging Face Hub. Seeing as there hasn't been much activity on our I don't have the same reservations about Some questions:
|
Let me answer your questions
For the canonical models (the existing models under Releases), the models would be in the Gensim organization in the Hub. https://huggingface.co/Gensim. Only organization members would be able to upload models. If it's of interest for the community members to upload their own models, they can do so by uploading to their account (not Gensim organization). I think this might bring security concerns though (running any code), but this can be supported if it's of interest. (In the current PR all downloaded models are from Gensim organization which you would have full control)
Answering this and the previous question, all Hugging Face repos are git-based. This means that you can simply do
I don't think we have guarantees for this, but if model repos are not updated, then there should be no changes. For new models there can be new model repos. |
A key difference is those takeovers would create a record, & require extra build/packaging steps, & pass through package-managers with their own recordkeeping. The current |
I'm not an expert on the topic, but my experience with other libraries is that this approach is very frequent. For example, If models are only installed from a Hugging Face organization managed by you, as @piskvorky mentioned, the only risk would be if someone enters your account and changes a repo, which would mean larger security concerns, and there is git-based versioning built in so mitigation should be straightforward. |
Of course it's true that any code using But also, this means projects that want to distribute The various data bundles offered via A fix for Gensim, or other projects in the same position, could be for any expansion of the data offerings to be clearly attributed to a responsible contributor, reviewed via normal release processes, and put into a named release with frozen artifacts. This can still pull the bulk of data from elsewhere, when needed later, on-demand - but the code would insist on a specific artifact by secure hash, ensuring the intent of the responsible contributor can't be silently/secretly corrupted, and that any/all users of the same fixed release are running the same vetted code/data (creating "safety in numbers"). |
Changing how the If I understand correctly, the options now are:
Ideally, we'd sit down one weekend and do 1) = rewrite the Personally, 1) is too low on my list of priorities to ever happen. So unless there's sufficient interest from you @gojomo or others, I'm in favour of 2). Because 1) can be done later anyway. |
I don't get the point of this integration. If HuggingFace Hub offers convenient curated models and/or datasets, my expectation would be that a user should be able to do something like...
...then work with the explicit file(s) that were downloaded. They'd do this from their own code, perhaps leveraging any example code that's included as part of the bundle. There's no need to layer extra cryptic, repository-specific downloading flags ( It's also a bad idea to scatter responsibility for this download-flow across two packages (Gensim & huggingface-hub) when it'd work just fine in one, with a Huggingface-maintained library fully handling the Huggingface-specific usability, security concerns, documentation, failure-modes, and so forth. It seems to me that all the code in the Regarding the independent issuse of what should happen to |
Let me reply some of the points since maybe I was not clear on the benefits of using the Hugging Face Hub. I agree that scattering responsibility might not be the best idea, but this PR is mostly a proof of concept to begin this discussion. I would actually suggest to replace the Gensim data release system in favour of the Hugging Face Hub. You mention that this logic should live in the
I would also like to take a step back on the larger benefits of this integration and how I see this going.
I think #2283 presents good questions and concerns in regards to what's the best way to share pretrained models for |
Indeed - it's tiny bolt-on in an odd place, and any benefits for users to do Hub-specific operations would seem better implemented in the
Here's the thing: not a single dataset/model in The models/formats we have so far aren't unique to Gensim. I don't think any of the existing files even use The datasets we've imported are a random grab-bag & I don't think we know (or even could know?) which, if any, are getting usage – compared to people who'd more likely go to the original sources, wherever they're still up. Github hosting has a longer track record of providing stable URLs – learning your system requires a new set of maintainer accounts, provider-specific workflows, and a new dependency on a startup that might go away. The discipline of documenting things via your 'model cards' might be good, but I have a hard time imagining any Gensim contributor further fleshing out these descriptions, of 3rd-party models, for the benefit of Huggingface Hub users. This is more-or-less an orphaned part of the project, not very central, not very maintained, & not the subject of any upcoming planned improvements. It's not been touched in over 3 years. So I tend to think you're barking up the wrong tree. If you like the particular datasets/models in I'd be happy there'd be a redundant source, and if HFHub proves reliable over time I suppose these download options might eventually get mentioned in our docs/examples as an alternate or even preferred source. |
Note it's on our roadmap @huggingface to add support for GPG-signed commits |
It's been over a year and now HF supports GPG-signed commits. Is there any update or extra thought on this? Is handling separately HF snapshot download and Gensim load still the best way to host user models on the hub for use with Gensim? |
All my prior preferences/concerns still apply. Specifically, these aren't really Gensim's datasets, just public datasets that at one point seemed useful to provide via a 2nd reliable download source. Huggingface may already have them under other names, or if Huggingface would find them useful, could add them – ideally from the original, better-documented, surely-canonical sources rather than Gensim's project-peculiar repackagings. If the same datasets were reliably downloadable/documented from Huggingface, I'd favorably view any contributions which update Gensim docs/tutorials/examples to grab them from Huggingface, using Huggingface packages/APIs, rather than Gensim's mirrors. (That's because I find the current But separate from potentially preferring the hub as a download source whenever practical, bolting a |
Hi 👋 As part of a project I have to use pre-trained The sources being varied, there is not a single protocol to load them easily (most of the time you have to download them manually, some must be unzipped, others not, the documentation is not necessarily in English, etc.).
Proceeding in this way does not bother me for my part but I find it a pity that it is a "hidden solution". In the sense that a simple additional argument In short, all this to say that we can indeed do without a |
Hi Gensim team! I hereby propose a proof of concept integration with the Hugging Face Hub
As mentioned in https://groups.google.com/g/gensim/c/47hWFeRDJOA and discussed in Twitter, this integration would allow you and your users to freely download and host models from the Hugging Face Hub (https://huggingface.co/). The current implementation just adds downloading models from the Hub, here is an example which downloads https://huggingface.co/Gensim/glove-twitter-25:
Sorry for the lack of tests, I wanted to share an early demo to see if this is a line of work that you're interested. Please let me know your thoughts
Some follow-ups:
gensim
models and adding automatic code snippets.FYI @LysandreJik, @julien-c, @lhoestq.