Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't add vector to pretrained fasttext model via .add_vector #3224

Open
om-hb opened this issue Aug 31, 2021 · 8 comments
Open

Can't add vector to pretrained fasttext model via .add_vector #3224

om-hb opened this issue Aug 31, 2021 · 8 comments
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach MEDIUM Affects a significant number of users

Comments

@om-hb
Copy link

om-hb commented Aug 31, 2021

Problem description

I'm trying to add a new vector to a pretrained fasttext model via .add_vector. However, it seems like the vector is not added if I check via .has_index_for.

Steps/code/corpus to reproduce

>>> from gensim.models import fasttext
>>> import numpy as np
>>> ft_model =  fasttext.load_facebook_vectors("fastText/cc.en.300.bin")
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.add_vector("testtest", np.zeros((300,)))
2000000
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.index_to_key[2000000]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Versions

Windows-10-10.0.19041-SP0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1
@drvenabili
Copy link

Hi! Have you found a solution yet? I encounter a same issue with a 4.x-trained (cf #3114) Word2Vec:

When adding new vectors:

  • using m.wv.add_vector(word, vector) works, BUT after saving to disk and reloading,

    • m.wv.get_vector(w) does return the vector
    • m.wv.has_index_for(w) does return True
    • m.wv.most_similar(w) returns ValueError: operands could not be broadcast together with shapes (115026,) (115025,)
  • using m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True), works, BUT after saving to disk, the model does not load:

Traceback (most recent call last):
  File "/home/simon/w2v/word2vec-utils/out/models/test_test.py", line 3, in <module>
    m2 = gensim.models.KeyedVectors.load("model_32_5_word2vec_sanitised.model")
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/utils.py", line 486, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1951, in _load_specials
    self.make_cum_table()  # rebuild cum_table from vocabulary
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 835, in make_cum_table
    count = self.wv.get_vecattr(word_index, 'count')
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 368, in get_vecattr
    return self.expandos[attr][index]
IndexError: index 115025 is out of bounds for axis 0 with size 115025

When updating vectors, m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True) works as intended.

Versions

Ubuntu 21.04 x86_64
Python 3.9.5 (default, May 11 2021, 08:20:37)
Bits 64
NumPy 1.21.2
SciPy 1.6.3
gensim 4.1.2
FAST_VERSION 1

Thanks

@piskvorky
Copy link
Owner

Thanks for reporting. @gojomo any ideas? IIRC you rewrote this part for Gensim 4.0.

@piskvorky piskvorky added bug Issue described a bug impact HIGH Show-stopper for affected users reach MEDIUM Affects a significant number of users labels Oct 5, 2021
@gojomo
Copy link
Collaborator

gojomo commented Oct 6, 2021

Although all in the same related add_vector(s) functionality, these may be 3 separate bugs:

  • the initial submission by @om-hb here, add_vector() for one vector to FastTextKeyedVectors not having intended effect (and thus giving an IndexError for an index that should be present) even without any save/load cycle
  • the ValueError for KeyedVectors reported by @faustusdotbe, when doing a .most_similar() after both an add_vector() & a save/reload
  • the IndexError for KeyedVectors reported by @faustusdotbe, preventing a .load() after an .add_vectors() that specifically also added (rather than just changed) existing vectors

While the traceback for that last one specifically implicates some of the pre-4.0 refactorings, the add() methods were sort of an afterthought on types that, when first made, didn't offer growing-the-set-of-word-vectors after original allocation. So some of these may have been issues pre-4.0 as well.

The code clearly needs better test coverage & a deeper look/reork for consistency/correctness, which would probably resolve these (& #3114). I might have some time for this next week. As each of these looks pretty easy to reproduce, anyone wanting to contribute minimal self-contained test cases triggering each of the failures in current code could give any future fixes a running head-start.

@s2terminal
Copy link

Is this problem still unresolved? I have reproduced the error ValueError: operands could not be broadcast together with shapes with the following program using the source code from the develop branch.

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

print(model.wv.most_similar("test1", topn=1)) #=> ValueError: operands could not be broadcast together with shapes (6,) (5,) 
  • Google colab
  • Python 3.7.13
  • gensim 4.1.3.dev0

Do you have a solution? Thanks.

@s2terminal
Copy link

I have found that this ValueError can be resolved by using fill_norms(force=True).

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

model.wv.fill_norms(force=True) # added

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

@gojomo
Copy link
Collaborator

gojomo commented Apr 9, 2022

Thanks for the fix! Yes, part of properly finishing/verifying/testing the add_vector(s) requires the norms to be flushed or updated after operations. The PR looks reasonable, but I think it's still missing related cases we'd want to fix:

  • if existing vectors are changed, but none net added, the existing len(self.norms) check might not trigger the right updates
  • similarly, via the single add_vector there's an accomodation for the vector landing in slots that were pre-allocated (to allow loops of sequential single-adds to not trigger time/space expensive grows every time) – & I think that's also fail to trigger the necessary update

The docs for the add methods may also deserve extra warnings about how such expansions to a KeyedVectors that's still a subsidiary component of a larger model may break assumptions the model was counting on. (EG: adding just to a KV won't update the enclosing model's output layer syn1neg or syn1, or other frequency/encoding tables - so such addtions are unsupported, witth undefined & likely-error-triggering effects, in such cases if any further enclosing-model operations, like further training, are attempted.)

@s2terminal
Copy link

@gojomo Thanks for your comment.
Indeed, this PR #3320 still does not seem to work well when updating vectors.

Does this mean that I should stop checking with len(self.norms) when the vectors are updated and always use fill_norms(force=True)?

Or should the correct specification be that the add method may cause some features of the model to stop working? As an example, should the library user explicitly run fill_norms(force=True) before running most_similar? In this case, I think the add method needs a warning, as you say.

@gojomo
Copy link
Collaborator

gojomo commented Apr 15, 2022

In general, it's never guaranteed that the norms have been calculated, so any method that requires them calls fill_norms() just in case.

So I think any method that invalidates anything about the current norms can just discard them, self.norms = None, similar to resize_vectors().

This 'lazy' approach means you can do a bunch of invalidating ops (like sequential adds) in a row, without paying the cost of the full norm-refresh until they're needed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach MEDIUM Affects a significant number of users
Projects
None yet
Development

No branches or pull requests

5 participants