Can't add vector to pretrained fasttext model via .add_vector #3224

om-hb · 2021-08-31T13:17:12Z

Problem description

I'm trying to add a new vector to a pretrained fasttext model via .add_vector. However, it seems like the vector is not added if I check via .has_index_for.

Steps/code/corpus to reproduce

>>> from gensim.models import fasttext
>>> import numpy as np
>>> ft_model =  fasttext.load_facebook_vectors("fastText/cc.en.300.bin")
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.add_vector("testtest", np.zeros((300,)))
2000000
>>> ft_model.has_index_for("testtest")
False
>>> ft_model.index_to_key[2000000]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

Versions

Windows-10-10.0.19041-SP0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.20.3
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

drvenabili · 2021-10-05T08:09:31Z

Hi! Have you found a solution yet? I encounter a same issue with a 4.x-trained (cf #3114) Word2Vec:

When adding new vectors:

using m.wv.add_vector(word, vector) works, BUT after saving to disk and reloading,
- m.wv.get_vector(w) does return the vector
- m.wv.has_index_for(w) does return True
- m.wv.most_similar(w) returns ValueError: operands could not be broadcast together with shapes (115026,) (115025,)
using m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True), works, BUT after saving to disk, the model does not load:

Traceback (most recent call last):
  File "/home/simon/w2v/word2vec-utils/out/models/test_test.py", line 3, in <module>
    m2 = gensim.models.KeyedVectors.load("model_32_5_word2vec_sanitised.model")
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/utils.py", line 486, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 1951, in _load_specials
    self.make_cum_table()  # rebuild cum_table from vocabulary
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/word2vec.py", line 835, in make_cum_table
    count = self.wv.get_vecattr(word_index, 'count')
  File "/home/simon/w2v/word2vec-utils/env2/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 368, in get_vecattr
    return self.expandos[attr][index]
IndexError: index 115025 is out of bounds for axis 0 with size 115025

When updating vectors, m.wv.add_vectors(list_of_words, list_of_numpy_arrays, replace = True) works as intended.

Versions

Ubuntu 21.04 x86_64
Python 3.9.5 (default, May 11 2021, 08:20:37)
Bits 64
NumPy 1.21.2
SciPy 1.6.3
gensim 4.1.2
FAST_VERSION 1

Thanks

piskvorky · 2021-10-05T08:18:43Z

Thanks for reporting. @gojomo any ideas? IIRC you rewrote this part for Gensim 4.0.

gojomo · 2021-10-06T22:59:43Z

Although all in the same related add_vector(s) functionality, these may be 3 separate bugs:

the initial submission by @om-hb here, add_vector() for one vector to FastTextKeyedVectors not having intended effect (and thus giving an IndexError for an index that should be present) even without any save/load cycle
the ValueError for KeyedVectors reported by @faustusdotbe, when doing a .most_similar() after both an add_vector() & a save/reload
the IndexError for KeyedVectors reported by @faustusdotbe, preventing a .load() after an .add_vectors() that specifically also added (rather than just changed) existing vectors

While the traceback for that last one specifically implicates some of the pre-4.0 refactorings, the add() methods were sort of an afterthought on types that, when first made, didn't offer growing-the-set-of-word-vectors after original allocation. So some of these may have been issues pre-4.0 as well.

The code clearly needs better test coverage & a deeper look/reork for consistency/correctness, which would probably resolve these (& #3114). I might have some time for this next week. As each of these looks pretty easy to reproduce, anyone wanting to contribute minimal self-contained test cases triggering each of the failures in current code could give any future fixes a running head-start.

s2terminal · 2022-04-08T03:06:30Z

Is this problem still unresolved? I have reproduced the error ValueError: operands could not be broadcast together with shapes with the following program using the source code from the develop branch.

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

print(model.wv.most_similar("test1", topn=1)) #=> ValueError: operands could not be broadcast together with shapes (6,) (5,)

Google colab
Python 3.7.13
gensim 4.1.3.dev0

Do you have a solution? Thanks.

s2terminal · 2022-04-08T03:21:52Z

I have found that this ValueError can be resolved by using fill_norms(force=True).

from gensim.models import Word2Vec
import numpy

model = Word2Vec(sentences=[
                            ["this", "is", "test1"],
                            ["that", "is", "test2"],
], vector_size=2, min_count=1)

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

model.wv.add_vectors(["test3"], [numpy.array([0.5, 0.5])])

model.wv.fill_norms(force=True) # added

print(model.wv.most_similar("test1", topn=1)) #=> [('test2', 0.9941185712814331)]

gojomo · 2022-04-09T20:20:18Z

Thanks for the fix! Yes, part of properly finishing/verifying/testing the add_vector(s) requires the norms to be flushed or updated after operations. The PR looks reasonable, but I think it's still missing related cases we'd want to fix:

if existing vectors are changed, but none net added, the existing len(self.norms) check might not trigger the right updates
similarly, via the single add_vector there's an accomodation for the vector landing in slots that were pre-allocated (to allow loops of sequential single-adds to not trigger time/space expensive grows every time) – & I think that's also fail to trigger the necessary update

The docs for the add methods may also deserve extra warnings about how such expansions to a KeyedVectors that's still a subsidiary component of a larger model may break assumptions the model was counting on. (EG: adding just to a KV won't update the enclosing model's output layer syn1neg or syn1, or other frequency/encoding tables - so such addtions are unsupported, witth undefined & likely-error-triggering effects, in such cases if any further enclosing-model operations, like further training, are attempted.)

s2terminal · 2022-04-15T06:33:06Z

@gojomo Thanks for your comment.
Indeed, this PR #3320 still does not seem to work well when updating vectors.

Does this mean that I should stop checking with len(self.norms) when the vectors are updated and always use fill_norms(force=True)?

Or should the correct specification be that the add method may cause some features of the model to stop working? As an example, should the library user explicitly run fill_norms(force=True) before running most_similar? In this case, I think the add method needs a warning, as you say.

gojomo · 2022-04-15T10:05:24Z

In general, it's never guaranteed that the norms have been calculated, so any method that requires them calls fill_norms() just in case.

So I think any method that invalidates anything about the current norms can just discard them, self.norms = None, similar to resize_vectors().

This 'lazy' approach means you can do a bunch of invalidating ops (like sequential adds) in a row, without paying the cost of the full norm-refresh until they're needed again.

piskvorky added bug Issue described a bug impact HIGH Show-stopper for affected users reach MEDIUM Affects a significant number of users labels Oct 5, 2021

s2terminal mentioned this issue Apr 8, 2022

Fix KeyedVectors.add_vectors() error when use most_similar #3320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't add vector to pretrained fasttext model via .add_vector #3224

Can't add vector to pretrained fasttext model via .add_vector #3224

om-hb commented Aug 31, 2021 •

edited

Loading

drvenabili commented Oct 5, 2021

piskvorky commented Oct 5, 2021

gojomo commented Oct 6, 2021

s2terminal commented Apr 8, 2022

s2terminal commented Apr 8, 2022

gojomo commented Apr 9, 2022

s2terminal commented Apr 15, 2022

gojomo commented Apr 15, 2022

Can't add vector to pretrained fasttext model via .add_vector #3224

Can't add vector to pretrained fasttext model via .add_vector #3224

Comments

om-hb commented Aug 31, 2021 • edited Loading

Problem description

Steps/code/corpus to reproduce

Versions

drvenabili commented Oct 5, 2021

piskvorky commented Oct 5, 2021

gojomo commented Oct 6, 2021

s2terminal commented Apr 8, 2022

s2terminal commented Apr 8, 2022

gojomo commented Apr 9, 2022

s2terminal commented Apr 15, 2022

gojomo commented Apr 15, 2022

om-hb commented Aug 31, 2021 •

edited

Loading