index.find() tries to reshape and fails #1822

nikhilmakan02 · 2023-10-12T05:28:01Z

Initial Checks

I have read and followed the docs and still think this is a bug

Description

Apologies the title of this is not the best. I have a very odd case and can't seem to understand what is causing it. I have also failed at recreating the issue in a simpler example.

I have a Doc List where each document has been built with the same process however the data is obviously different for each doc. I am using the hnswlib backend.

The issue I have is after I built the doc list with no issues I then try to run a .find() on the individual elements of the doc list, some of which fail and some don't. The error I get on some of these can be seen in the traceback below.

Code Snippet:

class AddressDoc(BaseDoc):
    ELID: int
    FULL_ADDRESS: str
    EMBEDDINGS: NdArray[768]

def build_doc_list(data):
    st = time.time()
    dl = DocList[AddressDoc](
            AddressDoc(
                ELID=0000000,
                FULL_ADDRESS="",
                EMBEDDINGS=d["EMBEDDINGS"],
            )
            for d in data
    )
    logger.info(f"Doc list created... {time.time()-st}")
    return dl

doc_index = HnswDocumentIndex[AddressDoc](work_dir=db_path)
dl = build_doc_list(data)

# This works!
results = doc_index.find(dl[2], search_field="EMBEDDINGS", limit=1)

# This doesn't!
results = doc_index.find(dl[3], search_field="EMBEDDINGS", limit=1)

type(dl[2].EMBEDDINGS) == type(dl[3].EMBEDDINGS) # returns True
type(dl[2].EMBEDDINGS.shape) == type(dl[3].EMBEDDINGS.shape) # returns True

I have compared dl[2] and dl[3] left right and center and can't understand what the issue is. The embeddings array in both documents are the same shape which I have checked with numpy (.shape, .ndims, .size). I can't understand what the difference is between the two that causes the error below.

Traceback below:

File /usr/local/lib/python3.11/site-packages/docarray/index/abstract.py:503, in BaseDocIndex.find(self, query, search_field, limit, **kwargs)
    [501](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=500)     query_vec = query
    [502](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=501) query_vec_np = self._to_numpy(query_vec)
--> [503](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=502) docs, scores = self._find(
    [504](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=503)     query_vec_np, search_field=search_field, limit=limit, **kwargs
    [505](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=504) )
    [507](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=506) if isinstance(docs, List) and not isinstance(docs, DocList):
    [508](file:///usr/local/lib/python3.11/site-packages/docarray/index/abstract.py?line=507)     docs = self._dict_list_to_docarray(docs)

File /usr/local/lib/python3.11/site-packages/docarray/index/backends/hnswlib.py:328, in HnswDocumentIndex._find(self, query, limit, search_field)
    [324](file:///usr/local/lib/python3.11/site-packages/docarray/index/backends/hnswlib.py?line=323) def _find(
...
--> [197](file:///usr/local/lib/python3.11/site-packages/docarray/typing/tensor/ndarray.py?line=196)     return cls._docarray_from_native(x.reshape(source.shape))
    [198](file:///usr/local/lib/python3.11/site-packages/docarray/typing/tensor/ndarray.py?line=197) elif len(source.shape) > 0:
    [199](file:///usr/local/lib/python3.11/site-packages/docarray/typing/tensor/ndarray.py?line=198)     return cls._docarray_from_native(np.zeros(source.shape))

ValueError: cannot reshape array of size 768 into shape (768,768)

Example Code

No response

Python, DocArray & OS Version

0.39.0

Affected Components

The text was updated successfully, but these errors were encountered:

nikhilmakan02 · 2023-10-12T06:13:31Z

After further digging into the traceback I am starting to think this maybe something to do with the way the scores are being processed after the query is made to hnswlib. The traceback appears to be pointing to this method.

def from_protobuf(cls: Type[T], pb_msg: 'NdArrayProto') -> 'T':
        """
        Read ndarray from a proto msg
        :param pb_msg:
        :return: a numpy array
        """
        source = pb_msg.dense
        if source.buffer:
            x = np.frombuffer(bytearray(source.buffer), dtype=source.dtype)
            return cls._docarray_from_native(x.reshape(source.shape))
        elif len(source.shape) > 0:
            return cls._docarray_from_native(np.zeros(source.shape))
        else:
            raise ValueError(f'proto message {pb_msg} cannot be cast to a NdArray')

JoanFM · 2023-10-12T19:11:59Z

Can I understand how did you build the index?

nikhilmakan02 · 2023-10-13T01:49:28Z

Its a large index of 3 million plus documents. I have a set of parquet files that contain about a 100 000 rows each. These files are read in as DataFrames converted to Dicts and then the same function def build_doc_list() as above is used to create a doc list.

The db is then initialized and indexed as below.

doc_index = HnswDocumentIndex[AddressDoc](work_dir=work_dir))
doc_index.index(dl)

The whole thing runs in a loop where it runs on each file at a time.

Is the fact that I initialise the doc index on each loop run potentially the problem? Should that sit outside the loop. It's worth mention this process did crash at one point and I restarted it from where it left off. I may just try rebuild the whole db index and see where that gets me.

I see you have labeled this as a bug, can you provide a bit more on what your initial thinking is?

JoanFM · 2023-10-13T15:29:39Z

I just labeled as a bug because it seems like it, but I have no thinking other than suspecring that something went wrong at indexing time.

I would indeed not initialize everytime but keep indexing in a loop.

JoanFM · 2023-10-15T18:37:41Z

Please let us know what is the result when u reindex again?

nikhilmakan02 · 2023-10-15T21:39:34Z

@JoanFM thanks for the response. After some debugging over the weekend I manage to nail down the issue. I don't believe it is a bug as it was a subtle error on my part.

I had two scripts one to create the index and another to search through it.

The script to create the index used the following document class

class AddressDoc(BaseDoc):
    ELID: int
    FULL_ADDRESS: str
    EMBEDDINGS: NdArray[768] = Field(
        max_elements=3500000, space="cosine", num_threads=20
    )

The script to search the index used this document class and as result seemed to throw the error.

class AddressDoc(BaseDoc):
    ELID: int
    FULL_ADDRESS: str
    EMBEDDINGS: NdArray[768]

Maybe as a feature request I wonder if its possible for this line to throw an error about an inconsistent class being used to initialise an already existing index.

HnswDocumentIndex[AddressDoc](work_dir=work_dir))

I did have a few follow on questions about this which hopefully you can shed some light on.

I was hoping by setting the number of threads I could improve performance of writing to the hnswlib backend as it takes quite long to index, however from reviewing my PC resources its clear its not really parallelised. Do you have any tips on how to improve this?
When using the hnswlib backend you create two files a .bin file (for the embeddings I assume) and a sqlite db file for everything else. I have noticed if I change the space property Field(space="cosine") this seemed to blow up the size of sqlite db file quite drastically. Do you have any thoughts on why this is?

If it is of value. The below example can be used to demonstrate firstly the reshape error and the blown up sql lite db by uncommenting the various lines pertaining to class Person.

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
from pydantic import Field
import numpy as np
import os

###### Build db ######


class Person(BaseDoc):
    name: str
    follower: int
    # embeddings: NdArray[32]
    embeddings: NdArray[32] = Field(space="cosine")


data = [
    {"name": f"Maria_{i}", "follower": 12345, "embeddings": np.zeros(32)}
    for i in range(20)
]

db_dl = DocList[Person](
    [
        Person(
            name=d["name"],
            follower=d["follower"],
            embeddings=d["embeddings"],
        )
        for d in data
    ]
)
print("db data complete")

doc_index = HnswDocumentIndex[Person](
    work_dir=os.path.join(work_dir)
)
doc_index.index(db_dl)
print("db indexed")
print(f"num docs: {doc_index.num_docs()}")


###### Search db ######


class Person(BaseDoc):
    name: str
    follower: int
    # embeddings: NdArray[32]
    embeddings: NdArray[32] = Field(space="cosine")


doc_index = HnswDocumentIndex[Person](
    work_dir=os.path.join(work_dir)
)

se_dl = DocList[Person](
    [
        Person(
            name=d["name"],
            follower=d["follower"],
            embeddings=d["embeddings"],
        )
        for d in data[:1]
    ]
)
print("search data complete")

doc_index.find_batched(se_dl, search_field="embeddings", limit=1)
print("db search complete")

JoanFM · 2023-10-15T23:50:32Z

I have a question, I do not understand why it blew, is it because the configuration was not respected? because the fields of the class are the same. Without this, it is hard to know how to validate the incompatibility.
I will review, but this is handled by HNSW, I would first check that the flag is properly passed to the Hnswlib dependency.
The reason why having "cosine" blows the size is because in that case, hnswlib normalizes the vectors, and therefore we cannot rely on the hnswlib to reconstruct the original vector to give as result. For this, we may be able to add a configuration value to see if you do not mind about that. (Not sure how this could be named, maybe you could easily provide a PR for this)

JoanFM · 2023-10-18T05:15:11Z

@nikhilmakan02 , may I ask you how you configure the num_threads?

nikhilmakan02 · 2023-10-19T01:00:34Z

@JoanFM apologies for the late reply on this.

may I ask you how you configure the num_threads?

I did it in the class as below, is that not correct?

class AddressDoc(BaseDoc):
    ELID: int
    FULL_ADDRESS: str
    EMBEDDINGS: NdArray[768] = Field(
        max_elements=3500000, space="cosine", num_threads=20
    )

1. I have a question, I do not understand why it blew, is it because the configuration was not respected? because the fields of the class are the same. Without this, it is hard to know how to validate the incompatibility.

Not sure if I understand what you are asking here when you say 'why it blew' The class has the same name, the fields have the same name correct, but the class used to build the index had this custom configuration Field(max_elements=3500000, space="cosine", num_threads=20) which was omitted from the class that was used to initialise the index upon searching it and that seemed to cause the reshape error when searching. After reading your answer to question 3 and then also reading this thread I am starting to think this reshape error is purely because the index was built using the cosine distance (and therefore the vectors were normalised). Reading that index in again and initialising without this config would cause an issue since technically you can't change the distance metric if the index is created with the cosine distance.

3. The reason why having "cosine" blows the size is because in that case, hnswlib normalizes the vectors, and therefore we cannot rely on the hnswlib to reconstruct the original vector to give as result. For this, we may be able to add a configuration value to see if you do not mind about that. (Not sure how this could be named, maybe you could easily provide a PR for this)

Definitely think this would be worth doing to bring that DB size down. Also in some vector dbs you also have the option to not return the embeddings in the search to improve performance, this is preferrable as ultimately I don't need the embeddings after the search is complete. I have never submitted a PR on github before, so let me know if there is something I can do to help here and how I go about doing it.

JoanFM · 2023-10-19T03:31:32Z

Hey @nikhilmakan02 ,

There is always a first time contributing to an Open Source project and I can assure you is very rewarding.

You just need to do your changes in a fork of yours and you then can open a PR from a branch from the GUI itself.

Just make sure to follow the CONTRIBUTING guidelines (https://github.com/docarray/docarray/blob/main/CONTRIBUTING.md) and to join our discord server (https://discord.com/invite/DRa9JpGT) to get assitance ans communicate with the community.

JoanFM self-assigned this Oct 15, 2023

JoanFM closed this as completed Oct 15, 2023

JoanFM reopened this Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.find() tries to reshape and fails #1822

index.find() tries to reshape and fails #1822

nikhilmakan02 commented Oct 12, 2023 •

edited by JoanFM

nikhilmakan02 commented Oct 12, 2023

JoanFM commented Oct 12, 2023

nikhilmakan02 commented Oct 13, 2023

JoanFM commented Oct 13, 2023

JoanFM commented Oct 15, 2023

nikhilmakan02 commented Oct 15, 2023 •

edited by JoanFM

JoanFM commented Oct 15, 2023 •

edited

JoanFM commented Oct 18, 2023

nikhilmakan02 commented Oct 19, 2023

JoanFM commented Oct 19, 2023 •

edited

index.find() tries to reshape and fails #1822

index.find() tries to reshape and fails #1822

Comments

nikhilmakan02 commented Oct 12, 2023 • edited by JoanFM

Initial Checks

Description

Example Code

Python, DocArray & OS Version

Affected Components

nikhilmakan02 commented Oct 12, 2023

JoanFM commented Oct 12, 2023

nikhilmakan02 commented Oct 13, 2023

JoanFM commented Oct 13, 2023

JoanFM commented Oct 15, 2023

nikhilmakan02 commented Oct 15, 2023 • edited by JoanFM

JoanFM commented Oct 15, 2023 • edited

JoanFM commented Oct 18, 2023

nikhilmakan02 commented Oct 19, 2023

JoanFM commented Oct 19, 2023 • edited

nikhilmakan02 commented Oct 12, 2023 •

edited by JoanFM

nikhilmakan02 commented Oct 15, 2023 •

edited by JoanFM

JoanFM commented Oct 15, 2023 •

edited

JoanFM commented Oct 19, 2023 •

edited