Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: error from hybrid search's bm25 search by field #1692

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

hsm207
Copy link
Collaborator

@hsm207 hsm207 commented Jul 10, 2023

fixes #1612

@hsm207 hsm207 enabled auto-merge (squash) July 10, 2023 10:52
@hsm207 hsm207 marked this pull request as draft July 10, 2023 10:53
auto-merge was automatically disabled July 10, 2023 10:53

Pull request was converted to draft

@hsm207
Copy link
Collaborator Author

hsm207 commented Jul 10, 2023

@JoanFM @AnneYang720 Commit 3a6a293 is causing the the following tests in the test_subindex.py file to fail:

  1. test_subindex_index
  2. test_find_subindex
  3. test_subindex_del

is the implementation of the subindex in weaviate somehow dependent on the ID field being a 'string' datatype?

@JoanFM
Copy link
Member

JoanFM commented Jul 10, 2023

@JoanFM @AnneYang720 Commit 3a6a293 is causing the the following tests in the test_subindex.py file to fail:

  1. test_subindex_index
  2. test_find_subindex
  3. test_subindex_del

is the implementation of the subindex in weaviate somehow dependent on the ID field being a 'string' datatype?

Maybe implicitly there is. How does it fail?

@JoanFM
Copy link
Member

JoanFM commented Jul 10, 2023

@hsm207 check this comment:

TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?

@hsm207
Copy link
Collaborator Author

hsm207 commented Jul 10, 2023

How does it fail?

it's always related to the length of the documents returned. E.g. in:

def test_subindex_get(index):
    doc = index['1']
    assert type(doc) == MyDoc
    assert doc.id == '1'

    assert len(doc.docs) == 5

the value of docs.docs is:

0: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
3: SimpleDoc(id='docs-3-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
4: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
5: SimpleDoc(id='docs-0-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
6: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')
7: SimpleDoc(id='docs-2-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
8: SimpleDoc(id='docs-4-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')

before it was:

0: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
3: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
4: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')

Does _get_items work differently with nested documents? I don't think this is the case based on this part of the codebase, unless it's overriden elsewhere

check this comment
TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?
it could be due to how the id is tokenized (field (before) vs word (now)) .

Can you confirm that nested documents don't somehow override _get_items?

@JoanFM
Copy link
Member

JoanFM commented Jul 10, 2023

How does it fail?

it's always related to the length of the documents returned. E.g. in:

def test_subindex_get(index):
    doc = index['1']
    assert type(doc) == MyDoc
    assert doc.id == '1'

    assert len(doc.docs) == 5

the value of docs.docs is:

0: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
3: SimpleDoc(id='docs-3-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
4: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
5: SimpleDoc(id='docs-0-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
6: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')
7: SimpleDoc(id='docs-2-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
8: SimpleDoc(id='docs-4-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')

before it was:

0: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
3: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
4: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')

Does _get_items work differently with nested documents? I don't think this is the case based on this part of the codebase, unless it's overriden elsewhere

check this comment
TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?
it could be due to how the id is tokenized (field (before) vs word (now)) .

Can you confirm that nested documents don't somehow override _get_items?

Maybe @JohannesMessner knows better

@JoanFM
Copy link
Member

JoanFM commented Jul 11, 2023

Check this:

    def _filter_by_parent_id(self, id: str) -> Optional[List[str]]:
        results = (
            self._client.query.get(self._db_config.index_name, ['docarrayid'])
            .with_where(
                {'path': ['parent_id'], 'operator': 'Equal', 'valueString': f'{id}'}
            )
            .do()
        )

        ids = [
            res['docarrayid']
            for res in results['data']['Get'][self._db_config.index_name]
        ]
        return ids

I guess the fact that the ID column is not a string but text may affect

@JoanFM JoanFM mentioned this pull request Jul 11, 2023
@hsm207
Copy link
Collaborator Author

hsm207 commented Jul 11, 2023

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

@JoanFM
Copy link
Member

JoanFM commented Jul 11, 2023

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

@hsm207
Copy link
Collaborator Author

hsm207 commented Jul 12, 2023

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

assuming that nested documents isn't somehow override _get_items, then yes, I think this is the cause.

@JohannesMessner
Copy link
Member

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

assuming that nested documents isn't somehow override _get_items, then yes, I think this is the cause.

If there are nested Documents it performs the same __getitem__/__get_items call on the nested docs as it does on the root docs. So there should be no overriding, unless a particular backend does something wonky.

@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Patch coverage: 50.00% and project coverage change: -2.37% ⚠️

Comparison is base (587ab5b) 83.74% compared to head (1a381f1) 81.37%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1692      +/-   ##
==========================================
- Coverage   83.74%   81.37%   -2.37%     
==========================================
  Files         134      134              
  Lines        8845     8848       +3     
==========================================
- Hits         7407     7200     -207     
- Misses       1438     1648     +210     
Flag Coverage Δ
docarray 81.37% <50.00%> (-2.37%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
docarray/index/backends/weaviate.py 52.57% <50.00%> (-42.47%) ⬇️

... and 16 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hsm207
Copy link
Collaborator Author

hsm207 commented Jul 31, 2023

@JohannesMessner @JoanFM

I made the commit to set the tokenization for "id" to field and now the tests is failing at this line:

assert len(doc.list_docs[0].docs) == 5

the value of doc.list_docs[0].docs is:

image

What is the correct value and how is it built differently compared to doc.docs and doc.list_docs?

@JoanFM
Copy link
Member

JoanFM commented Jul 31, 2023

@JohannesMessner @JoanFM

I made the commit to set the tokenization for "id" to field and now the tests is failing at this line:

assert len(doc.list_docs[0].docs) == 5

the value of doc.list_docs[0].docs is:

image What is the correct value and how is it built differently compared to `doc.docs` and `doc.list_docs`?

I do not know, @JohannesMessner may be better to judge.

However, I do not like at all that test folder, since each test cannot be run individually, as they all depend on the same fixture. I did a refactoring of that test for HNSWLib, I will see if I can do it for this one

@JoanFM
Copy link
Member

JoanFM commented Aug 24, 2023

Hey @hsm207,

what is the ststus of this issue?

@hsm207
Copy link
Collaborator Author

hsm207 commented Aug 24, 2023

hey @JoanFM

I haven't been able to delve into the nested document feature yet due to current commitments on other projects. I've moved this to my backlog since I don't see an opportunity to work on this until at least mid next month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weaviate hybrid example errors: init missing required positional argument 'properties'
3 participants