fix: error from hybrid search's bm25 search by field #1692

hsm207 · 2023-07-10T10:52:44Z

fixes #1612

"string" type is deprecated since v1.19

hsm207 · 2023-07-10T13:23:57Z

@JoanFM @AnneYang720 Commit 3a6a293 is causing the the following tests in the test_subindex.py file to fail:

test_subindex_index
test_find_subindex
test_subindex_del

is the implementation of the subindex in weaviate somehow dependent on the ID field being a 'string' datatype?

…hybrid-search

JoanFM · 2023-07-10T17:41:28Z

@JoanFM @AnneYang720 Commit 3a6a293 is causing the the following tests in the test_subindex.py file to fail:

test_subindex_index

test_find_subindex

test_subindex_del

is the implementation of the subindex in weaviate somehow dependent on the ID field being a 'string' datatype?

Maybe implicitly there is. How does it fail?

JoanFM · 2023-07-10T18:10:12Z

@hsm207 check this comment:

TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?

hsm207 · 2023-07-10T22:00:14Z

How does it fail?

it's always related to the length of the documents returned. E.g. in:

def test_subindex_get(index):
    doc = index['1']
    assert type(doc) == MyDoc
    assert doc.id == '1'

    assert len(doc.docs) == 5

the value of docs.docs is:

0: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
3: SimpleDoc(id='docs-3-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
4: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
5: SimpleDoc(id='docs-0-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
6: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')
7: SimpleDoc(id='docs-2-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
8: SimpleDoc(id='docs-4-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')

before it was:

0: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
3: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
4: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')

Does _get_items work differently with nested documents? I don't think this is the case based on this part of the codebase, unless it's overriden elsewhere

check this comment
TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?
it could be due to how the id is tokenized (field (before) vs word (now)) .

Can you confirm that nested documents don't somehow override _get_items?

JoanFM · 2023-07-10T22:21:22Z

How does it fail?

it's always related to the length of the documents returned. E.g. in:

def test_subindex_get(index):
    doc = index['1']
    assert type(doc) == MyDoc
    assert doc.id == '1'

    assert len(doc.docs) == 5

the value of docs.docs is:

0: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
3: SimpleDoc(id='docs-3-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
4: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
5: SimpleDoc(id='docs-0-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
6: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')
7: SimpleDoc(id='docs-2-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
8: SimpleDoc(id='docs-4-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')

before it was:

0: SimpleDoc(id='docs-1-1', simple_tens=NdArray([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), simple_text='hello 1')
1: SimpleDoc(id='docs-1-0', simple_tens=NdArray([1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), simple_text='hello 0')
2: SimpleDoc(id='docs-1-4', simple_tens=NdArray([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]), simple_text='hello 4')
3: SimpleDoc(id='docs-1-3', simple_tens=NdArray([4, 4, 4, 4, 4, 4, 4, 4, 4, 4]), simple_text='hello 3')
4: SimpleDoc(id='docs-1-2', simple_tens=NdArray([3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), simple_text='hello 2')

Does _get_items work differently with nested documents? I don't think this is the case based on this part of the codebase, unless it's overriden elsewhere

check this comment
TODO: add more types and figure out how to handle text vs string type on the top of the file, it may be related to this?
it could be due to how the id is tokenized (field (before) vs word (now)) .

Can you confirm that nested documents don't somehow override _get_items?

Maybe @JohannesMessner knows better

JoanFM · 2023-07-11T15:56:37Z

Check this:

    def _filter_by_parent_id(self, id: str) -> Optional[List[str]]:
        results = (
            self._client.query.get(self._db_config.index_name, ['docarrayid'])
            .with_where(
                {'path': ['parent_id'], 'operator': 'Equal', 'valueString': f'{id}'}
            )
            .do()
        )

        ids = [
            res['docarrayid']
            for res in results['data']['Get'][self._db_config.index_name]
        ]
        return ids

I guess the fact that the ID column is not a string but text may affect

hsm207 · 2023-07-11T19:33:01Z

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

JoanFM · 2023-07-11T21:15:21Z

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

hsm207 · 2023-07-12T00:14:27Z

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

assuming that nested documents isn't somehow override _get_items, then yes, I think this is the cause.

JohannesMessner · 2023-07-12T07:02:02Z

@JoanFM I think I found the reason for the failure. See #1698. I'll investigate further when I'm back from vacation next week

So the issue was that the ID is being tokenized then?

assuming that nested documents isn't somehow override _get_items, then yes, I think this is the cause.

If there are nested Documents it performs the same __getitem__/__get_items call on the nested docs as it does on the root docs. So there should be no overriding, unless a particular backend does something wonky.

…hybrid-search

…into fix-hybrid-search

codecov · 2023-07-31T09:29:45Z

Codecov Report

Patch coverage: 50.00% and project coverage change: -2.37% ⚠️

Comparison is base (587ab5b) 83.74% compared to head (1a381f1) 81.37%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1692      +/-   ##
==========================================
- Coverage   83.74%   81.37%   -2.37%     
==========================================
  Files         134      134              
  Lines        8845     8848       +3     
==========================================
- Hits         7407     7200     -207     
- Misses       1438     1648     +210

Flag	Coverage Δ
docarray	`81.37% <50.00%> (-2.37%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
docarray/index/backends/weaviate.py	`52.57% <50.00%> (-42.47%)`	⬇️

... and 16 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hsm207 · 2023-07-31T09:30:53Z

@JohannesMessner @JoanFM

I made the commit to set the tokenization for "id" to field and now the tests is failing at this line:

assert len(doc.list_docs[0].docs) == 5

the value of doc.list_docs[0].docs is:

What is the correct value and how is it built differently compared to doc.docs and doc.list_docs?

JoanFM · 2023-07-31T10:24:45Z

@JohannesMessner @JoanFM

I made the commit to set the tokenization for "id" to field and now the tests is failing at this line:
assert len(doc.list_docs[0].docs) == 5
the value of doc.list_docs[0].docs is:
What is the correct value and how is it built differently compared to `doc.docs` and `doc.list_docs`?

I do not know, @JohannesMessner may be better to judge.

However, I do not like at all that test folder, since each test cannot be run individually, as they all depend on the same fixture. I did a refactoring of that test for HNSWLib, I will see if I can do it for this one

…hybrid-search

…into fix-hybrid-search

JoanFM · 2023-08-24T09:06:10Z

Hey @hsm207,

what is the ststus of this issue?

hsm207 · 2023-08-24T18:30:53Z

hey @JoanFM

I haven't been able to delve into the nested document feature yet due to current commitments on other projects. I've moved this to my backlog since I don't see an opportunity to work on this until at least mid next month.

hsm207 added 3 commits July 10, 2023 10:48

chore: remove version cap

937d98c

chore: upgrade weaviate client version

8f03b8d

chore: update lock file

fdee1aa

hsm207 enabled auto-merge (squash) July 10, 2023 10:52

hsm207 marked this pull request as draft July 10, 2023 10:53

auto-merge was automatically disabled July 10, 2023 10:53
Pull request was converted to draft

hsm207 added 5 commits July 10, 2023 11:34

chore: always use latest weaviate

7f03387

fix: update hybrid search query builder

bf32862

refactor: update weaviate type mappings

3a6a293

"string" type is deprecated since v1.19

test: update test case to override column type

71bfd10

Merge branch 'main' into fix-hybrid-search

fbd6722

hsm207 added 2 commits July 10, 2023 13:26

Merge branch 'main' of https://github.com/docarray/docarray into fix-…

c2bdd29

…hybrid-search

chore: update lock file

d8828fe

hsm207 added 2 commits July 11, 2023 05:16

Merge branch 'main' into fix-hybrid-search

5ed3a10

chore: update comment

3b1a7ed

JoanFM mentioned this pull request Jul 11, 2023

chore: move ID to text #1697

Draft

hsm207 added 4 commits July 12, 2023 17:28

Merge branch 'main' into fix-hybrid-search

fbe9abd

Merge branch 'main' of https://github.com/docarray/docarray into fix-…

1c91581

…hybrid-search

fix: set DOCUMENTID tokenization to field

fe2d5e1

Merge branch 'fix-hybrid-search' of https://github.com/hsm207/docarray …

727ce90

…into fix-hybrid-search

Merge branch 'main' into fix-hybrid-search

6584db6

hsm207 added 3 commits July 31, 2023 11:01

Merge branch 'main' of https://github.com/docarray/docarray into fix-…

c35962f

…hybrid-search

Merge branch 'fix-hybrid-search' of https://github.com/hsm207/docarray …

3910c1b

…into fix-hybrid-search

Merge branch 'main' into fix-hybrid-search

78d8d64

Merge branch 'main' into fix-hybrid-search

1a381f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: error from hybrid search's bm25 search by field #1692

fix: error from hybrid search's bm25 search by field #1692

hsm207 commented Jul 10, 2023

hsm207 commented Jul 10, 2023

JoanFM commented Jul 10, 2023

JoanFM commented Jul 10, 2023 •

edited

hsm207 commented Jul 10, 2023

JoanFM commented Jul 10, 2023

JoanFM commented Jul 11, 2023

hsm207 commented Jul 11, 2023

JoanFM commented Jul 11, 2023

hsm207 commented Jul 12, 2023

JohannesMessner commented Jul 12, 2023

codecov bot commented Jul 31, 2023 •

edited

hsm207 commented Jul 31, 2023

JoanFM commented Jul 31, 2023

JoanFM commented Aug 24, 2023

hsm207 commented Aug 24, 2023

fix: error from hybrid search's bm25 search by field #1692

Are you sure you want to change the base?

fix: error from hybrid search's bm25 search by field #1692

Conversation

hsm207 commented Jul 10, 2023

hsm207 commented Jul 10, 2023

JoanFM commented Jul 10, 2023

JoanFM commented Jul 10, 2023 • edited

hsm207 commented Jul 10, 2023

JoanFM commented Jul 10, 2023

JoanFM commented Jul 11, 2023

hsm207 commented Jul 11, 2023

JoanFM commented Jul 11, 2023

hsm207 commented Jul 12, 2023

JohannesMessner commented Jul 12, 2023

codecov bot commented Jul 31, 2023 • edited

Codecov Report

hsm207 commented Jul 31, 2023

JoanFM commented Jul 31, 2023

JoanFM commented Aug 24, 2023

hsm207 commented Aug 24, 2023

JoanFM commented Jul 10, 2023 •

edited

codecov bot commented Jul 31, 2023 •

edited