21 Jun 08:16

github-actions

dc7b681

💫 Release v0.34.0

Release Note (`0.34.0`)

Release time: 2023-06-21 08:15:43

This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.

💣 Breaking Changes

Terminate Python 3.7 support

⚠️ ⚠️ DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.

We decided to drop it for two reasons:

Several dependencies of DocArray require Python 3.8.
Python long-term support for 3.7 is ending this week. This means there will no longer
be security updates for Python 3.7, making this a good time for us to change our requirements.

Changes to `DocVec` Protobuf definition (#1639)

In order to fix a bug in the DocVec protobuf serialization described in #1561,
we have changed the DocVec .proto definition.

This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa.

⚠️ ⚠️ We strongly recommend that everyone using Protobuf with DocVec upgrade to DocArray v0.34.0 or
later.

🆕 Features

Allow users to check if a Document is already indexed in a DocIndex (#1633)

You can now check if a Document has already been indexed by using the in keyword:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = DocList[MyDoc](
        [MyDoc(text="Example text", embedding=np.random.rand(128))
         for _ in range(2000)])

index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index

Support subindexes in `InMemoryExactNNIndex` (#1617)

You can now use the find_subindex
method with the ExactNNSearch DocIndex.

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor

class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor_image: AnyTensor = Field(space='cosine', dim=64)


class VideoDoc(BaseDoc):
    url: VideoUrl
    images: DocList[ImageDoc]
    tensor_video: AnyTensor = Field(space='cosine', dim=128)


class MyDoc(BaseDoc):
    docs: DocList[VideoDoc]
    tensor: AnyTensor = Field(space='cosine', dim=256)

doc_index = InMemoryExactNNIndex[MyDoc]()
...

# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
    np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)

Flexible tensor types for protobuf deserialization (#1645)

You can deserialize any DocVec protobuf message to any tensor type,
by passing the tensor_type parameter to from_protobuf.

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

class MyDoc(BaseDoc):
    tensor: TensorFlowTensor

da = DocVec[MyDoc](...)  # doesn't matter what tensor_type is here

proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)

assert isinstance(da_after.tensor, TensorFlowTensor)

⚙ Refactoring

Add `DBConfig` to `InMemoryExactNNSearch`

InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.

The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old
usage.

These two initializations are equivalent.

from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')

index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')

🐞 Bug Fixes

Allow protobuf deserialization of `BaseDoc` with `Union` type (#1655)

Serialization of BaseDoc types who have Union types parameter of Python native types is supported.

from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
    union_field: Union[int, str]

docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2

When these Union types involve other BaseDoc types, an exception is thrown.

class CustomDoc(BaseDoc):
    ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')

docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])

# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())

Cast limit to integer when passed to `HNSWDocumentIndex` (#1657, #1656)

If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast to
integer.

Moved `default_column_config` from `RuntimeConfig` to `DBconfig` (#1648)

default_column_config contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig so you can edit it there.

from docarray.index import HNSWDocumentIndex
import numpy as np

db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)

Fix issue with Protobuf (de)serialization for DocVec (#1639)

This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.

Fix order of returned matches when `find` and `filter` combination used in `InMemoryExactNNIndex` (#1642)

Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.

# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')

query = (
    db.build_query()
    .find(query=q_doc, search_field='embedding')
    .filter(filter_query={'text': {'$exists': True}})
    .build()
)

results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first

Working with external Qdrant collections (#1632)

When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of docarray raised a KeyError.
This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.

Other bug fixes

Update text search to match Weaviate client's new sig (#1654)
Fix DocVec equality (#1641, #1663)
Fix exception when summary() called for LegacyDocument. (#1637)
Fix DocList and DocVec coersion. (#1568)
Fix update() on BaseDoc with tensors fields (#1628)

📗 Documentation Improvements

Enhance DocVec section (#1658)
Qdrant in memory usage (#1634)

🤟 Contributors

We would like to thank all contributors to this release:

Johannes Messner (@JohannesMessner)
Nikolas Pitsillos (@npitsillos)
Shukri (@hsm207)
Kacper Łukawski (@kacperlukawski)
Aman Agarwal (@agaraman0)
maxwelljin (@maxwelljin)
samsja (@samsja)
Saba Sturua (@jupyterjazz)
Joan Fontanals (@JoanFM)

Contributors

hsm207, kacperlukawski, and 7 other contributors

Assets 2

06 Jun 14:06

github-actions

v0.33.0

68194f4

💫 Release v0.33.0

Release Note (`0.33.0`)

Release time: 2023-06-06 14:05:56

This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.

🆕 Features

Allow coercion between different Tensor types (#1552) (#1588)

Allow coercing to a TorchTensor from an NdArray or TensorFlowTensor and the other way around.

from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np


class MyTensorsDoc(BaseDoc):
    tensor: TorchTensor


doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()

📄 MyTensorsDoc : 0a10f88 ...
╭─────────────────────┬────────────────────────────────────────────────────────╮
│ Attribute           │ Value                                                  │
├─────────────────────┼────────────────────────────────────────────────────────┤
│ tensor: TorchTensor │ TorchTensor of shape (512,), dtype: torch.float64      │
╰─────────────────────┴────────────────────────────────────────────────────────╯

🚀 Performance

Avoid stack embedding for every search (#1586)

We have made a performance improvement for the find interface for InMemoryExactNNIndex that gives a ~2x speedup.

The script used to measure this is as follows:

from torch import rand
from time import perf_counter

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor


class MyDocument(BaseDoc):
    embedding: TorchTensor
    embedding2: TorchTensor
    embedding3: TorchTensor

def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
    return DocList[MyDocument](
        [
            MyDocument(
                embedding=rand(dims),
                embedding2=rand(dims),
                embedding3=rand(dims),
            )
            for _ in range(num_docs)
        ]
    )

num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)

index = InMemoryExactNNIndex[MyDocument](data_list)

start = perf_counter()
for _ in range(5):
    matches, scores =  index.find_batched(queries, search_field='embedding')

print(f"Number of queries: {num_queries} \n"
      f"Number of indexed documents: {num_docs} \n"
      f"Total time: {(perf_counter() - start)/5} seconds")

🐞 Bug Fixes

Respect `limit` parameter in `filter` for index backends (#1618)

InMemoryExactNNIndex and HnswDocumentIndex now respect the limit parameter in the filter API.

`HnswDocumentIndex` can search with `limit` greater than number of documents (#1611)

HnswDocumentIndex now allows to call find with a limit parameter larger than the number of indexed documents.

Allow updating `HnswDocumentIndex` (#1604)

HnswDocumentIndex now allows reindexing documents with the same id, updating the original documents.

Dynamically resize internal index to adapt to increasing number of documents (#1602)

HnswDocumentIndex now allows indexing more than max_elements, dynamically adapting the index as it grows.

Fix simple usage of `HnswDocumentIndex` (#1596)

from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')

Previously, this basic usage threw an exception:

TypeError: ModelMetaclass object argument after  must be a mapping, not MyDoc

Now, it works as expected.

Fix `InMemoryExactNNIndex` index initialization with nested `DocList` (#1582)

Instantiating an InMemoryExactNNIndex with a Document schema that had a nested DocList previously threw this error:

from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    text: str,
    d_list: DocList[TextDoc]

index = HnswDocumentIndex[MyDoc]()

TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'

Now it can be successfully instantiated.

Fix summary of document with list (#1595)

Calling summary on a document with a List attribute previously showed the wrong type:

from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
    str_list: List[str]

dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()

Previous output:

╭─────── DocList Summary ───────╮
│                               │
│   Type     DocList[TestDoc]   │
│   Length   2                  │
│                               │
╰───────────────────────────────╯
╭─── Document Schema ───╮
│                       │
│   TestDoc             │
│   └── str_list: str   │
│                       │
╰───────────────────────╯

New output:

╭─────── DocList Summary ───────╮
│                               │
│   Type     DocList[TestDoc]   │
│   Length   2                  │
│                               │
╰───────────────────────────────╯
╭────── Document Schema ──────╮
│                             │
│   TestDoc                   │
│   └── str_list: List[str]   │
│                             │
╰─────────────────────────────╯

Solve issues caused by `issubclass` (#1594)

DocArray relies heavily on calling Python's issubclass method which caused multiple issues. We now use a safe version that counts for edge cases and types.

Make example payload a string rather than bytes (#1587)

The example payload of a given document schema with Tensor attribute was previously of bytes type. This has now been changed to str.

from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')

📗 Documentation Improvements

Add forward declaration steps to example to avoid pickling error (#1615)
Fix n_dim to dim (#1610)
Add "in memory" to documentation as list of supported vector indexes (#1607)
Add a tensor section (#1576)

🤟 Contributors

We would like to thank all contributors to this release:

Mohammad Kalim Akram (@makram93)
samsja (@samsja)
Saba Sturua (@jupyterjazz)
Joan Fontanals (@JoanFM)
maxwelljin (@maxwelljin)

Contributors

makram93, JoanFM, and 3 other contributors

Assets 2

26 May 14:51

github-actions

v0.32.1

8a2e92a

💫 Patch v0.32.1

Release Note (`0.32.1`)

Release time: 2023-05-26 14:50:34

This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.

⚙ Refactoring

Improve `ElasticDocIndex` logging (#1551)

More debugging logs have been added inside ElasticDocIndex.

🐞 Bug Fixes

Allow `InMemoryExactNNIndex` with `Optional` embedding tensors (#1575)

You can now index Documents where the tensor search_field is Optional. The index will not consider these None embeddings when running a search.

import torch
from typing import Optional

from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex


class EmbeddingDoc(BaseDoc):
    embedding: Optional[TorchTensor[768]]

index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)

Safe `is_subclass` check (#1569)

In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass method.
This call fails under some circumstances, for instance when checked for a List or Tuple. Starting with this release, we use a safe version that does not fail for these cases.

This enables the following usage, which would otherwise fail:

from docarray import BaseDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    test: List[str]

index = HnswDocumentIndex[MyDoc]()

Fix `AnyDoc` deserialization (#1571)

AnyDoc is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:

from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict

class ConcreteDoc(BaseDoc):
    text: str
    tags: Dict[str, int]

doc = ConcreteDoc(text='text', tags={'type': 1})

any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}

`dict` method for Document view (#1559)

Prior to this fix, doc.dict() would return an empty Dictionary if doc.is_view() == True:

class MyDoc(BaseDoc):
    foo: int

vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}

# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}

📗 Documentation Improvements

Update doc building guide (#1566)
Explain the state of DocList in FastAPI (#1546)

🤟 Contributors

We would like to thank all contributors to this release:

aman-exp-infy (@agaraman0)
Johannes Messner (@JohannesMessner)
Joan Fontanals (@JoanFM)
Saba Sturua (@jupyterjazz)
Ge Jin (@maxwelljin)

Contributors

JoanFM, agaraman0, and 3 other contributors

Assets 2

16 May 11:30

github-actions

v0.32.0

9657d6f

💫 Release v0.32.0

Release Note (`v0.32.0`)

This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.

🆕 Features

Subindex for document index (#1428)

The subindex feature allows you to index documents that contain another DocList by automatically creating a separate collection/index for each such DocList:

# create nested document schema
class SimpleDoc(BaseDoc):
    tensor: NdArray[10]
    text: str


class MyDoc(BaseDoc):
    docs: DocList[SimpleDoc]


# create some docs
my_docs = [
    MyDoc(
        docs=DocList[SimpleDoc](
            [
                SimpleDoc(
                    tensor=np.ones(10) * (j + 1),
                    text=f"hello {j}",
                )
                for j in range(10)
            ]
        ),
    )
]

# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs)  # index with name 'idx' and 'idx__docs' will be generated

# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
    query, search_field="docs__tensor", limit=5
)

Openapi and FastAPI tensor shapes (#1510)

We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.

This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:

class Doc(BaseDoc):
    embedding_torch: TorchTensor[3, 4]


app = FastAPI()


@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
    return Doc(embedding=doc.embedding_np)

Generated Swagger UI:

Save and load inmemory index (#1534)

We added a persist method to the InMemoryExactNNIndex class to save the index to disk.

# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')

🐞 Bug Fixes

`search_field` should be optional in hybrid text search (#1516)

We have added a sane default to text_search() for the search_field argument that is now Optional.

Check if file path exists for in-memory index (#1537)

We have added an internal check to see if index_file_path exists when passed to InMemoryExactNNIndex.

Add empty judgement to index search (#1533)

We have ensured that empty indices do not fail when find is called.

Detach torch tensors (#1526)

Serializing tensors with gradients no longer fails.

`Docvec` display (#1522)

Docvec display issues have been resolved.

📗 Documentation Improvements

Remove erroneous info (#1531)
Fix link to documentation in readme (#1525)
Flatten structure (#1520)
Fix links (#1518)

🤟 Contributors

We would like to thank all contributors to this release:

Mohammad Kalim Akram (@makram93)
Johannes Messner (@JohannesMessner)
Anne Yang (@AnneYang720)
Zhaofeng Miao (@mapleeit)
Joan Fontanals (@JoanFM)
Kacper Łukawski (@kacperlukawski)
IyadhKhalfallah (@IyadhKhalfallah)
Saba Sturua (@jupyterjazz)

Contributors

kacperlukawski, mapleeit, and 6 other contributors

Assets 2

08 May 16:31

github-actions

v0.31.1

8c7eb95

💫 Patch v0.31.1

Release Note (`0.31.1`)

This patch release fixes a small bug that was introduced in the latest minor release (0.31.0).

🐞 Bug Fixes

Calling json or dict on a Optional nested DocList does not throw an error anymore if the value is set to None (#1512)

🤟 Contributors

We would like to thank all contributors to this release:

samsja (@samsja)

Contributors

samsja

Assets 2

08 May 09:41

github-actions

v0.31.0

67c7e6d

💫 Release v0.31.0

Release Note (`v0.31.0`)

This release contains 4 new features, 11 bug fixes, and several documentation improvements.

💥 Breaking changes

Return type of `DocVec` Optional Tensor (#1472)

Optional tensor fields in a DocVec will return None instead of a list of Nan if the column does not hold any tensor.

This code snippet shows the breaking change:

from typing import Optional

from docarray import BaseDoc, DocVec
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    tensor: Optional[NdArray[10]]

docs = DocVec[MyDoc]([MyDoc() for j in range(2)])

print(docs.tensor)

Version	Return type
0.30.0	`[nan nan]`
0.31.0	`None`

Default index collection names

Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.

In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name or collection_name.

Starting with DocArray v0.30.0, the default index_name/collection_name will be derived from the document schema name:

from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc

class MyDoc(BaseDoc):
    pass

# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()

If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.

You can fix this by manually specifying the index name to match the old default:

# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')

The below table summarizes the change for all DB backends:

	DBConfig argument	Default in v0.30.0	Default in v0.31.0
WeaviateDocumentIndex	`index_name`	'Document'	Schema class name
QdrantDocumentIndex	`collection_name`	'documents'	Schema class name
ElasticDocIndex	`index_name`	'index__' + a random id	Schema class name
ElasticV7DocIndex	`index_name`	'index__' + a random id	Schema class name
HnswDocumentIndex	n/a	n/a	n/a

🆕 Features

Add `InMemoryExactNNIndex` (#1441)

In this version we have introduced the InMemoryExactNNIndex Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).

The InMemoryExactNNIndex can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray

import numpy as np

class MyDoc(BaseDoc):
    tensor: NdArray[512]

docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))

FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

`DocList` inherits from Python `list` (#1457)

DocList is now a subclass of Python's list. This means that you can now use all the methods that are available to Python lists on DocList objects. For example, you can now use len on DocList objects and tools like Pydantic or FastAPI will be able to work with it more easily.

Add `len` to `DocIndex` (#1454)

You can now perform len(vector_index) which is equivalent to vector_index.num_docs().

Other minor features

Add a to_json alias to BaseDoc (#1494)

🐞 Bug Fixes

Point to older versions when importing `Document` or `Documentarray` (#1422)

Trying to load Document or DocumentArray from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.

Fix `AnyDoc.from_protobuf` (#1437)

AnyDoc can now read any BaseDoc protobuf file. The same applies to DocList.

Other bug fixes

Fix extend to DocList (#1493)
Fix bug when calling dict() on BaseDoc (#1481)
Fix bug when calling json() on BaseDoc (#1481)
Support Pandas 2.0 by using pd.concat() instead of df.append() in to_dataframe() to avoid warning (#1478)
Add logs to Elasticsearch index (#1427)
Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to ndarray (#1429)
Fix a bug with HNSW (#1426)
Hubble Binary format version bump (#1414)
Save index during creation for hnswlib (#1424)

📗 Documentation Improvements

Fix FastAPI docs (#1453)
Index predefined Documents (#1434)
Clean up data types section (#1412)
Remove duplicate API reference section (#1408)
Docindex URLs (#1433)
Fix Install commands hint (#1421)
Add Google Analytics (#1432)
Add install instructions for hnswlib and elastic document indexes (#1431)
Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)

🤟 Contributors

We would like to thank all contributors to this release:

Alex Cureton-Griffiths (@alexcg1)
samsja (@samsja)
Johannes Messner (@JohannesMessner)
Anne Yang (@AnneYang720)
Scott Martens (@scott-martens)
カレン (@RStar2022)
Aman Agarwal (@agaraman0)
Yanlong Wang (@nomagick)
Charlotte Gerhaher (@anna-charlotte)

Contributors

nomagick, alexcg1, and 7 other contributors

Assets 2

18 Apr 07:59

JoanFM

v0.30.0

56e11a3

💫 Release v0.30.0

💫 Release v0.30.0 (a.k.a DocArray v2)

Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.

Changelog

If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.

DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.

This gives the following advantages:

Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
Language agnostic: At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.

You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:

Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.

For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.

Changes to `Document`

Document has been renamed to BaseDoc.
BaseDoc cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.
Following from the previous point, extending BaseDoc allows for a flexible schema compared to the Document class in v1 which only allowed for a fixed schema, with one of tensor, text and blob, and additional chunks and matches.
Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as .load_uri_to_image_tensor()) are not supported in v2. Instead, we provide some of those methods on the typing-level.
In v2 we have the LegacyDocument class, which extends BaseDoc while following the same schema as v1's Document. The LegacyDocument can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document. Indeed, none of the methods associated with Document are present. Only the schema of the data is similar.

Changes to `DocumentArray`

DocList

The DocumentArray class from v1 has been renamed to DocList, to be more descriptive of its actual functionality, since it is a list of BaseDocs.

DocVec

Additionally, we introduced the class DocVec, which is a column-based representation of BaseDocs. Both DocVec and DocList extend AnyDocArray.
DocVec is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).
A DocVec has a similar interface as DocList but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec (the .doc_type which is a BaseDoc) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec (Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor or a Union of tensor types, the .tensor_type will be used to determine the type of the doc_vec column.

Parameterized DocList

With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a DocList it does not necessarily have to be homogenous.
If you want a homogenous DocList you can parameterize it at initialization time:

from docarray import DocList
from docarray.documents import ImageDoc

docs = DocList[ImageDoc]()

Methods like .from_csv() or .pull() only work with parameterized DocLists.

Access attributes of your DocumentArray

In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
In v2 you don't have to use the plural, but instead just use the document's attribute name, since AnyDocArray will expose the same attributes as the BaseDocs it contains. This will return a list of type(attribute). However, this only works if (and only if) all the BaseDocs in the AnyDocArray have the same schema. Therefore only this works:

from docarray import BaseDoc, DocList


class Book(BaseDoc):
    title: str
    author: str = None


docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title  # returns a list[str]

# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title

Changes to Document Store

In v2 the Document Store has been renamed to DocIndex and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex supports:

Instead of creating a DocumentArray instance and setting the storage parameter to a vector database of your choice, in v2 you can initialize a DocIndex object of your choice, such as:

db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')

In contrast, DocStore in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.

Thank you to all of the contributors to this release:

Contributors

hsm207, kacperlukawski, and 13 other contributors

Assets 2

0 Join discussion

17 Jan 09:11

github-actions

v0.21.0

ca2973f

💫 Release v0.21.0

Release Note (`0.21.0`)

Release time: 2023-01-17 09:10:50

This release contains 3 new features, 7 bug fixes and 5 documentation improvements.

🆕 Features

OpenSearch Document Store (#853)

This version of DocArray adds a new Document Store: OpenSearch!

You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:

from docarray import Document, DocumentArray
import numpy as np

# Connect to OpenSearch instance
n_dim = 3

da = DocumentArray(
    storage='opensearch',
    config={'n_dim': n_dim},
)

# Index Documents
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim))
            for i in range(10)
        ]
    )

# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)

Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.

Learn more about its usage in the official documentation.

Add color to point cloud display (#961)

You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor():

coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')

doc = Document(
    tensor=coords,
    chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()

Add language attribute to Redis Document Store (#953)

The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language parameter in the Redis configuration:

da = DocumentArray(
    storage='redis',
    config={
        'n_dim': 128,
        'index_text': True,
        'language': 'chinese',
    },
)

🐞 Bug Fixes

Replace newline with whitespace to fix display in plot embeddings (#963)

Whenever the string "\n" was contained in any Document field, doc.plot() would result in a rendering error. This fixes those errors be rendering "\n" as whitespace.

Fix unwanted coercion in `to_pydantic_model` (#949)

This bug caused all strings of the form 'Infinity' to be coerced to the string 'inf' when calling to_pydantic_model() or to_dict(). This is fixed now, leaving such strings unchanged.

Calculate relevant docs on index instead of queries (#950)

In the embed_and_evaluate() method, the number of relevant Documents per label used to be calculated based on the Document in self. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.

Remove offset index create on list like false (#936)

When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.

Add support for remote audio files (#933)

Loading audio files from a remote URL would cause FileNotFoundError, which is now fixed.

Query operator `$exists` does not work correctly with tags (#911) (#923)

Before this fix, $exists would treat false-y values such as 0 or [] as non existent. This is now fixed.

Document from dataclass with singleton list (#1018)

When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.

📗 Documentation Improvements

Link to Discord (#1010)
Have less versions to avoid deployment timeout (#977)
Fix data management section not appearing in Documentation (#967)
Link to OpenSearch docs in sidebar (#960)
Multimodal to datatypes (#934)

🤟 Contributors

We would like to thank all contributors to this release:

Jay Bhambhani (@jay-bhambhani)
Alvin Prayuda (@alphinside)
Johannes Messner (@JohannesMessner)
samsja (@samsja)
Marco Luca Sbodio (@marcosbodio)
Anne Yang (@AnneYang720)
Michael Günther (@guenthermi)
AlaeddineAbdessalem (@alaeddine-13)
Han Xiao (@hanxiao)
Alex Cureton-Griffiths (@alexcg1)
Charlotte Gerhaher (@anna-charlotte)

Contributors

hanxiao, marcosbodio, and 9 other contributors

Assets 2

12 Dec 09:33

github-actions

v0.20.1

59606d8

💫 Patch v0.20.1

Release Note (`0.20.1`)

Release time: 2022-12-12 09:32:37

🐞 Bug Fixes

Make Milvus DocumentArray thread safe and suitable for pytest (#904)

This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.

This would produce an error like the following:

E1207 14:59:51.357528591    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884    3934 ev_epoll1_linux.cc:824]     assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted

This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.

Restore backwards compatibility for (de)serialization (#903)

DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)

AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'

This fix restores backwards compatibility by not relying on newly introduced private attributes:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)

<DocumentArray (length=11) at 140683902276416>

Process finished with exit code 0

📗 Documentation Improvements

Polish docs throughout (#895)

🤟 Contributors

We would like to thank all contributors to this release:

Anne Yang (@AnneYang720)
Nan Wang (@nan-wang)
anna-charlotte (@anna-charlotte)
Alex Cureton-Griffiths (@alexcg1)

Contributors

alexcg1, nan-wang, and 2 other contributors

Assets 2

07 Dec 12:16

github-actions

v0.20.0

41297ab

💫 Release v0.20.0

Release Note (`0.20.0`)

Release time: 2022-12-07 12:15:30

This release contains 8 new features, 3 bug fixes and 7 documentation improvements.

🆕 Features

Milvus document store (#587)

This release supports the Milvus vector database as a document store.

da = DocumentArray(storage='milvus', config={'n_dim': 3))

Root_id for document stores (#808)

When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).

top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)

To allow this we now store the root_id in the chunks' tags. You can enable this by passing root_id=True in your document store configuration.

Filtering based on text keywords for Qdrant (#849)

You can now filter based on text keywords for the Qdrant document store.

filter = {
    'must': [
        {"key": "info", "match": {"text": "shoes"}}
    ]
}

results = da.find(np.random.rand(n_dim), filter=filter)

RGB-D representation of 3D meshes (#753)

DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.

doc.load_uris_to_rgbd_tensor()

Load multi page tiff files into chunks (#845)

Multi page tiff images can now be loaded with load_uri_to_image_tensor().

d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)

<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
  └─ chunks
     ├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
     ├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
     └─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>

Store key frame indices when loading video tensor from uri (#880)

key_frame_indices are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.

d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])

[0, 25, 196, ...]

Better plotting of embeddings for nested and complex data (#891)

You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding() method. This makes it easier to plot embeddings for complex and nested data.

docs.plot_embeddings(exclude_fields_metas=['chunks'])

Better support for information retrieval evaluation (#826)

This release adds a max_rel_per_label parameter to better support metric calculations that require the number of relevant Documents.

metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})

🐞 Bug Fixes

Support length calculation independently from list-like behavior (#840)

DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.

Remove cosine similarity field with false assignment (#835)

In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.

Rebuild index after clearing storage (#837)

The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage is called.

📗 Documentation Improvements

Correct Document description (#842)
Minor correction in Document description (#834)
Add username to DocArray pull (#847)
Fix broken docs (#805)
Fix data management section (#801)
Change logic order according to blog (#797)
Move cloud support to integrations (#798)

🤟 Contributors

We would like to thank all contributors to this release:

Delgermurun (@delgermurun)
Anne Yang (@AnneYang720)
anna-charlotte (@anna-charlotte)
Johannes Messner (@JohannesMessner)
Alex Cureton-Griffiths (@alexcg1)
AlaeddineAbdessalem (@alaeddine-13)
dong xiang (@dongxiang123)
coolmian (@coolmian)
Joan Fontanals (@JoanFM)
Nan Wang (@nan-wang)
samsja (@samsja)
Michael Günther (@guenthermi)

Contributors

delgermurun, alexcg1, and 10 other contributors

Assets 2

Releases: docarray/docarray

💫 Release v0.34.0

Release Note (0.34.0)

💣 Breaking Changes

Terminate Python 3.7 support

Changes to DocVec Protobuf definition (#1639)

🆕 Features

Allow users to check if a Document is already indexed in a DocIndex (#1633)

Support subindexes in InMemoryExactNNIndex (#1617)

Flexible tensor types for protobuf deserialization (#1645)

⚙ Refactoring

Add DBConfig to InMemoryExactNNSearch

🐞 Bug Fixes

Allow protobuf deserialization of BaseDoc with Union type (#1655)

Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)

Moved default_column_config from RuntimeConfig to DBconfig (#1648)

Fix issue with Protobuf (de)serialization for DocVec (#1639)

Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)

Working with external Qdrant collections (#1632)

Other bug fixes

📗 Documentation Improvements

🤟 Contributors

Contributors

Uh oh!

💫 Release v0.33.0

Release Note (0.33.0)

🆕 Features

Allow coercion between different Tensor types (#1552) (#1588)

🚀 Performance

Avoid stack embedding for every search (#1586)

🐞 Bug Fixes

Respect limit parameter in filter for index backends (#1618)

HnswDocumentIndex can search with limit greater than number of documents (#1611)

Allow updating HnswDocumentIndex (#1604)

Dynamically resize internal index to adapt to increasing number of documents (#1602)

Fix simple usage of HnswDocumentIndex (#1596)

Fix InMemoryExactNNIndex index initialization with nested DocList (#1582)

Fix summary of document with list (#1595)

Solve issues caused by issubclass (#1594)

Make example payload a string rather than bytes (#1587)

📗 Documentation Improvements

🤟 Contributors

Contributors

Uh oh!

💫 Patch v0.32.1

Release Note (0.32.1)

⚙ Refactoring

Improve ElasticDocIndex logging (#1551)

🐞 Bug Fixes

Allow InMemoryExactNNIndex with Optional embedding tensors (#1575)

Safe is_subclass check (#1569)

Fix AnyDoc deserialization (#1571)

dict method for Document view (#1559)

📗 Documentation Improvements

🤟 Contributors

Contributors

Uh oh!

💫 Release v0.32.0

Release Note (v0.32.0)

🆕 Features

Subindex for document index (#1428)

Openapi and FastAPI tensor shapes (#1510)

Save and load inmemory index (#1534)

🐞 Bug Fixes

search_field should be optional in hybrid text search (#1516)

Check if file path exists for in-memory index (#1537)

Add empty judgement to index search (#1533)

Detach torch tensors (#1526)

Docvec display (#1522)

📗 Documentation Improvements

🤟 Contributors

Contributors

Uh oh!

💫 Patch v0.31.1

Release Note (0.31.1)

🐞 Bug Fixes

🤟 Contributors

Contributors

Uh oh!

💫 Release v0.31.0

Release Note (`0.34.0`)

Changes to `DocVec` Protobuf definition (#1639)

Support subindexes in `InMemoryExactNNIndex` (#1617)

Add `DBConfig` to `InMemoryExactNNSearch`

Allow protobuf deserialization of `BaseDoc` with `Union` type (#1655)

Cast limit to integer when passed to `HNSWDocumentIndex` (#1657, #1656)

Moved `default_column_config` from `RuntimeConfig` to `DBconfig` (#1648)

Fix order of returned matches when `find` and `filter` combination used in `InMemoryExactNNIndex` (#1642)

Release Note (`0.33.0`)

Respect `limit` parameter in `filter` for index backends (#1618)

`HnswDocumentIndex` can search with `limit` greater than number of documents (#1611)

Allow updating `HnswDocumentIndex` (#1604)

Fix simple usage of `HnswDocumentIndex` (#1596)

Fix `InMemoryExactNNIndex` index initialization with nested `DocList` (#1582)

Solve issues caused by `issubclass` (#1594)

Release Note (`0.32.1`)

Improve `ElasticDocIndex` logging (#1551)

Allow `InMemoryExactNNIndex` with `Optional` embedding tensors (#1575)

Safe `is_subclass` check (#1569)

Fix `AnyDoc` deserialization (#1571)

`dict` method for Document view (#1559)

Release Note (`v0.32.0`)

`search_field` should be optional in hybrid text search (#1516)

`Docvec` display (#1522)

Release Note (`0.31.1`)

Release Note (`v0.31.0`)

Return type of `DocVec` Optional Tensor (#1472)

Add `InMemoryExactNNIndex` (#1441)

`DocList` inherits from Python `list` (#1457)

Add `len` to `DocIndex` (#1454)

Point to older versions when importing `Document` or `Documentarray` (#1422)

Fix `AnyDoc.from_protobuf` (#1437)

Changes to `Document`

Changes to `DocumentArray`

Release Note (`0.21.0`)

Fix unwanted coercion in `to_pydantic_model` (#949)

Query operator `$exists` does not work correctly with tags (#911) (#923)

Release Note (`0.20.1`)

Release Note (`0.20.0`)