Releases: docarray/docarray
💫 Release v0.33.0
Release Note (0.33.0
)
Release time: 2023-06-06 14:05:56
This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.
🆕 Features
Allow coercion between different Tensor types (#1552) (#1588)
Allow coercing to a TorchTensor
from an NdArray
or TensorFlowTensor
and the other way around.
from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np
class MyTensorsDoc(BaseDoc):
tensor: TorchTensor
doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
📄 MyTensorsDoc : 0a10f88 ...
╭─────────────────────┬────────────────────────────────────────────────────────╮
│ Attribute │ Value │
├─────────────────────┼────────────────────────────────────────────────────────┤
│ tensor: TorchTensor │ TorchTensor of shape (512,), dtype: torch.float64 │
╰─────────────────────┴────────────────────────────────────────────────────────╯
🚀 Performance
Avoid stack embedding for every search (#1586)
We have made a performance improvement for the find
interface for InMemoryExactNNIndex
that gives a ~2x speedup.
The script used to measure this is as follows:
from torch import rand
from time import perf_counter
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
class MyDocument(BaseDoc):
embedding: TorchTensor
embedding2: TorchTensor
embedding3: TorchTensor
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
return DocList[MyDocument](
[
MyDocument(
embedding=rand(dims),
embedding2=rand(dims),
embedding3=rand(dims),
)
for _ in range(num_docs)
]
)
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
index = InMemoryExactNNIndex[MyDocument](data_list)
start = perf_counter()
for _ in range(5):
matches, scores = index.find_batched(queries, search_field='embedding')
print(f"Number of queries: {num_queries} \n"
f"Number of indexed documents: {num_docs} \n"
f"Total time: {(perf_counter() - start)/5} seconds")
🐞 Bug Fixes
Respect limit
parameter in filter
for index backends (#1618)
InMemoryExactNNIndex
and HnswDocumentIndex
now respect the limit
parameter in the filter
API.
HnswDocumentIndex
can search with limit
greater than number of documents (#1611)
HnswDocumentIndex
now allows to call find
with a limit
parameter larger than the number of indexed documents.
Allow updating HnswDocumentIndex
(#1604)
HnswDocumentIndex
now allows reindexing documents with the same id
, updating the original documents.
Dynamically resize internal index to adapt to increasing number of documents (#1602)
HnswDocumentIndex
now allows indexing more than max_elements
, dynamically adapting the index as it grows.
Fix simple usage of HnswDocumentIndex
(#1596)
from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')
Previously, this basic usage threw an exception:
TypeError: ModelMetaclass object argument after must be a mapping, not MyDoc
Now, it works as expected.
Fix InMemoryExactNNIndex
index initialization with nested DocList
(#1582)
Instantiating an InMemoryExactNNIndex
with a Document
schema that had a nested DocList
previously threw this error:
from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
text: str,
d_list: DocList[TextDoc]
index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'
Now it can be successfully instantiated.
Fix summary of document with list (#1595)
Calling summary
on a document with a List
attribute previously showed the wrong type:
from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
str_list: List[str]
dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()
Previous output:
╭─────── DocList Summary ───────╮
│ │
│ Type DocList[TestDoc] │
│ Length 2 │
│ │
╰───────────────────────────────╯
╭─── Document Schema ───╮
│ │
│ TestDoc │
│ └── str_list: str │
│ │
╰───────────────────────╯
New output:
╭─────── DocList Summary ───────╮
│ │
│ Type DocList[TestDoc] │
│ Length 2 │
│ │
╰───────────────────────────────╯
╭────── Document Schema ──────╮
│ │
│ TestDoc │
│ └── str_list: List[str] │
│ │
╰─────────────────────────────╯
Solve issues caused by issubclass
(#1594)
DocArray
relies heavily on calling Python's issubclass
method which caused multiple issues. We now use a safe version that counts for edge cases and types.
Make example payload a string rather than bytes (#1587)
The example
payload of a given document schema with Tensor
attribute was previously of bytes
type. This has now been changed to str
.
from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')
📗 Documentation Improvements
- Add forward declaration steps to example to avoid pickling error (#1615)
- Fix
n_dim
todim
(#1610) - Add "in memory" to documentation as list of supported vector indexes (#1607)
- Add a tensor section (#1576)
🤟 Contributors
We would like to thank all contributors to this release:
- Mohammad Kalim Akram (@makram93)
- samsja (@samsja)
- Saba Sturua (@jupyterjazz)
- Joan Fontanals (@JoanFM)
- maxwelljin (@maxwelljin)
💫 Patch v0.32.1
Release Note (0.32.1
)
Release time: 2023-05-26 14:50:34
This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.
⚙ Refactoring
Improve ElasticDocIndex
logging (#1551)
More debugging logs have been added inside ElasticDocIndex
.
🐞 Bug Fixes
Allow InMemoryExactNNIndex
with Optional
embedding tensors (#1575)
You can now index Documents where the tensor search_field
is Optional
. The index will not consider these None
embeddings when running a search.
import torch
from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex
class EmbeddingDoc(BaseDoc):
embedding: Optional[TorchTensor[768]]
index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)
Safe is_subclass
check (#1569)
In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass
method.
This call fails under some circumstances, for instance when checked for a List
or Tuple
. Starting with this release, we use a safe version that does not fail for these cases.
This enables the following usage, which would otherwise fail:
from docarray import BaseDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
test: List[str]
index = HnswDocumentIndex[MyDoc]()
Fix AnyDoc
deserialization (#1571)
AnyDoc
is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:
from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict
class ConcreteDoc(BaseDoc):
text: str
tags: Dict[str, int]
doc = ConcreteDoc(text='text', tags={'type': 1})
any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}
dict
method for Document view (#1559)
Prior to this fix, doc.dict()
would return an empty Dictionary if doc.is_view() == True
:
class MyDoc(BaseDoc):
foo: int
vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}
# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}
📗 Documentation Improvements
🤟 Contributors
We would like to thank all contributors to this release:
- aman-exp-infy (@agaraman0)
- Johannes Messner (@JohannesMessner)
- Joan Fontanals (@JoanFM)
- Saba Sturua (@jupyterjazz)
- Ge Jin (@maxwelljin)
💫 Release v0.32.0
Release Note (v0.32.0
)
This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.
🆕 Features
Subindex for document index (#1428)
The subindex feature allows you to index documents that contain another DocList
by automatically creating a separate collection/index for each such DocList
:
# create nested document schema
class SimpleDoc(BaseDoc):
tensor: NdArray[10]
text: str
class MyDoc(BaseDoc):
docs: DocList[SimpleDoc]
# create some docs
my_docs = [
MyDoc(
docs=DocList[SimpleDoc](
[
SimpleDoc(
tensor=np.ones(10) * (j + 1),
text=f"hello {j}",
)
for j in range(10)
]
),
)
]
# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs) # index with name 'idx' and 'idx__docs' will be generated
# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
query, search_field="docs__tensor", limit=5
)
Openapi and FastAPI tensor shapes (#1510)
We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.
This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:
class Doc(BaseDoc):
embedding_torch: TorchTensor[3, 4]
app = FastAPI()
@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
return Doc(embedding=doc.embedding_np)
Generated Swagger UI:
Save and load inmemory index (#1534)
We added a persist
method to the InMemoryExactNNIndex
class to save the index to disk.
# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')
🐞 Bug Fixes
search_field
should be optional in hybrid text search (#1516)
We have added a sane default to text_search()
for the search_field
argument that is now Optional.
Check if file path exists for in-memory index (#1537)
We have added an internal check to see if index_file_path
exists when passed to InMemoryExactNNIndex
.
Add empty judgement to index search (#1533)
We have ensured that empty indices do not fail when find
is called.
Detach torch tensors (#1526)
Serializing tensors with gradients no longer fails.
Docvec
display (#1522)
Docvec
display issues have been resolved.
📗 Documentation Improvements
- Remove erroneous info (#1531)
- Fix link to documentation in readme (#1525)
- Flatten structure (#1520)
- Fix links (#1518)
🤟 Contributors
We would like to thank all contributors to this release:
- Mohammad Kalim Akram (@makram93)
- Johannes Messner (@JohannesMessner)
- Anne Yang (@AnneYang720)
- Zhaofeng Miao (@mapleeit)
- Joan Fontanals (@JoanFM)
- Kacper Łukawski (@kacperlukawski)
- IyadhKhalfallah (@IyadhKhalfallah)
- Saba Sturua (@jupyterjazz)
💫 Patch v0.31.1
Release Note (0.31.1
)
This patch release fixes a small bug that was introduced in the latest minor release (0.31.0
).
🐞 Bug Fixes
- Calling
json
ordict
on a Optional nested DocList does not throw an error anymore if the value is set toNone
(#1512)
🤟 Contributors
We would like to thank all contributors to this release:
- samsja (@samsja)
💫 Release v0.31.0
Release Note (v0.31.0
)
This release contains 4 new features, 11 bug fixes, and several documentation improvements.
💥 Breaking changes
Return type of DocVec
Optional Tensor (#1472)
Optional tensor fields in a DocVec
will return None
instead of a list of Nan
if the column does not hold any tensor.
This code snippet shows the breaking change:
from typing import Optional
from docarray import BaseDoc, DocVec
from docarray.typing import NdArray
class MyDoc(BaseDoc):
tensor: Optional[NdArray[10]]
docs = DocVec[MyDoc]([MyDoc() for j in range(2)])
print(docs.tensor)
Version | Return type |
---|---|
0.30.0 | [nan nan] |
0.31.0 | None |
Default index collection names
Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.
In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name
or collection_name
.
Starting with DocArray v0.30.0, the default index_name
/collection_name
will be derived from the document schema name:
from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc
class MyDoc(BaseDoc):
pass
# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()
If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.
You can fix this by manually specifying the index name to match the old default:
# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')
The below table summarizes the change for all DB backends:
DBConfig argument | Default in v0.30.0 | Default in v0.31.0 | |
---|---|---|---|
WeaviateDocumentIndex | index_name |
'Document' | Schema class name |
QdrantDocumentIndex | collection_name |
'documents' | Schema class name |
ElasticDocIndex | index_name |
'index__' + a random id | Schema class name |
ElasticV7DocIndex | index_name |
'index__' + a random id | Schema class name |
HnswDocumentIndex | n/a | n/a | n/a |
🆕 Features
Add InMemoryExactNNIndex
(#1441)
In this version we have introduced the InMemoryExactNNIndex
Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).
The InMemoryExactNNIndex
can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
tensor: NdArray[512]
docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)
print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))
DocList
inherits from Python list
(#1457)
DocList
is now a subclass of Python's list
. This means that you can now use all the methods that are available to Python lists on DocList
objects. For example, you can now use len
on DocList
objects and tools like Pydantic or FastAPI will be able to work with it more easily.
Add len
to DocIndex
(#1454)
You can now perform len(vector_index)
which is equivalent to vector_index.num_docs()
.
Other minor features
- Add a
to_json
alias toBaseDoc
(#1494)
🐞 Bug Fixes
Point to older versions when importing Document
or Documentarray
(#1422)
Trying to load Document
or DocumentArray
from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.
Fix AnyDoc.from_protobuf
(#1437)
AnyDoc
can now read any BaseDoc
protobuf file. The same applies to DocList
.
Other bug fixes
- Fix
extend
toDocList
(#1493) - Fix bug when calling
dict()
onBaseDoc
(#1481) - Fix bug when calling
json()
onBaseDoc
(#1481) - Support Pandas 2.0 by using
pd.concat()
instead ofdf.append()
into_dataframe()
to avoid warning (#1478) - Add logs to Elasticsearch index (#1427)
- Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to
ndarray
(#1429) - Fix a bug with HNSW (#1426)
- Hubble Binary format version bump (#1414)
- Save index during creation for
hnswlib
(#1424)
📗 Documentation Improvements
- Fix FastAPI docs (#1453)
- Index predefined Documents (#1434)
- Clean up data types section (#1412)
- Remove duplicate API reference section (#1408)
Docindex
URLs (#1433)- Fix Install commands hint (#1421)
- Add Google Analytics (#1432)
- Add install instructions for
hnswlib
andelastic
document indexes (#1431) - Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)
🤟 Contributors
We would like to thank all contributors to this release:
- Alex Cureton-Griffiths (@alexcg1)
- samsja (@samsja)
- Johannes Messner (@JohannesMessner)
- Anne Yang (@AnneYang720)
- Scott Martens (@scott-martens)
- カレン (@RStar2022)
- Aman Agarwal (@agaraman0)
- Yanlong Wang (@nomagick)
- Charlotte Gerhaher (@anna-charlotte)
💫 Release v0.30.0
💫 Release v0.30.0 (a.k.a DocArray v2)
Warning
This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.
Changelog
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
- Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
- Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
- Language agnostic: At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.
You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:
- Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
- Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
- Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Changes to Document
Document
has been renamed toBaseDoc
.BaseDoc
cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.- Following from the previous point, extending
BaseDoc
allows for a flexible schema compared to theDocument
class in v1 which only allowed for a fixed schema, with one oftensor
,text
andblob
, and additionalchunks
andmatches
. - Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as
.load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing-level. - In v2 we have the
LegacyDocument
class, which extendsBaseDoc
while following the same schema as v1'sDocument
. TheLegacyDocument
can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1Document
. Indeed, none of the methods associated withDocument
are present. Only the schema of the data is similar.
Changes to DocumentArray
DocList
- The
DocumentArray
class from v1 has been renamed toDocList
, to be more descriptive of its actual functionality, since it is a list ofBaseDoc
s.
DocVec
- Additionally, we introduced the class
DocVec
, which is a column-based representation ofBaseDoc
s. BothDocVec
andDocList
extendAnyDocArray
. DocVec
is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).- A
DocVec
has a similar interface asDocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of theDocVec
(the.doc_type
which is aBaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a singledoc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field isAnyTensor
or a Union of tensor types, the.tensor_type
will be used to determine the type of thedoc_vec
column.
Parameterized DocList
- With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a
DocList
it does not necessarily have to be homogenous. - If you want a homogenous
DocList
you can parameterize it at initialization time:
from docarray import DocList
from docarray.documents import ImageDoc
docs = DocList[ImageDoc]()
- Methods like
.from_csv()
or.pull()
only work with parameterizedDocList
s.
Access attributes of your DocumentArray
- In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
- In v2 you don't have to use the plural, but instead just use the document's attribute name, since
AnyDocArray
will expose the same attributes as theBaseDoc
s it contains. This will return a list oftype(attribute)
. However, this only works if (and only if) all theBaseDoc
s in theAnyDocArray
have the same schema. Therefore only this works:
from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
Changes to Document Store
In v2 the Document Store
has been renamed to DocIndex
and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex
supports:
Instead of creating a DocumentArray
instance and setting the storage
parameter to a vector database of your choice, in v2 you can initialize a DocIndex
object of your choice, such as:
db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')
In contrast, DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.
Thank you to all of the contributors to this release:
💫 Release v0.21.0
Release Note (0.21.0
)
Release time: 2023-01-17 09:10:50
This release contains 3 new features, 7 bug fixes and 5 documentation improvements.
🆕 Features
OpenSearch Document Store (#853)
This version of DocArray adds a new Document Store: OpenSearch!
You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:
from docarray import Document, DocumentArray
import numpy as np
# Connect to OpenSearch instance
n_dim = 3
da = DocumentArray(
storage='opensearch',
config={'n_dim': n_dim},
)
# Index Documents
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim))
for i in range(10)
]
)
# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)
Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.
Learn more about its usage in the official documentation.
Add color to point cloud display (#961)
You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor()
:
coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')
doc = Document(
tensor=coords,
chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()
Add language attribute to Redis Document Store (#953)
The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language
parameter in the Redis configuration:
da = DocumentArray(
storage='redis',
config={
'n_dim': 128,
'index_text': True,
'language': 'chinese',
},
)
🐞 Bug Fixes
Replace newline with whitespace to fix display in plot embeddings (#963)
Whenever the string "\n"
was contained in any Document field, doc.plot()
would result in a rendering error. This fixes those errors be rendering "\n"
as whitespace.
Fix unwanted coercion in to_pydantic_model
(#949)
This bug caused all strings of the form 'Infinity'
to be coerced to the string 'inf'
when calling to_pydantic_model()
or to_dict()
. This is fixed now, leaving such strings unchanged.
Calculate relevant docs on index instead of queries (#950)
In the embed_and_evaluate()
method, the number of relevant Documents per label used to be calculated based on the Document in self
. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.
Remove offset index create on list like false (#936)
When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.
Add support for remote audio files (#933)
Loading audio files from a remote URL would cause FileNotFoundError
, which is now fixed.
Query operator $exists
does not work correctly with tags (#911) (#923)
Before this fix, $exists
would treat false-y values such as 0
or []
as non existent. This is now fixed.
Document from dataclass with singleton list (#1018)
When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]
. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.
📗 Documentation Improvements
- Link to Discord (#1010)
- Have less versions to avoid deployment timeout (#977)
- Fix data management section not appearing in Documentation (#967)
- Link to OpenSearch docs in sidebar (#960)
- Multimodal to datatypes (#934)
🤟 Contributors
We would like to thank all contributors to this release:
- Jay Bhambhani (@jay-bhambhani)
- Alvin Prayuda (@alphinside)
- Johannes Messner (@JohannesMessner)
- samsja (@samsja)
- Marco Luca Sbodio (@marcosbodio)
- Anne Yang (@AnneYang720)
- Michael Günther (@guenthermi)
- AlaeddineAbdessalem (@alaeddine-13)
- Han Xiao (@hanxiao)
- Alex Cureton-Griffiths (@alexcg1)
- Charlotte Gerhaher (@anna-charlotte)
💫 Patch v0.20.1
Release Note (0.20.1
)
Release time: 2022-12-12 09:32:37
🐞 Bug Fixes
Make Milvus DocumentArray thread safe and suitable for pytest (#904)
This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.
This would produce an error like the following:
E1207 14:59:51.357528591 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884 3934 ev_epoll1_linux.cc:824] assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted
This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.
Restore backwards compatibility for (de)serialization (#903)
DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'
This fix restores backwards compatibility by not relying on newly introduced private attributes:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
<DocumentArray (length=11) at 140683902276416>
Process finished with exit code 0
📗 Documentation Improvements
- Polish docs throughout (#895)
🤟 Contributors
We would like to thank all contributors to this release:
- Anne Yang (@AnneYang720)
- Nan Wang (@nan-wang)
- anna-charlotte (@anna-charlotte)
- Alex Cureton-Griffiths (@alexcg1)
💫 Release v0.20.0
Release Note (0.20.0
)
Release time: 2022-12-07 12:15:30
This release contains 8 new features, 3 bug fixes and 7 documentation improvements.
🆕 Features
Milvus document store (#587)
This release supports the Milvus vector database as a document store.
da = DocumentArray(storage='milvus', config={'n_dim': 3))
Root_id for document stores (#808)
When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).
top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)
To allow this we now store the root_id
in the chunks' tags. You can enable this by passing root_id=True
in your document store configuration.
Filtering based on text keywords for Qdrant (#849)
You can now filter based on text keywords for the Qdrant document store.
filter = {
'must': [
{"key": "info", "match": {"text": "shoes"}}
]
}
results = da.find(np.random.rand(n_dim), filter=filter)
RGB-D representation of 3D meshes (#753)
DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.
doc.load_uris_to_rgbd_tensor()
Load multi page tiff files into chunks (#845)
Multi page tiff
images can now be loaded with load_uri_to_image_tensor()
.
d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
└─ chunks
├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
├─ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
└─ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>
Store key frame indices when loading video tensor from uri (#880)
key_frame_indices
are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.
d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]
Better plotting of embeddings for nested and complex data (#891)
You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding()
method. This makes it easier to plot embeddings for complex and nested data.
docs.plot_embeddings(exclude_fields_metas=['chunks'])
Better support for information retrieval evaluation (#826)
This release adds a max_rel_per_label
parameter to better support metric calculations that require the number of relevant Documents.
metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})
🐞 Bug Fixes
Support length calculation independently from list-like behavior (#840)
DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.
Remove cosine similarity field with false assignment (#835)
In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.
Rebuild index after clearing storage (#837)
The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage
is called.
📗 Documentation Improvements
- Correct Document description (#842)
- Minor correction in Document description (#834)
- Add username to DocArray pull (#847)
- Fix broken docs (#805)
- Fix data management section (#801)
- Change logic order according to blog (#797)
- Move cloud support to integrations (#798)
🤟 Contributors
We would like to thank all contributors to this release:
- Delgermurun (@delgermurun)
- Anne Yang (@AnneYang720)
- anna-charlotte (@anna-charlotte)
- Johannes Messner (@JohannesMessner)
- Alex Cureton-Griffiths (@alexcg1)
- AlaeddineAbdessalem (@alaeddine-13)
- dong xiang (@dongxiang123)
- coolmian (@coolmian)
- Joan Fontanals (@JoanFM)
- Nan Wang (@nan-wang)
- samsja (@samsja)
- Michael Günther (@guenthermi)
💫 Patch v0.19.1
Release note 0.19.1
This release contains 1 hot fix.
🐞 Hot Fix
Support for new Jina AI Cloud namespace format.
This release introduces namespaces when pushing/pulling DocumentArrays to/from Jina AI Cloud.
from docarray import DocumentArray
DocumentArray.pull('<username>/<da-name>')
DocumentArray.push('<username>/<da-name>')
You should now use a namespace when accessing an artifact. This release fixes a bug related to this namespace in DocArray.
🤟 Contributors
- samsja (@samsja)
- delgermurun (@delgermurun)