Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(python): Creating index for float16 vectors takes significantly longer time than float32 vector #1312

Open
weitianhan opened this issue May 17, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@weitianhan
Copy link

weitianhan commented May 17, 2024

LanceDB version

v0.6.13

What happened?

As the title described, I observe that creating index for float16 type vector takes longer time. Here is a short code snippet to reproduce the problem.

import numpy as np
import lancedb 
import time
import pyarrow as pa

# modify the below parameters accordingly
uri = "./example-lancedb"
record_num = 1000 # feel free to increase the dataset size to see the performance diff 
part_num = 10
sub_vec_num = 64
vector_dim = 512

# No need for modification below
# prepare the tables before building indices.
db = lancedb.connect(uri)
schema_32 = pa.schema([pa.field("vector", pa.list_(pa.float32(), vector_dim))])
tbl_32 = db.create_table("my_table_32", schema=schema_32, mode="overwrite") # set mode to overwrite for demo purpose.
tbl_32.add([{"vector": np.random.uniform(-1, 1, size=vector_dim)} for _ in range(record_num)])
print("size of float32 table is now: " + str(tbl_32.count_rows()))

schema_16 = pa.schema([pa.field("vector", pa.list_(pa.float16(), vector_dim))])
tbl_16 = db.create_table("my_table_16", schema=schema_16, mode="overwrite")
tbl_16.add([{"vector": np.random.uniform(-1, 1, size=vector_dim).astype(np.float16)} for _ in range(record_num)])
print("size of float16 table is now: " + str(tbl_16.count_rows()))

# create index for float32 table
start_time = time.perf_counter()
tbl_32.create_index(metric="cosine", num_partitions=part_num, num_sub_vectors=sub_vec_num, vector_column_name="vector")
end_time = time.perf_counter()
print ("float32 create index time used: %ss" % ((end_time - start_time)))

# create index for float16 table
start_time = time.perf_counter()
tbl_16.create_index(metric="cosine", num_partitions=part_num, num_sub_vectors=sub_vec_num, vector_column_name="vector")
end_time = time.perf_counter()
print ("float16 create index time used: %ss" % ((end_time - start_time)))

And the result is:

size of float32 table is now: 1000
size of float16 table is now: 1000
float32 create index time used: 1.0481734249997317s
float16 create index time used: 4.043548755000302s

When I grow the size of dataset to 1M, the time difference is 10min v.s. 2 hours. Is this expected behaviour? Or am I doing anything wrong?

Are there known steps to reproduce?

No response

@weitianhan weitianhan added the bug Something isn't working label May 17, 2024
@wjones127
Copy link
Contributor

Not sure if this is the reason, but we only compile optimized fp16 kernels for some platforms, and on all other ones it is expected to be slow. When you installed lancedb, it should have installed pylance too. Do you know what the wheel name was? You can run pip install -U lancedb again and you should see somewhere in the logs a string like pylance-0.10.12-cp38-abi3-macosx_11_0_arm64.whl.

@weitianhan
Copy link
Author

Yes. I am developing app on an ARM development board. The wheel used to install python package is pylance-0.10.18-cp39-abi3-manylinux_2_24_aarch64.whl.

But the same thing happens when I run this script on my desktop machine with AMD64 CPU where pylance is installed by pylance-0.10.12-cp38-abi3-manylinux_2_28_x86_64.whl. So is it really related to platform?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants