Improve MirroredObjectStore copying from non existent source #1180

albertlockett · 2024-04-02T16:17:23Z

Adds CheckedCopyObjectStore wrapper which will check if the source object exists before copying. This ensures that we won't go into an infinite loop due to bug Local object store copy/rename with nonexistent from file loops forever instead of erroring apache/arrow-rs#5503
Adds to MirroredObjectStore the ability to skip copying in the secondary if the source doesn't exist
Adds to MirroredObjectStore the ability to use WrappingObjectStore for secondary store

should fix the error on top of main https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725

Closes #795

addresses #797 Problem: tantivy does not expose option to explicitly Proposed solution here: 1. Add a `.phrase_query()` option 2. Under the hood, LanceDB takes care of wrapping the input in quotes and replace nested double quotes with single quotes I've also filed an upstream issue, if they support phrase queries natively then we can get rid of our manual custom processing here.

Closes #796

instead of starting and stopping the current thread's event loop on every http call, just make an http call.

We use `ruff` in CI and dev workflow now.

Named it Gemini-text for now. Not sure how complicated it will be to support both text and multimodal embeddings under the same class "gemini"..But its not something to worry about for now I guess.

We should now be able to directly ingest polars dataframes and return results as polars dataframes ![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)

This pull request adds check for the presence of an environment variable `OPENAI_API_KEY` and removes an unused parameter in `retry_with_exponential_backoff` function.

…cn (#816) ``` UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning) ```

This mimics CREATE TABLE IF NOT EXISTS behavior. We add `db.create_table(..., exist_ok=True)` parameter. By default it is set to False, so trying to create a table with the same name will raise an exception. If set to True, then it only opens the table if it already exists. If you pass in a schema, it will be checked against the existing table to make sure you get what you want. If you pass in data, it will NOT be added to the existing table.

…me (#814)

This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 ## Chores - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. ## Restructure/new content - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <[email protected]> Co-authored-by: Will Jones <[email protected]>

This PR makes the following aesthetic and content updates to the docs. - [x] Fix max width issue on mobile: Content should now render more cleanly and be more readable on smaller devices - [x] Improve image quality of flowchart in data management page - [x] Fix syntax highlighting in text at the bottom of the IVF-PQ concepts page - [x] Add example of Polars LazyFrames to docs (Integrations) - [x] Add example of adding data to tables using Polars (guides)

@eddyxu

@eddyxu added instructions for linting here: https://github.com/lancedb/lancedb/blob/7af213801a091cd5afb0f0814e184fc0b852de47/python/README.md?plain=1#L45-L50 However, we had a lot of failures and weren't checking this in CI. This PR fixes all lints and adds a check to CI to keep us in compliance with the lints.

* improve the docstring for NodeJS connect functions and `ConnectOptions` parameters. * Simplify `npm run build` steps.

…into tecmie-tecmie/embeddings-openai

#1134) Solves - #968

The previous release failed to release nodejs because the nodejs version wasn't bumped. This should fix that.

Fixes #1163

We aren't yet ready to switch over the examples since almost all JS examples rely on embeddings and we haven't yet ported those over. However, this makes it possible for those that are interested to start using `@lancedb/lancedb`

chebbyChefNEQ · 2024-04-02T17:09:44Z

rust/lancedb/src/io/object_store.rs

 self.primary.copy(from, to).await?;
 Ok(())
 }
 }

 async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> {
 if !to.primary_only() {
- self.secondary.copy(from, to).await?;
+ self.secondary_copy(from, to).await?;


could we just do a hard link? and soft fail?

Wouldn't we only be able to create the link on local FS? I was trying to do something that is agnostic to the underlying secondary implementation.

Also by soft-fail, do you mean not return the error if the source doesn't exist? That behaviour is now supported by supplying param secondary_copy_behavior =MirroringSecondaryCopy::SkipIfNotFound

also fyi the local fs object store does do a hard link when copying:
https://github.com/apache/arrow-rs/blob/object_store_0.9.0/object_store/src/local.rs#L601-L616

westonpace

I'll admit, I'm a little confused, but I was already a little confused by the MirroringObjectStore so this isn't something new.

I guess my main question is how a file ends up existing on the destination but not the source? Or why is it safe to silently ignore these "not found" errors?

westonpace · 2024-04-02T18:56:45Z

rust/lancedb/src/io/object_store.rs

+ ) => Ok(()),
+ (_, Err(e)) => Err(e),
+ }
+ }
 }

 impl std::fmt::Display for MirroringObjectStore {
 fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {


Should you also print/format the copy behavior?

westonpace · 2024-04-02T18:57:39Z

rust/lancedb/src/io/object_store.rs

+ // Since the secondary store may not be as durable as the primary, the copy source
+ // may exist on the primary but not on the secondary. If the source is not found,
+ // this skips making the copy
+ SkipIfNotFound,


In what cases would I want Copy and not SkipIfNotFound? How do I decide what value to set here?

Generally you'd use SkipIfNotFound if you can't be sure that the source object won't be in the secondary store, and your use case allows for the destination not to be there either. As a concrete example, you might want to use SkipIfNotFound if you were using the secondary object store as some kind of read through cache. E.g. Maybe your primary object store is s3, but you're mirroring objects to a local ephemeral volume. In this scenario, it could also be that another process is creating objects on the primary store, and if they don't exist on the copy it's not so bad (it's just a cache miss).

Generally you'd use Copy if you had some tighter constraints on about maintaining consistency between what's in your primary and secondary stores, and maybe you want the caller to be notified that the copy object couldn't be created on the secondary store. This option being (being the default), is added backwards compatibility.

westonpace · 2024-04-02T19:00:11Z

rust/lancedb/src/io/object_store.rs

@@ -314,33 +346,152 @@ impl AsyncWrite for MirroringUpload {
 #[derive(Debug)]
 pub struct MirroringObjectStoreWrapper {
 secondary: Arc<dyn ObjectStore>,
+ secondary_copy_behavior: MirroringSecondaryCopy,
+ secondary_wrapper: Option<Arc<dyn WrappingObjectStore>>,


Is the addition of secondary_wrapper essential to the fix? Or is it just something that will make debugging easier in the future?

This is essential to the fix. We need to be able to wrap the secondary with CheckedCopyObjectStoreWrapper so if the secondary is a LocalFileSystem, when we call copy on MirroredObjectStore it won't hang forever when it calls secondary.copy()

westonpace · 2024-04-02T19:03:24Z

rust/lancedb/src/io/object_store.rs

+ (_, Ok(_)) => Ok(()),
+ (
+ MirroringSecondaryCopy::SkipIfNotFound,
+ Err(object_store::Error::NotFound { path: _, source: _ }),


Does this mean the source was not found? Or does it mean the destination directory was not found?

This means the source is not found. Object store copy will create the destination directory if it's not found.

…edb#1270) Closes lancedb#1180

changhiskhan and others added 30 commits January 12, 2024 09:45

fix(rust): not sure why clippy is suddenly unhappy (#794)

f43c06d

should fix the error on top of main https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725

feat(python): add count_rows with filter option (#801)

fcfb458

Closes #795

chore(python): add docstring for limit behavior (#800)

91d64d8

Closes #796

use requests instead of aiohttp for underlying http client (#803)

eda4c58

instead of starting and stopping the current thread's event loop on every http call, just make an http call.

feat(node): align incoming data to table schema (#802)

3f66be6

chore: remove black as dependency (#808)

6e97fad

We use `ruff` in CI and dev workflow now.

upgrade lance (#809)

707df47

[python] Bump version: 0.4.3 → 0.4.4

a7be064

Bump version: 0.4.2 → 0.4.3

88affc1

Updating package-lock.json

cfdbddc

rebase from lancedb/main

f7f9bea

Merge branch 'main' into tecmie/embeddings-openai

a90358a

feat(python): Add gemini text embedding function (#806)

4568df4

Named it Gemini-text for now. Not sure how complicated it will be to support both text and multimodal embeddings under the same class "gemini"..But its not something to worry about for now I guess.

eslint fix

44eba36

feat(python): basic polars integration (#811)

49333e5

We should now be able to directly ingest polars dataframes and return results as polars dataframes ![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)

Add openai api key not found help (#815)

ff6f005

This pull request adds check for the presence of an environment variable `OPENAI_API_KEY` and removes an unused parameter in `retry_with_exponential_backoff` function.

chore(python): get rid of Pydantic deprication warning in embedding f…

184d2bc

…cn (#816) ``` UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning) ```

feat(python): allow the entire table to be converted a polars datafra…

af8263a

…me (#814)

chore(python): turn off lazy frame ingestion (#821)

39e14c7

[python] Bump version: 0.4.4 → 0.5.0

8bcdc81

bump lance to 0.9.7 (#826)

7af2138

fix: remote python client closes idle connections (#831)

c89d5e6

doc: improve docs for nodejs connect functions (#833)

9a9fc77

* improve the docstring for NodeJS connect functions and `ConnectOptions` parameters. * Simplify `npm run build` steps.

Merge branch 'tecmie/embeddings-openai' of github.com:tecmie/lancedb …

3ba1618

…into tecmie-tecmie/embeddings-openai

Merge branch 'tecmie-tecmie/embeddings-openai'

a1ab549

Lance Release and others added 22 commits March 25, 2024 15:43

Updating package-lock.json

9992b92

Updating package-lock.json

98c1e63

docs: add the async python API to the docs (#1156)

c37a28a

docs: Add all available HF/sentence transformers embedding models list (

e9e0a37

#1134) Solves - #968

New logo on docs site (#1157)

cae0348

added new logo to vercel example gif (#1158)

0e9ad76

chore: fix clippy (#1162)

ca4fdf5

chore: add nodejs to bumpversion (#1161)

e52dc87

The previous release failed to release nodejs because the nodejs version wasn't bumped. This should fix that.

ci: fix name collision in npm artifacts for vectordb (#1164)

d388ef2

Fixes #1163

Update LanceDB Logo in README (#1167)

e0bf6d9

<img width="1034" alt="image" src="https://github.com/lancedb/lancedb/assets/5846846/5b8aa53c-4d93-4c0e-bed4-80c238b319ba">

chore: bump lance version to 0.10.6 (#1175)

6779c1c

Bump version: 0.4.14 → 0.4.15

e5f40a4

Updating package-lock.json

7e92aa6

[python] Bump version: 0.6.5 → 0.6.6

05f484b

feat: add filterable countRows to remote API (#1169)

1d67615

Updating package-lock.json

5603f1e

Updating package-lock.json

09280bc

chore: pass str instead of String to build table names (#1178)

73810b4

chore: add social link footer (#1177)

46e95f2

add checked copy and new secondary copy behaviour

5b490d0

tests

f6bd785

albertlockett requested review from chebbyChefNEQ and westonpace April 2, 2024 16:59

chebbyChefNEQ reviewed Apr 2, 2024

View reviewed changes

westonpace reviewed Apr 2, 2024

View reviewed changes

albertlockett added 2 commits April 2, 2024 15:16

implement display for copy behavior

ada28d3

linter

a40170c

westonpace force-pushed the main branch from 93c8786 to 9fee384 Compare April 5, 2024 23:40

alexkohler pushed a commit to alexkohler/lancedb that referenced this pull request Apr 20, 2024

feat: add python bindings for the cleanup_old_versions function (lanc…

f042694

…edb#1270) Closes lancedb#1180

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MirroredObjectStore copying from non existent source #1180

Improve MirroredObjectStore copying from non existent source #1180

albertlockett commented Apr 2, 2024 •

edited

chebbyChefNEQ Apr 2, 2024

albertlockett Apr 2, 2024

albertlockett Apr 2, 2024

westonpace left a comment

westonpace Apr 2, 2024

albertlockett Apr 2, 2024

westonpace Apr 2, 2024

albertlockett Apr 2, 2024

westonpace Apr 2, 2024

albertlockett Apr 2, 2024

westonpace Apr 2, 2024

albertlockett Apr 2, 2024

Improve MirroredObjectStore copying from non existent source #1180

Are you sure you want to change the base?

Improve MirroredObjectStore copying from non existent source #1180

Conversation

albertlockett commented Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertlockett commented Apr 2, 2024 •

edited