Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MirroredObjectStore copying from non existent source #1180

Open
wants to merge 465 commits into
base: main
Choose a base branch
from

Conversation

albertlockett
Copy link
Contributor

@albertlockett albertlockett commented Apr 2, 2024

fixes: #1176

changhiskhan and others added 30 commits January 12, 2024 09:45
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
We use `ruff` in CI and dev workflow now.
Named it Gemini-text for now. Not sure how complicated it will be to
support both text and multimodal embeddings under the same class
"gemini"..But its not something to worry about for now I guess.
We should now be able to directly ingest polars dataframes and return
results as polars dataframes


![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)
This pull request adds check for the presence of an environment variable
`OPENAI_API_KEY` and removes an unused parameter in
`retry_with_exponential_backoff` function.
…cn (#816)

```
UserWarning: Valid config keys have changed in V2:
* 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning)
```
This mimics CREATE TABLE IF NOT EXISTS behavior.
We add `db.create_table(..., exist_ok=True)` parameter.
By default it is set to False, so trying to create
a table with the same name will raise an exception.
If set to True, then it only opens the table if it
already exists. If you pass in a schema, it will
be checked against the existing table to make sure
you get what you want. If you pass in data, it will
NOT be added to the existing table.
This PR makes incremental changes to the documentation.

* Closes #697 
* Closes #698

## Chores
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

## Restructure/new content 
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <[email protected]>
Co-authored-by: Will Jones <[email protected]>
This PR makes the following aesthetic and content updates to the docs.

- [x] Fix max width issue on mobile: Content should now render more
cleanly and be more readable on smaller devices
- [x] Improve image quality of flowchart in data management page
- [x] Fix syntax highlighting in text at the bottom of the IVF-PQ
concepts page
- [x] Add example of Polars LazyFrames to docs (Integrations)
- [x] Add example of adding data to tables using Polars (guides)
@eddyxu added instructions for linting here:


https://github.com/lancedb/lancedb/blob/7af213801a091cd5afb0f0814e184fc0b852de47/python/README.md?plain=1#L45-L50

However, we had a lot of failures and weren't checking this in CI. This
PR fixes all lints and adds a check to CI to keep us in compliance with
the lints.
* improve the docstring for NodeJS connect functions and
`ConnectOptions` parameters.
* Simplify `npm run build` steps.
Lance Release and others added 22 commits March 25, 2024 15:43
The previous release failed to release nodejs because the nodejs version
wasn't bumped. This should fix that.
We aren't yet ready to switch over the examples since almost all JS
examples rely on embeddings and we haven't yet ported those over.
However, this makes it possible for those that are interested to start
using `@lancedb/lancedb`
self.primary.copy(from, to).await?;
Ok(())
}
}

async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> {
if !to.primary_only() {
self.secondary.copy(from, to).await?;
self.secondary_copy(from, to).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just do a hard link? and soft fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we only be able to create the link on local FS? I was trying to do something that is agnostic to the underlying secondary implementation.

Also by soft-fail, do you mean not return the error if the source doesn't exist? That behaviour is now supported by supplying param secondary_copy_behavior =MirroringSecondaryCopy::SkipIfNotFound

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also fyi the local fs object store does do a hard link when copying:
https://github.com/apache/arrow-rs/blob/object_store_0.9.0/object_store/src/local.rs#L601-L616

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit, I'm a little confused, but I was already a little confused by the MirroringObjectStore so this isn't something new.

I guess my main question is how a file ends up existing on the destination but not the source? Or why is it safe to silently ignore these "not found" errors?

) => Ok(()),
(_, Err(e)) => Err(e),
}
}
}

impl std::fmt::Display for MirroringObjectStore {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also print/format the copy behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// Since the secondary store may not be as durable as the primary, the copy source
// may exist on the primary but not on the secondary. If the source is not found,
// this skips making the copy
SkipIfNotFound,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases would I want Copy and not SkipIfNotFound? How do I decide what value to set here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally you'd use SkipIfNotFound if you can't be sure that the source object won't be in the secondary store, and your use case allows for the destination not to be there either. As a concrete example, you might want to use SkipIfNotFound if you were using the secondary object store as some kind of read through cache. E.g. Maybe your primary object store is s3, but you're mirroring objects to a local ephemeral volume. In this scenario, it could also be that another process is creating objects on the primary store, and if they don't exist on the copy it's not so bad (it's just a cache miss).

Generally you'd use Copy if you had some tighter constraints on about maintaining consistency between what's in your primary and secondary stores, and maybe you want the caller to be notified that the copy object couldn't be created on the secondary store. This option being (being the default), is added backwards compatibility.

@@ -314,33 +346,152 @@ impl AsyncWrite for MirroringUpload {
#[derive(Debug)]
pub struct MirroringObjectStoreWrapper {
secondary: Arc<dyn ObjectStore>,
secondary_copy_behavior: MirroringSecondaryCopy,
secondary_wrapper: Option<Arc<dyn WrappingObjectStore>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the addition of secondary_wrapper essential to the fix? Or is it just something that will make debugging easier in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essential to the fix. We need to be able to wrap the secondary with CheckedCopyObjectStoreWrapper so if the secondary is a LocalFileSystem, when we call copy on MirroredObjectStore it won't hang forever when it calls secondary.copy()

(_, Ok(_)) => Ok(()),
(
MirroringSecondaryCopy::SkipIfNotFound,
Err(object_store::Error::NotFound { path: _, source: _ }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the source was not found? Or does it mean the destination directory was not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the source is not found. Object store copy will create the destination directory if it's not found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(rust): Infinite loop when resolving version w/ external manifest using MirroredObjectStore