-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MirroredObjectStore copying from non existent source #1180
base: main
Are you sure you want to change the base?
Conversation
should fix the error on top of main https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
addresses #797 Problem: tantivy does not expose option to explicitly Proposed solution here: 1. Add a `.phrase_query()` option 2. Under the hood, LanceDB takes care of wrapping the input in quotes and replace nested double quotes with single quotes I've also filed an upstream issue, if they support phrase queries natively then we can get rid of our manual custom processing here.
instead of starting and stopping the current thread's event loop on every http call, just make an http call.
We use `ruff` in CI and dev workflow now.
Named it Gemini-text for now. Not sure how complicated it will be to support both text and multimodal embeddings under the same class "gemini"..But its not something to worry about for now I guess.
We should now be able to directly ingest polars dataframes and return results as polars dataframes ![image](https://github.com/lancedb/lancedb/assets/759245/828b1260-c791-45f1-a047-aa649575e798)
This pull request adds check for the presence of an environment variable `OPENAI_API_KEY` and removes an unused parameter in `retry_with_exponential_backoff` function.
…cn (#816) ``` UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' warnings.warn(message, UserWarning) ```
This mimics CREATE TABLE IF NOT EXISTS behavior. We add `db.create_table(..., exist_ok=True)` parameter. By default it is set to False, so trying to create a table with the same name will raise an exception. If set to True, then it only opens the table if it already exists. If you pass in a schema, it will be checked against the existing table to make sure you get what you want. If you pass in data, it will NOT be added to the existing table.
This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 ## Chores - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. ## Restructure/new content - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <[email protected]> Co-authored-by: Will Jones <[email protected]>
This PR makes the following aesthetic and content updates to the docs. - [x] Fix max width issue on mobile: Content should now render more cleanly and be more readable on smaller devices - [x] Improve image quality of flowchart in data management page - [x] Fix syntax highlighting in text at the bottom of the IVF-PQ concepts page - [x] Add example of Polars LazyFrames to docs (Integrations) - [x] Add example of adding data to tables using Polars (guides)
@eddyxu added instructions for linting here: https://github.com/lancedb/lancedb/blob/7af213801a091cd5afb0f0814e184fc0b852de47/python/README.md?plain=1#L45-L50 However, we had a lot of failures and weren't checking this in CI. This PR fixes all lints and adds a check to CI to keep us in compliance with the lints.
* improve the docstring for NodeJS connect functions and `ConnectOptions` parameters. * Simplify `npm run build` steps.
…into tecmie-tecmie/embeddings-openai
The previous release failed to release nodejs because the nodejs version wasn't bumped. This should fix that.
We aren't yet ready to switch over the examples since almost all JS examples rely on embeddings and we haven't yet ported those over. However, this makes it possible for those that are interested to start using `@lancedb/lancedb`
<img width="1034" alt="image" src="https://github.com/lancedb/lancedb/assets/5846846/5b8aa53c-4d93-4c0e-bed4-80c238b319ba">
self.primary.copy(from, to).await?; | ||
Ok(()) | ||
} | ||
} | ||
|
||
async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> { | ||
if !to.primary_only() { | ||
self.secondary.copy(from, to).await?; | ||
self.secondary_copy(from, to).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just do a hard link? and soft fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we only be able to create the link on local FS? I was trying to do something that is agnostic to the underlying secondary implementation.
Also by soft-fail, do you mean not return the error if the source doesn't exist? That behaviour is now supported by supplying param secondary_copy_behavior
=MirroringSecondaryCopy::SkipIfNotFound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also fyi the local fs object store does do a hard link when copying:
https://github.com/apache/arrow-rs/blob/object_store_0.9.0/object_store/src/local.rs#L601-L616
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit, I'm a little confused, but I was already a little confused by the MirroringObjectStore
so this isn't something new.
I guess my main question is how a file ends up existing on the destination but not the source? Or why is it safe to silently ignore these "not found" errors?
) => Ok(()), | ||
(_, Err(e)) => Err(e), | ||
} | ||
} | ||
} | ||
|
||
impl std::fmt::Display for MirroringObjectStore { | ||
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you also print/format the copy behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
// Since the secondary store may not be as durable as the primary, the copy source | ||
// may exist on the primary but not on the secondary. If the source is not found, | ||
// this skips making the copy | ||
SkipIfNotFound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases would I want Copy
and not SkipIfNotFound
? How do I decide what value to set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally you'd use SkipIfNotFound
if you can't be sure that the source object won't be in the secondary store, and your use case allows for the destination not to be there either. As a concrete example, you might want to use SkipIfNotFound
if you were using the secondary object store as some kind of read through cache. E.g. Maybe your primary object store is s3, but you're mirroring objects to a local ephemeral volume. In this scenario, it could also be that another process is creating objects on the primary store, and if they don't exist on the copy it's not so bad (it's just a cache miss).
Generally you'd use Copy
if you had some tighter constraints on about maintaining consistency between what's in your primary and secondary stores, and maybe you want the caller to be notified that the copy object couldn't be created on the secondary store. This option being (being the default), is added backwards compatibility.
@@ -314,33 +346,152 @@ impl AsyncWrite for MirroringUpload { | |||
#[derive(Debug)] | |||
pub struct MirroringObjectStoreWrapper { | |||
secondary: Arc<dyn ObjectStore>, | |||
secondary_copy_behavior: MirroringSecondaryCopy, | |||
secondary_wrapper: Option<Arc<dyn WrappingObjectStore>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the addition of secondary_wrapper
essential to the fix? Or is it just something that will make debugging easier in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essential to the fix. We need to be able to wrap the secondary with CheckedCopyObjectStoreWrapper
so if the secondary is a LocalFileSystem
, when we call copy on MirroredObjectStore
it won't hang forever when it calls secondary.copy()
(_, Ok(_)) => Ok(()), | ||
( | ||
MirroringSecondaryCopy::SkipIfNotFound, | ||
Err(object_store::Error::NotFound { path: _, source: _ }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the source was not found? Or does it mean the destination directory was not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the source is not found. Object store copy will create the destination directory if it's not found.
CheckedCopyObjectStore
wrapper which will check if the source object exists before copying. This ensures that we won't go into an infinite loop due to bug Local object store copy/rename with nonexistentfrom
file loops forever instead of erroring apache/arrow-rs#5503fixes: #1176