-
-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to multiply with overflow #2577
Comments
We have a limit currently after which don't merge segments. This limit does not take into account multi values in columns, which could exceed the value space
|
Does that mean my current index is "toasted" and I should basically reindex? (While taking care of not having these larger segments). |
I think yes, it seems toasted. How many multi values per doc on average does your column have? |
I don't actually do multi values at all (If you mean storing multiple the same ""key"" per doc) |
So there are no arrays in your data? In that case, the issue is probably still similar, but not sure what exactly causes it. |
Yes, no arrays. Here's my schema let mut document = doc!(
self.indexer.id =>id,
self.indexer.date => DateTime::from_timestamp_secs(something as i64),
self.indexer.string1 =>String::new(),
self.indexer.string2 => String::new(),
self.indexer.string3 => String::new(),
); |
@PSeitz Would a reproduction be useful? I've been thinking about generating a 1B docs segment from a minimal repo to see how things goes. |
@Barre is there a rationale to having such gigantic segments? We recommend around 10millions docs per segment. |
I'm afraid yes |
Thanks to the stack trace you join, we know the problem is coming from here. It computes the address of the value to get, but expressed in "bits" (we do bitpacking). If the data is critical and reindexing is not an option for you, you can try to modify tantivy to use u64 in this computation. |
Thanks for the feedback on segment sizes! In my case, 10M would probably mean too many segments, and the compression ratio wouldn't be as good. There's no particular reason why I "needed" 1B doc segments. I just indexed with default settings and it "happened naturally." Wouldn't it make sense to lower that maximum docs limit if the default merge policy can make things problematic?
I ended up reindexing with a 106M max doc limit. |
I don't think this is true.
This is odd. The default settings does not do this. You use a program that merges everything at the end or something like that no? |
I was specifically thinking about the FST that may become more efficient as more entries it contains.
I don't. It's just default settings with default merge policy. |
Here's how I open my index: let mut index = IndexBuilder::new()
.schema(schema.clone())
.settings(IndexSettings {
docstore_compression: tantivy::store::Compressor::Lz4,
docstore_compress_dedicated_thread: true,
..default::Default::default()
})
.open_or_create(directory)?;
let index_writer_options = IndexWriterOptions::builder()
.num_merge_threads(num_cpus::get_physical())
.num_worker_threads(num_cpus::get_physical())
.memory_budget_per_thread(1_000_000_000)
.build(); Maybe it's because of
which makes merging more eager? Not quite default in that case, you are right. |
No... This is not it. Can you share the entire main? |
#[derive(Clone)]
pub struct Indexer {
pub id: Field,
pub text_indexing: Field,
pub schema: Schema,
pub index: Index,
pub index_reader: IndexReader,
}
impl Indexer {
pub fn new() -> anyhow::Result<Self> {
let mut schema_builder = Schema::builder();
let text_indexing = TextFieldIndexing::default()
.set_tokenizer("custom_tokenizer")
.set_index_option(IndexRecordOption::Basic);
let text_options = TextOptions::default()
.set_indexing_options(text_indexing)
.set_stored();
let date_options = DateOptions::from(INDEXED)
.set_stored()
.set_fast()
.set_precision(tantivy::schema::DateTimePrecision::Seconds);
let id = schema_builder.add_u64_field("id", FAST | STORED);
let date =
schema_builder.add_date_field("date", date_options);
let schema = schema_builder.build();
let directory = tantivy::directory::MmapDirectory::open("/tank/tantivy/")?;
let mut index = IndexBuilder::new()
.schema(schema.clone())
.settings(IndexSettings {
docstore_compression: tantivy::store::Compressor::Lz4,
docstore_compress_dedicated_thread: true,
..default::Default::default()
})
.open_or_create(directory)?;
index.set_multithread_executor(num_cpus::get()).unwrap();
index.tokenizers().register(
"custom_tokenizer",
TextAnalyzer::builder(CustomTokenizer)
.filter(LowerCaser)
.build(),
);
let index_reader = index.reader()?;
Ok(Self {
id,
text_indexing,
schema,
index,
index_reader,
})
}
pub fn get_writer(&self) -> anyhow::Result<IndexWriter> {
let index_writer_options = IndexWriterOptions::builder()
.num_merge_threads(num_cpus::get_physical())
.num_worker_threads(num_cpus::get_physical())
.memory_budget_per_thread(1_000_000_000)
.build();
Ok(self.index.writer_with_options(index_writer_options)?)
}
} |
still nothing special in there. |
To get segments that large, you should have overridden the default merge policy, or merged index on your own. You don't have code doing this? |
I didn't do anything like this. Though, unless I am missing something, I am not seeing anything preventing merging large segments together in https://github.com/quickwit-oss/tantivy/blob/4aa8cd24707be1255599284f52eb6d388cf86ae8/src/indexer/log_merge_policy.rs |
While searching my index of around 3B documents which looks like this for the biggest segments:
Performing a search such as:
Produces the following in debug mode
However, my query runs in release mode but ordering/sorting is broken.
I am running tantivy at rev
4aa8cd24707be1255599284f52eb6d388cf86ae8
The text was updated successfully, but these errors were encountered: