Skip to content

Add support for snapshots with zstd compression #1523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

jugal-chauhan
Copy link
Collaborator

Description

This PR brings in the No-Op implement for ignoring Bloom Filters and to skip the Elasticsearch816 Codec required due to certain default index settings in ES 8x. It also focuses on adding in any new Codec into our Lucene9 Readers to successfully read ZSTD compressed lucene segments.

Issues Resolved

MIGRATIONS-2536

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we update this PR / rename it to be oriented towards the customer problem solved. Seems like this one would be Add support for snapshots with zstd compression?

Lets also see about adding test cases that exercise this behavior included as well. What do you think about adding a new index onto the existing EndToEnd test cases that enable this level of compression?

@jugal-chauhan jugal-chauhan changed the title Register SPI stub class and add codec for zstd Add support for snapshots with zstd compression May 20, 2025
@@ -20,6 +20,7 @@ public class Elasticsearch816CodecFallback extends Codec {

public Elasticsearch816CodecFallback() {
super("Elasticsearch816");
System.out.println(">>>>> Loading stub Elasticsearch816CodecFallback class");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove all these prints, we can have logger.debug if you want to retain

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I miss where we've supported zstd, I don't see any libraries or updated compression configuration to our lucene readers?

def run_test_benchmarks(self, cluster: Cluster):
run_test_benchmarks(cluster=cluster)

def disable_bloom(self, cluster: Cluster, index_name: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't indices with bloom supported with the IgnoreBloomFilter, why would we need to disable them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we do not carry-forward this setting on OS versions during metadata migrate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I would say the metadata process should not move unsupported features from ES8 -> OpenSearch. This seems like its the case for this bloom feature, no?

@@ -28,6 +28,7 @@ public class IgnorePsmPostings extends PostingsFormat {

public IgnorePsmPostings() {
super("ES812Postings");
System.out.println(">>>>> Loading stub IgnorePsmPostings class");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all usage of System.out.println(..,), does it make sense to include debug level logging statements for these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will remove the usage of system.out.println and will use log4j wherever I want to opt in for temporary logging.

System.out.println(">>>>> Injecting missing mode BEST_SPEED into segment info");
si.putAttribute("Lucene90StoredFieldsFormat.mode", "BEST_SPEED");
}
return new Lucene90StoredFieldsFormat(Mode.BEST_SPEED).fieldsReader(directory, si, fn, context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose BEST_SPEED? Can we move this into a static, this 'why' is a good comment to include on the field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not plan on continuing with this file entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Hence it has not been referenced anywhere else.

@@ -29,6 +29,28 @@
"sonested": {"count": "1000"},
"nyc_taxis": {"count": "1000"}
}
empty_indices_no_taxi = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do not appear to be referenced?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refine it, I included them only for testing purposes, to see a checkpoint if test passes without nyc_taxis. I agree many things are not clean in this PR, and will clean them one by one. Thank you for pointing them to my attention.


logger.info(f"Successfully disabled bloom filter on index: {index_name}. Response: {response}")

def refresh(self, cluster: Cluster, index_name: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, please update check_doc_counts_match to use this refresh definition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

class Test0007EndToEndTestForES8WithOSBenchmarks(MATestBase):
def __init__(self, console_config_path: str, console_link_env: Environment, unique_id: str):
allow_combinations = [
(ElasticsearchV8_X, OpensearchV2_X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid creating a new test and add this matrix value onto an existing test case?

public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
System.out.println(">>> Attempting to decode stored fields using ZstdStoredFields814Format for segment: " + si.name);
// TODO: Replace with real reader implementation
return new Lucene90CompressingStoredFieldsFormat(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that importing custom-codecs implementation would be a better path that rolling our own implementation. Looks like Lucene912CustomStoredFieldsFormat might be the file we are looking for link.

Here it is on maven, https://mvnrepository.com/artifact/org.opensearch.plugin/opensearch-custom-codecs

Since we are using lucene 9, make sure to use their 2.19.2 artifact

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jugal-chauhan Why aren't we importing the custom codecs?

@peternied
Copy link
Member

@jugal-chauhan I'm not sure what the endpoint of this pull request is - can you update the description and describe what you want to accomplish by merging these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants