Skip to content

Compression plugin #18314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: develop2
Choose a base branch
from
Open

Conversation

perseoGI
Copy link
Contributor

@perseoGI perseoGI commented May 16, 2025

Changelog: Feature: Allow externalization of compression/decompression tasks to a user defined plugin
Docs: conan-io/docs#4107

Close #18259
Close #5209
Close: #18185

Related issues:

Related PRs:

With the aim of simplicity this PR keeps the current behaviour in some compression/decompression tasks such as:

  • Cache save/restore
  • File uploads
    Adding an extra layer which will be in charge of determining if the compression.py plugin is defined falling back to existing behaviour if it doesn't.

The compression.py plugin interface may follow the following structure:

def tar_extract(archive_path, dest_dir, conf=None, *args, **kwargs) -> None:
    pass
def tar_compress(archive_path, files, recursive, conf=None, ref=None, *args, **kwargs) -> None:
    pass

Pluging example with zstd:

# File: compression.py

import os
import tarfile

import zstandard

from conan.api.output import ConanOutput

# zstd compression
# Original author https://github.com/conan-io/conan/pull/14706
def tar_extract(archive_path, dest_dir, conf=None, *args, **kwargs):
    dctx = zstandard.ZstdDecompressor()
    ConanOutput().info(f"Decompressing {os.path.basename(archive_path)} with compression plugin (ZSTD)")
    with open(archive_path, "rb") as tarfile_obj:
        with dctx.stream_reader(tarfile_obj) as stream_reader:
            # The choice of bufsize=32768 comes from profiling decompression at various
            # values and finding that bufsize value consistently performs well.
            with tarfile.open(
                fileobj=stream_reader, bufsize=32768, mode="r|"
            ) as the_tar:
                the_tar.extractall(
                    path=dest_dir, filter=lambda tarinfo, _: tarinfo
                )


def tar_compress(archive_path, files, recursive, conf=None, ref=None, *args, **kwargs):
    ConanOutput(scope=str(ref or "")).info(
        f"Compressing {os.path.basename(archive_path)} with compression plugin (ZSTD)"
    )
    compresslevel = conf.get("user.zstd:compresslevel", check_type=int) if conf else None
    with open(archive_path, "wb") as tarfile_obj:
        # Only provide level if it was overridden by config.
        zstd_kwargs = {}
        if compresslevel is not None:
            zstd_kwargs["level"] = compresslevel

        dctx = zstandard.ZstdCompressor(write_checksum=True, threads=-1, **zstd_kwargs)

        # Create a zstd stream writer so tarfile writes uncompressed data to
        # the zstd stream writer, which in turn writes compressed data to the
        # output tar.zst file.
        with dctx.stream_writer(tarfile_obj) as stream_writer:
            # The choice of bufsize=32768 comes from profiling compression at various
            # values and finding that bufsize value consistently performs well.
            # The variance in compression times at bufsize<=64KB is small. It is only
            # when bufsize>=128KB that compression times start increasing.
            with tarfile.open(
                mode="w|",
                fileobj=stream_writer,
                bufsize=32768,
                format=tarfile.PAX_FORMAT,
            ) as tar:
                current_frame_bytes = 0
                for filename, abs_path in sorted(files.items()):
                    tar.add(abs_path, filename, recursive=recursive)

                    # Flush the current frame if it has reached a large enough size.
                    # There is no required size, but 128MB is a good starting point
                    # because it allows for faster random access to the file.
                    current_frame_bytes += os.path.getsize(abs_path)
                    if current_frame_bytes >= 134217728:
                        stream_writer.flush(zstandard.FLUSH_FRAME)
                        current_frame_bytes = 0

memsharded added a commit that referenced this pull request May 20, 2025
…xisting function (#18320)

Changelog: omit
Docs: omit

As suggested in
#18314 (comment)

This PR will simplify #18314 by
modifying the compression functionality on cache save command.
Intead of creating a tar object and adding files one by one (very
coupled interface with tarfile module), create a list of files which
will be passed to a already existing method in conan `compress_files`.

This method may be moved to another module in a future PR.
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better, easier to assess that the risk is low.
Just some small concern about the pkglist.json of the conan cache restore

@perseoGI perseoGI marked this pull request as ready for review May 23, 2025 09:45
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is looking good.

Before proceeding, I'd like to see a poc of using zstd and how the possible fallbacks and error handling is done, when mixing packages conan_package.tgz compressed with gz and with zstd.

@memsharded memsharded self-assigned this May 23, 2025
Copy link
Member

@AbrilRBS AbrilRBS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about not re-checking the file, otherwise this looks good :)

@memsharded memsharded added this to the 2.18.0 milestone May 26, 2025
@czoido czoido requested a review from Copilot May 26, 2025 09:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces a plugin hook for compression/decompression by loading a user-supplied compression.py and delegating to it when present, with fallback to the builtin behavior.

  • Loaded compression.py via a new config.compression_plugin property and propagated it through all compress/uncompress calls.
  • Updated compress_files and uncompress_file to invoke the plugin if available.
  • Added integration tests for plugin loading, fallback behavior, and default output.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/integration/extensions/test_compression_plugin.py added tests for plugin validity and end-to-end usage
test/integration/command/cache/test_cache_save_restore.py asserted default compress/decompress outputs when no plugin
conan/internal/rest/rest_client_local_recipe_index.py passed config_api into LocalRecipesIndexApp
conan/internal/rest/remote_manager.py propagated compression_plugin into uncompress_file
conan/internal/cache/home_paths.py added compression_plugin_path property
conan/internal/api/uploader.py forwarded compression_plugin to compress_files
conan/internal/conan_app.py injected config_api into RemoteManager and LocalRecipesIndexApp
conan/api/subapi/config.py implemented compression_plugin loader property
conan/api/subapi/cache.py integrated plugin hook into cache save/restore
Comments suppressed due to low confidence (1)

conan/api/subapi/config.py:248

  • The code references os.path but there’s no import os at the top of this module. Add import os to avoid a NameError.
if not os.path.exists(compression_plugin_path):

@@ -158,9 +158,13 @@ def test_cache_save_excluded_folders():

# exclude source
c.run("cache save * --no-source")
# Check default compression function is being used and not compression.py plugin one
assert "Compressing conan_cache_save.tgz\n" in c.out
Copy link
Preview

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Instead of matching the newline, assert on "Compressing conan_cache_save.tgz" in c.out to make the test more robust across platforms and output formats.

Suggested change
assert "Compressing conan_cache_save.tgz\n" in c.out
assert "Compressing conan_cache_save.tgz" in c.out

Copilot uses AI. Check for mistakes.

@perseoGI perseoGI changed the title [POC] Compression plugin Compression plugin May 27, 2025
Copy link
Member

@AbrilRBS AbrilRBS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to discuss the fact that any enterprise adding this plugin would not be able to use their old packaged libraries as-is.

I think the implementation should be robust enough that when the plugin is activated and the file to decompress was not compressed by the plugin, but by vanilla Conan, the workflow would then still be valid. This way we would avoid forcing users to recompile the world just to change the compression of their packages, which I think would kill the feature for any already-existing Conan user.

Note that this does NOT mean for the inverse to be true. I think it it's ok for packages that have been compressed by the plugin to only be able to be decompressed with the plugin too, anyone willing to use the plugin at-scale more than likely already has the infra setup to ensure everyone uses the plugin.

@perseoGI
Copy link
Contributor Author

@AbrilRBS

the plugin is activated and the file to decompress was not compressed by the plugin, but by vanilla Conan, the workflow would then still be valid

Yes, that is a great point!
This can be easily addressed by using a constant filename to pass to the compression plugin for it to create a tar with that name (file extensions apart).
Having a constant filename, have two benefits:

  1. Avoiding name conflicts when the compression plugin uses the same extension conan natively uses (.tgz)
  2. Being able to detect while decompressing, even though when we do not have the plugin enabled, if the decompressed tarfile has in its contents our "constant filename" in it (in that case should be passed to the compression plugin) or not, (plain conan tarfile)

packages that have been compressed by the plugin to only be able to be decompressed with the plugin

Of course, if the plugin is not enabled and conan finds out while decompressing that our "constant filename" has been decompressed, it will raise an error as client wouldn't know how to decompress that file.

@perseoGI perseoGI force-pushed the pgi/plugin/compression branch from 6fab859 to f986651 Compare June 23, 2025 15:51
Copy link
Member

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some minor interface/UI/UX issues, but overall this approach seems much better than the previous iteration.


def _tar_extract_with_plugin(fileobj, destination_dir, compression_plugin, conf):
"""First remove tar.gz wrapper and then call the plugin to extract"""
with tempfile.TemporaryDirectory() as temp_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conan doesn't really use system TemporaryDirectories for anything, it tries to do everything in the Conan cache, in cache folders, so if something goes wrong the files are there, and also there is no "leakage" of pacakges, recipes or code in anywhere else beside the Conan cache, that can be easily wiped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I didn't like the idea of messing up the conan cache. Do we have temporal folder in our conan cache?
Also, there are some other temporaries usages

pkglist_path = os.path.join(tempfile.gettempdir(), "pkglist.json")

(do not look at the blame, it was me!) but this case is different as it is only a pckglist extraction + removal.

I could move it also to the conan cache and may consider creating a tmp folder in it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand the concern.
Let me think about it.

@memsharded memsharded modified the milestones: 2.18.0, 2.19.0 Jun 26, 2025
# Check if the tar was compressed with the compression plugin by checking the existence of
# our constant COMPRESSED_PLUGIN_TAR_NAME (without extension as extension is added by the plugin)
if list(Path(temp_dir).glob(f"{COMPRESSED_PLUGIN_TAR_NAME}.*")):
# Get the only extracted file: the plugin tar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reference to #18259, we'd eventually like to have our own additional files added to the top-level. Don't assume here that it's the only file. I don't think this requires anything other than clarifying the comment.

Suggested change
# Get the only extracted file: the plugin tar
# Extract the actual contents from the plugin tar (ignore other files present).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greetings @mrjoel,
I understand your concern.
Maybe from the perspective of the git diff it is not clear enough.

With this incoming compression plugin you could do whatever you need with the files.
You just have to implement tar_compress and tar_extract functions in the compression.py file.
In that functions, you could add whatever metadata you need.

The thing is that, to avoid issues with the compressed file extension (which could be whatever the user decides), we have decided to wrap that compressed result file into a constant conan_cache_save.tgz file name.
This wraping is being done with compress level 0 and takes a negligible overhead.

I'll put an example.

Without the plugin:

$ conan cache save "zlib/1.3.1:*" 
$ ls -l 
.rw-r--r--@ 1,7M perseo 11 jul 16:19 conan_cache_save.tgz
$ tar xzf conan_cache_save.tgz
drwxr-xr-x@    - perseo 11 jul 16:20 b
.rw-r--r--@ 1,7M perseo 11 jul 16:19 conan_cache_save.tgz
.rw-r--r--@  949 perseo 11 jul 16:19 pkglist.json
drwxr-xr-x@    - perseo 11 jul 16:20 zlib204752602052d

With the plugin, using standard:

$ conan cache save "zlib/1.3.1:*" 
$ ls -l 
.rw-r--r--@ 1,7M perseo 11 jul 16:19 conan_cache_save.tgz
$ tar xzf conan_cache_save.tgz
.rw-r--r--@ 1,6M perseo 11 jul 16:21 __conan_plugin_compressed_contents__.zst
.rw-r--r--@ 1,6M perseo 11 jul 16:21 conan_cache_save.tgz
$ 

As you can see, when decompressing the conan_cache_save.tgz file, we will always get the output file of your tar_compress function.
And is that file the one which will be passed to the tar_extract function on your plugin.
In that compressed file, you could have whatever structure and metadata you need. Your tar_extract could work with that metadata as long as always decompress the files in the way conan expects.

Answering your suggestion: there will always be one file, the unwrapped compressed file from the plugin!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perseoGI Yep, I'm tracking all of that from the diff.

My main concern was just including the assumption and comment that the outer wrapper would only contain a single file. For #18259 we'd really like it to end up being the following, where we add the custom additional files. We don't expect Conan to know or do anything about it, but do want it to not cause issues.

I'm slightly sad that custom files in the direct wrapper won't be compressed at all, but so be it, we mainly expect files of a few hundred KBs (the HTML graph reports).

$ conan cache save "zlib/1.3.1:*" 
$ ls -l 
.rw-r--r--@ 1,7M perseo 11 jul 16:19 conan_cache_save.tgz
...
# Other things happen externally to add files directly to conan_cache_save.tgz
...
$ tar xzf conan_cache_save.tgz
.rw-r--r--@ 1,6M perseo 11 jul 16:21 __conan_plugin_compressed_contents__.zst
.rw-r--r--@ xxxx perseo 11 jul 16:21 README.txt
.rw-r--r--@ xxxx perseo 11 jul 16:21 Cache-contents-graph-report.html
.rw-r--r--@ xxxx perseo 11 jul 16:21 Some-other-externally-embedded-file.py
.rw-r--r--@ xxxx perseo 11 jul 16:21 Some-other-externally-embedded-file.exe
.rw-r--r--@ 1,6M perseo 11 jul 16:21 conan_cache_save.tgz
$ 

As a separate question, why gzip -0 the generated archive at all instead of just having it be a plain native .tar file? That would certainly make it easier for us to tar -rvf conan_cache_save.tar file1.txt file2.txt file3.html instead of having to ungzip and regzip a level 0 compression layer.

Copy link
Contributor

@mrjoel mrjoel Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, this approach of requiring the compression plugin to be enabled in a user's Conan home means that it is considered for all projects they may be working. We'd like to be able to provide our conan package archives in a non-interference way, not requiring users of our archive to have to use our compression plugin for everything.

Would you consider an approach where an embedded compression.py (or __conan_plugin_compression_extension__.py or some variant) file in the archive level would be used if present? It could then only be used for that archive instead of needing to be installed fully in a user's Conan home?

Comment on lines 300 to 303
# The tar was not compressed using the plugin, copy files to destination
from conan.tools.files import copy
ConanOutput().debug(f"Extracted in {time.time() - t1} time built in")
copy(None, pattern="*", src=temp_dir, dst=destination_dir)
Copy link
Contributor

@mrjoel mrjoel Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents will already be unpacked here and will just be deleted on cleanup of temp_dir. Instead of copying, would you consider moving them directly? That would seem to improve I/O required, skip the double disk space usage requirement, and likely result in much less I/O overhead from antivirus scanning on Windows.

Also, since I/O may be an appreciable contribution to timing, consider putting the time tracing after the copy/move step.

Add new test case to check those files are correctly pruned after
extraction
Improve performance by extracting in the actual destination folder even
when plugin is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] Use subdirectory for conan cache save content [feature] Multithreaded conan cache save Improve Conan untar speed
5 participants