Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (dataset): allow zipped kwcoco files #181

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link
Member

@Erotemic Erotemic commented Feb 8, 2025

In kwcoco 0.5.4 support was added to read/write data from/to compressed zipfiles. This can greatly impact the size of the kwcoco files on disk, as there is a lot of redundant information in the json files. This is especially true when the data contains polygon segmentations.

The way this works is by reading / writing the json file to a standard location inside the zipfile.

Inside kwcoco reading and writing to json or zip is implicit by default, although it can be controlled. The default rule is: if it looks like a zipfile (via zipfile.is_zipfile(path), which quickly checks for a magic number in the header of the file) then check if there is a single readable json file inside it, and load that instead. For writing with CocoDataset.dump, if compress='auto', it will write to a zipfile if self.fpath ends with .zip.

The change here is to allow .zip as a valid candidate coco file extension, and then use the CocoDataset loading itself to check if the file is a coco dataset. I set the autobuild flag to False, which prevents it from building any lookup index, so it only incurs the cost of reading and json parsing. We could be slightly more efficient here by doing something like:

                with open(fpath, 'rb') as file:
                    with zipfile.ZipFile(file, 'r') as zfile:
                        members = zfile.namelist()
                        if len(members) != 1:
                            raise Exception(
                                'Currently only zipfiles with exactly 1 '
                                'kwcoco member are supported')
                        text = zfile.read(members[0]).decode('utf8')

and then checking the text for required_keys, but the way this is structured already has an efficiency problem because you are going to read the dataset twice: once to check and once to actually read it, so I figured use less code.

It does make me think adding a classmethod to CocoDataset called is_coco_file that implements heuristic checks, and perhaps even returns a state object that can be used to expedite later loading if it actually is a coco file might be useful here, but that's a separate issue.

@PaulHax PaulHax requested a review from alesgenova February 11, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant