feat (dataset): allow zipped kwcoco files #181

Erotemic · 2025-02-08T18:11:40Z

In kwcoco 0.5.4 support was added to read/write data from/to compressed zipfiles. This can greatly impact the size of the kwcoco files on disk, as there is a lot of redundant information in the json files. This is especially true when the data contains polygon segmentations.

The way this works is by reading / writing the json file to a standard location inside the zipfile.

Inside kwcoco reading and writing to json or zip is implicit by default, although it can be controlled. The default rule is: if it looks like a zipfile (via zipfile.is_zipfile(path), which quickly checks for a magic number in the header of the file) then check if there is a single readable json file inside it, and load that instead. For writing with CocoDataset.dump, if compress='auto', it will write to a zipfile if self.fpath ends with .zip.

The change here is to allow .zip as a valid candidate coco file extension, and then use the CocoDataset loading itself to check if the file is a coco dataset. I set the autobuild flag to False, which prevents it from building any lookup index, so it only incurs the cost of reading and json parsing. We could be slightly more efficient here by doing something like:

                with open(fpath, 'rb') as file:
                    with zipfile.ZipFile(file, 'r') as zfile:
                        members = zfile.namelist()
                        if len(members) != 1:
                            raise Exception(
                                'Currently only zipfiles with exactly 1 '
                                'kwcoco member are supported')
                        text = zfile.read(members[0]).decode('utf8')

and then checking the text for required_keys, but the way this is structured already has an efficiency problem because you are going to read the dataset twice: once to check and once to actually read it, so I figured use less code.

It does make me think adding a classmethod to CocoDataset called is_coco_file that implements heuristic checks, and perhaps even returns a state object that can be used to expedite later loading if it actually is a coco file might be useful here, but that's a separate issue.

feat (dataset): allow zipped kwcoco files

808b1ae

PaulHax requested a review from alesgenova February 11, 2025 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat (dataset): allow zipped kwcoco files #181

feat (dataset): allow zipped kwcoco files #181

Erotemic commented Feb 8, 2025

feat (dataset): allow zipped kwcoco files #181

Are you sure you want to change the base?

feat (dataset): allow zipped kwcoco files #181

Conversation

Erotemic commented Feb 8, 2025