feat (dataset): allow zipped kwcoco files #181
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In kwcoco 0.5.4 support was added to read/write data from/to compressed zipfiles. This can greatly impact the size of the kwcoco files on disk, as there is a lot of redundant information in the json files. This is especially true when the data contains polygon segmentations.
The way this works is by reading / writing the json file to a standard location inside the zipfile.
Inside kwcoco reading and writing to json or zip is implicit by default, although it can be controlled. The default rule is: if it looks like a zipfile (via
zipfile.is_zipfile(path)
, which quickly checks for a magic number in the header of the file) then check if there is a single readable json file inside it, and load that instead. For writing withCocoDataset.dump
, ifcompress='auto'
, it will write to a zipfile ifself.fpath
ends with.zip
.The change here is to allow .zip as a valid candidate coco file extension, and then use the CocoDataset loading itself to check if the file is a coco dataset. I set the
autobuild
flag to False, which prevents it from building any lookup index, so it only incurs the cost of reading and json parsing. We could be slightly more efficient here by doing something like:and then checking the text for
required_keys
, but the way this is structured already has an efficiency problem because you are going to read the dataset twice: once to check and once to actually read it, so I figured use less code.It does make me think adding a classmethod to
CocoDataset
calledis_coco_file
that implements heuristic checks, and perhaps even returns a state object that can be used to expedite later loading if it actually is a coco file might be useful here, but that's a separate issue.