Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better zip comparison functionality #471

Open
Tracked by #418
zschira opened this issue Nov 4, 2024 · 0 comments
Open
Tracked by #418

Add better zip comparison functionality #471

zschira opened this issue Nov 4, 2024 · 0 comments

Comments

@zschira
Copy link
Member

zschira commented Nov 4, 2024

Background

#399 tracked fixes to FERC XBRL archivers which have been causing automated archive runs to mark changes to XBRL resources even when nothing has changed. There were some genuine fixes made that fixed issues with partitions/taxonomies changing, but every new XBRL archive is still being marked as changed even when nothing has changed. When nothing has substantively changed, these files are the exact same size as the previous version, so it's still fairly easy to see that there are no real changes and discard new drafts.

Current state

To identify the source of the issue, I used the tools zipcmp and zipinfo, which detected no changes to the underlying files or metadata. I attempted sorting files by filename before adding them to the zipfile and this made no difference. There's likely something I'm missing that could fix this and make the zipfiles appear identical, but this issue also highlights the fragility of directly comparing zipfile hashes. This will always detect changes we don't necessarily care about like headers, compression level, file order, etc.

Next steps

A better comparison would be to actually look inside zipfiles and compare their contents. This would also give us more insight into what specifically has changed between versions. This would not be too difficult to implement, as we could use the existing file comparison tooling in the validate.py module, but the issue is we don't actually have the previous version of zips available during comparison.

There are two possible implementations that come to mind to work around this problem:

  1. Just download the zipfiles during comparison. We would probably want to change where we do the comparisons to be right after downloading, otherwise we would have to download the old and new versions.
  2. Attach metadata containing the hash of all files within a zipfile, so we can use the metadata for comparison.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant