-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for optional inclusion of filesize in hash files #65
Comments
I, too, have thought that including the file size in a signature/checksum file would provide an early exit when the length doesn't match, but haven't been able to find a hash utility or file format that captures that metadata. I know that Wireshark signature files, for example, include the file lengths...
...and the format of the hash lines is consistent with the output of running As for this proposal, are you asking that HashCheck produce hash files that includes file length information, or just that it be smart enough to read and verify such information, if present? The former would break compatibility with other checksum verification utilities. |
I've seen at least one system (CallidusCloud, now SAP Commissions) which wanted filesize in its hashfile checks, so it's not unheard of. For compatibility purposes, adding |
LanceUMatthews: Allow me to preface by observing that every hash checking software mentions the simple caveat that there is NO standardized agreed upon format. And I hope that there never will be. That said, yes, I want HashCheck to both Write and Read hash files that do contain the filesize parameter. The user can choose this as an option. Heck, allow the user to specify their own user-defined format(s) to read and write if you want to get super advanced. The file formats are easy to convert with a simple script, if a future librarian needs to change things to make them compatible with the software being used in 2045. But conversion can only happen IF the filesize exists in the file. You can't convert something that is utterly missing. Hashcheck is the only software that recognizes a *.sha512 file extension. And all of these extensions, including .sfv and .md5, are arbitrary whimsical formats. Let's not worry about what other people are doing, and become a leader. |
On a side note, I'm fleshing out design of arbitrary metadata files to make my work as an archivist easier, adopting json, xml, yaml, quoted and unquoted csv, tsv, ssv as all valid means of reading and writing, importing and exporting metadata. This will allow me to draw from multiple sources like hash files, ffprobe output, webpage scraped data, etc, into flat / structured files for manipulation and tracking changes to files and folders, comparing duplicates, revisions, etc. Any simple file format that somebody uses or invents can easily be scripted around and imported / exported. |
I raised the issue of compatibility because having HashCheck change a known format and give it the same extension, as you've specified, would certainly make those files unreadable to all the programs out there that already handle that format/extension. Perhaps some people only use HashCheck in isolation, in which case it could use whatever format it wants, but one of the useful things about it is that, like any competent hashing utility, it reads and writes hash files in a format typical of and compatible with other hashing utilities. This isn't about software written in 2045; it's about compatibility with software that exists right now. Also, I disagree that hash files are "arbitrary" formats. Yes, they may have been arbitrary when someone invented them, but they're well-established now. How often do you encounter If you're going to use a different, incompatible format then my point is that it should have a different name (extension), too. (Although, at that point you might be better off creating something entirely new without the limitations of an By the way, I notice that HashCheck as well as GNU
This would have the following benefits for compatibility:
|
Using comment fields for data storage is a dark path. I second (or third by now) the suggestion that was made earlier that the file size be included by a flag, but I go one step further. The new flag should not just be for the size, it should be for the format of the file. Using a structured format of any kind (json, xml, etc...) with named values would future proof the file. Instead of PacketSenderPortable.sha256 with
We'd have PacketSenderPortable.checksum with
Then adding any new fields will break nothing. And what they are will be obvious without having to reference a document to figure out what order the fields are in and which computation methods were or were not included. Just my 2¢ |
I'll throw out there that the STANDARD format for hash files is just a single line containing nothing else except the hex value hash of the file that was downloaded. This is true for md5, sha1, sha256 that you might find accompanying a file on a download page, with the filename of the hash file matching that of the download file. ALL OTHER variations to this standard are deviations, including the format Hashcheck is using right now. Software can only reasonably expect the hash value, only one hash value, and no file size, and no filename or path. One of the drawbacks of using structured files is their complexity for simple script parsing. The one-line-per-item format is a lot easier to manipulate in scripts and spreadsheets. I'm not specifically asking for the elimination of SSV, just as I believe against the sanctification of the current Allow the user to know best and pursue their needs. I agree that using comment lines to convey data is a dark path. We are talking about 2045, not just 1995 or 2005 or 2015. People are excellent at adopting and adapting to new "standards" when they enter circulation. If somebody is using a 20 year old hash software that can't read HashCheck generated user-define files, or refuses to, then the user can decide to upgrade their 20 year old software or write an interpreter / conversion script for themselves, or email the author of the competing software. |
Embedding extra information in comments is certainly not ideal, but given the lack of extensibility in the pervasive As I said and @ThermoMan describes, an entirely new format that allows that flexibility would be better still than either of those options. At that point I don't think a flag is necessary but just another drop-down type to select in the XML or JSON are what I had in mind, too, although to lighten the parsing requirements it could be CSV...
...(which would have some redundant
Another attribute to work in there whatever the containing format might be the specifier for binary vs. text mode hashes, either per-file (e.g. |
@LanceUMatthews what do you consider to be a text-mode hash? Exclude all \r (0x0D) and \n (0x0A) characters from the data stream being hashed? Or just exclude \r characters? I don't know this to be a hashing mechanism used in the wild. CSV TSV and SSV, quoted and unquoted, each have equal merits. Should be a setting the user can define. SSV, space separated values, is what HashCheck currently uses. It can probably interpret TSV but I haven't tested. |
My understanding is text mode ignores any newline characters encountered during hashing. Or maybe it normalizes them to a particular scheme? I've never had a need for it so I'm not sure. I could...possibly...see it being used for code or configuration files that are shared across platforms. Someone must have had a use for it at some point such that it's part of the other hash file format so it would make sense to somehow support it. Maybe such an option should make it clear what it's doing, with I quoted the values in my sample CSV because otherwise there'd be no way to determine (without some extra logic or finding a delimiter that's an illegal character on all filesystems) where the path ends (with |
As far as delimiter goes: As long as only one parameter, the filename and path, could potentially contain the delimiter, whether space or comma, it is fine for this to be unquoted as long as that single parameter is the last one on the line. A lot of protocols use this sort of system of last-parameter delimiter anticipation. I can imagine text-mode hashes being used in FTP file sharing where .txt .html .cpp etc files are converted between \r\n and \n depending on whether the ftpd is on windows or linux. Probably not as bug of a deal today as people prefer exact file copy transfer modes, and software code tends to be maintained via git / svn versioning systems anymore, not hash. I was recently informed there's a video / audio frame hashing capability in the |
I wish to request support for HashCheck to support the presence and absence of the filesize (in bytes) stored within hash files (eg, .sha512). Filesize would be the second of the three columns, between the filehash and the filename. It would be contextually identifiable as it would be a plain integer value and not a fully qualifying path, and it would be followed by a fully qualified path. This should solve compatibility issues.
I need this in order to use these hash files in other software to be able to quickly identify a file disqualifies as a match because the filesize is a mismatch. And to help locate a moved or renamed file because the file has a matching filesize.
Format as proposed:
hash
filesize
filepath
FFEDCBA9876543210123456789ABCDEF
1234567890
*path\filename.ext
The text was updated successfully, but these errors were encountered: