sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

orion486 · 2024-06-16T23:40:38Z

The Issue

I am not sure if this can be called a bug but it's a setting that might not produce the intended results. An issue exist where if extractor.*.skip is true then some files with multiple revisions, such as from kemonoparty and coomerparty, will not be downloaded if extractor.*.archive-format is currently set to the default of "{service}_{user}_{id}_{num}"; which can be checked using the -E option.

How To Reproduce

For the following URL,

https://coomer.su/fansly/user/307507152082186240/post/577611859612409857

we extract session info using:

gallery-dl -s -j https://coomer.su/fansly/user/307507152082186240/post/577611859612409857

If the previously discussed conditions above are set, the object entries with attributes "filename": "577611769514565632_preview" and "filename": "577608964548603905" will both get assigned "num": 1 and thus, only one of these files will be downloaded while the second one in the download order will be skipped since the entry in the sqlite3 archive for both files will be identical due to both files sharing the same num value. Both files generate the following entry in the sqlite3 archive in spite of having different filenames: coomerpartyfansly_307507152082186240_577611859612409857_1.

Workarounds

Change the default setting of extractor.*.archive-format to something more unique, like "{service}_{user}_{id}_{filename}_{extension}_{num}".
Set extractor.*.skip to false, (which should have the same(?) effect as using the --no-skip option). This will download everything again so not the best solution.

The first option will break legacy support for previous entries already in the sqlite3 archive. Still, if this behavior is indeed unintended, then the first option is probably the best solution.

Other URLs Also Affected

The text was updated successfully, but these errors were encountered:

komoreshi · 2024-06-17T12:56:49Z

Doesn't really address the default config issue since it's dependent on the extractor, but with kemono/coomer, the api returns file hashes (iirc SHA256 is used) and can be used as a more specific way to ensure duplicates aren't downloaded like so: "archive-format": "{subcategory}_{user}_{id}_{num}_{hash}"

orion486 · 2024-06-17T18:48:35Z

Yes, I originally thought it may affect more sites, but given that these two websites in question seem to be pretty unique in how they provide multiple revisions of a download target, perhaps addressing this issue is better done on a per-website/extractor basis. I am not sure if other websites use a similar revision system but if they do then a similar solution could be used for their extractor, depending on the info that can be extracted.

And I agree, the file hash for this extractor would be a much better solution to ensure no shared entries in sqlite3. I'll make a new PR.

a84r7a3rga76fg · 2024-06-17T23:33:46Z

"filename": "{hash}.{extension}",
"archive-format": "{subcategory}_{user}_{id}_{hash}"

With these you'll only download unique files. Use Kemono's API to sort the files afterwards. There is literally no point in trying to sort files while downloading from Kemono because of how they handle revisions.

sntrenter · 2024-08-06T15:11:11Z

Is there a way to fix this with config changes? I tried some of the suggestions in this thread but gallery-dl still only pulls the latest "revision"/"imported" rip. I'm Haven't worked much with the gallery-dl config yet.

~~I guess you could(or i) write a script to pull revision urls then manually gallery-dl them, but I think it would create some dupes, IE for the post you give as an example:~~

rev2:https://n3.coomer.su/data/f3/ea/f3ea2ffcbe3dda1f889458180707130d2b3e59def266f83c16611ecc8af1507e.jpg?f=543137649662111744.jpeg
rev1:https://n2.coomer.su/data/ec/35/ec35e98166963840edb8beb330e901b6a7ac3690a0b36f20be63107ec4a1da03.jpg?f=543137648286380032_preview.jpeg
rev3:https://n2.coomer.su/data/ec/35/ec35e98166963840edb8beb330e901b6a7ac3690a0b36f20be63107ec4a1da03.jpg?f=543137648286380032_preview.jpeg

so you would end up with two, a "preview" version and a "real" one. (If you are already writing a script to include all the revisions of the page, it should be trivial to remove a "preview" version if a "real" version exists.)

~~the revisions are also in plaintext on the page so generating the URLs to check for new/missing files shouldn't be too hard?~~

~~I'm a little bit of a leech on the project right now so I'm not sure if this would be something better developed as a personal hacky fix or if I should try to get a PR working for it?~~

Edit: After experimenting with it myself, it seems like you might just need:

        "kemonoparty": {
            "revisions": "unique"
        },
        "coomerparty": {
            "revisions": "unique"
        },

or I misunderstood the issue (more likely)

orion486 · 2024-08-19T02:42:09Z

@sntrenter From my own tests, using "revisions": "unique" did not solve the problem; I was already using unique in my config when I noticed this issue and also attempted to disable it to see if it changed anything. Pull request #5751 I submitted fixes this issue but has currently not yet been merged. For now I have locally solved the issue by using my first workaround, which can be found in my first post in this thread. This can be implemented in the config by adding the following entry in the kemonoparty and coomerparty sections. It essentially does what the pull request would by default:

"coomerparty": {
    ...
    "archive-format": "{service}_{user}_{id}_{num}_{hash}",
    ...
},

Quick edit: Reminder that, to enable the sqlite3 archive, you also need to use "archive": "~/choose_path/archive-coomerparty.sqlite3",.

orion486 mentioned this issue Jun 17, 2024

fixes #5746 #5748

Closed

orion486 linked a pull request Jun 17, 2024 that will close this issue

[kemonoparty] uses more unique entries for the sqlite3 archive to ensure #5751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

orion486 commented Jun 16, 2024 •

edited

Loading

komoreshi commented Jun 17, 2024 •

edited

Loading

orion486 commented Jun 17, 2024

a84r7a3rga76fg commented Jun 17, 2024 •

edited

Loading

sntrenter commented Aug 6, 2024 •

edited

Loading

orion486 commented Aug 19, 2024 •

edited

Loading

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

Comments

orion486 commented Jun 16, 2024 • edited Loading

The Issue

How To Reproduce

Workarounds

Other URLs Also Affected

komoreshi commented Jun 17, 2024 • edited Loading

orion486 commented Jun 17, 2024

a84r7a3rga76fg commented Jun 17, 2024 • edited Loading

sntrenter commented Aug 6, 2024 • edited Loading

orion486 commented Aug 19, 2024 • edited Loading

orion486 commented Jun 16, 2024 •

edited

Loading

komoreshi commented Jun 17, 2024 •

edited

Loading

a84r7a3rga76fg commented Jun 17, 2024 •

edited

Loading

sntrenter commented Aug 6, 2024 •

edited

Loading

orion486 commented Aug 19, 2024 •

edited

Loading