Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

Open
orion486 opened this issue Jun 16, 2024 · 5 comments · May be fixed by #5751
Open

sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads #5746

orion486 opened this issue Jun 16, 2024 · 5 comments · May be fixed by #5751

Comments

@orion486
Copy link

orion486 commented Jun 16, 2024

The Issue

I am not sure if this can be called a bug but it's a setting that might not produce the intended results. An issue exist where if extractor.*.skip is true then some files with multiple revisions, such as from kemonoparty and coomerparty, will not be downloaded if extractor.*.archive-format is currently set to the default of "{service}_{user}_{id}_{num}"; which can be checked using the -E option.

How To Reproduce

For the following URL,

we extract session info using:

gallery-dl -s -j https://coomer.su/fansly/user/307507152082186240/post/577611859612409857

If the previously discussed conditions above are set, the object entries with attributes "filename": "577611769514565632_preview" and "filename": "577608964548603905" will both get assigned "num": 1 and thus, only one of these files will be downloaded while the second one in the download order will be skipped since the entry in the sqlite3 archive for both files will be identical due to both files sharing the same num value. Both files generate the following entry in the sqlite3 archive in spite of having different filenames: coomerpartyfansly_307507152082186240_577611859612409857_1.

Workarounds

  1. Change the default setting of extractor.*.archive-format to something more unique, like "{service}_{user}_{id}_{filename}_{extension}_{num}".
  2. Set extractor.*.skip to false, (which should have the same(?) effect as using the --no-skip option). This will download everything again so not the best solution.

The first option will break legacy support for previous entries already in the sqlite3 archive. Still, if this behavior is indeed unintended, then the first option is probably the best solution.

Other URLs Also Affected

@orion486 orion486 mentioned this issue Jun 17, 2024
@komoreshi
Copy link

komoreshi commented Jun 17, 2024

Doesn't really address the default config issue since it's dependent on the extractor, but with kemono/coomer, the api returns file hashes (iirc SHA256 is used) and can be used as a more specific way to ensure duplicates aren't downloaded like so: "archive-format": "{subcategory}_{user}_{id}_{num}_{hash}"

@orion486
Copy link
Author

Yes, I originally thought it may affect more sites, but given that these two websites in question seem to be pretty unique in how they provide multiple revisions of a download target, perhaps addressing this issue is better done on a per-website/extractor basis. I am not sure if other websites use a similar revision system but if they do then a similar solution could be used for their extractor, depending on the info that can be extracted.

And I agree, the file hash for this extractor would be a much better solution to ensure no shared entries in sqlite3. I'll make a new PR.

@a84r7a3rga76fg
Copy link

a84r7a3rga76fg commented Jun 17, 2024

"filename": "{hash}.{extension}",
"archive-format": "{subcategory}_{user}_{id}_{hash}"

With these you'll only download unique files. Use Kemono's API to sort the files afterwards. There is literally no point in trying to sort files while downloading from Kemono because of how they handle revisions.

@sntrenter
Copy link

sntrenter commented Aug 6, 2024

Is there a way to fix this with config changes? I tried some of the suggestions in this thread but gallery-dl still only pulls the latest "revision"/"imported" rip. I'm Haven't worked much with the gallery-dl config yet.

I guess you could(or i) write a script to pull revision urls then manually gallery-dl them, but I think it would create some dupes, IE for the post you give as an example:

rev2:https://n3.coomer.su/data/f3/ea/f3ea2ffcbe3dda1f889458180707130d2b3e59def266f83c16611ecc8af1507e.jpg?f=543137649662111744.jpeg
rev1:https://n2.coomer.su/data/ec/35/ec35e98166963840edb8beb330e901b6a7ac3690a0b36f20be63107ec4a1da03.jpg?f=543137648286380032_preview.jpeg
rev3:https://n2.coomer.su/data/ec/35/ec35e98166963840edb8beb330e901b6a7ac3690a0b36f20be63107ec4a1da03.jpg?f=543137648286380032_preview.jpeg

so you would end up with two, a "preview" version and a "real" one. (If you are already writing a script to include all the revisions of the page, it should be trivial to remove a "preview" version if a "real" version exists.)

the revisions are also in plaintext on the page so generating the URLs to check for new/missing files shouldn't be too hard?

I'm a little bit of a leech on the project right now so I'm not sure if this would be something better developed as a personal hacky fix or if I should try to get a PR working for it?

Edit: After experimenting with it myself, it seems like you might just need:

        "kemonoparty": {
            "revisions": "unique"
        },
        "coomerparty": {
            "revisions": "unique"
        },

or I misunderstood the issue (more likely)

@orion486
Copy link
Author

orion486 commented Aug 19, 2024

@sntrenter From my own tests, using "revisions": "unique" did not solve the problem; I was already using unique in my config when I noticed this issue and also attempted to disable it to see if it changed anything. Pull request #5751 I submitted fixes this issue but has currently not yet been merged. For now I have locally solved the issue by using my first workaround, which can be found in my first post in this thread. This can be implemented in the config by adding the following entry in the kemonoparty and coomerparty sections. It essentially does what the pull request would by default:

"coomerparty": {
    ...
    "archive-format": "{service}_{user}_{id}_{num}_{hash}",
    ...
},

Quick edit: Reminder that, to enable the sqlite3 archive, you also need to use "archive": "~/choose_path/archive-coomerparty.sqlite3",.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants