New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Command for fetching HathiTrust Images #694

Open

laurejt wants to merge 6 commits into develop from feature/fetch-hathi-images

Contributor

laurejt commented Nov 6, 2024 •

edited

Loading

ref: #663

New command for fetching HathiTrust images ("full-size" and thumbnails).

Notes:

This currently depends on the develop branch of ppa-nlp
The width of "full-size" pages is currently set to 800

Questions:

Is a width of 800pixels the "right" size?
I'm having some trouble with displaying two stacked progress bars. Any ideas?
Should we use a different crawl delay?
Thoughts on logging progress?

laurejt added 3 commits

November 5, 2024 16:16


          Updated developer notes and fixed unit test

24047db


          Add ruff to dev requirements

0ff502e


          Initial command for fetching hathi images.

1a64204

Note this currently relies on a feature branch of ppa-nlp

laurejt requested a review from rlskoeser

November 6, 2024 14:44

laurejt self-assigned this

codecov bot commented Nov 6, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 1.21951% with 81 lines in your changes missing coverage. Please review.

Project coverage is 93.34%. Comparing base (44bd84d) to head (9106cd5).
Report is 226 commits behind head on develop.

❗ There is a different number of reports uploaded between BASE (44bd84d) and HEAD (9106cd5). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (44bd84d) HEAD (9106cd5)

javascript 3 1

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #694      +/-   ##
===========================================
- Coverage   100.00%   93.34%   -6.66%     
===========================================
  Files            5      139     +134     
  Lines           78     7555    +7477     
  Branches         8        8              
===========================================
+ Hits            78     7052    +6974     
- Misses           0      503     +503

rlskoeser reviewed

View reviewed changes

Contributor

rlskoeser left a comment

This looks fantastic, and the script seems to be working really well.

I think there's a bit more work to do to polish it up and improve the output and reporting, but it seems like the basic functionality is pretty much working as we want it to.

It would be nice to add handling for SIGINT / ctrl-c so you can stop the script but have it end gracefully, finish downloading images for the current volume and report on what ws done, then stop. If you're interested in adding that, let me know, I should be able to find an example where I've done something similar.

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved

ppa/archive/management/commands/hathi_images.py

+                              )
+                              volume_progress.start()
+                          # Fetch images

Contributor

rlskoeser Nov 6, 2024

I suggest moving the logic for downloading all images from a volume into a separate function. I think it would help with readability and (hopefully) code complexity

ppa/archive/management/commands/hathi_images.py

Comment on lines +57 to +58

		if not kwargs["out"]:
		raise CommandError("An output directory must be specified")

Contributor

rlskoeser Nov 6, 2024

This check will never run because argparse handles it - positional arguments like this are required. (Only exception is if we call the handle method in unit tests without doing the minimal setup, but we shouldn't do that)

Suggested change

      
                    if not kwargs["out"]:
          
                        raise CommandError("An output directory must be specified")
          
                    if not kwargs["out"]:
          
                        raise CommandError("An output directory must be specified")

Contributor Author

laurejt Nov 6, 2024 •

edited

Loading

So, I should remove this? (the suggest change looks confusing)

Contributor

rlskoeser Nov 6, 2024

yes, sorry - please remove

ppa/archive/management/commands/hathi_images.py

Comment on lines +27 to +30

+                      parser.add_argument(
+                          "out",
+                          type=Path,
+                          help="Top-level output directory")

Contributor

rlskoeser Nov 6, 2024

I would probably use a longer and more descriptive parameter (maybe output_dir or at least output), but if you want to keep it short you could at least set a metavar so it's more readable in the script usage.

Suggested change

      
                    parser.add_argument(
          
                        "out",
          
                        type=Path,
          
                        help="Top-level output directory")
          
                    parser.add_argument(
          
                        "out",
          
                        type=Path,
          
                        help="Top-level output directory")
          
                        metavar="OUTPUT_DIR"

Contributor Author

laurejt Nov 6, 2024

I believe I was trying to follow conventions used elsewhere. I'm open to other names.

ppa/archive/management/commands/hathi_images.py

+                              page_image = vol_dir / image_name
+                              if not page_image.is_file():
+                                  image_url = page_image_url(vol_id, page_num, 800)
+                                  #self.download_image(image_url, page_image)

Contributor

rlskoeser Nov 6, 2024

Was this disabled intentionally? I ran the script locally before I noticed this, was wondering why I was only getting thumbnail images!

If it's useful, we could add a flag to only download thumbnail images

Contributor Author

laurejt Nov 6, 2024

Yes, this wasn't supposed to be a "true" PR. Sorry for not explicitly marking this as a draft. It was commented out because I was trying to focus on the progress bar.

Contributor

rlskoeser Nov 6, 2024

oh, that's right - I knew that, I guess I sort of lost track of your main questions for this version. Hopefully the comments are still useful for where you're at with the script.

ppa/archive/management/commands/hathi_images.py

+                              # Fetch thumbnail if file does not exist
+                              page_thumbnail = thumbnail_dir / image_name
+                              if not page_thumbnail.is_file():

Contributor

rlskoeser Nov 6, 2024

I wondered about a volume-level check to skip an entire volume if the expected number of images are present, but glad to see you have this check.

ppa/archive/management/commands/hathi_images.py

+                      overall_progress = None
+                      if self.options["progress"]:
+                          overall_progress = progressbar.ProgressBar(
+                              redirect_stdout=True, max_value=digworks.count(), max_error=False

Contributor

rlskoeser Nov 6, 2024

For progressbar2 when you are using multiple progress bars together you need to specify line_offset . I think this is why the first one, for overall progress, doesn't appear to update - it must be overwritten immediately by the volume-level progress bar.

Rather than create a new volume progress bar for each loop, I think it would work better to create two progress bars with the proper line offsets so they display together nicely, and reset the volume progress bar each volume. It's possible to customize the progressbar output - it would be nice to have the volume id at the beginning, and you could include include skip / download counts too.

If we do that, then details for each volume will be more ephemeral (you'll only see the current one). You could experiment and see how it works with the progressbar, once the line offset is working maybe print statements will always display above the progress bars. Otherwise we could use logging for the more detailed counts.

Contributor Author

laurejt Nov 6, 2024

Ah, good call. It was hard finding example code, since most examples were for independent / multithreaded situations. But I'll look into it. Maybe we can find some time to make some design choices next week?

ppa/archive/management/commands/hathi_images.py

+                      )
+                  def download_image(self, page_url: str, out_file: Path) -> None:
+                      response = requests.get(page_url)

Contributor

rlskoeser Nov 6, 2024

I would be interested in seeing debug logging / verbose output on the crawl tokens / throttling, so we can get a better sense of how that works.

How did you come up with the crawl delay?

Contributor Author

laurejt Nov 6, 2024

HathiTrust's robots

ppa/archive/management/commands/hathi_images.py

+                              # Fetch thumbnail if file does not exist
+                              page_thumbnail = thumbnail_dir / image_name
+                              if not page_thumbnail.is_file():
+                                  thumbnail_url = page_image_url(vol_id, page_num, 250)

Contributor

rlskoeser Nov 6, 2024

I suggest setting the image width as class variables so they are easier to find at the top of the file

laurejt and others added 3 commits

November 6, 2024 13:43


          Update ppa/archive/management/commands/hathi_images.py

9c185e2

Co-authored-by: Rebecca Sutton Koeser <[email protected]>


          Update ppa/archive/management/commands/hathi_images.py

a617d94

Co-authored-by: Rebecca Sutton Koeser <[email protected]>


          Update corppa dependency to develop branch

9106cd5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet