Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command for fetching HathiTrust Images #694

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Nov 6, 2024

ref: #663

New command for fetching HathiTrust images ("full-size" and thumbnails).

Notes:

  • This currently depends on the develop branch of ppa-nlp
  • The width of "full-size" pages is currently set to 800

Questions:

  • Is a width of 800pixels the "right" size?
  • I'm having some trouble with displaying two stacked progress bars. Any ideas?
  • Should we use a different crawl delay?
  • Thoughts on logging progress?

@laurejt laurejt self-assigned this Nov 6, 2024
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 1.21951% with 81 lines in your changes missing coverage. Please review.

Project coverage is 93.34%. Comparing base (44bd84d) to head (9106cd5).
Report is 226 commits behind head on develop.

❗ There is a different number of reports uploaded between BASE (44bd84d) and HEAD (9106cd5). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (44bd84d) HEAD (9106cd5)
javascript 3 1
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #694      +/-   ##
===========================================
- Coverage   100.00%   93.34%   -6.66%     
===========================================
  Files            5      139     +134     
  Lines           78     7555    +7477     
  Branches         8        8              
===========================================
+ Hits            78     7052    +6974     
- Misses           0      503     +503     

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic, and the script seems to be working really well.

I think there's a bit more work to do to polish it up and improve the output and reporting, but it seems like the basic functionality is pretty much working as we want it to.

It would be nice to add handling for SIGINT / ctrl-c so you can stop the script but have it end gracefully, finish downloading images for the current volume and report on what ws done, then stop. If you're interested in adding that, let me know, I should be able to find an example where I've done something similar.

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
)
volume_progress.start()

# Fetch images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving the logic for downloading all images from a volume into a separate function. I think it would help with readability and (hopefully) code complexity

Comment on lines +57 to +58
if not kwargs["out"]:
raise CommandError("An output directory must be specified")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will never run because argparse handles it - positional arguments like this are required. (Only exception is if we call the handle method in unit tests without doing the minimal setup, but we shouldn't do that)

Suggested change
if not kwargs["out"]:
raise CommandError("An output directory must be specified")
if not kwargs["out"]:
raise CommandError("An output directory must be specified")

Copy link
Contributor Author

@laurejt laurejt Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I should remove this? (the suggest change looks confusing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry - please remove

Comment on lines +27 to +30
parser.add_argument(
"out",
type=Path,
help="Top-level output directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably use a longer and more descriptive parameter (maybe output_dir or at least output), but if you want to keep it short you could at least set a metavar so it's more readable in the script usage.

Suggested change
parser.add_argument(
"out",
type=Path,
help="Top-level output directory")
parser.add_argument(
"out",
type=Path,
help="Top-level output directory")
metavar="OUTPUT_DIR"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I was trying to follow conventions used elsewhere. I'm open to other names.

page_image = vol_dir / image_name
if not page_image.is_file():
image_url = page_image_url(vol_id, page_num, 800)
#self.download_image(image_url, page_image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this disabled intentionally? I ran the script locally before I noticed this, was wondering why I was only getting thumbnail images!

If it's useful, we could add a flag to only download thumbnail images

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this wasn't supposed to be a "true" PR. Sorry for not explicitly marking this as a draft. It was commented out because I was trying to focus on the progress bar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, that's right - I knew that, I guess I sort of lost track of your main questions for this version. Hopefully the comments are still useful for where you're at with the script.


# Fetch thumbnail if file does not exist
page_thumbnail = thumbnail_dir / image_name
if not page_thumbnail.is_file():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about a volume-level check to skip an entire volume if the expected number of images are present, but glad to see you have this check.

overall_progress = None
if self.options["progress"]:
overall_progress = progressbar.ProgressBar(
redirect_stdout=True, max_value=digworks.count(), max_error=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For progressbar2 when you are using multiple progress bars together you need to specify line_offset . I think this is why the first one, for overall progress, doesn't appear to update - it must be overwritten immediately by the volume-level progress bar.

Rather than create a new volume progress bar for each loop, I think it would work better to create two progress bars with the proper line offsets so they display together nicely, and reset the volume progress bar each volume. It's possible to customize the progressbar output - it would be nice to have the volume id at the beginning, and you could include include skip / download counts too.

If we do that, then details for each volume will be more ephemeral (you'll only see the current one). You could experiment and see how it works with the progressbar, once the line offset is working maybe print statements will always display above the progress bars. Otherwise we could use logging for the more detailed counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call. It was hard finding example code, since most examples were for independent / multithreaded situations. But I'll look into it. Maybe we can find some time to make some design choices next week?

)

def download_image(self, page_url: str, out_file: Path) -> None:
response = requests.get(page_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested in seeing debug logging / verbose output on the crawl tokens / throttling, so we can get a better sense of how that works.

How did you come up with the crawl delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HathiTrust's robots

# Fetch thumbnail if file does not exist
page_thumbnail = thumbnail_dir / image_name
if not page_thumbnail.is_file():
thumbnail_url = page_image_url(vol_id, page_num, 250)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest setting the image width as class variables so they are easier to find at the top of the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants