-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command for fetching HathiTrust Images #694
base: develop
Are you sure you want to change the base?
Conversation
Note this currently relies on a feature branch of ppa-nlp
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #694 +/- ##
===========================================
- Coverage 100.00% 93.34% -6.66%
===========================================
Files 5 139 +134
Lines 78 7555 +7477
Branches 8 8
===========================================
+ Hits 78 7052 +6974
- Misses 0 503 +503 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic, and the script seems to be working really well.
I think there's a bit more work to do to polish it up and improve the output and reporting, but it seems like the basic functionality is pretty much working as we want it to.
It would be nice to add handling for SIGINT / ctrl-c so you can stop the script but have it end gracefully, finish downloading images for the current volume and report on what ws done, then stop. If you're interested in adding that, let me know, I should be able to find an example where I've done something similar.
) | ||
volume_progress.start() | ||
|
||
# Fetch images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving the logic for downloading all images from a volume into a separate function. I think it would help with readability and (hopefully) code complexity
if not kwargs["out"]: | ||
raise CommandError("An output directory must be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will never run because argparse handles it - positional arguments like this are required. (Only exception is if we call the handle method in unit tests without doing the minimal setup, but we shouldn't do that)
if not kwargs["out"]: | |
raise CommandError("An output directory must be specified") | |
if not kwargs["out"]: | |
raise CommandError("An output directory must be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I should remove this? (the suggest change looks confusing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry - please remove
parser.add_argument( | ||
"out", | ||
type=Path, | ||
help="Top-level output directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably use a longer and more descriptive parameter (maybe output_dir
or at least output
), but if you want to keep it short you could at least set a metavar so it's more readable in the script usage.
parser.add_argument( | |
"out", | |
type=Path, | |
help="Top-level output directory") | |
parser.add_argument( | |
"out", | |
type=Path, | |
help="Top-level output directory") | |
metavar="OUTPUT_DIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I was trying to follow conventions used elsewhere. I'm open to other names.
page_image = vol_dir / image_name | ||
if not page_image.is_file(): | ||
image_url = page_image_url(vol_id, page_num, 800) | ||
#self.download_image(image_url, page_image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this disabled intentionally? I ran the script locally before I noticed this, was wondering why I was only getting thumbnail images!
If it's useful, we could add a flag to only download thumbnail images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this wasn't supposed to be a "true" PR. Sorry for not explicitly marking this as a draft. It was commented out because I was trying to focus on the progress bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that's right - I knew that, I guess I sort of lost track of your main questions for this version. Hopefully the comments are still useful for where you're at with the script.
|
||
# Fetch thumbnail if file does not exist | ||
page_thumbnail = thumbnail_dir / image_name | ||
if not page_thumbnail.is_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about a volume-level check to skip an entire volume if the expected number of images are present, but glad to see you have this check.
overall_progress = None | ||
if self.options["progress"]: | ||
overall_progress = progressbar.ProgressBar( | ||
redirect_stdout=True, max_value=digworks.count(), max_error=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For progressbar2
when you are using multiple progress bars together you need to specify line_offset
. I think this is why the first one, for overall progress, doesn't appear to update - it must be overwritten immediately by the volume-level progress bar.
Rather than create a new volume progress bar for each loop, I think it would work better to create two progress bars with the proper line offsets so they display together nicely, and reset the volume progress bar each volume. It's possible to customize the progressbar output - it would be nice to have the volume id at the beginning, and you could include include skip / download counts too.
If we do that, then details for each volume will be more ephemeral (you'll only see the current one). You could experiment and see how it works with the progressbar, once the line offset is working maybe print statements will always display above the progress bars. Otherwise we could use logging for the more detailed counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call. It was hard finding example code, since most examples were for independent / multithreaded situations. But I'll look into it. Maybe we can find some time to make some design choices next week?
) | ||
|
||
def download_image(self, page_url: str, out_file: Path) -> None: | ||
response = requests.get(page_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested in seeing debug logging / verbose output on the crawl tokens / throttling, so we can get a better sense of how that works.
How did you come up with the crawl delay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HathiTrust's robots
# Fetch thumbnail if file does not exist | ||
page_thumbnail = thumbnail_dir / image_name | ||
if not page_thumbnail.is_file(): | ||
thumbnail_url = page_image_url(vol_id, page_num, 250) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest setting the image width as class variables so they are easier to find at the top of the file
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
ref: #663
New command for fetching HathiTrust images ("full-size" and thumbnails).
Notes:
Questions: