Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a user agent string that matches convention used by libraries/tools #300

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ephphatha
Copy link

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent#library_and_net_tool_ua_strings provides a few examples, also see urllib which uses "Python-urllib/".

img2dataset does not parse HTML so has no reason to pass a user-agent that indicates mozilla compatibility.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent#library_and_net_tool_ua_strings provides a few examples, also see urllib which uses "Python-urllib/<version>".

img2dataset does not parse HTML so has no reason to pass a user-agent that indicates mozilla compatibility.
@@ -38,9 +38,10 @@ def download_image(row, timeout, user_agent_token, disallowed_header_directives)
"""Download an image with urllib"""
key, url = row
img_stream = None
user_agent_string = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
user_agent_string = "img2dataset/1.x ("
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not into use {user_agent_token} rather than hard coding img2dataset here?

Copy link
Author

@ephphatha ephphatha Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to the repository was hardcoded previously if any user-agent was specified, so it seemed appropriate to use it as the base tool name with the user-provided string added in the comment section.

edit: actually double-checking main() it looks like the default useragent token is None, not "img2dataset" as I thought for some reason. The old default UA does not identify the tool at all.
default UA: img2dataset/1.x (+https://github.com/rom1504/img2dataset)
user-provided UA: img2dataset/1.x (compatible; <user-provided>; +https://github.com/rom1504/img2dataset)

previous strings were:
default UA: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
user-provided UA: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; <user-provided>; +https://github.com/rom1504/img2dataset)

@rom1504 rom1504 added this to Needs triage in PR Triage May 28, 2023
@rom1504 rom1504 moved this from Needs triage to Waiting for user input in PR Triage Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Triage
Waiting for user input
Development

Successfully merging this pull request may close these issues.

None yet

3 participants