Web crawler + scraper

Contents

Scripts

`get_physicists_urls.py`: Get list of URLs to Wikipedia pages of theoretical physicists

Starting from Category:Theoretical physicists, get all the absolute URLs of theoretical physicists' Wikipedia pages by processing the list of relative URLs in the section Pages in category "Theoretical physicists" and crawling through the next pages until no more next page is found.

This script outputs a list of URLs to Wikipedia pages of theoretical physicsts that is saved as a pickle file.

ℹ️

The Python script can be found at get_physicists_urls.py

The Python script saves the list of URLs as a pickle file

For more information about the script's usage, check the Usage section.

Dependencies for get_physicists_urls.py

This is the environment on which the script get_physicists_urls.py was tested:

Platforms: macOS
Python: versions 3.7 and 3.8
beautifulsoup4: v4.11.1, for screen-scraping

ℹ️ The built-in module urllib is used for sending HTTP requests.

Usage for get_physicists_urls.py

Run the script get_physicists_urls.py

Run the script by specifying the path of the pickle that will contain the list of URLs:

$ pyton get_physicists_urls.py ~/Data/wikipedia/list_physicists_urls.pkl -d 3

Showing the first 4 URLs in the list:

ipdb> list_physicists_urls[:4]

['https://en.wikipedia.org//wiki/Alexei_Abrikosov_(physicist)', 'https://en.wikipedia.org//wiki/Vadym_Adamyan', 'https://en.wikipedia.org//wiki/David_Adler_(physicist)', 'https://en.wikipedia.org//wiki/Diederik_Aerts']

ℹ️

~/Data/wikipedia/list_physicists_urls.pkl: pickle file that will contain the list of URLs to Wikipedia pages of theoretical physicists

-d 3: three seconds between HTTP requests

List of options for get_physicists_urls.py

To display the script's list of options and their descriptions, use the -h option:

$ python get_physicists_urls.py -h

usage: python get_physicists_urls.py [OPTIONS] {pickle_file}

Get URLs to Wikipedia pages of theoretical physicists

positional arguments:
  pickle_file           Path to the pickle file that will contain the list of URLs to theoretical physicists' Wikipedia pages.

optional arguments:
  -h, --help            show this help message and exit
  -d DELAY, --delay-requests DELAY
                        Delay in seconds between HTTP requests. (default: 2)

⚠️ Don't use a delay (-d) too short (e.g. 0.5 second between HTTP requests) because you will overwhelm the server and your IP address will eventually get banned.

`download_wiki_pages.py`: Download Wikipedia pages of theoretical physicists

This script takes as input a pickle file containing URLs to Wikipedia pages of theoretical physicists (See the previous script get_physicists_urls.py).

ℹ️

The Python script can be found at download_wiki_pages.py

By default, there is a delay of 2 seconds between HTTP requests.

For more information about the script's usage, check the Usage section.

Here are the general steps used by the script for downloading the Wikipedia pages with the corresponding images:

Load the pickle file containing the list of URLs which was generated from the previous script
For each URL,
1. If the Wikipedia page (html only) is not already found saved locally, then download it with the requests package
2. If the corresponding image is not already found saved locally, then download it by searching first if it is in the info-box (i.e. in a <td> tag with the infobox-image class): e.g. Alexei Abrikosov
3. If no image is found in the info-box, then try to get it as a thumb picture (i.e. in a <div> tag with the thumbinner class): e.g. Oriol Bohigas Martí
Every Wikipedia page (html) and its corresponing image are saved locally within a directory as specified by the user
Useful information for the casual user is printed in the console (important messages are colored, e.g. warning that an image couldn't be downloaded) and the logger hides the rest of the information useful for debugging

Dependencies for download_wiki_pages.py

This is the environment on which the script download_wiki_pages.py was tested:

Platforms: macOS
Python: versions 3.7 and 3.8
requests: v2.28.1, for sending HTTP GET requests
beautifulsoup4: v4.11.1, for screen-scraping

Usage for download_wiki_pages.py

Run the script download_wiki_pages.py

Run the script by specifying the paths to the pickle file and the ouput directory where the downloaded Wikipedia pages will be saved:

$ pyton download_wiki_pages.py ~/Data/wikipedia/list_physicists_urls.pkl ~/Data/wikipedia/physicists/ --log-format only_msg --log-level debug

ℹ️ Explaning the arguments from the previous command:

~/Data/wikipedia/list_physicists_urls.pkl: pickle file containing the list of URLs to Wikipedia pages of theoretical physicists (See the previous script get_physicists_urls.py)
~/Data/wikipedia/physicists/: ouput directory where the downloaded Wikipedia pages will be saved
--log-format only_msg: display only the logging message without the timestamp or the logging level
--log-level debug: display all logging messages with the debug logging level

⭐ In order to stop the script at any moment, press ctrl + c.

List of options for download_wiki_pages.py

To display the script's list of options and their descriptions, use the -h option:

$ pyton download_wiki_pages.py -h

usage: python download_wiki_pages.py [OPTIONS] {input_pickle_file} {output_directory}

General options:

-h, --help Show this help message and exit. -v, --version Show program's version number and exit. -q, --quiet Enable quiet mode, i.e. nothing will be printed. --verbose Print various debugging information, e.g. print traceback when there is an exception. --log-level Set logging level: {debug,info,warning,error}. (default: info) --log-format Set logging formatter: {console,only_msg,simple}. (default: simple)

HTTP requests options:

-u, --user-agent USER_AGENT User Agent. (default: Mozilla/5.0 (X11; Linux x86_64) ...) -t, --http-timeout TIMEOUT HTTP timeout in seconds. (default: 120) -d, --delay-requests DELAY Delay in seconds between HTTP requests. (default: 2)

⚠️ Don't use a delay (-d) too short (e.g. 0.5 second between HTTP requests) because you will overwhelm the server and your IP address will eventually get banned.

⭐ The following are required input/ouput arguments:

input_pickle_file is the path to the pickle file containing the list of URLs to theoretical physicists' Wikipedia pages.
output_directory is the path to the directory where the Wikipedia pages and corresponding images will be saved.

ℹ️ Logging formatters to choose from:

console: %(asctime)s | %(levelname)-8s | %(message)s
only_msg: %(message)s
simple: %(levelname)-8s %(message)s

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
images		images
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.rst

README.rst

Repository files navigation

Web crawler + scraper

Scripts

`get_physicists_urls.py`: Get list of URLs to Wikipedia pages of theoretical physicists

Dependencies for get_physicists_urls.py

Usage for get_physicists_urls.py

Run the script get_physicists_urls.py

List of options for get_physicists_urls.py

`download_wiki_pages.py`: Download Wikipedia pages of theoretical physicists

Dependencies for download_wiki_pages.py

Usage for download_wiki_pages.py

Run the script download_wiki_pages.py

List of options for download_wiki_pages.py

About

Languages

License

raul23/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web crawler + scraper

Scripts

get_physicists_urls.py: Get list of URLs to Wikipedia pages of theoretical physicists

Dependencies for get_physicists_urls.py

Usage for get_physicists_urls.py

Run the script get_physicists_urls.py

List of options for get_physicists_urls.py

download_wiki_pages.py: Download Wikipedia pages of theoretical physicists

Dependencies for download_wiki_pages.py

Usage for download_wiki_pages.py

Run the script download_wiki_pages.py

List of options for download_wiki_pages.py

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`get_physicists_urls.py`: Get list of URLs to Wikipedia pages of theoretical physicists

`download_wiki_pages.py`: Download Wikipedia pages of theoretical physicists