Skip to content

raul23/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web crawler + scraper

Contents

Scripts

get_physicists_urls.py: Get list of URLs to Wikipedia pages of theoretical physicists

Starting from Category:Theoretical physicists, get all the absolute URLs of theoretical physicists' Wikipedia pages by processing the list of relative URLs in the section Pages in category "Theoretical physicists" and crawling through the next pages until no more next page is found.

This script outputs a list of URLs to Wikipedia pages of theoretical physicsts that is saved as a pickle file.

ℹ️

  • The Python script can be found at get_physicists_urls.py
  • The Python script saves the list of URLs as a pickle file
  • For more information about the script's usage, check the Usage section.

Dependencies for get_physicists_urls.py

This is the environment on which the script get_physicists_urls.py was tested:

  • Platforms: macOS
  • Python: versions 3.7 and 3.8
  • beautifulsoup4: v4.11.1, for screen-scraping

ℹ️ The built-in module urllib is used for sending HTTP requests.

Usage for get_physicists_urls.py

Run the script get_physicists_urls.py

Run the script by specifying the path of the pickle that will contain the list of URLs:

$ pyton get_physicists_urls.py ~/Data/wikipedia/list_physicists_urls.pkl -d 3

Showing the first 4 URLs in the list:

ipdb> list_physicists_urls[:4]

['https://en.wikipedia.org//wiki/Alexei_Abrikosov_(physicist)', 'https://en.wikipedia.org//wiki/Vadym_Adamyan', 'https://en.wikipedia.org//wiki/David_Adler_(physicist)', 'https://en.wikipedia.org//wiki/Diederik_Aerts']

ℹ️

  • ~/Data/wikipedia/list_physicists_urls.pkl: pickle file that will contain the list of URLs to Wikipedia pages of theoretical physicists
  • -d 3: three seconds between HTTP requests
List of options for get_physicists_urls.py

To display the script's list of options and their descriptions, use the -h option:

$ python get_physicists_urls.py -h

usage: python get_physicists_urls.py [OPTIONS] {pickle_file}

Get URLs to Wikipedia pages of theoretical physicists

positional arguments:
  pickle_file           Path to the pickle file that will contain the list of URLs to theoretical physicists' Wikipedia pages.

optional arguments:
  -h, --help            show this help message and exit
  -d DELAY, --delay-requests DELAY
                        Delay in seconds between HTTP requests. (default: 2)

⚠️ Don't use a delay (-d) too short (e.g. 0.5 second between HTTP requests) because you will overwhelm the server and your IP address will eventually get banned.

download_wiki_pages.py: Download Wikipedia pages of theoretical physicists

This script takes as input a pickle file containing URLs to Wikipedia pages of theoretical physicists (See the previous script get_physicists_urls.py).

ℹ️

  • The Python script can be found at download_wiki_pages.py
  • By default, there is a delay of 2 seconds between HTTP requests.
  • For more information about the script's usage, check the Usage section.

Here are the general steps used by the script for downloading the Wikipedia pages with the corresponding images:

  1. Load the pickle file containing the list of URLs which was generated from the previous script
  2. For each URL,
    1. If the Wikipedia page (html only) is not already found saved locally, then download it with the requests package
    2. If the corresponding image is not already found saved locally, then download it by searching first if it is in the info-box (i.e. in a <td> tag with the infobox-image class): e.g. Alexei Abrikosov
    3. If no image is found in the info-box, then try to get it as a thumb picture (i.e. in a <div> tag with the thumbinner class): e.g. Oriol Bohigas Martí
  3. Every Wikipedia page (html) and its corresponing image are saved locally within a directory as specified by the user
  4. Useful information for the casual user is printed in the console (important messages are colored, e.g. warning that an image couldn't be downloaded) and the logger hides the rest of the information useful for debugging

Dependencies for download_wiki_pages.py

This is the environment on which the script download_wiki_pages.py was tested:

  • Platforms: macOS
  • Python: versions 3.7 and 3.8
  • requests: v2.28.1, for sending HTTP GET requests
  • beautifulsoup4: v4.11.1, for screen-scraping

Usage for download_wiki_pages.py

Run the script download_wiki_pages.py

Run the script by specifying the paths to the pickle file and the ouput directory where the downloaded Wikipedia pages will be saved:

$ pyton download_wiki_pages.py ~/Data/wikipedia/list_physicists_urls.pkl ~/Data/wikipedia/physicists/ --log-format only_msg --log-level debug

ℹ️ Explaning the arguments from the previous command:

  • ~/Data/wikipedia/list_physicists_urls.pkl: pickle file containing the list of URLs to Wikipedia pages of theoretical physicists (See the previous script get_physicists_urls.py)
  • ~/Data/wikipedia/physicists/: ouput directory where the downloaded Wikipedia pages will be saved
  • --log-format only_msg: display only the logging message without the timestamp or the logging level
  • --log-level debug: display all logging messages with the debug logging level

In order to stop the script at any moment, press ctrl + c.

List of options for download_wiki_pages.py

To display the script's list of options and their descriptions, use the -h option:

$ pyton download_wiki_pages.py -h

usage: python download_wiki_pages.py [OPTIONS] {input_pickle_file} {output_directory}

General options:

-h, --help Show this help message and exit. -v, --version Show program's version number and exit. -q, --quiet Enable quiet mode, i.e. nothing will be printed. --verbose Print various debugging information, e.g. print traceback when there is an exception. --log-level Set logging level: {debug,info,warning,error}. (default: info) --log-format Set logging formatter: {console,only_msg,simple}. (default: simple)

HTTP requests options:

-u, --user-agent USER_AGENT User Agent. (default: Mozilla/5.0 (X11; Linux x86_64) ...) -t, --http-timeout TIMEOUT HTTP timeout in seconds. (default: 120) -d, --delay-requests DELAY Delay in seconds between HTTP requests. (default: 2)

⚠️ Don't use a delay (-d) too short (e.g. 0.5 second between HTTP requests) because you will overwhelm the server and your IP address will eventually get banned.

The following are required input/ouput arguments:

  • input_pickle_file is the path to the pickle file containing the list of URLs to theoretical physicists' Wikipedia pages.
  • output_directory is the path to the directory where the Wikipedia pages and corresponding images will be saved.

ℹ️ Logging formatters to choose from:

  • console: %(asctime)s | %(levelname)-8s | %(message)s
  • only_msg: %(message)s
  • simple: %(levelname)-8s %(message)s