The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
While tools like newspaper3k and goose3 can be used for extracting articles from news websites, they need a dedicated article url for older articles and do not support paywall content. This package aims to solve these issues by providing a unified interface for indexing, extracting and processing articles from newspapers.
- Indexing: Index articles from a newspaper website using the beautifulsoup package for public articles and selenium for paywall content.
- Extraction: Extract article content using the goose3 package.
- Processing: Process articles for nlp features using the spaCy package.
The indexing functionality is based on a dedicated file for each newspaper. A few newspapers are already supported, but it is easy to add new ones.
Logo | Newspaper | Country | Time span | Number of articles |
---|---|---|---|---|
Der Spiegel | Germany | Since 2000 | tbd | |
Die Welt | Germany | Since 2000 | tbd | |
Bild | Germany | Since 2006 | tbd | |
Die Zeit | Germany | Since 1946 | tbd | |
Handelsblatt | Germany | Since 2003 | tbd | |
Der Tagesspiegel | Germany | Since 2000 | tbd | |
Süddeutsche Zeitung | Germany | Since 2001 | tbd |
It is recommended to install the package in an dedicated Python environment.
To install the package via pip, run the following command:
pip install newspaper-scraper
To also include the nlp extraction functionality (via spaCy), run the following command:
pip install newspaper-scraper[nlp]
To index, extract and process all public and premium articles from Der Spiegel, published in August 2021, run the following code:
import newspaper_scraper as nps
from credentials import username, password
with nps.Spiegel(db_file='articles.db') as news:
news.index_articles_by_date_range('2021-08-01', '2021-08-31')
news.scrape_public_articles()
news.scrape_premium_articles(username=username, password=password)
news.nlp()
This will create a sqlite database file called articles.db
in the current working directory. The database contains the following tables:
tblArticlesIndexed
: Contains all indexed articles with their scraping/ processing status and whether they are public or premium content.tblArticlesScraped
: Contains metadata for all parsed articles, provided by goose3.tblArticlesProcessed
: Contains nlp features of the cleaned article text, provided by spaCy.