This project is a Python script designed to scrape news articles from news.crunchbase.com. It uses the Playwright library to control a web browser, allowing it to effectively bypass anti-bot measures and handle dynamically loaded content.
- Powered by Playwright: Ensures reliable data scraping from modern, JavaScript-heavy websites.
- Full Article Scraping: Navigates to each article's page to extract the complete text.
- Data Cleaning: Removes junk elements (e.g., social sharing buttons) from the beginning of the article text.
- Duplicate Prevention: On each run, the script checks for existing articles and only adds new ones, making it ideal for scheduled execution.
- JSON Output: All scraped data is saved to
crunchbase_articles_clean.json
in a clean, human-readable format.
-
Clone the repository (or use your existing project files):
git clone <your-repository-url> cd <repository-folder>
-
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate
On Windows, use
.venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Install the Playwright browsers:
playwright install
To run the scraper, execute the following command in your terminal:
python crunchbase_scraper_playwright.py
The script will start, log its progress to the console, and finish by printing a summary of the articles checked and added.
The results are saved in crunchbase_articles_clean.json
. Each article is a JSON object with the following fields:
title
: The title of the article.link
: A direct URL to the article page.full_text
: The complete text of the article, cleaned of HTML tags and other non-content elements.