Skip to content

zaharenok/crunchbase-news-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Crunchbase News Scraper

This project is a Python script designed to scrape news articles from news.crunchbase.com. It uses the Playwright library to control a web browser, allowing it to effectively bypass anti-bot measures and handle dynamically loaded content.

Features

  • Powered by Playwright: Ensures reliable data scraping from modern, JavaScript-heavy websites.
  • Full Article Scraping: Navigates to each article's page to extract the complete text.
  • Data Cleaning: Removes junk elements (e.g., social sharing buttons) from the beginning of the article text.
  • Duplicate Prevention: On each run, the script checks for existing articles and only adds new ones, making it ideal for scheduled execution.
  • JSON Output: All scraped data is saved to crunchbase_articles_clean.json in a clean, human-readable format.

Installation and Setup

  1. Clone the repository (or use your existing project files):

    git clone <your-repository-url>
    cd <repository-folder>
  2. Create and activate a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate

    On Windows, use .venv\Scripts\activate

  3. Install the required dependencies:

    pip install -r requirements.txt
  4. Install the Playwright browsers:

    playwright install

Usage

To run the scraper, execute the following command in your terminal:

python crunchbase_scraper_playwright.py

The script will start, log its progress to the console, and finish by printing a summary of the articles checked and added.

Data Format

The results are saved in crunchbase_articles_clean.json. Each article is a JSON object with the following fields:

  • title: The title of the article.
  • link: A direct URL to the article page.
  • full_text: The complete text of the article, cleaned of HTML tags and other non-content elements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages