Crunchbase News Scraper

This project is a Python script designed to scrape news articles from news.crunchbase.com. It uses the Playwright library to control a web browser, allowing it to effectively bypass anti-bot measures and handle dynamically loaded content.

Features

Powered by Playwright: Ensures reliable data scraping from modern, JavaScript-heavy websites.
Full Article Scraping: Navigates to each article's page to extract the complete text.
Data Cleaning: Removes junk elements (e.g., social sharing buttons) from the beginning of the article text.
Duplicate Prevention: On each run, the script checks for existing articles and only adds new ones, making it ideal for scheduled execution.
JSON Output: All scraped data is saved to crunchbase_articles_clean.json in a clean, human-readable format.

Installation and Setup

Clone the repository (or use your existing project files):
```
git clone <your-repository-url>
cd <repository-folder>
```
Create and activate a virtual environment:
```
python3 -m venv .venv
source .venv/bin/activate
```
On Windows, use .venv\Scripts\activate
Install the required dependencies:
```
pip install -r requirements.txt
```
Install the Playwright browsers:
```
playwright install
```

Usage

To run the scraper, execute the following command in your terminal:

python crunchbase_scraper_playwright.py

The script will start, log its progress to the console, and finish by printing a summary of the articles checked and added.

Data Format

The results are saved in crunchbase_articles_clean.json. Each article is a JSON object with the following fields:

title: The title of the article.
link: A direct URL to the article page.
full_text: The complete text of the article, cleaned of HTML tags and other non-content elements.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
crunchbase_scraper_playwright.py		crunchbase_scraper_playwright.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crunchbase News Scraper

Features

Installation and Setup

Usage

Data Format

About

Uh oh!

Releases

Packages

Languages

zaharenok/crunchbase-news-scraper

Folders and files

Latest commit

History

Repository files navigation

Crunchbase News Scraper

Features

Installation and Setup

Usage

Data Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages