This project integrates a Scrapy spider with a Flask API to scrape articles, store them in Google BigQuery, and provide a search functionality through a REST API.
- Python 3.x
- Google Cloud SDK (with BigQuery API enabled)
- Create a project/dataset/table in BigQuery
- Google Cloud credentials JSON file
- Clone the repository:
git clone https://github.com/Frankson18/scrapy_articles cd scrapy_articles
- Create and activate a virtual environment:
source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the dependencies:
pip install -r requirements.txt
- Configure environment variables:
Update the following variables in settings.py and app.py with your actual Google Cloud project details:
- BIGQUERY_PROJECT_ID
- BIGQUERY_DATASET_ID
- BIGQUERY_TABLE_ID
- GOOGLE_APPLICATION_CREDENTIALS (path to your JSON key file)
- Navigate to the project directory:
cd articlescraper
- Run the Scrapy spider:
scrapy crawl newsscrapper
- Set environment variables and run the Flask app:
export FLASK_APP=app.py export FLASK_ENV=development flask run
- On Windows (Command Prompt):
set FLASK_APP=app.py set FLASK_ENV=development flask run
- Access the API:
http://127.0.0.1:5000/search?q=exemple