Book details exploratory data analysis with data scraped from several sources, mostly Spanish sites.
I am a book lover and, most of all, I am passionate about fiction. Literary fiction in particular, though not only. So I decided this interest of mine could be the spark for a data science pet project, where I could build the end-to-end ETL and analysis pipelines and extract some interesting insights about books, apart from reading (and enjoying) them.
There are few book-related datasets around. Some are in Kaggle, such as this one or this other one. They are mostly related to genre fiction or Amazon book data... in English. I have found no datasets related to books solely in Spanish or to the Spanish book market, so I decided to gather the data by myself and learn a few new things in the process.
The selection of suitable sites to retrieve book data for books published in Spain is quite reduced, more so since the Goodreads API was deprecated in December 2020. Among the available ones, the most straightforward is Todos tus libros, a site put up by Cegal, the Spanish bookshops association, to publicize books and their availability, as part of a campaign to encourage readers to buy locally. The site (as stated in the 'Who we are' section) includes information (sometimes incomplete) on more than 4 million books --and counting.
The site allows searches on author, title, publishing house, ISBN or date of publication and books are tagged for one or more categories ('materias'). Initially, I have chosen for the exploratory analysis the category related to literary fiction: 'Ficción moderna y contemporánea', containing about 100,000 books. The books details were scraped and formatted in a JSON file, including the following information:
- Title
- Authors
- Publisher
- Price (if available)
- Publishing country
- Publishing language
- Original language
- ISBN
- EAN
- Publication date
- Type of binding
- Number of pages
- Number of bookstores where the book is available (at the time of the dataset generation)
- Tags: a compendium of genre, language, style, etc.
- Book cover URL
For exhaustive details about the scraping process, please check my repo ermine-book-data-scraping.
This repo uses a pipenv virtual environment, so you'd either install pipenv and recreate the environment or you may install a few python packages in your python environment of choice to be able to run the notebooks and the rest of the code:
- jupyter (or jupyterlab)
- pandas
- matplotlib
- seaborn
- plotly
- Clone this repo (for help see this tutorial).
- Recreate the pipenv virtual environment using:
pipenv sync --dev
-
Raw Data is being kept here within this repo.
The data as a json file is generated through the scrapy crawler in ermine-book-data-scraping.
-
Data processing/transformation notebooks are being kept here.
This project is licensed under the Creative Commons License - see the LICENSE.md file for details