📚 Book data exploratory analysis

📚 Book data exploratory analysis

📚 Book data exploratory analysis

Book details exploratory data analysis with data scraped from several sources, mostly Spanish sites.

Business case

I am a book lover and, most of all, I am passionate about fiction. Literary fiction in particular, though not only. So I decided this interest of mine could be the spark for a data science pet project, where I could build the end-to-end ETL and analysis pipelines and extract some interesting insights about books, apart from reading (and enjoying) them.

There are few book-related datasets around. Some are in Kaggle, such as this one or this other one. They are mostly related to genre fiction or Amazon book data... in English. I have found no datasets related to books solely in Spanish or to the Spanish book market, so I decided to gather the data by myself and learn a few new things in the process.

The selection of suitable sites to retrieve book data for books published in Spain is quite reduced, more so since the Goodreads API was deprecated in December 2020. Among the available ones, the most straightforward is Todos tus libros, a site put up by Cegal, the Spanish bookshops association, to publicize books and their availability, as part of a campaign to encourage readers to buy locally. The site (as stated in the 'Who we are' section) includes information (sometimes incomplete) on more than 4 million books --and counting.

The site allows searches on author, title, publishing house, ISBN or date of publication and books are tagged for one or more categories ('materias'). Initially, I have chosen for the exploratory analysis the category related to literary fiction: 'Ficción moderna y contemporánea', containing about 100,000 books. The books details were scraped and formatted in a JSON file, including the following information:

Title
Authors
Publisher
Price (if available)
Publishing country
Publishing language
Original language
ISBN
EAN
Publication date
Type of binding
Number of pages
Number of bookstores where the book is available (at the time of the dataset generation)
Tags: a compendium of genre, language, style, etc.
Book cover URL

For exhaustive details about the scraping process, please check my repo ermine-book-data-scraping.

Getting Started

Dependencies

This repo uses a pipenv virtual environment, so you'd either install pipenv and recreate the environment or you may install a few python packages in your python environment of choice to be able to run the notebooks and the rest of the code:

jupyter (or jupyterlab)
pandas
matplotlib
seaborn
plotly

Setup

Clone this repo (for help see this tutorial).
Recreate the pipenv virtual environment using:

 pipenv sync --dev

Raw Data is being kept here within this repo.

The data as a json file is generated through the scrapy crawler in ermine-book-data-scraping.
Data processing/transformation notebooks are being kept here.

Authors

@ladywithanermine

Version History

License

This project is licensed under the Creative Commons License - see the LICENSE.md file for details

Acknowledgments

An inspirational post on how to setup a professional data science repository

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Book data exploratory analysis

Business case

Getting Started

Dependencies

Setup

Authors

Version History

License

Acknowledgments

About

Releases

Packages

Languages

License

ladywithanermine/ermine-books-analysis

Folders and files

Latest commit

History

Repository files navigation

📚 Book data exploratory analysis

Business case

Getting Started

Dependencies

Setup

Authors

Version History

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages