Skip to content

A pet project that analyses book data crawled from todostuslibros.com and other sources.

License

Notifications You must be signed in to change notification settings

ladywithanermine/ermine-books-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 Book data exploratory analysis

Book details exploratory data analysis with data scraped from several sources, mostly Spanish sites.

Business case

I am a book lover and, most of all, I am passionate about fiction. Literary fiction in particular, though not only. So I decided this interest of mine could be the spark for a data science pet project, where I could build the end-to-end ETL and analysis pipelines and extract some interesting insights about books, apart from reading (and enjoying) them.

There are few book-related datasets around. Some are in Kaggle, such as this one or this other one. They are mostly related to genre fiction or Amazon book data... in English. I have found no datasets related to books solely in Spanish or to the Spanish book market, so I decided to gather the data by myself and learn a few new things in the process.

The selection of suitable sites to retrieve book data for books published in Spain is quite reduced, more so since the Goodreads API was deprecated in December 2020. Among the available ones, the most straightforward is Todos tus libros, a site put up by Cegal, the Spanish bookshops association, to publicize books and their availability, as part of a campaign to encourage readers to buy locally. The site (as stated in the 'Who we are' section) includes information (sometimes incomplete) on more than 4 million books --and counting.

The site allows searches on author, title, publishing house, ISBN or date of publication and books are tagged for one or more categories ('materias'). Initially, I have chosen for the exploratory analysis the category related to literary fiction: 'Ficción moderna y contemporánea', containing about 100,000 books. The books details were scraped and formatted in a JSON file, including the following information:

  • Title
  • Authors
  • Publisher
  • Price (if available)
  • Publishing country
  • Publishing language
  • Original language
  • ISBN
  • EAN
  • Publication date
  • Type of binding
  • Number of pages
  • Number of bookstores where the book is available (at the time of the dataset generation)
  • Tags: a compendium of genre, language, style, etc.
  • Book cover URL

For exhaustive details about the scraping process, please check my repo ermine-book-data-scraping.

Getting Started

Dependencies

This repo uses a pipenv virtual environment, so you'd either install pipenv and recreate the environment or you may install a few python packages in your python environment of choice to be able to run the notebooks and the rest of the code:

  • jupyter (or jupyterlab)
  • pandas
  • matplotlib
  • seaborn
  • plotly

Setup

  1. Clone this repo (for help see this tutorial).
  2. Recreate the pipenv virtual environment using:
 pipenv sync --dev
  1. Raw Data is being kept here within this repo.

    The data as a json file is generated through the scrapy crawler in ermine-book-data-scraping.

  2. Data processing/transformation notebooks are being kept here.

Authors

@ladywithanermine

Version History

License

This project is licensed under the Creative Commons License - see the LICENSE.md file for details

Acknowledgments

About

A pet project that analyses book data crawled from todostuslibros.com and other sources.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published