Book data crawling with Scrapy from several websites, starting with www.todostuslibros.com and moving on to other complementary book information sources in the near future.
I am a book lover and, most of all, I am passionate about fiction, so I decided this interest of mine could be the spark for a data science pet project, where I could build the end-to-end ETL and analysis pipelines and go on extracting some interesting insights about books, apart from reading (and enjoying) them.
This repo is the first part of the project and it includes the Scrapy code for crawling the data and storing it in JSON files. See more about the exploratory data analysis and machile learning models related to book data in my other repo ermine-book-data-analysys.
There are few book-related datasets around. Some are in Kaggle, such as this one or this other one. They are mostly related to genre fiction or Amazon book data in English. I have found no datasets related to books solely in Spanish or to the Spanish book market, so I decided to gather the data by myself and learn a few new things in the process.
The selection of suitable sites to retrieve book data for books published in Spain is quite reduced, more so since the Goodreads API was deprecated in December 2020. Amazon is of course a great data source with, reviews, categories and other relevante information, but their protection against third-party crawlers makes the task of gathering the data difficult enough to not consider it as a first option for this learning project.
Among the rest of the book-related sites, one the most straightforward is Todos tus libros, a site put up by Cegal, the Spanish bookshops association, to publicize books and their availability, as part of a campaign to encourage readers to buy locally. The site (as stated in the 'Who we are' section) includes information (sometimes incomplete) on more than 4 million books --and counting.
The Spanish Ministry of Culture has a ISBN database that can be queried by ISBN number or by year interval, limited to 1000 results. It does not seem that useful to actually gather bulk data, unless you are enriching book information collected elsewhere. An option would be to find out some way to auto-page the results and get over that 1000 books limit.
The Spanish Statistics National Institute also has information related to published books by year, but in this case it is aggregated data, so no individual details about the books are available in their site.
Book sales rankings are available, not only at www.todostuslibros.com, but also at several online bookshops' sites, such as La casa del libro or Fnac, although there is no information regarding the ranking date or the update frequency. These big online bookshops could be al alternate source for crawling the data or, even, for additional information, such as book reviews.
Taking everything into account, I have decided to create a first version of the crawler using www.todostuslibros.com and make my way towards other sites, one at a time. I will, in subsequent iterations, enrich the information with details from the ISBN database and reviews from La casa del libro, Fnac and possibly other sites. Further developments will include gathering top sales from several sites to create a unified ranking that may be refreshed periodically, in order to have time series information on sales and such.
Site details: www.todostuslibros.com
The site allows searches on author, title, publishing house, ISBN or date of publication and books are tagged for one or more categories/genres ('materias'). The tags are somewhat diverse, including genres (mistery, romance) or, sometimes, even language (Español/Castellano, Inglés, etc.) and other custom labels. There appears to be no unified taxonomy or, rather, one that has evolved over time. For instance, some books are listed with 'Español' as original language, and others with 'Castellano', where both languages are in fact the same (Spanish).
The scraping of a direct search URL, such as https://todostuslibros.com/busquedas?keyword=novela is not allowed by the website. Instead, in order to gather information on different genres, I have selected a list of genres based on several 'materias' URLs and grouped them as follows:
- Mistery and crime:
- Terror and suspense:
- Fantasy:
- Science fiction:
- Historical fiction:
- Romance:
It is important to notice that any book may be tagged under one or more 'materias' ('mistery' and 'fantasy', for instance). Thus, some books will appear in more than one of the previous searches. This is not a problem if the output files are analyzed separately, but it must be taken into account for duplicates if a joined analysis is performed (see my other repo ermine-book-data-analysys for more details about this).
Every search or 'materias' results list displays a page with 10 books and some basic information: author, title, book cover, publisher, ISBN, a few lines of the synopsis, price (if available), plus pagination links at the bottom of the page:
Clicking on a book, a detail page is displayed with information on tags, original language, country, number of pages, etc.:
Every genre query (one or more URLs) is crawled independently and stored in a JSON file. The details of the base URLs for the crawling and the JSON feed export can be parametrized in the settings.py file, under the following variables:
TODOSTUSLIBROS_URL_LIST # List of URLs for the crawling
TODOSTUSLIBROS_BOOK_DETAIL_URL_TEMPLATE # Template for the book details page
FEED_URI # Name of the JSON output file
The crawler retrieves the following information for the books included in the selected genres:
- Title
- Author(s)
- Publisher
- Synopsis
- Price (if available)
- Publishing country
- Publishing language
- Original language
- ISBN
- EAN
- Publication date
- Type of binding
- Number of pages
- Number of bookstores where the book is available (at the time of the dataset generation)
- Tags: a compendium of genre, language, style, etc.
- Book cover URL
This repo uses a pipenv virtual environment, so you'd either install pipenv and recreate the environment or you may install a few python packages in your own python environment of choice to be able to run the crawler:
- scrapy
- Clone this repo (for help see this tutorial).
- Recreate the pipenv virtual environment using:
pipenv sync --dev
-
Review the settings.py file in order to setup the custom search you want the crawler to run and the output file name or other feed export config. See more about other feed export formats here in the Scrapy documentation site.
-
Run the crawler within the pipenv virtual environment:
pipenv run scrapy crawl todostuslibros
In case you are running the repo under a standard Python installation (without a virtual environment), it would be enough to do the following:
scrapy crawl todostuslibros
1.0 * Initial Release
This project is licensed under the Creative Commons License - see the LICENSE.md file for details