Skip to content

A concurrent crawler to retrieve song lyrics from Genius

License

Notifications You must be signed in to change notification settings

floscha/genius-lyrics-crawler

Repository files navigation

temporary logo image

Genius Lyrics Crawler

Codacy Badge

A concurrent crawler to retrieve song lyrics from Genius and store them in a MongoDB database.

Architecture

crawler architecture iamge

From a high-level perspective, the crawler consists of the following services:

  • Lyrics Crawler Driver: Used to initiate the crawling process.
  • Genius Crawler Spider: Scrapes song lyrics from Genius and stores them in MongoDB.
  • Celery: Keeps concurrent scraping tasks in a message queue with RabbitMQ as the broker.
  • Fluentd: Aggregates logs and sends them to MongoDB in batches to avoid lock contention.
  • MongoDB: Used to store both lyrics and logs.

Usage

Using the provided Docker Compose configuration, the crawler can be run by using only two simple steps.

  1. The infrastructure for the crawler (including Celery worker(s), MongoDB, and Fluentd) can be started using:
$ docker-compose up

By appending an additional --scale worker=n argument, n instances of the Celery worker will be started, whereby the number of workers should depend on the computational power of your machine.

  1. To start the scraping process, the driver program needs to be started like so:
$ docker-compose run driver

The driver program can thereby be used as follows:

usage: driver [-h] [-l LETTERS] [-apl ARTISTS_PER_LETTER]
              [-ppa PAGES_PER_ARTIST] [-spp SONGS_PER_PAGE] [-a ARTIST]
              [-t TITLE]
              {all,popular,artist,song}

Driver program to initiate the scraping process

positional arguments:
  {all,popular,artist,song}
                        defines wether all song, only popular songs, song from
                        a single artist, or only a single song should be
                        scraped

optional arguments:
  -h, --help            show this help message and exit
  -l LETTERS, --letters LETTERS
                        only scrape songs from artists beginning with one of
                        the specified letters
  -apl ARTISTS_PER_LETTER, --artists_per_letter ARTISTS_PER_LETTER
                        number of artist per letter to scrape
  -ppa PAGES_PER_ARTIST, --pages_per_artist PAGES_PER_ARTIST
                        number of pages per per to scrape
  -spp SONGS_PER_PAGE, --songs_per_page SONGS_PER_PAGE
                        number of songs per page to scrape
  -a ARTIST, --artist ARTIST
                        the name of the artist to scrape
  -t TITLE, --title TITLE
                        the title of the song to scrape

About

A concurrent crawler to retrieve song lyrics from Genius

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published