Skip to content

reallynotburner/apod-scraper

Repository files navigation

Introduction

I would like to gather a large number of JSON's and image urls's for a development project. I have chosen the dataset behind the Astronomy Picture Of the Day site https://apod.nasa.gov/apod/astropix.html. They bring the wonder of the cosmos with beautiful images and videos daily, going all the way back to 1995. The api that supports APOD is https://api.nasa.gov/planetary/apod. By changing the query parameters you can select any date and it will return the APOD metadata for that day.

Method

Individually calling for each day of data is burdensome for my apps, and the NASA api. I'm most comfortable with Javascript so I'll use Node.js on the server. It pulls all the JSON entries from NASA into a Google Firestore NoSQL database table with this scraper script. The APOD API has a limit of 2000 accesses per api key per hour. When you don't have any data in your database, it takes about 6 hours for it to gather all the data in the API all the way back to when it started in 1995. Later, run the script once a day. The scraper will pick up on the last APOD entry it stored, and query the api for the next days until it's caught up with todays APOD data.

Installation

yarn

Environmental Variables

You'll need two json files in the secrets folder to make this work. These define your credentials for the NASA Api and Google Firebase platform. For NASA get a developer api-key at https://api.nasa.gov/, "generate api key" navigation tab. The default 'DEMO_KEY' is limited to about 50 requests PER DAY. Your Developer api key allows ~2000 requests per hour. To get your Firebase credentials, go to your Firebase Console Project Settings in the section "Your Apps." It'll have a snippet conveniently populated with your specific credentials. DO NOT COMMIT the contents of your secrets folder! This'll allow anyone to do anything they want to your account!

You'll need your own values in json format for your NASA Api credentials in the "secrets" folder. Here's examples using mock data:

{
  "apiKey": "1234567890123456789012345678901234567890",
  "apiEndpoint": "https://api.nasa.gov/planetary/apod"
}

You'll need your own values in json format for your Firebase credentials. Here's examples using mock data:

{
  "type": "service_account",
  "project_id": "your-project",
  "private_key_id": "1234567890123456789012345678901234567890",
  "private_key": "-----BEGIN PRIVATE KEY-----\nPRIVATE_KEY_PRIVATE_KEY_PRIVATE_KEY=\n-----END PRIVATE KEY-----\n",
  "client_email": "[email protected]",
  "client_id": "123456789012345678901",
  "auth_uri": "https://accounts.google.com/auth",
  "token_uri": "https://auth.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/auth/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/your-project.iam.gserviceaccount.com"
}

Run the scraper

yarn start

Create Thumbnails TODO:

Uses the database to see if there are any image records that don't have a thumbnail. If a row entry lacks a thumbnail it downloads the regular resolution of the url to local filesystem to images folder. Then it converts the image to a fixed height of 200px thumbnail in thumbnails folder. Caveat: .gif output isn't supported by the conversion library. Therefore, it exports .gif to .webp format.

Since the APOD image servers aren't limited by the daily max of the API, this runs until it's scraped every image.

Warning! This initially downloads every regular resolution image from NASA APOD. That's almost 3 GB of data as of 2023. Daily scrapes there after are much smaller, just a single image of varying size. Initial run time is about an hour on my machine and network conditions.

yarn utils/grabOriginalImage.js

Suspend Conditions

  • when the rate limit of the api is reached the session remains open on a timeout and will try again in an hour
  • when the database has caught up with the API, so it will try again in 24 hours

Exit Condtions TODO: CURRENTLY JUST RUNS UNTIL FAILURE

  • unable to connect with any database
  • unable to create a new database if missing.
  • unable to create a new table if missing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published