New access patterns for NOAA@NSIDC data

I've been exploring requests and BeautifulSoup to get a list of files on HTTPS.  I have code to recursively list files in a directory.  I'm in two minds if this should be a tutorial or a how-to.  The code "walks" the server directory tree and returns a generator containing the urls for each file.  Recursion and generators are hard for many to get their heads around (they are for me at least).  But it fills a need.

Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns.  This might be for my next playtime.

```
import time
from http import HTTPStatus

import requests
from requests.exceptions import HTTPError

from bs4 import BeautifulSoup


retry_codes = [
    HTTPStatus.TOO_MANY_REQUESTS,
    HTTPStatus.INTERNAL_SERVER_ERROR,
    HTTPStatus.BAD_GATEWAY,
    HTTPStatus.SERVICE_UNAVAILABLE,
    HTTPStatus.GATEWAY_TIMEOUT,
]


def get_page(url: str, 
             retries: int = 3) -> requests.Response:
    """Gets resonse from requests

    Parameters
    ----------
    url : url to resource
    retries : number of retries before failing

    Returns
    -------
    requests.Response object
    """
    for n in range(retries):
        try:
            response = requests.get(url)
            response.raise_for_status()

            return response

        except HTTPError as exc:
            code = exc.response.status_code
        
            if code in retry_codes:
                # retry after n seconds
                time.sleep(n)
                continue

            raise    


def get_filelist(url: str, 
                 ext: str = ".nc"):
    """Returns a generator containing files in directory tree
    below url.

    Parameters
    ----------
    url : url to resource
    ext : file extension of files to search for

    Returns
    -------
    Generator containing list files
    """
    
    def is_subdirectory(href):
        return (href.endswith("/") and 
                href not in url and
                not href.startswith("."))

    def is_file(href, ext):
        return href.endswith(ext)
        
    response = get_page(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for a in soup.find_all('a', href=True):
        if is_subdirectory(a["href"]):
            yield from get_filelist(url+a["href"])
        if is_file(a["href"], ext):
            yield(url + a["href"])
``` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New access patterns for NOAA@NSIDC data #63

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New access patterns for NOAA@NSIDC data #63

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions