Open
Description
I've been exploring requests and BeautifulSoup to get a list of files on HTTPS. I have code to recursively list files in a directory. I'm in two minds if this should be a tutorial or a how-to. The code "walks" the server directory tree and returns a generator containing the urls for each file. Recursion and generators are hard for many to get their heads around (they are for me at least). But it fills a need.
Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns. This might be for my next playtime.
import time
from http import HTTPStatus
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
retry_codes = [
HTTPStatus.TOO_MANY_REQUESTS,
HTTPStatus.INTERNAL_SERVER_ERROR,
HTTPStatus.BAD_GATEWAY,
HTTPStatus.SERVICE_UNAVAILABLE,
HTTPStatus.GATEWAY_TIMEOUT,
]
def get_page(url: str,
retries: int = 3) -> requests.Response:
"""Gets resonse from requests
Parameters
----------
url : url to resource
retries : number of retries before failing
Returns
-------
requests.Response object
"""
for n in range(retries):
try:
response = requests.get(url)
response.raise_for_status()
return response
except HTTPError as exc:
code = exc.response.status_code
if code in retry_codes:
# retry after n seconds
time.sleep(n)
continue
raise
def get_filelist(url: str,
ext: str = ".nc"):
"""Returns a generator containing files in directory tree
below url.
Parameters
----------
url : url to resource
ext : file extension of files to search for
Returns
-------
Generator containing list files
"""
def is_subdirectory(href):
return (href.endswith("/") and
href not in url and
not href.startswith("."))
def is_file(href, ext):
return href.endswith(ext)
response = get_page(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.find_all('a', href=True):
if is_subdirectory(a["href"]):
yield from get_filelist(url+a["href"])
if is_file(a["href"], ext):
yield(url + a["href"])
Metadata
Metadata
Assignees
Labels
No labels