You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been exploring requests and BeautifulSoup to get a list of files on HTTPS. I have code to recursively list files in a directory. I'm in two minds if this should be a tutorial or a how-to. The code "walks" the server directory tree and returns a generator containing the urls for each file. Recursion and generators are hard for many to get their heads around (they are for me at least). But it fills a need.
Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns. This might be for my next playtime.
import time
from http import HTTPStatus
import requests
from requests.exceptions import HTTPError
from bs4 import BeautifulSoup
retry_codes = [
HTTPStatus.TOO_MANY_REQUESTS,
HTTPStatus.INTERNAL_SERVER_ERROR,
HTTPStatus.BAD_GATEWAY,
HTTPStatus.SERVICE_UNAVAILABLE,
HTTPStatus.GATEWAY_TIMEOUT,
]
def get_page(url: str,
retries: int = 3) -> requests.Response:
"""Gets resonse from requests
Parameters
----------
url : url to resource
retries : number of retries before failing
Returns
-------
requests.Response object
"""
for n in range(retries):
try:
response = requests.get(url)
response.raise_for_status()
return response
except HTTPError as exc:
code = exc.response.status_code
if code in retry_codes:
# retry after n seconds
time.sleep(n)
continue
raise
def get_filelist(url: str,
ext: str = ".nc"):
"""Returns a generator containing files in directory tree
below url.
Parameters
----------
url : url to resource
ext : file extension of files to search for
Returns
-------
Generator containing list files
"""
def is_subdirectory(href):
return (href.endswith("/") and
href not in url and
not href.startswith("."))
def is_file(href, ext):
return href.endswith(ext)
response = get_page(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.find_all('a', href=True):
if is_subdirectory(a["href"]):
yield from get_filelist(url+a["href"])
if is_file(a["href"], ext):
yield(url + a["href"])
The text was updated successfully, but these errors were encountered:
I've been exploring requests and BeautifulSoup to get a list of files on HTTPS. I have code to recursively list files in a directory. I'm in two minds if this should be a tutorial or a how-to. The code "walks" the server directory tree and returns a generator containing the urls for each file. Recursion and generators are hard for many to get their heads around (they are for me at least). But it fills a need.
Ideally, we would have a STAC catalog for these datasets so that we do not need to have these kinds of access patterns. This might be for my next playtime.
The text was updated successfully, but these errors were encountered: