-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
19 additions
and
13 deletions.
There are no files selected for viewing
11 changes: 0 additions & 11 deletions
11
_freeze/examples/earthdata-access-demo/execute-results/html.json
This file was deleted.
Oops, something went wrong.
14 changes: 14 additions & 0 deletions
14
_freeze/in-development/earthdata-python-r-handoff/execute-results/html.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "4982f88fa604155c7836f194dd8d9119", | ||
"result": { | ||
"markdown": "---\ntitle: \"`earthdata`: Python-R Handoff\"\n---\n\n\n## The dream\n\nCreate once, use often: using `earthdata` python package for NASA Earthdata authorization and identifying the s3 links (i.e. the locations where the data are stored on Amazon Web Services), then passing those python objects to R through Quarto for analysis by R folks. These notes are a work-in-progress by Julie and Luis and we'll tidy them up as we develop them further.\n\n[Note: this dream is currently not working but we are sharing our progress.]{style=\"color: red\"}\n\n## Python: `earthdata` package for auth & s3 links\n\n`earthdata` gets me the credentials, it gets me the links based on the queries.\n\nIn this example, the data we want is in the Cloud. For this examples we're using this data we identified from the Earthdata Cloud Cookbook's [Multi-File_Direct_S3_Access_NetCDF_Example](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/Multi-File_Direct_S3_Access_NetCDF_Example.html), and its `short_name` is `'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'`.\n\n### Identify the s3 links\n\nBelow is our query, pretending that that is the data and the bounding box we want.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n## import DataCollections class from earthdata library\nfrom earthdata import DataGranules\n\n## To find the concept_id from the shortname that we copied: \n# short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4' \n# collection = DataCollections().short_name(short_name).get()\n# [c.concept_id() for c in collection] ## this returned 'C1990404799-POCLOUD'\n\n# Then we build a Query with spatiotemporal parameters. \nGranuleQuery = DataGranules().concept_id('C1990404799-POCLOUD').bounding_box(-134.7,58.9,-133.9,59.2)\n\n## We get the metadata records from CMR\ngranules = GranuleQuery.get()\n\n## Now it's time to open our data granules list. \ns3_links = [granule.data_links(access='direct') for granule in granules] \ns3_links[0]\n```\n:::\n\n\nNote that `files = Store(auth).open(granules)` would work for Python users but `open` won't work in the R world because it will create some kind of python file handlers from `fsspec`.\n\n### Get the Cloud credentials\n\nPrerequesite: you'll need a functioning .netrc here. `earthdata` expects interactivity and that did not work here with Quarto in the RStudio IDE (and it also did not work for Julie in Jupyter notebook (June 7 2022)). So, we followed the 2021-Cloud-Hackathon's [NASA_Earthdata_Authentication](https://nasa-openscapes.github.io/2021-Cloud-Hackathon/tutorials/04_NASA_Earthdata_Authentication.html), copying and pasting and running that code in a Jupyter notebook. (remember to `rm .netrc` beforehand!)\n\nThen, with a nice .netrc file, the next step is to get Cloud credentials:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n## import the Auth class from the earthdata library\nfrom earthdata import Auth\n\nauth = Auth().login(strategy=\"netrc\")\ncredentials = auth.get_s3_credentials(cloud_provider = \"POCLOUD\") \n```\n:::\n\n\nSo now we have the s3 links and the credentials to download the links, so now we can use the tutorial in R!!\n\n**Notes**\n\n- Luis will update `earthdata` to automatically know the cloud provider so that you don't have to specify for example POCLOUD vs PODAAC\n# credentials you actually don't want to print your credentials, we were just checking that they worked\n- The resulting JSON dictionary is what we'll export to R, and it will be valid for 1 hour. When I run into issues, I'll say \"why is this not working\", and it's because it's expired in 1 hour.\n- When we want to identify the bucket level, we'll need to remove the name of the file. For example:\n - <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_1992-01_ECCO_V4r4_latlon_0p50deg.nc> includes the filename\n - <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/> is only the bucket\n- Expect to run into issues with listing the files in the bucket (because maybe something is restricted or maybe you can access files but not list everything that's inside the bucket)\n\n## R: data access from s3 links!\n\nAnd now I can switch to R, if R is my preferred language.\n\nThe blog post [Using Amazon S3 with R](https://blog.djnavarro.net/posts/2022-03-17_using-aws-s3-in-r/) by Danielle Navarro is hugely informative and describes how to use the [aws.s3](https://github.com/cloudyr/aws.s3) R package.\n\nFirst load libraries: \n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(readr)\nlibrary(purrr)\nlibrary(stringr)\nlibrary(tibble)\nlibrary(aws.s3) # install.packages(\"aws.s3\")\nlibrary(reticulate)\n```\n:::\n\n\nTranslate credentials from python variables (created with `earthdata` above) to R variables using `reticulate`'s `py$` syntax and `purr`'s `pluck()` to isolate a variable from a list:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## translate credentials from python to R, map to dataframe\ncredentials_r_list <- py$credentials #YAY!\ncredentials_r <- purrr::map_df(credentials_r_list, print)\n\n## translate s3 links from python to R, create my_bucket\ns3_links_r_list <- py$s3_links\nmy_link_list <- s3_links_r_list[1] # let's just start with one\nmy_link_chr <- purrr:::map_chr(my_link_list, paste, collapse=\"\")\n#my_link <- as_tibble(my_link_chr)\n#my_link_split <- stringr::str_split(my_link, \"/\")\n#my_bucket <- str_c(\"s3://\", my_link_split[3], my_link_split[4])\nmy_bucket <- \"s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/\"\n```\n:::\n\n\nFrom the [`aws.s3` documentation](https://github.com/cloudyr/aws.s3#aws-s3-client-package), set up system environment variables for AWS:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.setenv(\"AWS_ACCESS_KEY_ID\" = credentials_r$accessKeyId,\n \"AWS_SECRET_ACCESS_KEY\" = credentials_r$secretAccessKey,\n \"AWS_DEFAULT_REGION\" = \"us-west-2\",\n \"AWS_SESSION_TOKEN\" = credentials_r$sessionToken)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# testing by hand: Luis\nSys.setenv(\"AWS_ACCESS_KEY_ID\" = \"ASIATNGJQBXBHRPIKFFB\",\n \"AWS_SECRET_ACCESS_KEY\" = \"zbYP2fueNxLK/joDAcz678mkjjzP6fz4HUN131ID\",\n \"AWS_DEFAULT_REGION\" = \"us-west-2\")\n```\n:::\n\n\nFirst let's test Danielle's code to see if it runs. Note to Luis: the following only works when the `Sys.setenv` is not set:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(aws.s3)\n\nbucket_exists(\n bucket = \"s3://herbariumnsw-pds/\", \n region = \"ap-southeast-2\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nClient error: (403) Forbidden\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\nattr(,\"x-amz-bucket-region\")\n[1] \"ap-southeast-2\"\nattr(,\"x-amz-request-id\")\n[1] \"0FQ1R57F2VHGFPDF\"\nattr(,\"x-amz-id-2\")\n[1] \"N6RPTKPN3/H9tDuKNHM2ZAcChhkkn2WpfcTzhpxC3fUmiZdNEIiu1xJsQAvFSecYIuWZ28pchQW3sAPAdVU57Q==\"\nattr(,\"content-type\")\n[1] \"application/xml\"\nattr(,\"date\")\n[1] \"Thu, 07 Jul 2022 23:11:30 GMT\"\nattr(,\"server\")\n[1] \"AmazonS3\"\n```\n:::\n:::\n\n\nNow, see if the PODAAC bucket exists:\n\n\n::: {.cell}\n\n```{.r .cell-code}\naws.s3::bucket_exists(\n bucket = \"s3://podaac-ops-cumulus-protected/\", \n region = \"us-west-2\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nClient error: (403) Forbidden\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\nattr(,\"x-amz-bucket-region\")\n[1] \"us-west-2\"\nattr(,\"x-amz-request-id\")\n[1] \"M4T3W1JZ93M08AZB\"\nattr(,\"x-amz-id-2\")\n[1] \"hvGLWqGCRB4lLf9pD8f67OsTDulSOgqd+yLWzUTRFz2tlLPVpxHr9mSREL0bQPVyo70j0hvJp+8=\"\nattr(,\"content-type\")\n[1] \"application/xml\"\nattr(,\"date\")\n[1] \"Thu, 07 Jul 2022 23:11:30 GMT\"\nattr(,\"server\")\n[1] \"AmazonS3\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nherbarium_files <- get_bucket_df(\n bucket = \"s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/\", \n region = \"us-west-2\",\n max = 20000\n) %>% \n as_tibble()\n```\n:::\n\n\n\nIf forbidden: \n- 1 hour expiration time\n- this bucket is not listable (or protected) (hopefully this error will be clear enough)\n\nIf you get the following error, it's likely because your credentials have expired:\n\n::: callout-important\n Client error: (403) Forbidden\n [1] FALSE\n attr(,\"x-amz-bucket-region\")\n [1] \"us-west-2\"\n attr(,\"x-amz-request-id\")\n [1] \"W2PQV030PDTGDD32\"\n attr(,\"x-amz-id-2\")\n [1] \"S8C0qzL1lAYLufzUupjqplyyS/3fWCKxIELk0OJLVHGzTOqlyhof+IPFYbaRUhmJwXQelfprYCU=\"\n attr(,\"content-type\")\n [1] \"application/xml\"\n attr(,\"date\")\n [1] \"Wed, 08 Jun 2022 03:11:16 GMT\"\n attr(,\"server\")\n [1] \"AmazonS3\"\n:::\n\n## Dev notes\n\n### Chat with Andy May 26\n\nMaybe have a python script that takes arguments, compiled in a way that then in MatLab you can sys.admin that python script. Then he doesn't need to know python\n\nOther approach would be MatLab to re-write earthdata in MatLab\n\nOur dream, revised: the code should be language-agnostic\n\n## Background\n\nThis was Luis' original example code, but it downloads data. The examples above access it in the cloud. *From <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/earthdata-access-demo.html>*\n\n``` python\nfrom earthdata import Auth, DataGranules, Store\n\n# first we authenticate with NASA EDL\nauth = Auth().login(strategy=\"netrc\")\n\n# Then we build a Query with spatiotemporal parameters\nGranuleQuery = DataGranules().concept_id(\"C1575731655-LPDAAC_ECS\").bounding_box(-134.7,58.9,-133.9,59.2)\n\n# We get the metadata records from CMR\ngranules = GranuleQuery.get()\n\n# Now it{s time to download (or open) our data granules list with get()\nfiles = Store(auth).get(granules, local_path='./data')\n```\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters