updating freeze

NASA-Openscapes · Jul 7, 2022 · 9436bab · 9436bab
1 parent d2a147c
commit 9436bab
Show file tree

Hide file tree

Showing 4 changed files with 19 additions and 13 deletions.
diff --git a/_freeze/examples/earthdata-access-demo/execute-results/html.json b/_freeze/examples/earthdata-access-demo/execute-results/html.json
diff --git a/_freeze/in-development/earthdata-python-r-handoff/execute-results/html.json b/_freeze/in-development/earthdata-python-r-handoff/execute-results/html.json
@@ -0,0 +1,14 @@
+{
+  "hash": "4982f88fa604155c7836f194dd8d9119",
+  "result": {
+    "markdown": "---\ntitle: \"`earthdata`: Python-R Handoff\"\n---\n\n\n## The dream\n\nCreate once, use often: using `earthdata` python package for NASA Earthdata authorization and identifying the s3 links (i.e. the locations where the data are stored on Amazon Web Services), then passing those python objects to R through Quarto for analysis by R folks. These notes are a work-in-progress by Julie and Luis and we'll tidy them up as we develop them further.\n\n[Note: this dream is currently not working but we are sharing our progress.]{style=\"color: red\"}\n\n## Python: `earthdata` package for auth & s3 links\n\n`earthdata` gets me the credentials, it gets me the links based on the queries.\n\nIn this example, the data we want is in the Cloud. For this examples we're using this data we identified from the Earthdata Cloud Cookbook's  [Multi-File_Direct_S3_Access_NetCDF_Example](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/Multi-File_Direct_S3_Access_NetCDF_Example.html), and its `short_name` is `'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'`.\n\n### Identify the s3 links\n\nBelow is our query, pretending that that is the data and the bounding box we want.\n\n\n::: {.cell}\n\n```{.python .cell-code}\n## import DataCollections class from earthdata library\nfrom earthdata import DataGranules\n\n## To find the concept_id from the shortname that we copied: \n# short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4' \n# collection = DataCollections().short_name(short_name).get()\n# [c.concept_id() for c in collection] ## this returned 'C1990404799-POCLOUD'\n\n# Then we build a Query with spatiotemporal parameters. \nGranuleQuery = DataGranules().concept_id('C1990404799-POCLOUD').bounding_box(-134.7,58.9,-133.9,59.2)\n\n## We get the metadata records from CMR\ngranules = GranuleQuery.get()\n\n## Now it's time to open our data granules list. \ns3_links = [granule.data_links(access='direct') for granule in granules] \ns3_links[0]\n```\n:::\n\n\nNote that `files = Store(auth).open(granules)` would work for Python users but `open` won't work in the R world because it will create some kind of python file handlers from `fsspec`.\n\n### Get the Cloud credentials\n\nPrerequesite: you'll need a functioning .netrc here. `earthdata` expects interactivity and that did not work here with Quarto in the RStudio IDE (and it also did not work for Julie in Jupyter notebook (June 7 2022)). So, we followed the 2021-Cloud-Hackathon's  [NASA_Earthdata_Authentication](https://nasa-openscapes.github.io/2021-Cloud-Hackathon/tutorials/04_NASA_Earthdata_Authentication.html), copying and pasting and running that code in a Jupyter notebook. (remember to `rm .netrc` beforehand!)\n\nThen, with a nice .netrc file, the next step is to get Cloud credentials:\n\n\n::: {.cell}\n\n```{.python .cell-code}\n## import the Auth class from the earthdata library\nfrom earthdata import Auth\n\nauth = Auth().login(strategy=\"netrc\")\ncredentials = auth.get_s3_credentials(cloud_provider = \"POCLOUD\") \n```\n:::\n\n\nSo now we have the s3 links and the credentials to download the links, so now we can use the tutorial in R!!\n\n**Notes**\n\n- Luis will update `earthdata` to automatically know the cloud provider so that you don't have to specify for example POCLOUD vs PODAAC\n# credentials you actually don't want to print your credentials, we were just checking that they worked\n- The resulting JSON dictionary is what we'll export to R, and it will be valid for 1 hour. When I run into issues, I'll say \"why is this not working\", and it's because it's expired in 1 hour.\n- When we want to identify the bucket level, we'll need to remove the name of the file. For example:\n  - <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_1992-01_ECCO_V4r4_latlon_0p50deg.nc> includes the filename\n  - <s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/> is only the bucket\n- Expect to run into issues with listing the files in the bucket (because maybe something is restricted or maybe you can access files but not list everything that's inside the bucket)\n\n## R: data access from s3 links!\n\nAnd now I can switch to R, if R is my preferred language.\n\nThe blog post [Using Amazon S3 with R](https://blog.djnavarro.net/posts/2022-03-17_using-aws-s3-in-r/) by Danielle Navarro is hugely informative and describes how to use the [aws.s3](https://github.com/cloudyr/aws.s3) R package.\n\nFirst load libraries: \n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\nlibrary(readr)\nlibrary(purrr)\nlibrary(stringr)\nlibrary(tibble)\nlibrary(aws.s3) # install.packages(\"aws.s3\")\nlibrary(reticulate)\n```\n:::\n\n\nTranslate credentials from python variables (created with `earthdata` above) to R variables using `reticulate`'s `py$` syntax and `purr`'s `pluck()` to isolate a variable from a list:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## translate credentials from python to R, map to dataframe\ncredentials_r_list <- py$credentials #YAY!\ncredentials_r <- purrr::map_df(credentials_r_list, print)\n\n## translate s3 links from python to R, create my_bucket\ns3_links_r_list <- py$s3_links\nmy_link_list <- s3_links_r_list[1] # let's just start with one\nmy_link_chr <- purrr:::map_chr(my_link_list, paste, collapse=\"\")\n#my_link <- as_tibble(my_link_chr)\n#my_link_split <- stringr::str_split(my_link, \"/\")\n#my_bucket <- str_c(\"s3://\", my_link_split[3], my_link_split[4])\nmy_bucket <- \"s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/\"\n```\n:::\n\n\nFrom the [`aws.s3` documentation](https://github.com/cloudyr/aws.s3#aws-s3-client-package), set up system environment variables for AWS:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nSys.setenv(\"AWS_ACCESS_KEY_ID\" = credentials_r$accessKeyId,\n           \"AWS_SECRET_ACCESS_KEY\" = credentials_r$secretAccessKey,\n           \"AWS_DEFAULT_REGION\" = \"us-west-2\",\n           \"AWS_SESSION_TOKEN\" = credentials_r$sessionToken)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# testing by hand: Luis\nSys.setenv(\"AWS_ACCESS_KEY_ID\" = \"ASIATNGJQBXBHRPIKFFB\",\n           \"AWS_SECRET_ACCESS_KEY\" = \"zbYP2fueNxLK/joDAcz678mkjjzP6fz4HUN131ID\",\n           \"AWS_DEFAULT_REGION\" = \"us-west-2\")\n```\n:::\n\n\nFirst let's test Danielle's code to see if it runs. Note to Luis: the following only works when the `Sys.setenv` is not set:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(aws.s3)\n\nbucket_exists(\n  bucket = \"s3://herbariumnsw-pds/\", \n  region = \"ap-southeast-2\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nClient error: (403) Forbidden\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\nattr(,\"x-amz-bucket-region\")\n[1] \"ap-southeast-2\"\nattr(,\"x-amz-request-id\")\n[1] \"0FQ1R57F2VHGFPDF\"\nattr(,\"x-amz-id-2\")\n[1] \"N6RPTKPN3/H9tDuKNHM2ZAcChhkkn2WpfcTzhpxC3fUmiZdNEIiu1xJsQAvFSecYIuWZ28pchQW3sAPAdVU57Q==\"\nattr(,\"content-type\")\n[1] \"application/xml\"\nattr(,\"date\")\n[1] \"Thu, 07 Jul 2022 23:11:30 GMT\"\nattr(,\"server\")\n[1] \"AmazonS3\"\n```\n:::\n:::\n\n\nNow, see if the PODAAC bucket exists:\n\n\n::: {.cell}\n\n```{.r .cell-code}\naws.s3::bucket_exists(\n  bucket = \"s3://podaac-ops-cumulus-protected/\", \n  region = \"us-west-2\"\n)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nClient error: (403) Forbidden\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\nattr(,\"x-amz-bucket-region\")\n[1] \"us-west-2\"\nattr(,\"x-amz-request-id\")\n[1] \"M4T3W1JZ93M08AZB\"\nattr(,\"x-amz-id-2\")\n[1] \"hvGLWqGCRB4lLf9pD8f67OsTDulSOgqd+yLWzUTRFz2tlLPVpxHr9mSREL0bQPVyo70j0hvJp+8=\"\nattr(,\"content-type\")\n[1] \"application/xml\"\nattr(,\"date\")\n[1] \"Thu, 07 Jul 2022 23:11:30 GMT\"\nattr(,\"server\")\n[1] \"AmazonS3\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nherbarium_files <- get_bucket_df(\n  bucket = \"s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/\", \n  region = \"us-west-2\",\n  max = 20000\n) %>% \n  as_tibble()\n```\n:::\n\n\n\nIf forbidden: \n- 1 hour expiration time\n- this bucket is not listable (or protected) (hopefully this error will be clear enough)\n\nIf you get the following error, it's likely because your credentials have expired:\n\n::: callout-important\n    Client error: (403) Forbidden\n    [1] FALSE\n    attr(,\"x-amz-bucket-region\")\n    [1] \"us-west-2\"\n    attr(,\"x-amz-request-id\")\n    [1] \"W2PQV030PDTGDD32\"\n    attr(,\"x-amz-id-2\")\n    [1] \"S8C0qzL1lAYLufzUupjqplyyS/3fWCKxIELk0OJLVHGzTOqlyhof+IPFYbaRUhmJwXQelfprYCU=\"\n    attr(,\"content-type\")\n    [1] \"application/xml\"\n    attr(,\"date\")\n    [1] \"Wed, 08 Jun 2022 03:11:16 GMT\"\n    attr(,\"server\")\n    [1] \"AmazonS3\"\n:::\n\n## Dev notes\n\n### Chat with Andy May 26\n\nMaybe have a python script that takes arguments, compiled in a way that then in MatLab you can sys.admin that python script. Then he doesn't need to know python\n\nOther approach would be MatLab to re-write earthdata in MatLab\n\nOur dream, revised: the code should be language-agnostic\n\n## Background\n\nThis was Luis' original example code, but it downloads data. The examples above access it in the cloud. *From <https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/earthdata-access-demo.html>*\n\n``` python\nfrom earthdata import Auth, DataGranules, Store\n\n# first we authenticate with NASA EDL\nauth = Auth().login(strategy=\"netrc\")\n\n# Then we build a Query with spatiotemporal parameters\nGranuleQuery = DataGranules().concept_id(\"C1575731655-LPDAAC_ECS\").bounding_box(-134.7,58.9,-133.9,59.2)\n\n# We get the metadata records from CMR\ngranules = GranuleQuery.get()\n\n# Now it{s time to download (or open) our data granules list with get()\nfiles = Store(auth).get(granules, local_path='./data')\n```\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/_quarto.yml b/_quarto.yml
@@ -2,6 +2,9 @@ project:
   type: website
   output-dir: _site
 
+execute:
+  freeze: true  # never re-execute computational content during project render, needed for .qmd and .rmd since quarto does not render .ipynb by default
+
 website:
   page-navigation: true
   title: "EarthData Cloud Cookbook"

diff --git a/in-development/earthdata-python-r-handoff.qmd b/in-development/earthdata-python-r-handoff.qmd
@@ -1,13 +1,13 @@
 ---
 title: "`earthdata`: Python-R Handoff"
-execute:
-  freeze: true  # never re-render during project render
 ---
 
 ## The dream
 
 Create once, use often: using `earthdata` python package for NASA Earthdata authorization and identifying the s3 links (i.e. the locations where the data are stored on Amazon Web Services), then passing those python objects to R through Quarto for analysis by R folks. These notes are a work-in-progress by Julie and Luis and we'll tidy them up as we develop them further.
 
+[Note: this dream is currently not working but we are sharing our progress.]{style="color: red"}
+
 ## Python: `earthdata` package for auth & s3 links
 
 `earthdata` gets me the credentials, it gets me the links based on the queries.