Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Documentation for Creating a Data Pump with MetPX Sarracenia to Subscribe Data to S3 Bucket #1379

Open
jjure opened this issue Feb 3, 2025 · 7 comments
Labels
Discussion_Needed developers should discuss this issue. Documentation Primary deliverable of this item is documentation

Comments

@jjure
Copy link

jjure commented Feb 3, 2025

Hi there,

I'm new to using MetPX Sarracenia and I'm trying to set up a data pump to subscribe to some data and store it in an S3 bucket. However, I've found that the documentation for this specific use case is either incomplete or missing.
Steps I Followed:

  • Installed MetPX Sarracenia.
  • Configured the basic setup as per the general documentation. I managed to get watch and subscribe examples working
  • Attempted to find instructions on how to configure a data pump to subscribe to data and push it to an S3 bucket.

Expected Behavior:

I expected to find detailed steps or a guide on how to configure MetPX Sarracenia to subscribe to data and store it in an S3 bucket.
Actual Behavior:

The documentation does not cover this specific scenario, making it difficult for a newcomer like myself to proceed.
Environment:

MetPX Sarracenia Version: 3.00.57 
Operating System:  Linux Debian 12
S3 Provider: ceph s3

Additional Information:

  • I have tried searching through the official documentation and community forums but couldn't find the necessary information.
  • I would greatly appreciate if there could be a step-by-step guide added to the documentation for setting up a data pump to subscribe to data and store it in an S3 bucket.

I can see that there is a s3 transfer class existing in the code but I can't make sr3 start using it

Best regards,
Jure

@petersilva
Copy link
Contributor

1st make sure you have python driver for S3 protocol.
try:

fractal% sr3 features

Status:    feature:   python imports:      Description: 
Installed  amqp       amqp                 can connect to rabbitmq brokers
Absent     azurestorage azure-storage-blob   cannot connect natively to Azure Stoarge accounts
Installed  appdirs    appdirs              place configuration and state files appropriately for platform (windows/mac/linux)
Installed  filetypes  magic                able to set content headers
Absent     ftppoll    dateparser,pytz      not able to poll with ftp
Installed  humanize   humanize,humanfriendly humans numbers that are easier to read.
Installed  jsonlogs   pythonjsonlogger     can write json logs, in addition to text ones.
Installed  mqtt       paho.mqtt.client     can connect to mqtt brokers
Installed  process    psutil               can monitor, start, stop processes:  Sr3 CLI should basically work
Installed  reassembly flufl.lock           can reassemble block segmented files transferred
Installed  redis      redis,redis_lock     can use redis implementations of retry and nodupe
Installed  retry      jsonpickle           can write messages to local queues to retry failed publishes/sends/downloads
Installed  s3         boto3                able to connect natively to S3-compatible locations (AWS S3, Minio, etc..)
Installed  sftp       paramiko             can use sftp or ssh based services
Installed  vip        netifaces            able to use the vip option for high availability clustering
Installed  watch      watchdog             watch directories
Installed  xattr      xattr                will store file metadata in extended attributes

 state dir: /home/peter/.cache/sr3 
 config dir: /home/peter/.config/sr3 

fractal% 

you see the s3 line? it means s3 is enabled by having the boto3 modules installed.


sudo apt install python3-boto3

or a python modules:


pip install boto3

@petersilva
Copy link
Contributor

step 2 put the credentials for the sr3 bucket you want to write to in ~/.config/sr3/credentials.conf


s3://key:[email protected]/endpoint


  • key == aws_access_key_id
  • password == aws_secret_access_key

Then in a sender configuration.... ~/.config/sr3/upload_to_sr3.conf:


destination s3://[email protected]/endpoint

accept .*

so... sr3 does not do direct 3 party transfers... if you are trying to transfer from one upstream, and push to an s3... it has to traverse the local machine, and then be pushed out. To do that, you need:

  • a subscriber to download the data.
  • a local broker to post the messages about successful downloads to.
  • a sender configuration that will subscribe to the downloads and send the files to S3.

Documentation.

Yeah, the documentation is really lacking in this... We need to try out some use cases
and develop that.

First thing we should do is explain S3 credential fields in https://metpx.github.io/sarracenia/Reference/sr3_credentials.7.html

Somebody did work out a great example here: https://github.com/MetPX/sr3-examples/tree/main/cloud-publisher-s3

but the problem with it is that the s3 support was re-written afterwards. The example above is with an s3 plugin, whereas the information I was giving you is for native support (no plugin needed.) We need to update that example. for now, that's the documentation we have.

@petersilva petersilva added Documentation Primary deliverable of this item is documentation Discussion_Needed developers should discuss this issue. labels Feb 4, 2025
@petersilva
Copy link
Contributor

@tysonkaufmann maybe update the sr3-example to use the built-in driver?

@jjure
Copy link
Author

jjure commented Feb 5, 2025

Thank you for your prompt reply. I managed to get to 2nd step already before. Just looking into the code I was able to spot the difference with previous versions of s3 data flows implemented with call back functions and the new developemnts.

I followed your instructions and it seems I am missing something obvious:

...
Installed  s3         boto3                able to connect natively to S3-compatible locations (AWS S3, Minio, etc..)
...

This is my credentials.conf:

cat ~/.config/sr3/credentials.conf
s3://KEY:[email protected]/BUCKET
amqp://USERRMQ:[email protected]/ 

This is config for watch pump. which is working:

post_broker amqp://[email protected]
post_exchange xs_probase
path /tmp/sarra/input/
post_baseDir /tmp/sarra/input/
post_baseUrl file:/tmp/sarra/input
fileEvents create|delete|link|modify
mirror on
delete on

And this does not work (No file is send to bucket):

debug on
broker amqp://[email protected]
exchange xs_probase
destination s3://[email protected]/BUCKET
accept .*
mirror on

In my understanding, the files are already present locally. Am I wrong?

And yes, I am ready to help with some s3 use-cases documentation once something will be working. ;-)

@petersilva
Copy link
Contributor

petersilva commented Feb 5, 2025

  • I made a mistake above... for the sender, the destination keyword should be replaced with sendTo
  • the logs for the running components should be in ~/.cache/sr3/logs
  • is there a separate subscriber that is downloading files? If so, you just put the post_ settings in the subscriber itself, and you don't need the local poll configuration. One downloader and one uploader is enough.
  • you should be able to see if it successffully downloading files to the local machine and where it is putting them...
  • with the mirror on setting... I think you will be putting the files several directories deep. I don't remember if that normally works in an S3, folders don't really exist... it might work... don't recall.
    anyways, good to validate the path.

Anyways, that should move forward a bit... maybe some messages you are seeing from the logs would make it easier to diagnose.

@petersilva
Copy link
Contributor

petersilva commented Feb 5, 2025

oh... and delete on is not good... it will delete after download, before the sender can pick it up.
put that in the sender instead.

@petersilva
Copy link
Contributor

petersilva commented Feb 6, 2025

Even clearer would be to put in the sender configuration:


delete_source on
delete_destination off
flowcb work.delete


This does the same thing as plain old delete but it is a bit easier to understand.

UPDATE: had delete settings inverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion_Needed developers should discuss this issue. Documentation Primary deliverable of this item is documentation
Projects
None yet
Development

No branches or pull requests

2 participants