Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): support customize credential provider for write_dataset… #3283

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yanghua
Copy link
Collaborator

@yanghua yanghua commented Dec 23, 2024

Support customize credential provider for write dataset

@github-actions github-actions bot added enhancement New feature or request python labels Dec 23, 2024
@yanghua yanghua force-pushed the 3275-write_dataset_AwsCredentialsProvider branch 2 times, most recently from ab24a5b to b3219e8 Compare December 23, 2024 09:17
}

#[async_trait]
impl CredentialProvider for UrlBasedCredentialProvider {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eddyxu May I ask a question? I realized an UrlBasedCredentialProvider for our inner object store service to get a temporal credential via an assume role. It seems it is very similar to the AwsCredentialAdapter you have provided. Based on AwsCredentialAdapter, if I only need to implement the ProvideCredentials trait, that's enough for my purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way you can re-use the AwsCredentialsAdapter? Or is the purpose of this one somehow different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I need to implement a non-standard AssumeRoleProvider which receives a URL and fetches ak, sk and session_token from it and support TTL. The built-in AssumeRoleProvider can not customize a URL. It seems AwsCredentialAdapter provided a TTL-based credential wrapper. Maybe I need to implement ProvideCredentials trait.

@yanghua yanghua force-pushed the 3275-write_dataset_AwsCredentialsProvider branch from 2653d0d to 596533b Compare December 23, 2024 11:52
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears this credential provider is custom for you or your environment? Is it based on some kind of specification or standard?

Can we come up with some kind of solution where this is provided as a plugin? I'd prefer to avoid introducing another specification (UrlCredentialProvider).

I'd also like to avoid the reqwest dependency if possible.

Can we do something like this?

#[derive(Debug)]
pub struct PythonAwsCredentialsProvider {
    callback: Arc<PyObject>,
}

impl PythonAwsCredentialsProvider {
    fn extract(py: Python<'_>, py_cred: PyObject) -> PyResult<Arc<AwsCredential>> {
        let py_cred: &Bound<'_, PyDict> = py_cred.downcast_bound(py)?;
        let key_id = py_cred.get_item("key_id")?.ok_or(PyValueError::new_err("key_id is a required property on the dictionary returned by an AWS credential provider"))?;
        let secret_key = py_cred.get_item("secret_key")?.ok_or(PyValueError::new_err("secret_key is a required property on the dictionary returned by an AWS credential provider"))?;
        let token = py_cred.get_item("token")?;

        let key_id: String = key_id.extract()?;
        let secret_key: String = secret_key.extract()?;
        let token: Option<String> = token.map(|t| t.extract()).transpose()?;

        Ok(Arc::new(AwsCredential {
            key_id,
            secret_key,
            token,
        }))
    }

    async fn do_get_credential(&self) -> object_store::Result<Arc<AwsCredential>> {
        let callback = self.callback.clone();
        tokio::task::spawn_blocking(move || {
            Python::with_gil(move |py| {
                callback
                    .call0(py)
                    .and_then(|py_cred| Self::extract(py, py_cred))
            })
        })
        .await
        .unwrap()
        .map_err(|e| object_store::Error::Generic {
            store: "PythonAwsCredentialsProvider",
            source: e.into(),
        })
    }
}

#[async_trait]
impl CredentialProvider for PythonAwsCredentialsProvider {
    type Credential = AwsCredential;

    async fn get_credential(&self) -> object_store::Result<Arc<Self::Credential>> {
        self.do_get_credential().await
    }
}

This way the credentials can be provided by a python function that can do anything (including making an HTTP request to get the fields you need). Then we can make set this as part of the python API:

def write_dataset(
    data_obj: ReaderLike,
    uri: Union[str, Path, LanceDataset],
    schema: Optional[pa.Schema] = None,
    mode: str = "create",
    *,
    max_rows_per_file: int = 1024 * 1024,
    ...
    credentials_provider: Optional[Callable[[], Dict[str, str]]] = None,
    ...
) -> LanceDataset:

Then it can be used like...

def get_credentials():
  # use python requests library or some other mechanism to make
  # HTTP request or do whatever is needed

lance.write_dataset(..., credential_provider=get_credentials)

@yanghua
Copy link
Collaborator Author

yanghua commented Jan 3, 2025

@westonpace Looks good, let me have a try.

Is it based on some kind of specification or standard?

No, just a customized requirements. But I want to introduce a pluggable way to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants