Skip to content
This repository has been archived by the owner on Dec 29, 2021. It is now read-only.

extensibitlity of data source #38

Open
houqp opened this issue May 3, 2020 · 5 comments
Open

extensibitlity of data source #38

houqp opened this issue May 3, 2020 · 5 comments

Comments

@houqp
Copy link
Contributor

houqp commented May 3, 2020

I am experimenting with evaluating lazy frame with a custom data source. However, looks like Reader being declared as a struct makes it hard to add support for custom data source that shouldn't be part of the dataframe core code base.

Would it make sense to change Reader and Writer into traits so that custom data source implementations can be fully decoupled from the core code base?

@nevi-me
Copy link
Owner

nevi-me commented May 4, 2020

For lazy reads, what matters are being able to get the schema (easy) and bring able to express predicate push-down if supported. The DataSourceEval does the first, but I hadn't gotten to the second.

It's fine if we start with just a trait that doesn't express predicate push-down, as long as the solution leaves scope for it being done in future.

Given that DataFrame takes in a &Reader and creates a concrete Arrow reader for the various formats, we could use arrow::record_batch::RecordBatchReader as the trait for custom sources. We ultimately want to consume a bunch of record batches to create the dataframe.

I'm typing this on my phone (power outage), but I'll update my comment later when I'm on a desktop.

@houqp
Copy link
Contributor Author

houqp commented May 5, 2020

if we want to consider predicate push-down as part of the initial design, then perhaps it should be a higher level trait that provides access to both RecordBatchReader as well as the predicates?

@nevi-me
Copy link
Owner

nevi-me commented Jun 6, 2020

if we want to consider predicate push-down as part of the initial design, then perhaps it should be a higher level trait that provides access to both RecordBatchReader as well as the predicates?

Yes, I'm thinking along the lines of creating a trait that allows data sources to declare their capabilities (projection, filtering, etc), then applying optimisations to them based on such capability.

@nevi-me
Copy link
Owner

nevi-me commented Jul 27, 2020

Hey @houqp, I spent some time trying to find a solution to this. The current #[derive(Serialize, Deserialize, Debug, Clone)] on crate::expression::Reader makes me unable to pass a trait that implements a reader/writer (also tried boxing it).

I haven't explored the idea of a datasource registry, but maybe that could work.
I'm thinking of instead using Arrow Flight for custom datasources. The main downside in the short term is that a custom source would need to implement the gRPC protocol + some defined extensions that would allow determining capabilities.

Spark's DataSource API benefits from runtime reflection to get the correct classes, I'm not good enough with Rust's data types to find equivalent solutions that would work :(


I wanted to implement something like:

pub trait DataSource: INSERT_BOUNDS {
    fn get_dataset(&self) -> Result<Dataset>;
    fn source(&self) -> DataSourceType;
    fn format(&self) -> &str;
    fn schema(&self) -> arrow::datatypes::SchemaRef;
    fn next_batch(&mut self) -> Result<Option<RecordBatch>>;

    fn supports_projection(&self) -> bool {
        false
    }
    fn supports_filtering(&self) -> bool {
        false
    }
    fn supports_sorting(&self) -> bool {
        false
    }
    fn supports_limit(&self) -> bool {
        false
    }

    fn limit(&mut self, limit: usize) -> Result<()>;
    fn filter(&mut self, filter: BooleanFilter) -> Result<()>;
    fn project(&mut self, columns: Vec<String>) -> Result<()>;
    fn sort(&mut self, criteria: Vec<SortCriteria>) -> Result<()>;
}

@houqp
Copy link
Contributor Author

houqp commented Aug 2, 2020

I think Arrow Flight would be an overkill for this use-case. As for the proposed trait, wouldn't it make more sense to move the responsibility of performing operations like filter and limit into dataframe struct itself?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants