-
Notifications
You must be signed in to change notification settings - Fork 22
extensibitlity of data source #38
Comments
For lazy reads, what matters are being able to get the schema (easy) and bring able to express predicate push-down if supported. The It's fine if we start with just a trait that doesn't express predicate push-down, as long as the solution leaves scope for it being done in future. Given that I'm typing this on my phone (power outage), but I'll update my comment later when I'm on a desktop. |
if we want to consider predicate push-down as part of the initial design, then perhaps it should be a higher level trait that provides access to both |
Yes, I'm thinking along the lines of creating a trait that allows data sources to declare their capabilities (projection, filtering, etc), then applying optimisations to them based on such capability. |
Hey @houqp, I spent some time trying to find a solution to this. The current I haven't explored the idea of a datasource registry, but maybe that could work. Spark's DataSource API benefits from runtime reflection to get the correct classes, I'm not good enough with Rust's data types to find equivalent solutions that would work :( I wanted to implement something like: pub trait DataSource: INSERT_BOUNDS {
fn get_dataset(&self) -> Result<Dataset>;
fn source(&self) -> DataSourceType;
fn format(&self) -> &str;
fn schema(&self) -> arrow::datatypes::SchemaRef;
fn next_batch(&mut self) -> Result<Option<RecordBatch>>;
fn supports_projection(&self) -> bool {
false
}
fn supports_filtering(&self) -> bool {
false
}
fn supports_sorting(&self) -> bool {
false
}
fn supports_limit(&self) -> bool {
false
}
fn limit(&mut self, limit: usize) -> Result<()>;
fn filter(&mut self, filter: BooleanFilter) -> Result<()>;
fn project(&mut self, columns: Vec<String>) -> Result<()>;
fn sort(&mut self, criteria: Vec<SortCriteria>) -> Result<()>;
} |
I think Arrow Flight would be an overkill for this use-case. As for the proposed trait, wouldn't it make more sense to move the responsibility of performing operations like filter and limit into dataframe struct itself? |
I am experimenting with evaluating lazy frame with a custom data source. However, looks like Reader being declared as a struct makes it hard to add support for custom data source that shouldn't be part of the dataframe core code base.
Would it make sense to change
Reader
andWriter
into traits so that custom data source implementations can be fully decoupled from the core code base?The text was updated successfully, but these errors were encountered: