-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Thanks for creating this project!
Implement writing multiuple dataframes into the same file
Current implementation supports reading of a large file into several dataframes, however, there is no way to write a large amount of data to a file, that might span several dataframes. Assuming all the dataframes have the same schema, how can we write them to the same Apache Parquet file?
Why is this useful?
Whenever you have a large amount of data that is partitionable, it should be processed independently for each partition. Then you want to be able to parallelise the computation and avoid blowing up memory. Yet you might want to be able save output of each partition to the same file.
Scenario:
- Open Large Parquet file.
- Parallel.For each rowgroup
- read it into a data frame
- process dataframe
- Write each dataframe consecutively into disk.
Sample Usage of the new Interface:
// get data
IEnumerable<DataFrame> dataframes = GetDataFrames();
dataframes = ProcessOneDataFrameAtATime(dataframes);
// Extension Method for IEnumerable<DataFrame>
using var propertiesBuilder = new WriterPropertiesBuilder();
propertiesBuilder.Compression(Compression.Snappy);
using var properties = propertiesBuilder.Build();
// Extension for IEnumerable<DataFrame>
// can assume all frames have same schema.
dataframes.ToParquet(parquet_file_path, properties);
// Iterative interface is useful you when need to write one at a time. In general,
// if you have only one implementation -- do this one as it is more general.
using var parquetWriter = new ParquetFileWriter(parquet_file_path);
foreach(var dataframe in dataframes)
{
parquetWriter.Write(dataframe)
}Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request