Skip to content

Need to be able to write IEnumerable<DataFrame> to disk #4

@GKrivosheev-rms

Description

@GKrivosheev-rms

Thanks for creating this project!

Implement writing multiuple dataframes into the same file

Current implementation supports reading of a large file into several dataframes, however, there is no way to write a large amount of data to a file, that might span several dataframes. Assuming all the dataframes have the same schema, how can we write them to the same Apache Parquet file?

Why is this useful?

Whenever you have a large amount of data that is partitionable, it should be processed independently for each partition. Then you want to be able to parallelise the computation and avoid blowing up memory. Yet you might want to be able save output of each partition to the same file.

Scenario:

  • Open Large Parquet file.
  • Parallel.For each rowgroup
    • read it into a data frame
    • process dataframe
    • Write each dataframe consecutively into disk.

Sample Usage of the new Interface:

// get data
IEnumerable<DataFrame> dataframes = GetDataFrames();
dataframes = ProcessOneDataFrameAtATime(dataframes);

// Extension Method for IEnumerable<DataFrame>
using var propertiesBuilder = new WriterPropertiesBuilder();
propertiesBuilder.Compression(Compression.Snappy);
using var properties = propertiesBuilder.Build();

// Extension for IEnumerable<DataFrame>
// can assume all frames have same schema.
dataframes.ToParquet(parquet_file_path, properties);

// Iterative interface is useful you when need to write one at a time. In general,
// if you have only one implementation -- do this one as it is more general.
using var parquetWriter = new ParquetFileWriter(parquet_file_path);
foreach(var dataframe in dataframes)
{
       parquetWriter.Write(dataframe)
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions