Skip to content

Use zero-copy C data interface in Parquet adapter #544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2025

Conversation

arhamchopra
Copy link
Collaborator

@arhamchopra arhamchopra commented Jun 10, 2025

We previously used the Arrow IPC interface in the ParquetAdapter to pass data from Python to C++. This would result in the data getting serialized involving additional compute and data copy. We were not able to use the zero-copy interface before due to lack of a portable zero-copy method for arrow data.

The PyCapsule interface, enables easy zero-copy movement of data through the stable C data interface. This PR resolves #212.

This PR replaces the IPC transfer logic with the PyCapsule interface to pass data from Python to C++ zero-copy.
This also results in some performance improvements:

import csp
from csp.adapters.parquet import ParquetReader
import pyarrow as pa

class MyStruct(csp.Struct):
    ts: datetime

@csp.graph
def ReadArrowTable(tables: object):
    reader = ParquetReader(tables, time_column="ts", binary_arrow=True, read_from_memory_tables=True, NEW=True)
    records = reader.subscribe_all(MyStruct, MyStruct.default_field_map())
    csp.add_graph_output("o", records)


def generate_pyarrow_table(num_rows):
    import pyarrow as pa
    import numpy as np
    from datetime import datetime, timedelta

    random_data = np.random.rand(num_rows)
    start_time = datetime(2023, 1, 1)
    timestamps = [start_time + timedelta(seconds=i) for i in range(num_rows)]

    timestamp_array = pa.array(timestamps)
    fields = [ pa.field('ts', pa.timestamp('s')) ]
    schema = pa.schema(fields)
    table = pa.Table.from_arrays([timestamp_array], schema=schema)
    return table

# Old logic:
# %timeit csp.run(ReadArrowTable, generate_pyarrow_table(1000000), starttime=datetime(2023,1,1))
# 1.43 s ± 40.4 ms per loop

# New logic:
# %timeit csp.run(ReadArrowTable, generate_pyarrow_table(1000000), starttime=datetime(2023,1,1))
# 1.25 s ± 20.3 ms per loop

@arhamchopra arhamchopra marked this pull request as ready for review June 10, 2025 17:58
@arhamchopra arhamchopra force-pushed the ac/zero_copy_parquet_adapter branch from 5e1fe14 to 1f5b7fe Compare June 10, 2025 22:37
Copy link
Collaborator

@AdamGlustein AdamGlustein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, wait for @robambalu to take a look as well

@arhamchopra arhamchopra merged commit 0e1dad5 into main Jun 16, 2025
27 checks passed
@arhamchopra arhamchopra deleted the ac/zero_copy_parquet_adapter branch June 16, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move Arrow code to PyCapsule / C API
3 participants