Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of metadata to move away from pandas #415

Draft
wants to merge 21 commits into
base: dev
Choose a base branch
from
Draft

Conversation

sjvenditto
Copy link
Collaborator

Pandas DataFrames, while versatile, add a lot of overhead to object initialization with metadata, even when metadata is an empty DataFrame. Since slicing an existing object can often return a new object, this overhead is compounded each time an object is sliced.

In this PR, I've replaced the datatype of the private _metadata to a custom dictionary, where it previously was a pandas DataFrame. This custom dictionary includes minimal methods used by _metadata's DataFrame counterpart -- e.g. .loc, .iloc, .columns, .index -- but is proving to be more lightweight. Rudimentary benchmarking suggests that, with the dictionary metadata, slicing IntervalSet and TsdFrame objects is 4-8X faster for objects with metadata and 2-3X faster for objects without metadata (i.e. empty metadata), when compared to objects with DataFrame metadata. (This speed up is not seen for TsGroup objects, which has a much slower initialization that the other objects, where metadata initialization is not the primary source of overhead)

On the user side of things, metadata will behave exactly the same as it did previously, where obj.metadata still returns a DataFrame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant