Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing dtype information alongside BIDS tabular files? #1853

Open
psadil opened this issue Jun 8, 2024 · 0 comments
Open

Storing dtype information alongside BIDS tabular files? #1853

psadil opened this issue Jun 8, 2024 · 0 comments

Comments

@psadil
Copy link

psadil commented Jun 8, 2024

Your idea

This is to continue a conversation that started on Neurostars: https://neurostars.org/t/are-there-recommended-ways-of-storing-dtype-information-alongside-bids-tabular-files/29601.

I'm working with a fairly large dataset (thousands of participants, many of which have multiple sessions), and at some point most of the information that is stored in a tabular (tsv) format will end up in either a database or a binary table format--something like postgres or parquet. I'd like to facilitate that conversion by storing metadata about the datatype of each column in the json sidecars (for example, float16 vs float32 vs int32). Opening this here in case others have an interest in this kind functionality.

Other details

Most tools with something like a read_csv method can do a decent good job at guessing type information. But things can break down when A) there are many missing entries in a column, B) one wants to specify a limited numeric type (for example, int8, or even unsigned int8), or C) one wants to distinguish between unordered versus ordered categorical information.

Unfortunately, both json-schema and OpenAPI only offer a limited range of types (no distinction between types of floats).

In general, most tools have at least slightly different data types, so it's not obvious to me how to build an allowable list of types (for example, pyarrow versus pandas). If pursuing this, my first direction would probably be to use the names given by arrow -- excluding the ones that are not BIDS-valid (like time without date, date without a time, or null type).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant