Favorite data-efficient format for tabular data #2483
-
Hi folks, I used to store data for my experiments in CSV, and now I store it in JSONL. I'm getting sick of how data-wasteful it is. I want my files to be smaller. I know there are compressed formats for tabular data such as Parquet, Feather and HDF5. I've never worked with any of them, so I could benefit from a recommendation of which one you've had a good experience working with, especially in VisiData. I know that VisiData supports Parquet and HDF5 and many other formats. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
You might try See this comment: And the PR where I added support for file format See the JSONL documentation: There is also Note also that when saving data, you can add So, you could save as filetype Here's a breakdown of the filesizes of a few different formats, saving the Visidata usage data:
You can see that Obviously this will all be different for different data sets and file sizes. |
Beta Was this translation helpful? Give feedback.
You might try
jsonla
, which is an alternative version ofjsonl
which stores each row as a JSON list, rather than a JSON object.This means you specify the column headers once on the first row (as JSON list), then only specify the data on subsequent rows.
So it's more compact than JSONL (closer to CSV), but retains the stronger structure of JSONL.
See this comment:
#1726 (comment)
And the PR where I added support for file format
jsonla
(Short for JSON Lines Arrays):#1730
See the JSONL documentation:
https://jsonlines.org/examples/
There is also
usv
, which is basically CSV, but with unicode separators for columns and rows, which means you can freely include commas, newlines, tabs, quotes, …