Favorite data-efficient format for tabular data #2483

cool-RR · 2024-08-02T07:25:10Z

cool-RR
Aug 2, 2024

Hi folks,

I used to store data for my experiments in CSV, and now I store it in JSONL. I'm getting sick of how data-wasteful it is. I want my files to be smaller. I know there are compressed formats for tabular data such as Parquet, Feather and HDF5. I've never worked with any of them, so I could benefit from a recommendation of which one you've had a good experience working with, especially in VisiData. I know that VisiData supports Parquet and HDF5 and many other formats.

Answered by daviewales

Sep 26, 2024

You might try jsonla, which is an alternative version of jsonl which stores each row as a JSON list, rather than a JSON object.
This means you specify the column headers once on the first row (as JSON list), then only specify the data on subsequent rows.
So it's more compact than JSONL (closer to CSV), but retains the stronger structure of JSONL.

See this comment:
#1726 (comment)

And the PR where I added support for file format jsonla (Short for JSON Lines Arrays):
#1730

See the JSONL documentation:
https://jsonlines.org/examples/

There is also usv, which is basically CSV, but with unicode separators for columns and rows, which means you can freely include commas, newlines, tabs, quotes, …

View full answer

daviewales · 2024-09-26T12:50:16Z

daviewales
Sep 26, 2024

You might try jsonla, which is an alternative version of jsonl which stores each row as a JSON list, rather than a JSON object.
This means you specify the column headers once on the first row (as JSON list), then only specify the data on subsequent rows.
So it's more compact than JSONL (closer to CSV), but retains the stronger structure of JSONL.

See this comment:
#1726 (comment)

And the PR where I added support for file format jsonla (Short for JSON Lines Arrays):
#1730

See the JSONL documentation:
https://jsonlines.org/examples/

There is also usv, which is basically CSV, but with unicode separators for columns and rows, which means you can freely include commas, newlines, tabs, quotes, etc in the raw data, without needing to escape it.
Again, it's as compact as CSV, but more robust.

Note also that when saving data, you can add .zip to the end after the file extension, and Visidata will automatically save a compressed copy. You can open the compressed copy directly in Visidata as well.

So, you could save as filetype .jsonla.zip or .usv.zip if you want to save disk space.

Here's a breakdown of the filesizes of a few different formats, saving the Visidata usage data:

$ ls -lhS usage.*
-rw-rw-r-- 1 dwales dwales 1.7M Sep 26 22:44 usage.jsonl
-rw-rw-r-- 1 dwales dwales 695K Sep 26 22:44 usage.jsonla
-rw-rw-r-- 1 dwales dwales 593K Sep 26 22:44 usage.usv
-rw-rw-r-- 1 dwales dwales 431K Sep 26 22:43 usage.tsv
-rw-rw-r-- 1 dwales dwales  79K Sep 26 22:44 usage.jsonla.zip
-rw-rw-r-- 1 dwales dwales  79K Sep 26 22:44 usage.jsonl.zip
-rw-rw-r-- 1 dwales dwales  79K Sep 26 22:45 usage.tsv.zip
-rw-rw-r-- 1 dwales dwales  79K Sep 26 22:44 usage.usv.zip
-rw-rw-r-- 1 dwales dwales  72K Sep 26 22:44 usage.parquet

You can see that tsv is the most compact human readable plain text format.
But all the text formats are the same after you zip them.
And parquet is very slightly better.

Obviously this will all be different for different data sets and file sizes.
Parquet is probably fastest to read for big files. But for small ones it won't make a noticeable difference.

4 replies

cool-RR Sep 26, 2024
Author

Thank you David!

cool-RR Sep 26, 2024
Author

Did you invent JSONLA? I can't find any information about it online.

saulpw Sep 26, 2024
Maintainer

@daviewales probably better than .zip is .gz. It can only store one file but then you don't have to mess around with an index sheet first.

daviewales Sep 27, 2024

I didn't invent it. The first example on the JSON Lines examples page uses the same format.
However, I did invent the file extension .jsonla for Visidata, as it avoided adding magic to the JSONL reader to detect what kind of JSONL format was used, and made it easier to specify which type of JSONL output you want when writing.

Edit: I checked the issue and PR linked above. I came up with the .jsonla extension here. And Saul suggested keeping jsonl and jsonla more separate here. Although it looks like Visidata will fall back to loading jsonl if it doesn't detect an array in the first row of a jsonla file.

Thanks Saul for the .gz suggestion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Favorite data-efficient format for tabular data #2483

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Favorite data-efficient format for tabular data #2483

cool-RR Aug 2, 2024

Replies: 1 comment · 4 replies

daviewales Sep 26, 2024

cool-RR Sep 26, 2024 Author

cool-RR Sep 26, 2024 Author

saulpw Sep 26, 2024 Maintainer

daviewales Sep 27, 2024

cool-RR
Aug 2, 2024

Replies: 1 comment 4 replies

daviewales
Sep 26, 2024

cool-RR Sep 26, 2024
Author

cool-RR Sep 26, 2024
Author

saulpw Sep 26, 2024
Maintainer