Skip to content

Storage Usage: Add benchmark results from v5.10.2 #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

surister
Copy link
Member

@surister surister commented Jun 17, 2025

About

Update the page about "Performance » Storage Usage" regarding storage space improvements with CrateDB 5.10.

Preview

https://cratedb-guide--214.org.readthedocs.build/performance/storage.html

Checklist

Copy link

coderabbitai bot commented Jun 17, 2025

Walkthrough

The documentation for storage performance has been updated to reflect improvements in CrateDB version 5.10.2, including updated storage size comparisons, detailed shard and replication calculations, a new section on manual table optimization, expanded tables showing storage effects, and various clarifications, formatting corrections, and structural enhancements.

Changes

File(s) Change Summary
docs/performance/storage.md Updated storage size comparisons for CrateDB v5.10.2, clarified shard and replica calculations, added manual OPTIMIZE table guidance, expanded storage effect tables, corrected grammar and formatting, and improved overall clarity and structure.

Poem

Hopping through docs with nimble delight,
Storage shrinks down—oh, what a sight!
Shards and replicas, now clearly explained,
With optimize tips, much knowledge is gained.
CrateDB’s new tricks make data less dense,
This bunny approves—storage now makes more sense!
🐇📦✨

✨ Finishing Touches
🧪 Generate Unit Tests
  • Create PR with Unit Tests
  • Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (13)
docs/performance/storage.md (13)

32-34: Rephrase INFO block wording
The INFO block is a bit wordy and contains repetitive phrasing. Consider tightening “In version 5.10 storage usage was improved…” to avoid redundancy.


36-36: Convert bare URL to markdown link
The plain URL violates MD034. Wrap the CrateDB blog link in [Reducing storage costs in CrateDB v5.10](https://cratedb.com/blog/reducing-storage-costs-in-cratedb-v5.10) to satisfy the linter and improve readability.


65-67: Merge redundant sentences and remove extra blank line
Lines 65–67 repeat “when a table is created…” and introduce an unnecessary blank line. Consider combining the two sentences and deleting the blank line to improve flow.


70-76: Use concise transition and remove blank line
Replace “On top of that,” with “Additionally,” and remove the blank line at 71 to tighten the replication explanation.


122-126: Replace hard tabs with spaces
Lines 122–126 use hard tabs (MD010). Convert them to spaces for consistent formatting.


142-142: Format SQL command in bullet list
Wrap OPTIMIZE table in backticks in the bullet at line 142 for consistency with other code examples.


146-146: Optional: Simplify phrasing
Consider shortening “Disk size improvements can vary depending on the data types, schema and even disk manufacturer.” to something like “Disk size improvements vary by data type, schema, and storage hardware.”


154-154: Optional: Add IF NOT EXISTS
The CREATE TABLE taxi_noindex snippet could include IF NOT EXISTS to prevent errors if the table is already present.


163-164: Clarify index deletion note
“It can’t be deleted” may confuse readers—recommend rephrasing to “the existing index cannot be removed without recreating the table.”


239-246: Fix version notation and extra period
In the disable-columnar-store section, remove the extra period in “v.5.10” and ensure you’re using “v5.10.2” consistently.


307-308: Unify SQL column aliases
In the “All results” query you alias the total size as total_size_mib on line 307 but use total_size_mb in other sections. Choose one alias and apply consistently.

Also applies to: 316-316


383-383: Remove trailing comma
Line 383 ends with a comma (“...data evolves,”). Change it to a period to properly finish the sentence.


393-393: Convert bare Wikipedia URL to markdown link
The bare link should be formatted as [Database Normalization](https://en.wikipedia.org/wiki/Database_normalization) to satisfy MD034 and improve readability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0c6c30 and a609bc4.

📒 Files selected for processing (1)
  • docs/performance/storage.md (14 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/performance/storage.md

36-36: Bare URL used
null

(MD034, no-bare-urls)


122-122: Hard tabs
Column: 5

(MD010, no-hard-tabs)


123-123: Hard tabs
Column: 4

(MD010, no-hard-tabs)


124-124: Hard tabs
Column: 4

(MD010, no-hard-tabs)


125-125: Hard tabs
Column: 4

(MD010, no-hard-tabs)


126-126: Hard tabs
Column: 4

(MD010, no-hard-tabs)


393-393: Bare URL used
null

(MD034, no-bare-urls)

🪛 LanguageTool
docs/performance/storage.md

[style] ~65-~65: ‘taken into account’ might be wordy. Consider a shorter alternative.
Context: ...ed, sharding and replication have to be taken into account. When a table is created with default v...

(EN_WORDINESS_PREMIUM_TAKEN_INTO_ACCOUNT)


[style] ~73-~73: ‘On top of that’ might be wordy. Consider a shorter alternative.
Context: ...odes = 3 max(4, 3 * 2) = 6 shards On top of that, the default replication is the0-1` r...

(EN_WORDINESS_PREMIUM_ON_TOP_OF_THAT)


[uncategorized] ~356-~356: The verb “are” doesn’t seem to fit in this context, “is” is probably more formally correct.
Context: ...eplica shards) = 768 MiB > The result are slightly off768 ~= 775` because in th...

(AI_HYDRA_LEO_CPT_ARE_IS)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build docs
🔇 Additional comments (10)
docs/performance/storage.md (10)

9-9: Dataset updated: link and row count
The introduction correctly links to the January 2024 Yellow taxi trip dataset and updates the row count to 2 964 624.


30-30: Section transition text looks good
The “We will dive deeper to really understand what is going on.” segue is clear and helpful.


58-59: Clarify measurement breakdown intro
This introduction to the shard‐by‐shard breakdown is concise and clear. No change needed.


99-99: Code block closure
The closing backticks properly terminate the SQL snippet; no action required.


191-191: Indexes INFO is concise
The INFO note about indexes not being re-addable after table creation is accurate and clear.


195-202: Disable indexing storage results validated
The storage comparison table for disable-indexing in v5.9.3 vs v5.10.2 is correct and clearly formatted.


248-248: Column store recommendation is fine
The note advising against disabling column storage in v5.10 is clear and appropriate.


261-261: OPTIMIZE command snippet
The example OPTIMIZE table <table_name> WITH (max_num_segments=1) is correct and clear.


282-289: Compression results table reviewed
The tables comparing default vs. best_compression in v5.9.3 and v5.10.2 are accurate and well formatted.


294-296: Percentages formatting acceptable
The tilde-marked approximate percentages (~27.86%, ~1%) effectively convey that these are estimates.

@surister surister force-pushed the fix/update-storage-usage-guide branch from a609bc4 to dab3146 Compare June 17, 2025 16:57
@surister surister requested review from amotl and hlcianfagna June 17, 2025 16:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
docs/performance/storage.md (4)

9-9: Prevent unintended italic formatting for numeric separators
The plain text 2_964_624 will trigger markdown italic around 964. Wrap this number in backticks or format it without underscores to avoid unintended styling.


65-66: Consider tightening phrasing
“have to be taken into account” could be shortened to “consider sharding and replication.”


356-357: Fix grammatical agreement
The verb “are” should be “is” in this context.

- > The result are slightly off `768 ~= 775` because in this example, decimals are being ignored.
+ > The result is slightly off `768 ~= 775` because in this example, decimals are being ignored.

393-393: Convert bare URL to linked text
Avoid raw URLs for readability. Link the text “Database normalization” instead.

- Read more about it at https://en.wikipedia.org/wiki/Database_normalization
+ Read more about it at [Database normalization](https://en.wikipedia.org/wiki/Database_normalization)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a609bc4 and dab3146.

📒 Files selected for processing (1)
  • docs/performance/storage.md (13 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/performance/storage.md

36-36: Bare URL used
null

(MD034, no-bare-urls)


122-122: Hard tabs
Column: 5

(MD010, no-hard-tabs)


123-123: Hard tabs
Column: 4

(MD010, no-hard-tabs)


124-124: Hard tabs
Column: 4

(MD010, no-hard-tabs)


125-125: Hard tabs
Column: 4

(MD010, no-hard-tabs)


126-126: Hard tabs
Column: 4

(MD010, no-hard-tabs)


393-393: Bare URL used
null

(MD034, no-bare-urls)

🪛 LanguageTool
docs/performance/storage.md

[style] ~65-~65: ‘taken into account’ might be wordy. Consider a shorter alternative.
Context: ...ed, sharding and replication have to be taken into account. When a table is created with default v...

(EN_WORDINESS_PREMIUM_TAKEN_INTO_ACCOUNT)


[style] ~73-~73: ‘On top of that’ might be wordy. Consider a shorter alternative.
Context: ...odes = 3 max(4, 3 * 2) = 6 shards On top of that, the default replication is the0-1` r...

(EN_WORDINESS_PREMIUM_ON_TOP_OF_THAT)


[uncategorized] ~356-~356: The verb “are” doesn’t seem to fit in this context, “is” is probably more formally correct.
Context: ...eplica shards) = 768 MiB > The result are slightly off768 ~= 775` because in th...

(AI_HYDRA_LEO_CPT_ARE_IS)


[uncategorized] ~357-~357: The preposition “of” seems more likely in this position.
Context: ...ored. > The goal is to give you an idea on how tweaking some CrateDB aspect can af...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build docs
🔇 Additional comments (19)
docs/performance/storage.md (19)

22-28: v5.10.2 storage result added
The new bullet for CrateDB 5.10.2 is correctly formatted and consistent with the approximate storage sizes.


30-30: Heading for deeper dive is clear
The introductory line for the detailed breakdown fits well.


70-72: Shard calculation example is clear
The formula and example values are correct and easy to follow.


95-99: Shard size SQL snippet looks good
The query correctly computes average and total MiB per shard.


142-142: Optimize table bullet is accurate
The note on calling OPTIMIZE table for v5.10+ is valid and concise.


146-147: Retention of disk improvement note
The informational callout about variability in disk improvements is appropriate.


154-154: Code example is correct
The CREATE TABLE snippet for disabling indexing is well-formatted and accurate.


195-195: Baseline v5.9.3 results unchanged
The CrateDB v5.9.3 reference remains correct.


201-201: v5.10.2 effect on indexing is correct
The CrateDB v5.10.2 storage numbers for disabled indexing are accurate.


233-233: Column store v5.9.3 results are fine
The table header and values for v5.9.3 are accurate.


239-239: Column store v5.10.2 results are fine
The v5.10.2 storage numbers for disabling columnar store look correct.


242-243: Table formatting is correct
The separator and data rows align with markdown best practices.


245-245: Baseline reduction note is accurate
The reduction percentage for v5.9 (no index + no column store) is correct.


307-307: Final SQL alias for total_size_mib
The new alias total_size_mib matches the table references.


316-316: Ordering by avg_size_per_shard_in_mib is correct
The ORDER BY clause aligns with the computed metric.


323-330: v5.9.3 table rows validated
The data entries under the results for v5.9.3 are accurate.


332-336: Ensure consistent column header in v5.10.2 table
Verify that the header matches the finalized naming convention (avg_size_in_bytes_per_record).

Please confirm that the header on the v5.10.2 results table matches the naming you’ve aligned for the v5.9.3 table.


336-343: v5.10.2 table rows validated
The data entries for CrateDB v5.10.2 benchmarking are correct.


382-382: Future-proofing note is clear
The remark about future changes is appropriate and well-placed.

| 39 | 471 |

Data was reduced `~13.11%` in v5.9 similar to `no_index`.
Data was increased `~10%` in v.5.10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update version in column store increase note
The note references “v.5.10”; it should be v5.10.2 and remove the extraneous period.

- Data was increased `~10%` in v.5.10
+ Data was increased `~10%` in v5.10.2
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Data was increased `~10%` in v.5.10
- Data was increased `~10%` in v.5.10
+ Data was increased `~10%` in v5.10.2
🤖 Prompt for AI Agents
In docs/performance/storage.md at line 246, update the version reference from
"v.5.10" to "v5.10.2" by removing the extra period and adding the patch version
to accurately reflect the correct version number.

>
>Read more at https://cratedb.com/blog/reducing-storage-costs-in-cratedb-v5.10
>
> This guide has been updated to reflect the v5.10 improvements.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update improvement note to v5.10.2
The note says “reflect the v5.10 improvements” but it should be v5.10.2.

Apply this diff:

- > This guide has been updated to reflect the v5.10 improvements.
+ > This guide has been updated to reflect the v5.10.2 improvements.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> This guide has been updated to reflect the v5.10 improvements.
> This guide has been updated to reflect the v5.10.2 improvements.
🤖 Prompt for AI Agents
In docs/performance/storage.md at line 38, update the version number in the
improvement note from "v5.10" to "v5.10.2" to accurately reflect the correct
version. Simply replace "v5.10" with "v5.10.2" in the existing text.

Comment on lines +207 to +208
Data was reduced `~13.11%` in v5.9
Data was reduced `~33.87%` in v5.10
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Align percentage context to v5.10.2
The second reduction metric still labels the version as “v5.10.” Update to v5.10.2.

- Data was reduced `~33.87%` in v5.10
+ Data was reduced `~33.87%` in v5.10.2
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Data was reduced `~13.11%` in v5.9
Data was reduced `~33.87%` in v5.10
Data was reduced `~13.11%` in v5.9
Data was reduced `~33.87%` in v5.10.2
🤖 Prompt for AI Agents
In docs/performance/storage.md around lines 207 to 208, update the version label
for the second data reduction metric from "v5.10" to "v5.10.2" to accurately
reflect the correct version context.

Comment on lines +319 to +322
CrateDB v5.9.3

| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record |
|----------------------------------------------|---------|----------------|---------------------------|----------------------|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unify average bytes column header
The first results table uses avg_bytes_per_record while the second uses avg_size_in_bytes_per_record. Align both names.

-| table_name                                   | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record |
+| table_name                                   | records | total_size_mib | avg_size_per_shard_in_mib | avg_size_in_bytes_per_record |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CrateDB v5.9.3
| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record |
|----------------------------------------------|---------|----------------|---------------------------|----------------------|
CrateDB v5.9.3
| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_size_in_bytes_per_record |
|----------------------------------------------|---------|----------------|---------------------------|------------------------------|
🤖 Prompt for AI Agents
In docs/performance/storage.md around lines 319 to 322, the column header for
average bytes per record is inconsistent between tables, using
`avg_bytes_per_record` in the first and `avg_size_in_bytes_per_record` in the
second. Update the second table's column header to `avg_bytes_per_record` to
unify the naming across both tables.

Comment on lines +32 to +34
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Align version reference to v5.10.2
The guidance still mentions “version 5.10” but should reference v5.10.2 for consistency with the rest of the document.

Apply this diff:

- > INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
+ > INFO: In version 5.10.2 storage usage was improved, some users report up to 70% of storage reduction,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
> INFO: In version 5.10.2 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
🤖 Prompt for AI Agents
In docs/performance/storage.md around lines 32 to 34, update the version
reference from "version 5.10" to "version v5.10.2" to maintain consistency with
the rest of the document. Replace all occurrences of "5.10" in this context with
"v5.10.2".

Comment on lines +46 to +49
5. [Optimize table](#upgrade-to-v510-and-optimize-table)
6. [Changing the compression algorithm](#changing-the-compression-algorithm)
7. [All results](#all-results-and-what-to-do)
8. [Data normalization](#extra-data-normalization)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct TOC anchor for v5.10.2 and ensure link consistency
The TOC entry still references the slug upgrade-to-v510-and-optimize-table. Update it to match v5.10.2 and the updated heading slug.

-5. [Optimize table](#upgrade-to-v510-and-optimize-table)
+5. [Optimize table](#upgrade-to-v5-10-2-and-optimize-table)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
5. [Optimize table](#upgrade-to-v510-and-optimize-table)
6. [Changing the compression algorithm](#changing-the-compression-algorithm)
7. [All results](#all-results-and-what-to-do)
8. [Data normalization](#extra-data-normalization)
5. [Optimize table](#upgrade-to-v5-10-2-and-optimize-table)
6. [Changing the compression algorithm](#changing-the-compression-algorithm)
7. [All results](#all-results-and-what-to-do)
8. [Data normalization](#extra-data-normalization)
🤖 Prompt for AI Agents
In docs/performance/storage.md around lines 46 to 49, the table of contents
entry for the upgrade section uses the outdated anchor slug
"upgrade-to-v510-and-optimize-table". Update this anchor to reflect the correct
version "v5.10.2" and ensure it matches the corresponding heading slug exactly
for link consistency.

@@ -47,30 +55,29 @@ CrateDB is a distributed database; nodes, shards, partitions and replicas are ti
When a table is created, data is sharded and distributed among nodes. This
means that the memory footprint depends on our replication and sharding strategy.

Let's break down how the `775MiB` in CrateDB and the `510MiB` in PostgreSQL were
Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Specify CrateDB version in breakdown
Reference the optimized storage as CrateDB v5.10.2 when breaking down the 431MiB.

- Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were obtained.
+ Let's break down how the `431MiB` in CrateDB v5.10.2 and the `510MiB` in PostgreSQL were obtained.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were
Let's break down how the `431MiB` in CrateDB v5.10.2 and the `510MiB` in PostgreSQL were obtained.
🤖 Prompt for AI Agents
In docs/performance/storage.md at line 58, specify the CrateDB version as
v5.10.2 when mentioning the optimized storage size of 431MiB. Update the text to
clearly indicate that the 431MiB figure corresponds to CrateDB version 5.10.2
for clarity and accuracy.

@amotl amotl changed the title Add benchmark results from v5.10.2 Performance » Storage Usage: Add benchmark results from v5.10.2 Jun 17, 2025
@amotl amotl changed the title Performance » Storage Usage: Add benchmark results from v5.10.2 Storage Usage: Add benchmark results from v5.10.2 Jun 17, 2025
Copy link
Member

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a stack for the update, this outcome looks very promising.

I've added a few suggestions about MyST Markdown syntax and friends, and would like to give @hlcianfagna the pleasure of final approval, as he knows the topic matter and the English language better than me.

Comment on lines +32 to +34
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: In MyST Markdown, you may want to phrase the admonition like this:

Suggested change
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
:::{note}
In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
depending on use cases.
:::

Comment on lines +65 to +66
In CrateDB, when a table is created, sharding and replication have to be taken into account.
When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +115 to +116
You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`
shows the file path of the shards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`
shows the file path of the shards.
You can check the actual size of a shard by inspecting the filesystem.
Use `select table_name, path from sys.shards` to show the file path of the individual shards.

Comment on lines -146 to +164
The index can only be disabled when the table is created, if the table already exists and it cannot
be deleted it will have to be re-created.
The index can only be disabled when the table is created, if the table already exists, and it can’t
be deleted, it will have to be re-created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to delete an index afterwards? Doesn't the table always need to be re-created?

@@ -171,15 +188,24 @@ DROP TABLE "taxi_deleteme"

> WARNING: Dropping the table deletes the data, make sure that the copy was done correctly.

> INFO: Indexes cannot be re-added after table creation.
> INFO: Indexes can’t be re-added after table creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> INFO: Indexes can’t be re-added after table creation.
:::{attention}
Indexes can’t be re-added after table creation.
:::

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> INFO: Indexes can’t be re-added after table creation.
> INFO: Disabled indexes cannot be re-added without recreating the table.

Data was reduced `~13.11%` in v5.9 similar to `no_index`.
Data was increased `~10%` in v.5.10

> NOTE: In v5.10, we don’t recommend that you turn off column storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> NOTE: In v5.10, we don’t recommend that you turn off column storage.
:::{note}
In v5.10, we don’t recommend that you turn off column storage.
:::

Comment on lines +252 to +255
In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
to previous versions.

The easiest way to gain massive storage reduction is to update to the latest version available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to remove the additional line break here.

Suggested change
In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
to previous versions.
The easiest way to gain massive storage reduction is to update to the latest version available.
In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
to previous versions. The easiest way to gain massive storage reduction is to update to the latest version available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just upgrading alone would not be enough, the new format only applies to new tables and partitions I believe, and only after segments are merged.

you would typically see storage usage being automatically reduced as more data gets added to the
table. Nonetheless, you can trigger these optimizations manually by running:

`OPTIMIZE table <table_name> WITH (max_num_segments=1)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a little more uppercasing.

Suggested change
`OPTIMIZE table <table_name> WITH (max_num_segments=1)`
`OPTIMIZE TABLE <table_name> WITH (max_num_segments=1)`


> NOTE: In v5.10, we don’t recommend that you turn off column storage.

## Upgrade to v5.10 and Optimize table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
## Upgrade to v5.10 and Optimize table
## Upgrade to v5.10 and `OPTIMIZE TABLE`


## Upgrade to v5.10 and Optimize table

In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
CrateDB v5.10 introduced a new table storage format that can reduce storage up to 50% in comparison

Copy link
Contributor

@hlcianfagna hlcianfagna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please look at the comments in lines 66 and 253

At first sight, it might look that CrateDB storage takes more than PostgreSQL,
but we need to dive deeper to really understand what is going on.
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> more reasonably a 30–50% can be expected for most data, the reduction percentage will vary

but we need to dive deeper to really understand what is going on.
> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
> depending on use cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> depending on use cases.
> depending on the use case.

For CrateDB when a table is created, sharding and replication has to be taken into account.
When a table is created with default values, it gets partitioned in `max(4, num_data_nodes * 2)` shards.
In CrateDB, when a table is created, sharding and replication have to be taken into account.
When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.
When a table is created with default values, it gets a number of shards derived using this formula: `max(4, num_data_nodes * 2)`.


The average size of a shard is ~35MiB, `35 * 12 = ~420`.

You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not differ significantly from what sys.shards report right?

@@ -143,8 +160,8 @@ CREATE TABLE taxi
)
```

The index can only be disabled when the table is created, if the table already exists and it cannot
be deleted it will have to be re-created.
The index can only be disabled when the table is created, if the table already exists, and it can’t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The index can only be disabled when the table is created, if the table already exists, and it can’t
The index can only be disabled when the table is created, if the table already exists

The index can only be disabled when the table is created, if the table already exists and it cannot
be deleted it will have to be re-created.
The index can only be disabled when the table is created, if the table already exists, and it can’t
be deleted, it will have to be re-created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
be deleted, it will have to be re-created.
it will have to be re-created.

@@ -171,15 +188,24 @@ DROP TABLE "taxi_deleteme"

> WARNING: Dropping the table deletes the data, make sure that the copy was done correctly.

> INFO: Indexes cannot be re-added after table creation.
> INFO: Indexes can’t be re-added after table creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> INFO: Indexes can’t be re-added after table creation.
> INFO: Disabled indexes cannot be re-added without recreating the table.

Comment on lines +252 to +255
In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
to previous versions.

The easiest way to gain massive storage reduction is to update to the latest version available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just upgrading alone would not be enough, the new format only applies to new tables and partitions I believe, and only after segments are merged.

@@ -300,7 +379,7 @@ CREATE TABLE IF NOT EXISTS "doc"."taxi_nocolumnstore_noindex_bestcompresion" (
CrateDB's default settings are optimized for performance.

If some columns will never be used for aggregations or groupings, there will be no performance penalty.
That might change in the future as your use case and data needs evolve,
That might change in the future as your use case and data evolves,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That might change in the future as your use case and data evolves,
That might change in the future as your use case and data evolve,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants