Storage Usage: Add benchmark results from v5.10.2 #214

surister · 2025-06-17T13:42:35Z

About

Update the page about "Performance » Storage Usage" regarding storage space improvements with CrateDB 5.10.

Preview

https://cratedb-guide--214.org.readthedocs.build/performance/storage.html

Checklist

Link to issue this PR refers to (if applicable): Fixes https://github.com/crate/tech-content/issues/136

coderabbitai · 2025-06-17T13:42:41Z

Walkthrough

The documentation for storage performance has been updated to reflect improvements in CrateDB version 5.10.2, including updated storage size comparisons, detailed shard and replication calculations, a new section on manual table optimization, expanded tables showing storage effects, and various clarifications, formatting corrections, and structural enhancements.

Changes

File(s)	Change Summary
docs/performance/storage.md	Updated storage size comparisons for CrateDB v5.10.2, clarified shard and replica calculations, added manual `OPTIMIZE table` guidance, expanded storage effect tables, corrected grammar and formatting, and improved overall clarity and structure.

Poem

Hopping through docs with nimble delight,
Storage shrinks down—oh, what a sight!
Shards and replicas, now clearly explained,
With optimize tips, much knowledge is gained.
CrateDB’s new tricks make data less dense,
This bunny approves—storage now makes more sense!
🐇📦✨

✨ Finishing Touches

🧪 Generate Unit Tests

Create PR with Unit Tests
Post Copyable Unit Tests in Comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai auto-generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (13)

docs/performance/storage.md (13)

32-34: Rephrase INFO block wording
The INFO block is a bit wordy and contains repetitive phrasing. Consider tightening “In version 5.10 storage usage was improved…” to avoid redundancy.

36-36: Convert bare URL to markdown link
The plain URL violates MD034. Wrap the CrateDB blog link in [Reducing storage costs in CrateDB v5.10](https://cratedb.com/blog/reducing-storage-costs-in-cratedb-v5.10) to satisfy the linter and improve readability.

65-67: Merge redundant sentences and remove extra blank line
Lines 65–67 repeat “when a table is created…” and introduce an unnecessary blank line. Consider combining the two sentences and deleting the blank line to improve flow.

70-76: Use concise transition and remove blank line
Replace “On top of that,” with “Additionally,” and remove the blank line at 71 to tighten the replication explanation.

122-126: Replace hard tabs with spaces
Lines 122–126 use hard tabs (MD010). Convert them to spaces for consistent formatting.

142-142: Format SQL command in bullet list
Wrap OPTIMIZE table in backticks in the bullet at line 142 for consistency with other code examples.

146-146: Optional: Simplify phrasing
Consider shortening “Disk size improvements can vary depending on the data types, schema and even disk manufacturer.” to something like “Disk size improvements vary by data type, schema, and storage hardware.”

154-154: Optional: Add IF NOT EXISTS
The CREATE TABLE taxi_noindex snippet could include IF NOT EXISTS to prevent errors if the table is already present.

163-164: Clarify index deletion note
“It can’t be deleted” may confuse readers—recommend rephrasing to “the existing index cannot be removed without recreating the table.”

239-246: Fix version notation and extra period
In the disable-columnar-store section, remove the extra period in “v.5.10” and ensure you’re using “v5.10.2” consistently.

307-308: Unify SQL column aliases
In the “All results” query you alias the total size as total_size_mib on line 307 but use total_size_mb in other sections. Choose one alias and apply consistently.

Also applies to: 316-316

383-383: Remove trailing comma
Line 383 ends with a comma (“...data evolves,”). Change it to a period to properly finish the sentence.

393-393: Convert bare Wikipedia URL to markdown link
The bare link should be formatted as [Database Normalization](https://en.wikipedia.org/wiki/Database_normalization) to satisfy MD034 and improve readability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0c6c30 and a609bc4.

📒 Files selected for processing (1)

docs/performance/storage.md (14 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/performance/storage.md

36-36: Bare URL used
null

(MD034, no-bare-urls)

122-122: Hard tabs
Column: 5

(MD010, no-hard-tabs)

123-123: Hard tabs
Column: 4

(MD010, no-hard-tabs)

124-124: Hard tabs
Column: 4

(MD010, no-hard-tabs)

125-125: Hard tabs
Column: 4

(MD010, no-hard-tabs)

126-126: Hard tabs
Column: 4

(MD010, no-hard-tabs)

393-393: Bare URL used
null

(MD034, no-bare-urls)

🪛 LanguageTool

docs/performance/storage.md

[style] ~65-~65: ‘taken into account’ might be wordy. Consider a shorter alternative.
Context: ...ed, sharding and replication have to be taken into account. When a table is created with default v...

(EN_WORDINESS_PREMIUM_TAKEN_INTO_ACCOUNT)

[style] ~73-~73: ‘On top of that’ might be wordy. Consider a shorter alternative.
Context: ...odes = 3 max(4, 3 * 2) = 6 shards On top of that, the default replication is the0-1` r...

(EN_WORDINESS_PREMIUM_ON_TOP_OF_THAT)

[uncategorized] ~356-~356: The verb “are” doesn’t seem to fit in this context, “is” is probably more formally correct.
Context: ...eplica shards) = 768 MiB > The result are slightly off768 ~= 775` because in th...

(AI_HYDRA_LEO_CPT_ARE_IS)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build docs

🔇 Additional comments (10)

docs/performance/storage.md (10)

9-9: Dataset updated: link and row count
The introduction correctly links to the January 2024 Yellow taxi trip dataset and updates the row count to 2 964 624.

30-30: Section transition text looks good
The “We will dive deeper to really understand what is going on.” segue is clear and helpful.

58-59: Clarify measurement breakdown intro
This introduction to the shard‐by‐shard breakdown is concise and clear. No change needed.

99-99: Code block closure
The closing backticks properly terminate the SQL snippet; no action required.

191-191: Indexes INFO is concise
The INFO note about indexes not being re-addable after table creation is accurate and clear.

195-202: Disable indexing storage results validated
The storage comparison table for disable-indexing in v5.9.3 vs v5.10.2 is correct and clearly formatted.

248-248: Column store recommendation is fine
The note advising against disabling column storage in v5.10 is clear and appropriate.

261-261: OPTIMIZE command snippet
The example OPTIMIZE table <table_name> WITH (max_num_segments=1) is correct and clear.

282-289: Compression results table reviewed
The tables comparing default vs. best_compression in v5.9.3 and v5.10.2 are accurate and well formatted.

294-296: Percentages formatting acceptable
The tilde-marked approximate percentages (~27.86%, ~1%) effectively convey that these are estimates.

docs/performance/storage.md

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

docs/performance/storage.md (4)
9-9: Prevent unintended italic formatting for numeric separators
The plain text 2_964_624 will trigger markdown italic around 964. Wrap this number in backticks or format it without underscores to avoid unintended styling.

65-66: Consider tightening phrasing
“have to be taken into account” could be shortened to “consider sharding and replication.”

356-357: Fix grammatical agreement
The verb “are” should be “is” in this context.
- > The result are slightly off `768 ~= 775` because in this example, decimals are being ignored.
+ > The result is slightly off `768 ~= 775` because in this example, decimals are being ignored.
393-393: Convert bare URL to linked text
Avoid raw URLs for readability. Link the text “Database normalization” instead.
- Read more about it at https://en.wikipedia.org/wiki/Database_normalization
+ Read more about it at [Database normalization](https://en.wikipedia.org/wiki/Database_normalization)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a609bc4 and dab3146.

📒 Files selected for processing (1)

docs/performance/storage.md (13 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/performance/storage.md

36-36: Bare URL used
null

(MD034, no-bare-urls)

122-122: Hard tabs
Column: 5

(MD010, no-hard-tabs)

123-123: Hard tabs
Column: 4

(MD010, no-hard-tabs)

124-124: Hard tabs
Column: 4

(MD010, no-hard-tabs)

125-125: Hard tabs
Column: 4

(MD010, no-hard-tabs)

126-126: Hard tabs
Column: 4

(MD010, no-hard-tabs)

393-393: Bare URL used
null

(MD034, no-bare-urls)

🪛 LanguageTool

docs/performance/storage.md

[style] ~65-~65: ‘taken into account’ might be wordy. Consider a shorter alternative.
Context: ...ed, sharding and replication have to be taken into account. When a table is created with default v...

(EN_WORDINESS_PREMIUM_TAKEN_INTO_ACCOUNT)

[style] ~73-~73: ‘On top of that’ might be wordy. Consider a shorter alternative.
Context: ...odes = 3 max(4, 3 * 2) = 6 shards On top of that, the default replication is the0-1` r...

(EN_WORDINESS_PREMIUM_ON_TOP_OF_THAT)

[uncategorized] ~356-~356: The verb “are” doesn’t seem to fit in this context, “is” is probably more formally correct.
Context: ...eplica shards) = 768 MiB > The result are slightly off768 ~= 775` because in th...

(AI_HYDRA_LEO_CPT_ARE_IS)

[uncategorized] ~357-~357: The preposition “of” seems more likely in this position.
Context: ...ored. > The goal is to give you an idea on how tweaking some CrateDB aspect can af...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build docs

🔇 Additional comments (19)

docs/performance/storage.md (19)

22-28: v5.10.2 storage result added
The new bullet for CrateDB 5.10.2 is correctly formatted and consistent with the approximate storage sizes.

30-30: Heading for deeper dive is clear
The introductory line for the detailed breakdown fits well.

70-72: Shard calculation example is clear
The formula and example values are correct and easy to follow.

95-99: Shard size SQL snippet looks good
The query correctly computes average and total MiB per shard.

142-142: Optimize table bullet is accurate
The note on calling OPTIMIZE table for v5.10+ is valid and concise.

146-147: Retention of disk improvement note
The informational callout about variability in disk improvements is appropriate.

154-154: Code example is correct
The CREATE TABLE snippet for disabling indexing is well-formatted and accurate.

195-195: Baseline v5.9.3 results unchanged
The CrateDB v5.9.3 reference remains correct.

201-201: v5.10.2 effect on indexing is correct
The CrateDB v5.10.2 storage numbers for disabled indexing are accurate.

233-233: Column store v5.9.3 results are fine
The table header and values for v5.9.3 are accurate.

239-239: Column store v5.10.2 results are fine
The v5.10.2 storage numbers for disabling columnar store look correct.

242-243: Table formatting is correct
The separator and data rows align with markdown best practices.

245-245: Baseline reduction note is accurate
The reduction percentage for v5.9 (no index + no column store) is correct.

307-307: Final SQL alias for total_size_mib
The new alias total_size_mib matches the table references.

316-316: Ordering by avg_size_per_shard_in_mib is correct
The ORDER BY clause aligns with the computed metric.

323-330: v5.9.3 table rows validated
The data entries under the results for v5.9.3 are accurate.

332-336: Ensure consistent column header in v5.10.2 table
Verify that the header matches the finalized naming convention (avg_size_in_bytes_per_record).

Please confirm that the header on the v5.10.2 results table matches the naming you’ve aligned for the v5.9.3 table.

336-343: v5.10.2 table rows validated
The data entries for CrateDB v5.10.2 benchmarking are correct.

382-382: Future-proofing note is clear
The remark about future changes is appropriate and well-placed.

coderabbitai · 2025-06-17T17:01:43Z

docs/performance/storage.md

+| 39                | 471       |
+
+Data was reduced `~13.11%` in v5.9 similar to `no_index`.
+Data was increased `~10%` in v.5.10


⚠️ Potential issue

Update version in column store increase note
The note references “v.5.10”; it should be v5.10.2 and remove the extraneous period.

- Data was increased `~10%` in v.5.10 + Data was increased `~10%` in v5.10.2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Data was increased `~10%` in v.5.10

- Data was increased `~10%` in v.5.10

+ Data was increased `~10%` in v5.10.2

🤖 Prompt for AI Agents

In docs/performance/storage.md at line 246, update the version reference from "v.5.10" to "v5.10.2" by removing the extra period and adding the patch version to accurately reflect the correct version number.

coderabbitai · 2025-06-17T17:01:43Z

docs/performance/storage.md

+> 
+>Read more at https://cratedb.com/blog/reducing-storage-costs-in-cratedb-v5.10
+> 
+> This guide has been updated to reflect the v5.10 improvements.


🛠️ Refactor suggestion

Update improvement note to v5.10.2
The note says “reflect the v5.10 improvements” but it should be v5.10.2.

Apply this diff:

- > This guide has been updated to reflect the v5.10 improvements. + > This guide has been updated to reflect the v5.10.2 improvements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

> This guide has been updated to reflect the v5.10 improvements.

> This guide has been updated to reflect the v5.10.2 improvements.

🤖 Prompt for AI Agents

In docs/performance/storage.md at line 38, update the version number in the improvement note from "v5.10" to "v5.10.2" to accurately reflect the correct version. Simply replace "v5.10" with "v5.10.2" in the existing text.

coderabbitai · 2025-06-17T17:01:43Z

docs/performance/storage.md

+Data was reduced `~13.11%` in v5.9
+Data was reduced `~33.87%` in v5.10


⚠️ Potential issue

Align percentage context to v5.10.2
The second reduction metric still labels the version as “v5.10.” Update to v5.10.2.

- Data was reduced `~33.87%` in v5.10 + Data was reduced `~33.87%` in v5.10.2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Data was reduced `~13.11%` in v5.9

Data was reduced `~33.87%` in v5.10

Data was reduced `~13.11%` in v5.9

Data was reduced `~33.87%` in v5.10.2

🤖 Prompt for AI Agents

In docs/performance/storage.md around lines 207 to 208, update the version label for the second data reduction metric from "v5.10" to "v5.10.2" to accurately reflect the correct version context.

coderabbitai · 2025-06-17T17:01:43Z

docs/performance/storage.md

+CrateDB v5.9.3
+
+| table_name                                   | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record |
+|----------------------------------------------|---------|----------------|---------------------------|----------------------|


⚠️ Potential issue

Unify average bytes column header
The first results table uses avg_bytes_per_record while the second uses avg_size_in_bytes_per_record. Align both names.

-| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record | +| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_size_in_bytes_per_record |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CrateDB v5.9.3

| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_bytes_per_record |

|----------------------------------------------|---------|----------------|---------------------------|----------------------|

CrateDB v5.9.3

| table_name | records | total_size_mib | avg_size_per_shard_in_mib | avg_size_in_bytes_per_record |

|----------------------------------------------|---------|----------------|---------------------------|------------------------------|

🤖 Prompt for AI Agents

In docs/performance/storage.md around lines 319 to 322, the column header for average bytes per record is inconsistent between tables, using `avg_bytes_per_record` in the first and `avg_size_in_bytes_per_record` in the second. Update the second table's column header to `avg_bytes_per_record` to unify the naming across both tables.

coderabbitai · 2025-06-17T17:01:43Z

docs/performance/storage.md

+> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
+> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
+> depending on use cases. 


🛠️ Refactor suggestion

Align version reference to v5.10.2
The guidance still mentions “version 5.10” but should reference v5.10.2 for consistency with the rest of the document.

Apply this diff:

- > INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction, + > INFO: In version 5.10.2 storage usage was improved, some users report up to 70% of storage reduction,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,

> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary

> depending on use cases.

> INFO: In version 5.10.2 storage usage was improved, some users report up to 70% of storage reduction,

> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary

> depending on use cases.

🤖 Prompt for AI Agents

In docs/performance/storage.md around lines 32 to 34, update the version reference from "version 5.10" to "version v5.10.2" to maintain consistency with the rest of the document. Replace all occurrences of "5.10" in this context with "v5.10.2".

coderabbitai · 2025-06-17T17:01:44Z

docs/performance/storage.md

+5. [Optimize table](#upgrade-to-v510-and-optimize-table)
+6. [Changing the compression algorithm](#changing-the-compression-algorithm)
+7. [All results](#all-results-and-what-to-do)
+8. [Data normalization](#extra-data-normalization)


⚠️ Potential issue

Correct TOC anchor for v5.10.2 and ensure link consistency
The TOC entry still references the slug upgrade-to-v510-and-optimize-table. Update it to match v5.10.2 and the updated heading slug.

-5. [Optimize table](#upgrade-to-v510-and-optimize-table) +5. [Optimize table](#upgrade-to-v5-10-2-and-optimize-table)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

5. [Optimize table](#upgrade-to-v510-and-optimize-table)

6. [Changing the compression algorithm](#changing-the-compression-algorithm)

7. [All results](#all-results-and-what-to-do)

8. [Data normalization](#extra-data-normalization)

5. [Optimize table](#upgrade-to-v5-10-2-and-optimize-table)

6. [Changing the compression algorithm](#changing-the-compression-algorithm)

7. [All results](#all-results-and-what-to-do)

8. [Data normalization](#extra-data-normalization)

🤖 Prompt for AI Agents

In docs/performance/storage.md around lines 46 to 49, the table of contents entry for the upgrade section uses the outdated anchor slug "upgrade-to-v510-and-optimize-table". Update this anchor to reflect the correct version "v5.10.2" and ensure it matches the corresponding heading slug exactly for link consistency.

coderabbitai · 2025-06-17T17:01:44Z

docs/performance/storage.md

@@ -47,30 +55,29 @@ CrateDB is a distributed database; nodes, shards, partitions and replicas are ti
 When a table is created, data is sharded and distributed among nodes. This
 means that the memory footprint depends on our replication and sharding strategy.

-Let's break down how the `775MiB` in CrateDB and the `510MiB` in PostgreSQL were
+Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were


⚠️ Potential issue

Specify CrateDB version in breakdown
Reference the optimized storage as CrateDB v5.10.2 when breaking down the 431MiB.

- Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were obtained. + Let's break down how the `431MiB` in CrateDB v5.10.2 and the `510MiB` in PostgreSQL were obtained.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were

Let's break down how the `431MiB` in CrateDB v5.10.2 and the `510MiB` in PostgreSQL were obtained.

🤖 Prompt for AI Agents

In docs/performance/storage.md at line 58, specify the CrateDB version as v5.10.2 when mentioning the optimized storage size of 431MiB. Update the text to clearly indicate that the 431MiB figure corresponds to CrateDB version 5.10.2 for clarity and accuracy.

amotl

Thanks a stack for the update, this outcome looks very promising.

I've added a few suggestions about MyST Markdown syntax and friends, and would like to give @hlcianfagna the pleasure of final approval, as he knows the topic matter and the English language better than me.

amotl · 2025-06-17T17:23:02Z

docs/performance/storage.md

+> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
+> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
+> depending on use cases. 


FYI: In MyST Markdown, you may want to phrase the admonition like this:

Suggested change

> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,

> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary

> depending on use cases.

:::{note}

In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,

more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary

depending on use cases.

:::

amotl · 2025-06-17T17:24:45Z

docs/performance/storage.md

+In CrateDB, when a table is created, sharding and replication have to be taken into account.
+When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.


Maybe elegantly refer to the sharding and scaling guideline docs at this spot?

https://cratedb.com/docs/guide/performance/sharding.html

https://cratedb.com/docs/guide/performance/scaling.html#number-of-shards

amotl · 2025-06-17T17:26:27Z

docs/performance/storage.md

+You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`
+shows the file path of the shards.


Suggested change

You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`

shows the file path of the shards.

You can check the actual size of a shard by inspecting the filesystem.

Use `select table_name, path from sys.shards` to show the file path of the individual shards.

amotl · 2025-06-17T17:28:21Z

docs/performance/storage.md

-The index can only be disabled when the table is created, if the table already exists and it cannot
-be deleted it will have to be re-created.
+The index can only be disabled when the table is created, if the table already exists, and it can’t
+be deleted, it will have to be re-created.


Is it possible to delete an index afterwards? Doesn't the table always need to be re-created?

amotl · 2025-06-17T17:29:11Z

docs/performance/storage.md

@@ -171,15 +188,24 @@ DROP TABLE "taxi_deleteme"

 > WARNING: Dropping the table deletes the data, make sure that the copy was done correctly.

-> INFO: Indexes cannot be re-added after table creation.
+> INFO: Indexes can’t be re-added after table creation.


Suggested change

> INFO: Indexes can’t be re-added after table creation.

:::{attention}

Indexes can’t be re-added after table creation.

:::

Suggested change

> INFO: Indexes can’t be re-added after table creation.

> INFO: Disabled indexes cannot be re-added without recreating the table.

amotl · 2025-06-17T17:29:45Z

docs/performance/storage.md

+Data was reduced `~13.11%` in v5.9 similar to `no_index`.
+Data was increased `~10%` in v.5.10
+
+> NOTE: In v5.10, we don’t recommend that you turn off column storage.


Suggested change

> NOTE: In v5.10, we don’t recommend that you turn off column storage.

:::{note}

In v5.10, we don’t recommend that you turn off column storage.

:::

amotl · 2025-06-17T17:30:29Z

docs/performance/storage.md

+In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
+to previous versions.
+
+The easiest way to gain massive storage reduction is to update to the latest version available. 


I would suggest to remove the additional line break here.

Suggested change

In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison

to previous versions.

The easiest way to gain massive storage reduction is to update to the latest version available.

In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison

to previous versions. The easiest way to gain massive storage reduction is to update to the latest version available.

Just upgrading alone would not be enough, the new format only applies to new tables and partitions I believe, and only after segments are merged.

amotl · 2025-06-17T17:31:30Z

docs/performance/storage.md

+you would typically see storage usage being automatically reduced as more data gets added to the
+table. Nonetheless, you can trigger these optimizations manually by running:
+
+`OPTIMIZE table <table_name> WITH (max_num_segments=1)`


Just a little more uppercasing.

Suggested change

`OPTIMIZE table <table_name> WITH (max_num_segments=1)`

`OPTIMIZE TABLE <table_name> WITH (max_num_segments=1)`

amotl · 2025-06-17T17:31:52Z

docs/performance/storage.md

+
+> NOTE: In v5.10, we don’t recommend that you turn off column storage.
+
+## Upgrade to v5.10 and Optimize table


Maybe?

Suggested change

## Upgrade to v5.10 and Optimize table

## Upgrade to v5.10 and `OPTIMIZE TABLE`

amotl · 2025-06-17T17:32:08Z

docs/performance/storage.md

+
+## Upgrade to v5.10 and Optimize table
+
+In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison


Suggested change

In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison

CrateDB v5.10 introduced a new table storage format that can reduce storage up to 50% in comparison

hlcianfagna

Looks good, but please look at the comments in lines 66 and 253

hlcianfagna · 2025-06-18T07:43:18Z

docs/performance/storage.md

-At first sight, it might look that CrateDB storage takes more than PostgreSQL,
-but we need to dive deeper to really understand what is going on.
+> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
+> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary


Suggested change

> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary

> more reasonably a 30–50% can be expected for most data, the reduction percentage will vary

hlcianfagna · 2025-06-18T07:43:47Z

docs/performance/storage.md

-but we need to dive deeper to really understand what is going on.
+> INFO: In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
+> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
+> depending on use cases. 


Suggested change

> depending on use cases.

> depending on the use case.

hlcianfagna · 2025-06-18T07:45:26Z

docs/performance/storage.md

-For CrateDB when a table is created, sharding and replication has to be taken into account.
-When a table is created with default values, it gets partitioned in `max(4, num_data_nodes * 2)` shards.
+In CrateDB, when a table is created, sharding and replication have to be taken into account.
+When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.


Suggested change

When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.

When a table is created with default values, it gets a number of shards derived using this formula: `max(4, num_data_nodes * 2)`.

hlcianfagna · 2025-06-18T07:48:53Z

docs/performance/storage.md

+
+The average size of a shard is ~35MiB, `35 * 12 = ~420`.
+
+You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`


This should not differ significantly from what sys.shards report right?

hlcianfagna · 2025-06-18T07:50:33Z

docs/performance/storage.md

@@ -143,8 +160,8 @@ CREATE TABLE taxi
 )
 ```

-The index can only be disabled when the table is created, if the table already exists and it cannot
-be deleted it will have to be re-created.
+The index can only be disabled when the table is created, if the table already exists, and it can’t


Suggested change

The index can only be disabled when the table is created, if the table already exists, and it can’t

The index can only be disabled when the table is created, if the table already exists

hlcianfagna · 2025-06-18T07:51:07Z

docs/performance/storage.md

-The index can only be disabled when the table is created, if the table already exists and it cannot
-be deleted it will have to be re-created.
+The index can only be disabled when the table is created, if the table already exists, and it can’t
+be deleted, it will have to be re-created.


Suggested change

be deleted, it will have to be re-created.

it will have to be re-created.

hlcianfagna · 2025-06-18T07:52:09Z

docs/performance/storage.md

@@ -171,15 +188,24 @@ DROP TABLE "taxi_deleteme"

 > WARNING: Dropping the table deletes the data, make sure that the copy was done correctly.

-> INFO: Indexes cannot be re-added after table creation.
+> INFO: Indexes can’t be re-added after table creation.


Suggested change

> INFO: Indexes can’t be re-added after table creation.

> INFO: Disabled indexes cannot be re-added without recreating the table.

hlcianfagna · 2025-06-18T07:54:09Z

docs/performance/storage.md

+In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
+to previous versions.
+
+The easiest way to gain massive storage reduction is to update to the latest version available. 


Just upgrading alone would not be enough, the new format only applies to new tables and partitions I believe, and only after segments are merged.

hlcianfagna · 2025-06-18T07:54:45Z

docs/performance/storage.md

@@ -300,7 +379,7 @@ CREATE TABLE IF NOT EXISTS "doc"."taxi_nocolumnstore_noindex_bestcompresion" (
 CrateDB's default settings are optimized for performance.

 If some columns will never be used for aggregations or groupings, there will be no performance penalty.
-That might change in the future as your use case and data needs evolve,
+That might change in the future as your use case and data evolves,


Suggested change

That might change in the future as your use case and data evolves,

That might change in the future as your use case and data evolve,

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

Add benchmark results from v5.10.2

dab3146

surister force-pushed the fix/update-storage-usage-guide branch from a609bc4 to dab3146 Compare June 17, 2025 16:57

surister requested review from amotl and hlcianfagna June 17, 2025 16:57

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

amotl changed the title ~~Add benchmark results from v5.10.2~~ Performance » Storage Usage: Add benchmark results from v5.10.2 Jun 17, 2025

amotl changed the title ~~Performance » Storage Usage: Add benchmark results from v5.10.2~~ Storage Usage: Add benchmark results from v5.10.2 Jun 17, 2025

amotl reviewed Jun 17, 2025

View reviewed changes

hlcianfagna reviewed Jun 18, 2025

View reviewed changes

	Data was increased `~10%` in v.5.10
	- Data was increased `~10%` in v.5.10
	+ Data was increased `~10%` in v5.10.2

	> This guide has been updated to reflect the v5.10 improvements.
	> This guide has been updated to reflect the v5.10.2 improvements.

		Data was reduced `~13.11%` in v5.9
		Data was reduced `~33.87%` in v5.10

	Let's break down how the `431MiB` in CrateDB and the `510MiB` in PostgreSQL were
	Let's break down how the `431MiB` in CrateDB v5.10.2 and the `510MiB` in PostgreSQL were obtained.

		In CrateDB, when a table is created, sharding and replication have to be taken into account.
		When a table is created with default values, it gets partitioned using this formula: `max(4, num_data_nodes * 2)`.

		You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`
		shows the file path of the shards.

	> INFO: Indexes can’t be re-added after table creation.
	> INFO: Disabled indexes cannot be re-added without recreating the table.

	`OPTIMIZE table <table_name> WITH (max_num_segments=1)`
	`OPTIMIZE TABLE <table_name> WITH (max_num_segments=1)`


		> NOTE: In v5.10, we don’t recommend that you turn off column storage.

		## Upgrade to v5.10 and Optimize table

	## Upgrade to v5.10 and Optimize table
	## Upgrade to v5.10 and `OPTIMIZE TABLE`


		## Upgrade to v5.10 and Optimize table

		In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison

	In v5.10 we introduced a new table storage format that can reduce storage up to 50% in comparison
	CrateDB v5.10 introduced a new table storage format that can reduce storage up to 50% in comparison

	> more reasonably a 30–50% can be expected for most data, the reduction % of reduction will vary
	> more reasonably a 30–50% can be expected for most data, the reduction percentage will vary


		The average size of a shard is ~35MiB, `35 * 12 = ~420`.

		You can check the actual size of a shard locally; querying `select table_name, path from sys.shards`

	The index can only be disabled when the table is created, if the table already exists, and it can’t
	The index can only be disabled when the table is created, if the table already exists

	be deleted, it will have to be re-created.
	it will have to be re-created.

	That might change in the future as your use case and data evolves,
	That might change in the future as your use case and data evolve,

Storage Usage: Add benchmark results from v5.10.2 #214

Are you sure you want to change the base?

Storage Usage: Add benchmark results from v5.10.2 #214

Uh oh!

Conversation

surister commented Jun 17, 2025 • edited by amotl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

Preview

Checklist

Uh oh!

coderabbitai bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

amotl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

surister commented Jun 17, 2025 •

edited by amotl

Loading

coderabbitai bot commented Jun 17, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

amotl left a comment •

edited

Loading