-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tpc-h compression seems not work well #2430
Comments
Well, there are different things to discuss here. First a question: the 12.5 GB for SF 5, is this the max RSS you have seen? We do have a rather significant footprint during data ingestion. For TPC-H usually a factor 2.5 of the actual data. This is only for data loading as we create histograms in parallel and (admittedly) are not very resource sensitive here. If you care about the data footprint of the running database, I just want to mention that there are various data structures to look at. One of them is secondary indexes, which can be very large (but we don't really use them in TPC-H and they are not created by default). There are data statistics (this includes the mentioned histograms, but that's mostly KB/few MB range after creation). And there are various caches which can grow over time (e.g., the cache for the physical/logical query plans). Now to the actual data. If you want to examine the table data footprint, you can run We are actively researching this area and it is a super interesting field of research. In case you have any questions, you can also send me an email and I can share the most recent work you with (it's not yet published). |
Oh, btw: it's not yet in the master branch, but we also support FSST (https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf) encoding. It can be directly merged if you're interested in it. |
That's the data for SF 5. For TPC-H with dictionary compression (without any further compression, no bit-packing), the string data is often stored in more bytes than the raw data while many integer columns are compressed by a factor 10x. |
There are 3 phases:
|
Yeah, I just verified it and it pretty much looks the same on my machine. Is this overhead of interest for you? |
In case you want to check what exactly is going on, colleagues have been successfully using Heaptrack for similar tasks: https://milianw.de/blog/heaptrack-a-heap-memory-profiler-for-linux.html |
thanks for the help. hyrise is an in-memory database, so the memory usage is very important for us. |
Yes, the memory footprint is definitely of high importance. 1 TB of raw data can easily be stored on 512 GB (assuming TPC-H data), but not with the default settings of Hyrise. We are actually even a bit more space-efficient here than other database (e.g., Umbra, MonetDB, DuckDB). Using "a bit" more storage for TPC-H appears to be the norm (commercial database systems are better in this regards, SAP HANA uses ~4GB for SF 10). What's even more important here is the fact that's is not only about the base data. In case you want to support multiple concurrent clients, you also need DRAM capacity for the query processing. Most database thus recommend to only use half of the DRAM for data and have the rest available for query processing. We don't have a cluster mode, but we are currently working on using disaggregated memory in Hyrise. That is one potential form of data tiering beside tiering to SSD/HDD. |
Actually we want to run tpc-h with SF 1000 on 512G RAM. |
There are branches that tier segments to secondary storage (e.g., SSD) but they are not in a state where we recommend using them. Maybe might look different in about 6-8 weeks. I might have a branch that suits your needs (will take a while to have that all in the master). I try to send you some information in mid January, ok? |
OK, thank you very much |
I have to postpone the information to end of January. Cannot access the test server right now and going on vacation for a week. Hope that still works for you. |
It's ok |
Hey @louishust, can you tell me a little bit what you are planning to do? We have branch that should work sufficiently well on a 512 GB machine (still testing it), but depending on what you are planning to do, I'd need to adjust a few things more. |
Hi @Bouncner , We are trying clickhouse to see if TPCH 1TB can run on 512 RAM. we use hyriseConsole to generate tpch data, and use psql to connect hyrise and issue tpc-h queries. when the branch is testing ready, we can tried the new branch. BTW, does hyrise support mysql protocol? |
We extended the server to encode the data during TPC-H's data generation. Calling I am running a few to check if the current approach works fine. I'll keep you posted.
Unfortunately, no. Until now, we only support the postgresql protocol. |
OK |
Just to keep you up to date: working on memory leaks. It's a bit more tricky than I thought. |
It's ok |
I think it should be running now. Haven't seen any leaks. Testing large scale factors now. The approach that works is based on a current branch with which we study encoding schemes (a rather memory-efficient configuration has to be used as Hyrise's default dictionary decoding does not compress long strings well) and has some modifications for very large scale factors (encode data already while TPC-H data generation runs etc.) |
Glad to here that. |
I guess I have to give up for now. SF 1000 is running on Hyrise, but without reworking the aggregate we will not run on a 512 GB machine. Data loading and compression is not an issue (I have a working branch that gets the data to <400 GB and loads the data concurrently). But when running the queries, the memory consumption gets above 700 GB with many threads. Sorry that it took so long to give this unsatisfying answer. |
Could you link that branch here? |
The branch is: https://github.com/hyrise/hyrise/tree/martin/sf1000 You can execute TPC-H SF 1000 the following way: The configuration is required to reduce the data set size. It's committed to the branch. The data preparation cores are used to limit the concurrency of encoding. Otherwise, all available cores would encode the 1000 GB of data, which is too much for a 512 GB system and a server with many cores. |
I want to test the memory usage for tpc-h.
use hyriseConsole to run the tpch.
The memory used :
So the data is about 5G, but the memory usage is 12.6G.
Then I modified the default encoding type from dictionary to lz4 for generate_and_store function.
The memory used :
I want to reduce the memory usage as much as possible.
AFAIK many column store database can reach a high compression ratio.
So Is there any best practice for memory tuning?
The text was updated successfully, but these errors were encountered: