You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had an issue using Qbeast with 8 dimensions on a relatively large dataset (~600 GB), as the jobs never end in one specific task. When I listed the files created, I noticed that they had very different file sizes, ranging from 1.8MB to 220GB.
ls -hl webpage_embedded_OPQ_quantized_qbeast/8dim_10_bites_100kcs
total 376G
-rw-r--r-- 1 qbst363636 qbst01 220G Oct 27 01:59 0c6a8d1f-52a5-4f8f-8ecc-cc72405c9b7b.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:15 42a62ec2-4d12-41f3-8798-72158ec3628d.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 48723494-c348-4b88-8e87-ff00f0a3d9ee.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.4G Oct 27 01:16 7ac2246a-a0d9-441b-b10f-69fff3b3dc7b.parquet
-rw-r--r-- 1 qbst363636 qbst01 13G Oct 27 01:17 84f4f027-bfef-495b-9414-ee476db909a9.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 8dd83175-b71a-404a-bca5-5936abc3c188.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.3G Oct 27 01:16 937fd444-7292-4ad8-bf91-4dc93fe6f3af.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.8G Oct 27 01:16 9ba58775-3c24-4b7e-b545-e80e257eb5e7.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.3G Oct 27 01:16 a9ca33f0-0e90-4fc0-bcec-79e60ec6c55a.parquet
-rw-r--r-- 1 qbst363636 qbst01 9.2G Oct 27 01:17 ab02f4b8-c26d-4f34-8e6d-09bcf73a144f.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.9G Oct 27 01:15 b283eb6d-63ad-4930-a083-e0dd6175d3aa.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.8M Oct 27 01:15 b2f6e0ef-6c5a-4992-950a-7f6ae7576cae.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:16 bd95fd41-7042-422f-a19f-85a6262688ab.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.4G Oct 27 01:15 c01d558f-00bd-45c1-9cb3-adfb5d533a73.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 cc5ef11b-ddac-4b8d-8de4-7296dca21dcf.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 df401968-02ca-4428-8060-1a7286e32506.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.0G Oct 27 01:16 ec897000-a272-428f-8020-5fc362af5e0d.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.5G Oct 27 01:16 f5c70de8-dfca-42cf-8116-9333ff68a2f5.parquet
-rw-r--r-- 1 qbst363636 qbst01 106G Oct 27 01:59 f79b8a20-44a0-4d62-86c9-6feddeda86e5.parquet
I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.
How to reproduce?
Different steps about how to reproduce the problem.
1. Code that triggered the bug, or steps to reproduce:
importorg.apache.spark.sql.SparkSessionimportorg.apache.spark.sql.functions._importscala.math._importio.qbeast.spark.QbeastSparkSessionExtension._importorg.apache.spark.sql.catalyst.TableIdentifiervalspark=SparkSession.builder.appName("SkewedColumnExample").getOrCreate()
// Generate a DataFrame with a range of 100,000 rowsvaldf= spark.range(1000000)
// Define columns with various distributionsvalskewedDF= df
.withColumn("uniform", rand()) // Uniform distribution [0, 1]
.withColumn("right_skewed_exp", exp(rand() *3)) // Exponential distribution (right-skewed)
.withColumn("left_skewed_log", -log(rand() +0.01)) // Left-skewed distribution
.withColumn("normal_dist", randn()) // Normally distributed values
.withColumn("bimodal_dist", when(rand() <0.5, rand() *2).otherwise(rand() *5+5)) // Bimodal distribution
.withColumn("power_law", pow(rand(), -1.5)) // Power-law distribution (heavy tail)
.withColumn("triangular_dist", (rand() + rand() + rand()) /3) // Triangular distribution
.withColumn("binary_dist", when(rand() >0.5, 1).otherwise(0)) // Binary distribution (0 or 1)// Write the DataFrame in Qbeast format with all columns indexed
(skewedDF.write
.format("qbeast")
.option("cubeSize",1000)
.option("columnsToIndex", "uniform,right_skewed_exp,left_skewed_log,normal_dist,bimodal_dist,power_law,triangular_dist,binary_dist")
.mode("overwrite")
.saveAsTable("tmp_cesare.dim8test"))
vallocation= spark.sessionState.catalog.getTableMetadata(TableIdentifier("dim8test",Some("tmp_cesare"))).location
valqt=QbeastTable.forPath(spark,location.toString)
println(qt.getIndexMetrics())
qt.getDenormalizedBlocks().groupBy("filePath").agg(sum("blockElementCount")).summary().show()
And I get this results, where we can see that most of the files have way more elements than the 1000 elements of the desired Cube size (up to 20x).
What went wrong?
I had an issue using Qbeast with 8 dimensions on a relatively large dataset (~600 GB), as the jobs never end in one specific task. When I listed the files created, I noticed that they had very different file sizes, ranging from 1.8MB to 220GB.
I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.
How to reproduce?
Different steps about how to reproduce the problem.
1. Code that triggered the bug, or steps to reproduce:
And I get this results, where we can see that most of the files have way more elements than the 1000 elements of the desired Cube size (up to 20x).
2. Branch and commit id:
3. Spark version:
3.5.3
4. Hadoop version:
3.3.4
5. How are you running Spark?
Spark Standalone, on 4 large nodes (256 GB RAM, 112 CPU each).
The text was updated successfully, but these errors were encountered: