Rollup performance on high dimensional space #454

cugni · 2024-10-28T09:30:46Z

What went wrong?

I had an issue using Qbeast with 8 dimensions on a relatively large dataset (~600 GB), as the jobs never end in one specific task. When I listed the files created, I noticed that they had very different file sizes, ranging from 1.8MB to 220GB.

 ls -hl webpage_embedded_OPQ_quantized_qbeast/8dim_10_bites_100kcs
total 376G
-rw-r--r-- 1 qbst363636 qbst01 220G Oct 27 01:59 0c6a8d1f-52a5-4f8f-8ecc-cc72405c9b7b.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:15 42a62ec2-4d12-41f3-8798-72158ec3628d.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 48723494-c348-4b88-8e87-ff00f0a3d9ee.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.4G Oct 27 01:16 7ac2246a-a0d9-441b-b10f-69fff3b3dc7b.parquet
-rw-r--r-- 1 qbst363636 qbst01  13G Oct 27 01:17 84f4f027-bfef-495b-9414-ee476db909a9.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 8dd83175-b71a-404a-bca5-5936abc3c188.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.3G Oct 27 01:16 937fd444-7292-4ad8-bf91-4dc93fe6f3af.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.8G Oct 27 01:16 9ba58775-3c24-4b7e-b545-e80e257eb5e7.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.3G Oct 27 01:16 a9ca33f0-0e90-4fc0-bcec-79e60ec6c55a.parquet
-rw-r--r-- 1 qbst363636 qbst01 9.2G Oct 27 01:17 ab02f4b8-c26d-4f34-8e6d-09bcf73a144f.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.9G Oct 27 01:15 b283eb6d-63ad-4930-a083-e0dd6175d3aa.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.8M Oct 27 01:15 b2f6e0ef-6c5a-4992-950a-7f6ae7576cae.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:16 bd95fd41-7042-422f-a19f-85a6262688ab.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.4G Oct 27 01:15 c01d558f-00bd-45c1-9cb3-adfb5d533a73.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 cc5ef11b-ddac-4b8d-8de4-7296dca21dcf.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 df401968-02ca-4428-8060-1a7286e32506.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.0G Oct 27 01:16 ec897000-a272-428f-8020-5fc362af5e0d.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.5G Oct 27 01:16 f5c70de8-dfca-42cf-8116-9333ff68a2f5.parquet
-rw-r--r-- 1 qbst363636 qbst01 106G Oct 27 01:59 f79b8a20-44a0-4d62-86c9-6feddeda86e5.parquet

I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.

How to reproduce?

Different steps about how to reproduce the problem.

1. Code that triggered the bug, or steps to reproduce:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scala.math._
import io.qbeast.spark.QbeastSparkSessionExtension._
import org.apache.spark.sql.catalyst.TableIdentifier

val spark = SparkSession.builder.appName("SkewedColumnExample").getOrCreate()

// Generate a DataFrame with a range of 100,000 rows
val df = spark.range(1000000)

// Define columns with various distributions
val skewedDF = df
  .withColumn("uniform", rand()) // Uniform distribution [0, 1]
  .withColumn("right_skewed_exp", exp(rand() * 3)) // Exponential distribution (right-skewed)
  .withColumn("left_skewed_log", -log(rand() + 0.01)) // Left-skewed distribution
  .withColumn("normal_dist", randn()) // Normally distributed values
  .withColumn("bimodal_dist", when(rand() < 0.5, rand() * 2).otherwise(rand() * 5 + 5)) // Bimodal distribution
  .withColumn("power_law", pow(rand(), -1.5)) // Power-law distribution (heavy tail)
  .withColumn("triangular_dist", (rand() + rand() + rand()) / 3) // Triangular distribution
  .withColumn("binary_dist", when(rand() > 0.5, 1).otherwise(0)) // Binary distribution (0 or 1)



// Write the DataFrame in Qbeast format with all columns indexed
(skewedDF.write
  .format("qbeast")
  .option("cubeSize",1000)
  .option("columnsToIndex", "uniform,right_skewed_exp,left_skewed_log,normal_dist,bimodal_dist,power_law,triangular_dist,binary_dist")
  .mode("overwrite")
  .saveAsTable("tmp_cesare.dim8test"))


  val location = spark.sessionState.catalog.getTableMetadata(TableIdentifier("dim8test",Some("tmp_cesare"))).location

  val qt = QbeastTable.forPath(spark,location.toString)

  println(qt.getIndexMetrics())

  qt.getDenormalizedBlocks().groupBy("filePath").agg(sum("blockElementCount")).summary().show()

And I get this results, where we can see that most of the files have way more elements than the 1000 elements of the desired Cube size (up to 20x).

+-------+--------------------+----------------------+
|summary|            filePath|sum(blockElementCount)|
+-------+--------------------+----------------------+
|  count|                 312|                   312|
|   mean|                NULL|     3205.128205128205|
| stddev|                NULL|     2560.482928268479|
|    min|00f4e8cf-630f-49d...|                    34|
|    25%|                NULL|                  1461|
|    50%|                NULL|                  2152|
|    75%|                NULL|                  4314|
|    max|ff567798-b120-44e...|                 20228|
+-------+--------------------+----------------------+

2. Branch and commit id:

3. Spark version:

3.5.3

4. Hadoop version:

3.3.4

5. How are you running Spark?

Spark Standalone, on 4 large nodes (256 GB RAM, 112 CPU each).

The text was updated successfully, but these errors were encountered:

…beast-io#454

cugni self-assigned this Oct 28, 2024

cugni added the enhancement label Oct 28, 2024

cugni added a commit to cugni/qbeast-spark that referenced this issue Oct 28, 2024

Implementing the "next in line" change to the rollup algorithm to fix Q…

001b8a2

…beast-io#454

cugni mentioned this issue Oct 28, 2024

Issue #454: Next in line Rollup #455

Closed

5 tasks

fpj changed the title ~~Rollup perfromance on high dimensional space~~ Rollup performance on high dimensional space Nov 11, 2024

Jiaweihu08 self-assigned this Dec 14, 2024

Jiaweihu08 linked a pull request Dec 18, 2024 that will close this issue

Issue #455: Next in line Rollup (rebased) #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollup performance on high dimensional space #454

Rollup performance on high dimensional space #454

cugni commented Oct 28, 2024

Rollup performance on high dimensional space #454

Rollup performance on high dimensional space #454

Comments

cugni commented Oct 28, 2024

What went wrong?

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?