Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollup performance on high dimensional space #454

Open
cugni opened this issue Oct 28, 2024 · 0 comments · May be fixed by #514
Open

Rollup performance on high dimensional space #454

cugni opened this issue Oct 28, 2024 · 0 comments · May be fixed by #514
Assignees

Comments

@cugni
Copy link
Member

cugni commented Oct 28, 2024

What went wrong?

I had an issue using Qbeast with 8 dimensions on a relatively large dataset (~600 GB), as the jobs never end in one specific task. When I listed the files created, I noticed that they had very different file sizes, ranging from 1.8MB to 220GB.

 ls -hl webpage_embedded_OPQ_quantized_qbeast/8dim_10_bites_100kcs
total 376G
-rw-r--r-- 1 qbst363636 qbst01 220G Oct 27 01:59 0c6a8d1f-52a5-4f8f-8ecc-cc72405c9b7b.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:15 42a62ec2-4d12-41f3-8798-72158ec3628d.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 48723494-c348-4b88-8e87-ff00f0a3d9ee.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.4G Oct 27 01:16 7ac2246a-a0d9-441b-b10f-69fff3b3dc7b.parquet
-rw-r--r-- 1 qbst363636 qbst01  13G Oct 27 01:17 84f4f027-bfef-495b-9414-ee476db909a9.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.6G Oct 27 01:16 8dd83175-b71a-404a-bca5-5936abc3c188.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.3G Oct 27 01:16 937fd444-7292-4ad8-bf91-4dc93fe6f3af.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.8G Oct 27 01:16 9ba58775-3c24-4b7e-b545-e80e257eb5e7.parquet
-rw-r--r-- 1 qbst363636 qbst01 3.3G Oct 27 01:16 a9ca33f0-0e90-4fc0-bcec-79e60ec6c55a.parquet
-rw-r--r-- 1 qbst363636 qbst01 9.2G Oct 27 01:17 ab02f4b8-c26d-4f34-8e6d-09bcf73a144f.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.9G Oct 27 01:15 b283eb6d-63ad-4930-a083-e0dd6175d3aa.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.8M Oct 27 01:15 b2f6e0ef-6c5a-4992-950a-7f6ae7576cae.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.1G Oct 27 01:16 bd95fd41-7042-422f-a19f-85a6262688ab.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.4G Oct 27 01:15 c01d558f-00bd-45c1-9cb3-adfb5d533a73.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 cc5ef11b-ddac-4b8d-8de4-7296dca21dcf.parquet
-rw-r--r-- 1 qbst363636 qbst01 1.2G Oct 27 01:15 df401968-02ca-4428-8060-1a7286e32506.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.0G Oct 27 01:16 ec897000-a272-428f-8020-5fc362af5e0d.parquet
-rw-r--r-- 1 qbst363636 qbst01 2.5G Oct 27 01:16 f5c70de8-dfca-42cf-8116-9333ff68a2f5.parquet
-rw-r--r-- 1 qbst363636 qbst01 106G Oct 27 01:59 f79b8a20-44a0-4d62-86c9-6feddeda86e5.parquet

I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.

How to reproduce?

Different steps about how to reproduce the problem.

1. Code that triggered the bug, or steps to reproduce:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scala.math._
import io.qbeast.spark.QbeastSparkSessionExtension._
import org.apache.spark.sql.catalyst.TableIdentifier

val spark = SparkSession.builder.appName("SkewedColumnExample").getOrCreate()

// Generate a DataFrame with a range of 100,000 rows
val df = spark.range(1000000)

// Define columns with various distributions
val skewedDF = df
  .withColumn("uniform", rand()) // Uniform distribution [0, 1]
  .withColumn("right_skewed_exp", exp(rand() * 3)) // Exponential distribution (right-skewed)
  .withColumn("left_skewed_log", -log(rand() + 0.01)) // Left-skewed distribution
  .withColumn("normal_dist", randn()) // Normally distributed values
  .withColumn("bimodal_dist", when(rand() < 0.5, rand() * 2).otherwise(rand() * 5 + 5)) // Bimodal distribution
  .withColumn("power_law", pow(rand(), -1.5)) // Power-law distribution (heavy tail)
  .withColumn("triangular_dist", (rand() + rand() + rand()) / 3) // Triangular distribution
  .withColumn("binary_dist", when(rand() > 0.5, 1).otherwise(0)) // Binary distribution (0 or 1)



// Write the DataFrame in Qbeast format with all columns indexed
(skewedDF.write
  .format("qbeast")
  .option("cubeSize",1000)
  .option("columnsToIndex", "uniform,right_skewed_exp,left_skewed_log,normal_dist,bimodal_dist,power_law,triangular_dist,binary_dist")
  .mode("overwrite")
  .saveAsTable("tmp_cesare.dim8test"))


  val location = spark.sessionState.catalog.getTableMetadata(TableIdentifier("dim8test",Some("tmp_cesare"))).location

  val qt = QbeastTable.forPath(spark,location.toString)

  println(qt.getIndexMetrics())

  qt.getDenormalizedBlocks().groupBy("filePath").agg(sum("blockElementCount")).summary().show()

And I get this results, where we can see that most of the files have way more elements than the 1000 elements of the desired Cube size (up to 20x).

+-------+--------------------+----------------------+
|summary|            filePath|sum(blockElementCount)|
+-------+--------------------+----------------------+
|  count|                 312|                   312|
|   mean|                NULL|     3205.128205128205|
| stddev|                NULL|     2560.482928268479|
|    min|00f4e8cf-630f-49d...|                    34|
|    25%|                NULL|                  1461|
|    50%|                NULL|                  2152|
|    75%|                NULL|                  4314|
|    max|ff567798-b120-44e...|                 20228|
+-------+--------------------+----------------------+

2. Branch and commit id:

3. Spark version:

3.5.3

4. Hadoop version:

3.3.4

5. How are you running Spark?

Spark Standalone, on 4 large nodes (256 GB RAM, 112 CPU each).

@cugni cugni self-assigned this Oct 28, 2024
@fpj fpj changed the title Rollup perfromance on high dimensional space Rollup performance on high dimensional space Nov 11, 2024
@Jiaweihu08 Jiaweihu08 self-assigned this Dec 14, 2024
@Jiaweihu08 Jiaweihu08 linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants